Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes

Abstract

Ortholog identification is a crucial first step in comparative genomics. Here, we present a rapid method of ortholog grouping which is effective enough to allow the comparison of many genomes simultaneously. The method takes as input all-against-all similarity data and classifies genes based on the traditional hierarchical clustering algorithm UPGMA. In the course of clustering, the method detects domain fusion or fission events, and splits clusters into domains if required. The subsequent procedure splits the resulting trees such that intra-species paralogous genes are divided into different groups so as to create plausible orthologous groups. As a result, the procedure can split genes into the domains minimally required for ortholog grouping. The procedure, named DomClust, was tested using the COG database as a reference. When comparing several clustering algorithms combined with the conventional bidirectional best-hit (BBH) criterion, we found that our method generally showed better agreement with the COG classification. By comparing the clustering results generated from datasets of different releases, we also found that our method showed relatively good stability in comparison to the BBH-based methods.

INTRODUCTION

Because of the rapid accumulation of data from various high-throughput technologies, comprehensive gene or protein classification is one of the central issues in bioinformatics ( 1 , 2 ). Although classification schemes based on functional roles, molecular interactions or reaction networks have recently become of increasing interest, classification schemes based on sequence or structural similarities are still of fundamental importance. Especially, the accumulation of complete genomes enhances the need for large-scale sequence comparison in light of comparative genomics.

To date, many schemes have been developed to classify proteins into homologous groups, or families. While motifs or profiles that characterize families are helpful for putting the existing knowledge to use ( 3 , 4 ), most of the automated schemes use some clustering techniques applied to precomputed pairwise similarities ( 5 – 11 ). Typically, the focus of these researches has been on either the grouping of weakly similar homologs, or the identification of the building blocks of proteins, termed domains, whose combinations generates the variety of naturally existing proteins.

On the other hand, it has long been recognized that orthology, a kind of homology derived from speciation, should be distinguished from paralogy, a kind of homology derived from duplication ( 12 ). The problem of distinguishing orthologs from paralogs has been formulated as a problem of fitting a gene tree to a species tree ( 13 , 14 ). Recently, the importance of ortholog identification has been widely recognized in the context of comparative genomics. Ortholog identification is crucial for function prediction, since gene functions are typically conserved between orthologs, whereas paralogs can share similar but different functions. Especially, exhaustive ortholog classification is an essential first step in the recently proposed methods of inferring functional linkages between proteins, such as the phylogenetic profile method ( 15 ), the domain fusion method ( 16 , 17 ) and the gene neighborhood method ( 18 ) [reviewed in ( 19 , 20 )]. Phylogenetic profiles can also be used to predict more specific functions when phenotypic traits of each organism are available ( 21 ).

Despite its importance, however, a scheme for large-scale ortholog grouping for multiple genome comparison is yet to be established. The conventional approach to the identification of orthologs between two genomes is the so-called bidirectional best-hit (BBH) criterion, where two genes, a and b , in the genomes A and B , respectively, are considered to be orthologs if a is the best-hit of b in genome A and vice versa. For three or more genomes, orthologous groups can be constructed by extending the BBH relationships with a clustering algorithm. The Clusters of Orthologous Groups (COGs) Database, a widely used curated database for ortholog grouping, was constructed basically through this approach ( 22 ). However, ortholog grouping is not a simple task; the overall COG construction process has included additional complex procedures such as the addition of species-specific paralogs, the splitting of proteins into multiple domains if required, as well as other case-by-case manual modifications ( 23 ). Although several methods have been developed recently to solve some aspects of the ortholog identification problem ( 24 – 31 ), none of them is suitable for any large-scale exhaustive classification comparable to COGs.

Here, we present an efficient algorithm for clustering many protein sequences at the domain level. Our algorithm, DomClust, which was originally developed for our comparative microbial genome database (MBGD) ( 32 ), is a natural extension of the traditional hierarchical clustering algorithm and is suitable for splitting genes into the domains minimally required for ortholog grouping.

MATERIALS AND METHODS

Calculation of sequence similarities

Our procedure takes the results as input of all-against-all pairwise protein sequence comparison containing similarity scores and the beginning and ending positions of the aligned segments. We use the following notations: sim( a , b ) = (score( a , b ), ali_ab , ali_ba ) is the similarity relationship between the sequences a and b , where score( a , b ) is the score between a and b and ali_ab = [ from , to ] is the segment on a that is locally aligned against b . We use square brackets to represent a segment and ali_ab.fr and ali_ab.to to refer to the individual endpoints.

Basically any sequence alignment program can be used, but here we used the following protocol: for each significant match found by the BLASTP program ( 33 ) with an adjusted E -value of ≤0.001, rigorous local alignment ( 34 ) was calculated using the JTT PAM250 scoring matrix ( 35 ). Here, we fixed the search space size to 10 ⁹ for every BLASTP search, and the resulting E -value was further adjusted with E × l_ql_s × 10 ⁻⁵ , where l_q and l_s are the lengths of the query and the subject sequences, respectively; this adjustment favors global matches between short sequences to short local matches between long sequences.

For measuring relatedness, either the similarity score or the distance can be used. While distance is generally more suitable for evolutionary analysis, similarity is a more appropriate measure of local alignment. Here, we used similarity.

Hierarchical clustering as graph contraction

The above data are used to construct a similarity graph, G = ( S , H ), where S is the set of protein sequences (vertices) and H is the set of homologous relationships (edges). Our clustering procedure is basically a successive contraction of this graph ( Figure 1 ) by the traditional hierarchical clustering method known as the unweighted pair-group method using arithmetic averages (UPGMA) ( 36 ). In each iteration, the procedure takes the best similarity edge, say sim( s₁ , s₂ ), and replaces the vertices s₁ and s₂ with a new vertex, s_M , representing the merged cluster. In addition, for each vertex s₃ ∈ S − { s₁ , s₂ }, it also replaces the edges sim( s₁ , s₃ ) and sim( s₂ , s₃ ) with sim( s_M , s₃ ), assigning score score( s_M , s₃ ) = avg ₁₂ [score( s₁ , s₃ ), score( s₂ , s₃ ,)], where avg _AB ( x _A , x _B ) ≡ ( x _A | A | + x _B | B |)/(|A| + | B |) is the group-average function for the quantity x , where | A | denotes the cardinality of the set A . The procedure is repeated until the best score becomes worse than the given cutoff, c .

While the usual UPGMA requires a complete similarity matrix, many edges are actually missing in our similarity graph G , because we consider only significant similarities. Thus, for each iteration with the best edge sim( s₁ , s₂ ), we must consider only a set of vertices S ( s₁ , s₂ ) ≡ { s₃ | sim( s₁ , s₃ ) ∈ H ∨ sim( s₂ , s₃ ) ∈ H }. When one of the two edges is missing, say sim( s₁ , s₃ ) ∉ H , we assign a fixed score (parameter), m < c , to score( s₁ , s₃ ) for calculating the average ( Figure 1 ). While maintaining the algorithmic logic, this modification reduces the computational cost of UPGMA from O(| S | ² ) to O(| H |). Hereafter, we call this simple method ‘gUPGMA’ to distinguish it from the normal UPGMA, where the ‘g’ stands for ‘graphical’ or ‘gap-filling’.

Note that the parameter m combined with the cutoff c can control the granularity of the clustering: m → c approaches the single linkage method, while m →−∞ approaches the complete linkage method. Through a validation test with the COG database as a reference (see below), we found that better results are obtained when m is close to c (data no shown). We set m = 0.95 c throughout the present work. See Supplementary Table S1 for list of parameters used in this work.

Domain splitting: an overview

An exhaustive sequence comparison often reveals a number of domain fusion or fission events ( 37 ). Although such events often provide valuable information for predicting functional linkage ( 16 , 17 ), they complicate the sequence grouping task. To treat these events, we added a process for domain splitting to the basic procedure outlined above ( Figure 2 ). In each iteration with the best edge sim( s₁ , s₂ ), a merged vertex was split into at most five vertices according to the aligned segment of the best scoring edge: the aligned segment itself and the left and right overhangs on either of the sequences (or clusters); we denoted them s_M , s_{L ₁} , s_{L ₂} , s_{R ₁} and s_{R ₂} . For each s₃ ∈ S ( s₁ , s₂ ), the edges sim( s₁ , s₃ ) and sim( s₂ , s₃ ) were reconnected to one or more of the new segments, say s_M , and the information in these edges, say sim( s_M , s₃ ) = (score( s_M , s₃ ), ali_M3 , ali_3M ), was updated appropriately, by averaging the alignment lengths as well as the scores over all of the relationships included in the merged edges.

Figure 2 shows a simplified example, which contains clusters constituted from six domains, A–F, each of which is 100 residues long. One can easily validate this artificial example, since all alignment boundaries are simply equivalent to the domain boundaries, but this assumption generally does not hold in real cases. For practical implementation, it is necessary to answer the following questions: (i) how do we determine the exact positions to be split on the sequences s₁ and s₂ , and (ii) how do we update the information in the edges sim( s₁ , s₃ ) and sim( s₂ , s₃ ) for each s₃ ∈ S ( s₁ , s₂ )?

The resulting structure is a directed acyclic graph representing overlapping trees, as shown in Figure 2C ; it is further processed for orthologous grouping (see below).

Some concepts: segment overlap, position mapping and alignment consistency

Before describing the procedure precisely, we will introduce some definitions. Let | a | denote the length of the segment a , i.e. | a | = a . to − a . fr + 1. For the two segments a and b , a ∪ b = [min( a . fr , b . fr ), max( a . to , b . to )] and a ∩ b = [max( a.fr,b.fr ), min( a.to,b.to )]; we define [ p,q ] = φ if p > q . The segments a and b are overlapped if | a ∩ b | ≥ L_ov ; we denote this by the predicate overlap( a,b ). We also introduce a partial order, ≺, between non-overlapping segments: a ≺ b if a.to < b.fr + L_ov . Here, the minimum overlap length, L_ov , between the segments a and b is determined by L_ov ( a,b ) = min{ r_ovl min(| a |,| b |), max( l_min , r_ov2 max(| a |,| b |))} with three parameters, 0 < r_ov2 < r_ov1 < 1 and l_min . We set ( r_ov1 , r_ov2 , l_min ) = (0.6, 0.3, 50).

The position mapping, T _ji ( x ), is defined as the position of the sequence j aligned against the position x of the sequence i . Although this mapping can be directly obtained from the alignment information, positions other than the endpoints of each alignment need to be mapped by interpolation since our algorithm remembers only the endpoints. Here, we used a simple linear interpolation:

\[ {T}_{ji}\left(x\right)\equiv \{\begin{array}{ll}(x-{f}_{ij})({t}_{ji}-{f}_{ji})/({t}_{ij}-{f}_{ij})+{f}_{ji}& ({f}_{ij}\le x\le {t}_{ij})\\ x-{f}_{ij}+{f}_{ji}& (x < {f}_{ij})\\ x-{t}_{ij}+{t}_{ji}& ({t}_{ij} < x)\end{array}, \]

where f _ij = ali _ij . fr and t _ij = ali _ij . to . We also denote the mapping of the segment S = [ S.fr , S.to ] by T _ji ( S ), which is equivalent to the segment [ T _ji ( S.fr ), T _ji ( S.to )]. For example, the equality ali _ji = T _ji ( ali _ij ) always holds by definition.

In gUPGMA, we consider the three sequences s₁ , s₂ and s₃ at once. Although we usually assume that all pairwise alignments between them are embedded in a common multiple alignment, sometimes this assumption does not hold (especially when there is an internal repetitive structure). To test this, we checked the equivalency of the overlapping patterns between the two segment pairs: { ali₃₁ , ali₃₂ } on s₃ and { ali₁₃ , T₁₂ ( ali₂₃ )} on s₁ ; we denote this by the predicate consistent( s₁ , s₂ , s₃ ). An overlapping pattern is defined by the order of the four endpoints of the segment pair along each sequence. There are 13 possible patterns, among which we defined an equivalency allowing some flexibility (see Supplementary Figure S1).

Determination of domain boundaries

We must determine up to four boundaries, i.e. the left and right boundaries on the sequences s₁ and s₂ ; we denote them

\( {B}_{1}=[{b}_{{\hbox{ L }}_{1}},{b}_{{\hbox{ R }}_{1}}] \)

and

\( {B}_{2}=[{b}_{{\hbox{ L }}_{2}},{b}_{{\hbox{ R }}_{2}}] \)

. The problem would be trivially solved as B₁ = ali₁₂ and B₂ = ali₂₁ if the alignment boundaries simply correspond to the domain boundaries. Although this strategy works in typical cases, because the best-scoring alignments usually have the highest reliability, it can also yield spurious splits when the alignment boundary leads to a partial covering of a domain. Moreover, our purpose here is to split genes into the domains minimally required for ortholog grouping. Therefore, the procedure checks each s₃ ∈ S ( s₁ , s₂ ) and extends the boundary when s₃ supports the extension ( Figure 3A ), and splits the sequence when some of the s₃ sequences supports the split ( Figure 3B ).

We will now focus on the boundary, b_{R ₁} . Starting from the initial boundary,

\( {b}_{{\hbox{ R }}_{1}}^{\prime }=al{i}_{12}.to \)

(the prime indicates a tentative extended boundary), the procedure examines ali₁₃ and ali₂₃ for each s₃ ∈ S ( s₁ , s₂ ) and extends the boundary to the right when the following conditions are met (Condition I; Figure 3A ): (i) sim( s₃ , s₁ ) ∈ H and sim( s₃ , s₂ ) ∈ H , (ii) consistent( s₁ , s₂ , s₃ ), (iii) overlap( ali₁₃ , ali₁₂ ) and overlap( ali₂₃ , ali₂₁ ) and (iv) ali₁₃ . to > ali₁₂ . to and ali₂₃ . to > ali₂₁ . to . Let S₁ ⊂ S ( s₁ , s₂ ) be a set of s₃ satisfying Condition I. Then, the boundary is updated as

\( {b}_{{\hbox{ R }}_{1}}^{\prime }={\hbox{ max }}_{{s}_{3}\in {S}_{\hbox{ I }}}\hbox{ \hspace{1em} }\left\{\mathrm{min}(al{i}_{13}.to,{T}_{12}\left(al{i}_{23}.to\right))\right\} \)

On the other hand, if there is a s₃ ∈ S ( s₁ , s₂ ) satisfying the following conditions (Condition II; Figure 3B ), we should split the sequence: (i) ali₁₂ ≺ ali₁₃ but not ali₂₁ ≺ ali₂₃ (typically sim( s₃ , s₂ ) ∉ H , but domain swapping or repetitive structures may yield other patterns), (ii) | ali₃₁ | ≥ r_cov | s₃ | with a coverage parameter, 0 ≤ r_cov ≤ 1 and (iii) score( s₁ , s₃ ) ≥ c_split with another cutoff score, c_split ≥ c . Only the first condition is essential, which suggests that ali₁₃ and ali₁₂ belong to different domains, while the other conditions are useful to prevent spurious splits from arising when the similarity of ali₁₃ is too weak for an orthologous match.

Let S_II ⊂ S ( s₁ , s₂ ) be a set of s₃ satisfying Condition II. We split the sequence at

\( {b}_{{\hbox{ R }}_{1}}={b}_{{\hbox{ R }}_{1}}^{\prime } \)

if either of the following conditions is satisfied: (i) | S_II | ≥ n_split and

\( \left|{s}_{{\hbox{ R }}_{1}}\right|=\left|{s}_{1}\right|-{b}_{{\hbox{ R }}_{1}}^{\prime }\ge {L}_{\hbox{ min }} \)

, where | S_II | denotes the number of sequences within the clusters contained in the set S_II , or (ii)

\( \left|{s}_{{\hbox{ R }}_{1}}\right|\ge {L}_{\hbox{ max }} > {L}_{\hbox{ min }} \)

regardless of S_II ; otherwise, we set

\( {b}_{{\hbox{ R }}_{1}}=\left|{s}_{1}\right| \)

. The first condition requires that at least n_split sequences satisfy Condition II, whereas the latter allows a sufficiently large (≥ L_max ) unaligned segment to be treated as an independent domain. Here, we set n_split = 1, L_min = 40 and L_max = 400.

Finally the segment s_M is constructed by merging the segments B₁ and B₂ ( Figure 3C ). More precisely, it consists of three parts: the aligned segment with the length l_MM = avg ₁₂ (| ali₁₂ |,| ali₂₁ |) and the left and right unaligned segments with the lengths

\( {l}_{\hbox{ ML }}=\hbox{ max }\left(al{i}_{12}.fr-{b}_{{\hbox{ L }}_{1}},al{i}_{21}.fr-{b}_{{\hbox{ L }}_{2}}\right) \)

and

\( {l}_{\hbox{ MR }}=\hbox{ max }\left({b}_{{\hbox{ R }}_{1}}-al{i}_{12}.to,{b}_{{\hbox{ R }}_{2}}-al{i}_{21}.to\right) \)

, respectively. Using these equations, the aligned region on s_M is expressed as [ l_ML + 1, l_ML + l_MM ]. Consequently, the correspondence between s₁ , s₂ and s_M can be represented as ali_M1 = ali_M2 = [ l_ML + 1, l_ML + l_MM ], ali_1M = ali₁₂ and ali_2M = ali₂₁ ( Figure 3C ), and can be used for mapping positions between the original and merged sequences (through the function T_M1 , etc.).

Updating the alignment information

Next, the alignment information between each s₃ ∈ S ( s₁ , s₂ ) and each new segment ( s_M , s_{L ₁} , etc.) is updated ( Figure 3C ). The procedure is rather straightforward, although somewhat tedious. Here, we briefly list the key points:

Determine the domain boundaries on s₃ (denoted by
\( {B}_{3}=\left[{b}_{{\hbox{ L }}_{3}},{b}_{{\hbox{ R }}_{3}}\right] \)
). If consistent( s₁ , s₂ , s₃ ) holds, set B₃ = avg ₁₂ ( T₃₁ ( B₁ ), T₃₂ ( B₂ )), but here we consider only the boundaries that are actually split. Otherwise, prioritize the sequence with better scoring alignment. For example, when score( s₁ , s₃ ) > score( s₂ , s₃ ), then B₃ = T₃₁ ( B₁ ).
Determine the endpoints of the alignment between s₃ and s_M . If consistent( s₁ , s₂ , s₃ ) holds, then merge the two alignments to construct the new alignment with ali_M3 = ( T_M1 ( ali₁₃ ) ∪ T_M2 ( ali₂₃ )) ∩ [1,| s_M |] and
\( al{i}_{3\hbox{ M }}=\left(al{i}_{31}\cup al{i}_{32}\right)\cap [{b}_{{\hbox{ L }}_{3}},{b}_{{\hbox{ R }}_{3}}] \)
. Otherwise, take the alignment with the better score for the new alignment. For example, when score( s₁ , s₃ ) > score( s₂ , s₃ ), then ali_M3 = T_M1 ( ali₁₃ ) ∩ [1,| s_M |] and
\( al{i}_{3\hbox{ M }}=al{i}_{31}\cap [{b}_{{\hbox{ L }}_{3}},{b}_{{\hbox{ R }}_{3}}] \)
(this includes the case sim( s₃ , s₂ ) ∉ H ).
Determine the endpoints of the other alignments. For example, the alignment between s₃ and
\( {s}_{{\hbox{ L }}_{1}} \)
is given by
\( al{i}_{{\hbox{ L }}_{1}3}=al{i}_{13}\cap [1,{b}_{{\hbox{ L }}_{1}}-1] \)
and
\( al{i}_{3{\hbox{ L }}_{1}}=al{i}_{31}\cap [1,{b}_{{\hbox{ L }}_{3}}-1] \)
, and the alignment between s₃ and
\( {s}_{{\hbox{ R }}_{1}} \)
is given by
\( {\mathrm{ali}}_{{\hbox{ R }}_{1}3}=(al{i}_{13}\cap [{b}_{{\hbox{ R }}_{1}}+1,|{s}_{1}\left|\right])-{b}_{{\hbox{ R }}_{1}} \)
(subtraction means translation) and
\( al{i}_{3{\hbox{ R }}_{1}}=al{i}_{31}\cap [{b}_{{\hbox{ R }}_{3}}+1,\left|{s}_{3}\right|] \)
.
Assign a score to each new alignment. As described previously, score( s_M , s₃ ) = avg ₁₂ (score( s₁ , s₃ ),score( s₂ , s₃ )), where the missing edges are assigned the score m ; on the other hand,
\( \hbox{ score }({s}_{{\hbox{ L }}_{1}},{s}_{3})=\hbox{ score }({s}_{1},{s}_{3}). \)
Do not split the score even when the alignment is split into multiple domains.

Phylogenetic-tree cutting

In the structure shown in Figure 2C , one can identify the clusters of homologous domains by traversing it from top to bottom. Also, one can trace the domain-splitting history of each gene by traversing it in reverse direction. One can split genes into the domains minimally required for ortholog grouping by specifying a set of internal nodes as roots.

To do this, we can fit the gene tree into the species tree ( 13 , 14 ) if we can safely assume that there are no horizontal transfers and know both the true gene tree and the true species tree. However, this assumption does not hold in practice, especially in microbial genomes, where a substantial number of horizontal transfers have occurred. Here, we took a simpler but more practical approach: each root node is recursively cut until the two clusters merged at that node share no or few intra-species paralogous genes ( Figure 4A ). More precisely, a root node with two child nodes, A and B , is cut when | Ph ( A ) ∩ Ph ( B )|/min(| Ph ( A )|,| Ph ( B )|) > p with a given cutoff parameter, p , where Ph ( A ) denotes the set of species contained in cluster A (phylogenetic pattern). Strictly speaking, the parameter p must be 0 according to the definition of orthologs, but actually we found that relaxed conditions, around p = 0.5, often generated a more plausible classification. Here, one can incorporate taxonomic relations by counting closely related species only once. In the validation test described below, we adopted this strategy using the set of related species that are assigned the same code in the COG database (see Supplementary Table S2).

After the set of root nodes is determined, the domain boundaries on each sequence are determined by mapping the boundary positions stored in each internal node on to its leaf nodes by applying the mapping functions T_1M and T_2M recursively.

Rejoining adjacent domains

Sometimes, the tree-cutting procedure violates the domain-splitting criteria given above. Therefore, the program reexamines the criteria for each adjacent pair of domains in the final step: two adjacent clusters are rejoined when (i) all members in one of the clusters are included in the other, and (ii) the average length of the included cluster is shorter than L_max . We also considered more aggressive rejoining at the final step: the two clusters A and B are joined when either of the following conditions is satisfied ( Figure 4B ): (i) |adj( A , B )| ≥ r_adj1 max(| A |,| B |), where adj( A , B ) is the set of adjacent segments belonging to the clusters A and B , or (ii) |adj( A , B )| ≥ r_adj2 min(| A |,| B |) and the average length of the smaller group is shorter than L_max ; the latter is a relaxed version of the above original criteria. Here we considered two parameters, 0 ≤ r_adj1 ≤ r_adj2 ≤ 1, and found the best values through the COG recovery test described below. This procedure could improve the clustering quality, since split genes found only in one or a few genomes often have abnormal split points, which are unlikely to correspond to meaningful domain boundaries.

Validation test

Our program, DomClust, was implemented in the C programming language. The program was validated using the COG database ( http://www.ncbi.nlm.nih.gov/COG/ ) as a reference. Here we mainly used the 2002 update of the COG database ( 23 ) (referred to as ‘COG02’; 43 genomes, 104 094 sequences, 3307 COGs) but we also used the 2003 update of the COG database ( 38 ) (referred to as ‘COG03’; 66 genomes, 185 898 sequences, 4873 COGs) for comparison. Orthologous groups were reconstructed from the set of sequences in order to create the COGs, which were obtained from the COG web site. Throughout the work, as the original definition of a COG, we considered only orthologous groups comprising genes of at least three phylogenetically distinct organisms (defined by Supplementary Table S2). After eliminating entries that did not satisfy this condition, we retained 3192 COGs in the COG02 set. From this set, we further eliminated groups containing too many paralogs to be considered orthologous groups, and groups conserved in only a few species, which are generally biased and less interesting. For this, we defined a ‘well-defined orthologous group’ (WDOG) somewhat arbitrarily as a group G such that | G |/| Ph ( G )| ≤ 2 and | Ph ( G )| ≥ 5. We extracted 2360 WDOGs.

To evaluate the agreement between the two grouping systems, we first identified a set of corresponding group pairs; for each reference (COG) group, the best compatible group was selected from the target groups. The compatibility between the reference ( R ) and the target ( T ) groups was evaluated using Jaccard's coefficient | R ∧ T | / (| R | + | T | − | R ∧ T |), where R ∧ T denotes a set of segment pairs overlapping between R and T . Here, we considered that the segments a and b are overlapped if | a ∩ b |/max(| a |,| b |) > 0.5. After the set of corresponding group pairs was identified, the total agreement was evaluated by the number of matching group pairs (MGPs). For this, we introduced the following MGPs: an exact MGP is a group pair satisfying | R ∧ T | = | R | = | T |, a phylo-MGP is a group pair satisfying | Ph ( R ∧ T )| = | Ph ( R )| = | Ph ( T )| and | R | ≥ | T |, and a m %-MGP is a group pair satisfying | R ∧ T |/| R | ≥ m /100 and | R ∧ T |/| T | ≥ m /100. Let M _Ex , M _Ph , M₈₀ and M₆₀ be the number of exact, phylo-, 80%-, 60%-MGPs, respectively. The total agreement was evaluated by M_Tot = M _Ex + M _Ph + M₈₀ + M₆₀ . Note that there are overlaps between the various types of MGP, e.g. an exact MGP is also any other type of MGP, and therefore is weighted 4 times as much as a proper 60%-MGP.

For the purpose of comparison, we also tested the following clustering methods: (i) single linkage clustering (SLink), (ii) ‘triangular linkage clustering’ (TriLink), which is our implementation of the original method described in the COG paper ( 22 ) and consists in the merging of triangles in the graph of the best hits when they share the same side and (iii) the TribeMCL algorithm ( 9 ) (the program available at http://micans.org/mcl ), which uses the equilibrium probabilities of a Markov chain defined on the similarity graph for evaluating transitive similarities, a method recently applied to the ortholog grouping problem ( 30 ).

For these methods, the BBH relationships were used as input. Here, we used a relaxed criterion: a gene pair ( a , b ) of the genomes A and B is considered to be in a BBH relationship when the genes satisfy score( a , b ) ≥ r_BH max {max _{y ∈ B} [score( a , y )],max _{x ∈ A} [score( b , x )]}, with one parameter, 0 ≤ r_BH ≤ 1 ( r _BH = 1 corresponds to the rigorous BBH). Intra-species homologs were also included when a gene pair ( a , b ) of the genome A satisfied score( a , b ) ≥ r_BH max {max _{x ∈ G − A} [score( a , x )],max _{x ∈ G − A} [score( b , x )]}, where G – A denotes all genomes except A . In addition, we also prepared BBH relationships from which alignments with low coverage, defined as max(| ali₁₂ |/| s₁ |,| ali₂₁ |/| s₂ |) < r_cov , were filtered out. During the evaluation, we systematically changed the parameters r _BH and r_cov together with the score cutoff c (for SLink and TriLink) or the inflation parameter I (for TribeMCL) to find the parameter set that gives the best M_Tot . On the other hand, all similarity relationships were used in DomClust, since the algorithm already includes such best-hit-first criteria.

Stability of the clustering methods

We evaluated the stability of the clustering methods by comparing two sets of clusters created from two different datasets, COG02 (old) and COG03 (new). For each method, the same parameters as those selected above (i.e. those giving the best M_Tot for the COG02 set) were used. After clustering, sequences only in the new dataset were eliminated. In the comparison, we examined which new clusters the members of each old cluster belonged to, and classified each old cluster ( C_old ) into the following six categories: (i) unchanged: there is a unique new cluster that consists of the same members as C_old ; (ii) group-merged: C_old is merged with other old clusters into a new cluster; (iii) group-divided: C_old is divided into multiple new clusters, each of which contain no member of the other old clusters; (iv) domain-merged: similar to group-merged, but allowing different domains (usually adjacent to one another) to fuse into one; (v) domain-divided: similar to group-divided, but allowing splits of a sequence into multiple domains; (vi) others, including more complex cases. We considered the ratio of ‘unchanged’ clusters as an indication of stability. See Supplementary data for the precise definition.

RESULTS

DomClust took about a minute or more to classify the entire COG02 dataset (| H | = 3.5 × 10 ⁶ ) on a 2.4 GHz Xeon machine. This was more than four times faster than TribeMCL with BBH relationships when using the selected set of parameters, even though TribeMCL used a smaller input ( Table 1 ). DomClust has been available on the MBGD server ( http://mbgd.genome.ad.jp/ ) ( 32 ), on which there are more than 200 classified microbial genomes (the DomClust program itself is available at http://mbgd.genome.ad.jp/domclust/ ), but here we focused on the validation of the method using this smaller dataset. At first, as an example of DomClust classification containing domain fusion or fission events, we show the orthologous groups of RNA polymerase beta (RpoB) and beta' (RpoC) subunits ( Figure 5 ). These subunits are fused into one gene in the genomes of two strains of Helicobacter pylori , 26695 (Hpy) and J99 (jHp), while in most archaea, each subunit is further divided into two genes. The history of domain splitting can be traced by traversing the tree in a bottom-up manner ( Figure 5A ): the algorithm first joined the fused genes of Hpy and jHp, and then it divided the cluster into two domains when joining the cluster with the RpoC ortholog of the Campylobacter jejuni (Cje) genome; the remainder domain was subsequently joined with the RpoB ortholog of Cje. By repeating this procedure, DomClust finally identified five orthologous domains ( Figure 5B ).

Evaluation of the clustering methods: the COG recovery test

To evaluate the DomClust algorithm, we performed COG recovery tests using as a reference the set of 2360 WDOGs from the COG02 dataset. First, we performed extensive tests to determine the parameters giving the best agreement with the COG, according to the total agreement value, M_Tot (Supplementary Table S1). Next, we compared the results with those of the other clustering methods (SLink, TriLink and TribeMCL) applied to the BBH similarity data ( Table 2 and Figure 6 ). For each method, the best parameter set was again selected using the agreement value M_Tot ( Table 2 ). In addition, we also tested the DomClust algorithm without a domain-splitting procedure, i.e. gUPGMA with the phylogenetic-tree cutting procedure (designated simply as gUPGMA).

DomClust recovered 1060 WDOGs (44.9%) exactly and 1429 WDOGs (60.6%) as phylo-MGPs. These values were greater than those of the other methods ( Figure 6 ). The resulting order of M_Tot values was SLink (4296) < TriLink (4702) < TribeMCL (5638) < gUPGMA (5827) < DomClust (6176). Note that the TriLink method taken from the original COG construction protocol did not recover the COG groups well, suggesting that the actual COG construction process was far more complex.

We tested the significance of the differences between the methods adjacent in the above inequality with the sign test, where the numbers of entries that showed better/worse agreement were counted, and the P- value was calculated using binomial distribution with a success probability of 0.5. As a result, we found all the differences highly significant ( P < 10 ⁻¹⁰ ), except the difference between TribeMCL and gUPGMA, which was only marginally significant ( P = 0.004) (see Supplementary Figure S2). The performance superiority of DomClust over gUPGMA indicates that the domain-splitting procedure was indeed effective. The difference between TribeMCL and gUPGMA was less clear, but the advantage of gUPGMA over TribeMCL appeared to be the larger M _Ph value: while the M _Ex values of TribeMCL (930) and gUPGMA (945) were almost the same, a clearer difference was observed between the M _Ph values of TribeMCL (1229) and gUPGMA (1305).

According to the definition, two distinct subfamilies that share the same phylogenetic pattern should always be paralogous. This is the key point of our phylogenetic-tree cutting procedure. However, in the COG database, the subfamilies A and B are sometimes merged even if Ph ( A ) ⊆ Ph ( B ). One of the examples is COG0593 (ATPase involved in DNA replication initiation), in which the DnaA orthologs conserved throughout the bacteria are merged with a distinct subfamily distributed among some gram-negative bacteria. In such a case, the counterpart of phylo-MGP (with the same phylogenetic pattern but with a smaller group size) can be more appropriate than the COG group. In this sense, we consider the M _Ph value to be a good indication of the classification performance. This feature of gUPGMA and DomClust also appears in the average number of inparalogs ( Table 2 ); the clusters generated by these methods contain fewer inparalogs than those generated by other methods.

Note that the phylogenetic pattern itself can play an important role in function prediction ( 15 ). Therefore, our results, in which the agreement of the phylogenetic patterns between the automatic and the manual classification was at most 60%, suggest that the limitation of this approach may be in the definition of orthologous groups. Note also that there is also an erroneous phylo-MGP pattern, i.e. the incorrect elimination of some lineage-specific paralogs. This pattern typically appears when a simple BBH strategy is used. From the nature of the algorithm, though, we can safely rule out this possibility for the gUPGMA and DomClust methods.

Validation of domain boundaries

During the test described above, DomClust identified 5377 split points in 4630 sequences, whereas the COG02 groups contained 3794 split points in 3117 sequences (here, we also counted the domains that were excluded from the final set of clusters due to the lack of orthologs in phylogenetically distinct organisms). The number of clusters containing genes that were split into multiple domains was 1302 (DomClust) or 971 (COG02); about 30% of the clusters contained such genes in either classification ( Table 2 ). There were 1913 sequences (41.3% for DomClust and 61.2% for COG) that were commonly split by both methods ( Table 3 ). The number of split points actually depended on the parameters, especially the rejoining parameters r_adj1 and r_adj2 . By reducing ( r_adj1 , r_adj2 ) from (1, 1) to (0.8, 0.95), the selected set, we could reduce the number of split points to 68.7% (5377/7822), while the number of sequences commonly split was only slightly reduced (1913/1978, 96.7%) ( Table 3 ). This means that many of the partially matched orthologs were probably lineage-specific fission (rather than fusion) genes, some of which might have resulted from frameshift mutations or errors ( 37 ); these were in many cases treated as a single cluster rather than split clusters in the COG database. However, it is very difficult to find clear-cut general criteria for determining whether the clusters should be split or not. Indeed, although we were able to reduce the number of split points by further reducing the rejoining parameters, we could not improve the agreement ratio substantially ( Table 3 ).

We also examined the positional differences between corresponding (i.e. nearest) boundaries on the 1913 sequences commonly split by the two methods. In total, 1498 split points (66.9% of the 2240 split points) were located within ±50 residues ( Table 3 ). We think that this correspondence is fairly good, considering that our main purpose was to group sequences correctly rather than to determine accurate domain boundaries.

Clustering stability

It is a desirable feature of clustering methods to avoid extreme changes between releases. To test the stability of each method, we compared the clustering results created from the COG02 and COG03 datasets with the same method and parameters ( Figure 7A ). First of all, the manually maintained COG database was extremely stable: about 75% of the clusters were ‘unchanged.’ This is probably because the actual COG updating process was incremental ( 38 ). The automated methods were generally less stable, but TriLink was extremely unstable. This instability seems to be related to the emergence of a ‘giant component’ in the COG03 clusters (Supplementary Figure S3), which is a common problem of SLink and related methods. On the contrary, SLink appeared to be relatively stable at first glance (note that very stringent conditions were set for SLink; see Table 2 ). However, when the total effect was evaluated based on the number of cluster members ( Figure 7B ), the ratio of unchanged entries in SLink was substantially reduced, indicating that a relatively large number of small clusters accounts for a large proportion of ‘unchanged’ clusters (see Supplementary Figure S3).

On the whole, gUPGMA showed relatively good stability. Here, the ‘group-merged’-to-‘group-divided’ ratio appeared to be balanced in contrast to SLink, TriLink and TribeMCL, where the ratios were biased toward ‘group-merged’ clusters (it is a logical consequence of the SLink algorithm that all changes must be ‘group-merged,’ although exceptions may arise due to the BBH selection step). DomClust showed a similar tendency but was less stable than gUPGMA, probably due to the additional complexity of the domain-splitting procedure (see below). In terms of the effects of domain splitting, the COG database was again extremely stable. Note that extreme stability can also be a drawback, in that an extremely stable method is unable to catch up with meaningful changes between different releases.

Change in orthologous-domain organization: an example

As an example of DomClust groups whose domain organizations were changed between different datasets, here we show the orthologous groups of ModE protein, a molybdate-responsive transcription regulator ( Figure 8 ) (in the following example, we used different parameters from the default ones to simplify the example). When using the COG02 dataset, DomClust split the Escherichia coli ModE protein into two domains belonging to the groups referred to here as Group A and Group B, respectively ( Figure 8A ); Group B, the C-terminal one, also included individual proteins (e.g. HI1370) annotated as molybdopterin-binding proteins, MopI. On the other hand, when using the COG03 dataset, DomClust did not split the ModE protein, but classified it into a different group (Group C) than that containing MopI (Group D) ( Figure 8B ). It is known that ModE consists of two domains: the N-terminal domain carrying a helix–turn–helix motif (LysR type) for binding to DNA and the C-terminal domain containing the molybdate-binding site, which is similar to MopI ( 39 , 40 ); indeed, the C-terminal domain contains two TOBE domains (Pfam ID: PF03459). Therefore, the domain splitting shown in Figure 8A is reasonable, but this does not mean that these are indeed reasonable ‘orthologous’ domains. So how did this organizational change occur?

Actually, in this case, the phylogenetic-tree cutting procedure was responsible for this change. When using the COG02 dataset, the algorithm directly generated two separate domains corresponding to Group A and Group B, respectively ( Figure 8C ), since in this case the criterion for tree cutting was not fulfilled at either of the root nodes (encircled in Figure 8C ). On the other hand, when using the COG03 dataset, the algorithm separated Group D from Group C at the node indicated by the blue horizontal bar in Figure 8D , because the phylogenetic pattern of the right subtree largely overlapped that of the left subtree ( Figure 8D ). As a result, the algorithm rejoined the domain in the remaining (overlapping) subtree with the adjacent domain and generated Group C.

This example clearly illustrates the complexity and difficulty of the orthologous-domain classification problem, which are quite different from those of the usual domain classification problem. Probably this complexity reduces the clustering stability of DomClust; nonetheless, we would like to emphasize that the stability of DomClust is not worse than that of the BBH-based methods tested here ( Figure 7 ).

DISCUSSION

DomClust as a tool for orthologous-domain classification: summary

We presented here a method useful for grouping orthologous domains in multiple genomes. To our knowledge, this is the first report of a fully automated method of large-scale orthologous grouping at the domain level, and it is also the first direct application of the traditional UPGMA to this kind of problem. Although most of the previous studies have relied on the BBH criterion, BBH pairs are not always orthologs. Two non-orthologous genes may become a BBH pair when both of their respective orthologs have been lost. Lineage-specific paralogs, or inparalogs ( 41 ), can complicate the problem, although it can be overcome by some elaborate methods such as Inparanoid ( 24 ). A more serious problem is probably the fact that the extension of pairwise ortholog relationships to multiple genomes by SLink or its relatives such as TriLink is generally not relevant, because of the non-transitive nature of orthologous relationships ( 42 ). UPGMA, on which our algorithm is based, can solve the problems naturally. Indeed, we have shown here that UPGMA-based methods have an advantage over the BBH-based ones in terms not only of validity, measured through the COG recovery test, but also of stability. The main feature of our algorithm is its integration of the domain-splitting procedure into UPGMA. A comparison of DomClust with simple gUPGMA with the COG recovery test has clearly corroborated this advantage.

DomClust as a tool for homologous-domain classification

Although DomClust was developed mainly for ortholog grouping, it is also applicable to general homologous-domain classification by simply omitting the phylogenetic-tree cutting procedure. To date, numerous methods have been developed to address this issue ( 1 , 43 ). Since homology is a transitive relationship, SLink may be the natural solution to the problem of grouping homologs. However, since SLink is known to chain different domains, such a method requires either implicit or explicit handling of multidomain structures. Clearly, DomClust belongs to the latter (explicit handling) class of methods, but first we note that gUPGMA without domain splitting already has some properties common to methods of the former (implicit handling) class: (i) ProtoMap ( 8 ) and CluSTr ( 44 ) use hierarchical clustering schemes that are recursive applications of SLink clustering with various cutoff values. They are similar to gUPGMA, but require a series of arbitrary cutoff values. A more seamless approach is taken by ProtoNet ( 5 ), a recent replacement of ProtoMap, whose clustering scheme ( 45 ) is probably very similar to gUPGMA, especially when using the geometric means of the E -values as merged scores. (ii) GeneRAGE ( 11 ) is based on SLink clustering, but it uses a simple criterion to detect multidomain proteins and splits clusters using this information. The criterion is based on a violation of transitivity in similarity relations, which can be represented as the incomplete triangle shown in Figure 1 . gUPGMA imposes a penalty on such a relationship by assigning it a bad score, which prevents multidomain proteins from connecting different domain clusters, although DomClust treats this issue more explicitly. (iii) The normalized cut algorithm ( 25 ) and TribeMCL ( 9 ) are more sophisticated methods which use a network flow model on a similarity graph. Although these algorithms are quite different from gUPGMA, all of these methods take both the similarity scores (as edge weights) and graph connectivity into consideration in the clustering processes.

Among the classification methods which explicitly treat the domain structure, such as ProDom ( 6 ), DOMO ( 7 ) and ADDA ( 10 ), the salient feature of the DomClust algorithm is again its integration of a domain splitting procedure into the hierarchical clustering scheme. Because of the greedy nature of the algorithm, it can be a cost-effective solution to the domain splitting problem, just as the progressive alignment method is a cost-effective solution to the problem of gap insertion in multiple alignment. In fact, DomClust was even much faster than TribeMCL in our COG recovery test ( Table 1 ). However, we must note some limitations. First, domain boundaries can become very ambiguous near the root node. This is partly because the procedure averages the boundaries at each clustering step. Moreover, the alignment qualities generally become worse between less similar sequences. Second, the procedure cannot handle repetitive structures correctly. This difficulty derives from step (ii) of the alignment updating procedure. These problems are partly due to the fact that DomClust does not calculate explicit sequence alignment, which generally requires much computation time. However, in practice these problems have relatively little effect on ortholog grouping.

Difficulty of the ortholog grouping problem

In the original definition by Fitch ( 12 ), orthology is a well-defined concept in terms of evolution, and knowledge of the correct evolutionary history (gene and species trees) is the only prerequisite for ortholog grouping. It is interesting to imagine that a hierarchical structure such as the one shown in Figure 5A might truly illustrate the evolutionary history of complex domain structures. Although this poses an interesting problem, unfortunately currently we cannot expect so much; the similarity score does not exactly reflect the evolutionary distance, and the UPGMA algorithm often fails to reconstruct an evolutionary tree unless the molecular clock assumption holds.

In fact, the definition of orthologous groups is still somewhat obscure. Although this is not stated in Fitch's original definition, many biologists strongly expect an orthologous group to be a group of functionally equivalent genes. In addition, for prokaryotic genomes, the original definition based on speciation has already been violated due to the existence of horizontal transfers. Domain fusion/fission events add further complexity to the problem. In this context, a similarity-based clustering method can be a practical solution to the problem of gene classification. In the COG database, many difficulties were overcome by manual modifications. However, COG grouping is not the only ‘correct’ classification; indeed, orthologous relationships may be different according to the set of organisms considered ( 24 , 32 ). Moreover, manual maintenance will soon become impractical, as the number of genomes increases. In this study, we incorporated many complex classification processes into a relatively simple framework with parameters adjusted using the COG database as a reference. But there is still room for improvement, in that a more rigorous scheme based on the original ortholog definition must be implemented, especially for eukaryotic genome comparison. We believe that our method is flexible enough to adapt to more rigorous schemes which might be developed in the future, and that it provides efficient ways of creating orthologous groups from huge numbers of genomes, which is the crucial first step in large-scale comparative genomics.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

Figure 1

Hierarchical clustering as graph contraction. The best similarity edge is indicated by the thick line, and the missing edges to which a fixed score is assigned are indicated by the broken lines.

Open in new tab Download slide

Figure 2

Overview of the domain splitting procedure. ( A ) A similarity graph that has 7 clusters (vertices) containing 12 sequences, which are constituted from six domains, A–F, each of which is 100 residues long. The numbers in brackets on each edge indicate the coordinates of the aligned segments. The best similarity edge, sim( s₁ , s₂ ), that is selected for merging is colored red. At the bottom is the schematic illustration of the alignment between s₁ and s₂ . ( B ) A similarity graph, after the two clusters have been merged and split. The newly created nodes are colored red. ( C ) The resulting clustering tree. The process of merging ABCD and BCDE, at the center of the figure, corresponds to the process shown in (A) and (B). In this case, CD is not actually split, because D is always adjacent to C, and neither is EF.

Open in new tab Download slide

Detail of the domain-splitting procedure. Here, s1 and s2 are the sequences to be merged, and s3 is a sequence similar to either s1 or s2 . The aligned segments on each sequence are indicated by rectangles which have the same pattern for corresponding segments. ( A ) Condition I: the domain boundaries will be extended. ( B ) Condition II: the sequence (in this case, s1 ) will be split. ( C ) Updating the alignment information. In this example, s1 and s2 (upper half) are merged and split into sM and s R 1 (lower half). The thick lines indicate the extended regions. See text for details.

Figure 3

Detail of the domain-splitting procedure. Here, s₁ and s₂ are the sequences to be merged, and s₃ is a sequence similar to either s₁ or s₂ . The aligned segments on each sequence are indicated by rectangles which have the same pattern for corresponding segments. ( A ) Condition I: the domain boundaries will be extended. ( B ) Condition II: the sequence (in this case, s₁ ) will be split. ( C ) Updating the alignment information. In this example, s₁ and s₂ (upper half) are merged and split into s_M and s_{R ₁} (lower half). The thick lines indicate the extended regions. See text for details.

Open in new tab Download slide

Figure 4

Post-processing of the clustering algorithm. ( A ) The phylogenetic-tree cutting procedure. The procedure recursively checks whether child clusters of each root node share intra-species paralogs. The three-letter code on each leaf indicates the species name, and the number at each internal node indicates the number of species shared by its child clusters (| Ph ( A ) ∩ Ph ( B )|). ( B ) Rejoining adjacent clusters. Two clusters, A and B , are joined when most of the segments (cutoff ratio r_adj1 ) belonging to each cluster are adjacent to each other [the segment set adj( A , B )], or almost all of the segments (cutoff ratio r_adj2 ) belonging to the smaller clusters are adjacent to the other cluster.

Open in new tab Download slide

Figure 5

Orthologous groups of RNA polymerase beta (RpoB) and beta' (RpoC) subunits. ( A ) Hierarchical clustering trees created by the DomClust program. Each domain is drawn in a different color. An abbreviated species name (taken from the COG database; see Supplementary Table S2) is shown on each leaf, which is colored according to the kingdom: salmon, bacteria; khaki, archaea; sky-blue, eukaryotes. ( B ) Schematic illustration of the gene structures of RpoB and RpoC in selected genomes.

Open in new tab Download slide

Table 1

Open in new tab

CPU time required for clustering the entire dataset

Methods	COG02 dataset			COG03 dataset
	Ta (s)	\| H \| b	T /\| H \| (× 10 ⁻⁶ )	T (s)	\| H \|	T /\| H \| (× 10 ⁻⁶ )
DomClust	65.2	3 500 593	18.6	281.3	12 250 884	23.0
TribeMCL c	277.5	846210	327.9	1231.5	2833190	434.7

Methods	COG02 dataset			COG03 dataset
	Ta (s)	\| H \| b	T /\| H \| (× 10 ⁻⁶ )	T (s)	\| H \|	T /\| H \| (× 10 ⁻⁶ )
DomClust	65.2	3 500 593	18.6	281.3	12 250 884	23.0
TribeMCL c	277.5	846210	327.9	1231.5	2833190	434.7

^a CPU time on a 2.4 GHz Xeon machine.

^b The number of pairwise similarities used as input.

^c BBH relationships (with the selected parameters, r _BH = 0.7 and r_cov = 0.4) were used as input.

Table 1

Open in new tab

CPU time required for clustering the entire dataset

Methods	COG02 dataset			COG03 dataset
	Ta (s)	\| H \| b	T /\| H \| (× 10 ⁻⁶ )	T (s)	\| H \|	T /\| H \| (× 10 ⁻⁶ )
DomClust	65.2	3 500 593	18.6	281.3	12 250 884	23.0
TribeMCL c	277.5	846210	327.9	1231.5	2833190	434.7

Methods	COG02 dataset			COG03 dataset
	Ta (s)	\| H \| b	T /\| H \| (× 10 ⁻⁶ )	T (s)	\| H \|	T /\| H \| (× 10 ⁻⁶ )
DomClust	65.2	3 500 593	18.6	281.3	12 250 884	23.0
TribeMCL c	277.5	846210	327.9	1231.5	2833190	434.7

^a CPU time on a 2.4 GHz Xeon machine.

^b The number of pairwise similarities used as input.

^c BBH relationships (with the selected parameters, r _BH = 0.7 and r_cov = 0.4) were used as input.

Figure 6

Comparison of the various clustering methods with the COG recovery test, where evaluation was done according to the numbers of exact, phylo-, 80%- and 60%-MGPs against the 2360 WDOG reference set.

Open in new tab Download slide

Table 2

Open in new tab

Summary of the selected clusters

Methods	Selected parameters a			No. of clusters b
	r _BH	r_cov	Cutoff c	Total	Split d	Maxsize e	Inparalogs f
SLink	1.0	0.8	80	2910	—	3901	1.60
TriLink	0.8	0.4	60	3134	—	1757	1.66
TribeMCL	0.7	0.4	1.5	3652	—	1100	1.52
gUPGMA	—	—	60	4126	—	473	1.39
DomClust	—	—	60	4412	1302	396	1.41
COG	—	—	—	3192	971	806	1.65

Methods	Selected parameters a			No. of clusters b
	r _BH	r_cov	Cutoff c	Total	Split d	Maxsize e	Inparalogs f
SLink	1.0	0.8	80	2910	—	3901	1.60
TriLink	0.8	0.4	60	3134	—	1757	1.66
TribeMCL	0.7	0.4	1.5	3652	—	1100	1.52
gUPGMA	—	—	60	4126	—	473	1.39
DomClust	—	—	60	4412	1302	396	1.41
COG	—	—	—	3192	971	806	1.65

^a The parameter set yielding the best agreement with the COG classification ( M_Tot ). Note that DomClust uses more parameters. See Supplementary Table S1 for a detailed list.

^b The number of clusters comprising genes of at least three phylogenetically distinct organisms.

^c Cutoff scores for SLink, TriLink, gUPGMA and DomClust, and the inflation parameter for TribeMCL.

^d The number of clusters containing genes that were split into multiple domains.

^e The maximum cluster size.

^f The average number of inparalogs, which is defined as the sum of cluster sizes divided by the sum of the numbers of organisms included in all clusters. The term ‘inparalog’ is adopted from ref. ( 41 ).

Table 2

Open in new tab

Summary of the selected clusters

Methods	Selected parameters a			No. of clusters b
	r _BH	r_cov	Cutoff c	Total	Split d	Maxsize e	Inparalogs f
SLink	1.0	0.8	80	2910	—	3901	1.60
TriLink	0.8	0.4	60	3134	—	1757	1.66
TribeMCL	0.7	0.4	1.5	3652	—	1100	1.52
gUPGMA	—	—	60	4126	—	473	1.39
DomClust	—	—	60	4412	1302	396	1.41
COG	—	—	—	3192	971	806	1.65

Methods	Selected parameters a			No. of clusters b
	r _BH	r_cov	Cutoff c	Total	Split d	Maxsize e	Inparalogs f
SLink	1.0	0.8	80	2910	—	3901	1.60
TriLink	0.8	0.4	60	3134	—	1757	1.66
TribeMCL	0.7	0.4	1.5	3652	—	1100	1.52
gUPGMA	—	—	60	4126	—	473	1.39
DomClust	—	—	60	4412	1302	396	1.41
COG	—	—	—	3192	971	806	1.65

^a The parameter set yielding the best agreement with the COG classification ( M_Tot ). Note that DomClust uses more parameters. See Supplementary Table S1 for a detailed list.

^b The number of clusters comprising genes of at least three phylogenetically distinct organisms.

^c Cutoff scores for SLink, TriLink, gUPGMA and DomClust, and the inflation parameter for TribeMCL.

^d The number of clusters containing genes that were split into multiple domains.

^e The maximum cluster size.

Table 3

Open in new tab

Total number of splitting points and the number of splitting points that agree with COG for each parameter

Parameters		No. of splitting points (No. of sequences)
r_adj1	r_adj2	Total	Agreement with COG a	Within 50 amino acids b
1.0	1.0	7822 (6665)	2418 (1978)	1562 (1474)
0.8	0.95	5377 (4630)	2240 (1913)	1498 (1410)
0.7	0.9	4374 (3799)	1998 (1716)	1337 (1260)

Parameters		No. of splitting points (No. of sequences)
r_adj1	r_adj2	Total	Agreement with COG a	Within 50 amino acids b
1.0	1.0	7822 (6665)	2418 (1978)	1562 (1474)
0.8	0.95	5377 (4630)	2240 (1913)	1498 (1410)
0.7	0.9	4374 (3799)	1998 (1716)	1337 (1260)

^a The number of sequences split by both DomClust and COG and the number of splitting points on these sequences in DomClust.

^b The number of splitting points in DomClust that were located within 50 residues of their corresponding splitting points in COG.

Table 3

Open in new tab

Total number of splitting points and the number of splitting points that agree with COG for each parameter

Parameters		No. of splitting points (No. of sequences)
r_adj1	r_adj2	Total	Agreement with COG a	Within 50 amino acids b
1.0	1.0	7822 (6665)	2418 (1978)	1562 (1474)
0.8	0.95	5377 (4630)	2240 (1913)	1498 (1410)
0.7	0.9	4374 (3799)	1998 (1716)	1337 (1260)

Parameters		No. of splitting points (No. of sequences)
r_adj1	r_adj2	Total	Agreement with COG a	Within 50 amino acids b
1.0	1.0	7822 (6665)	2418 (1978)	1562 (1474)
0.8	0.95	5377 (4630)	2240 (1913)	1498 (1410)
0.7	0.9	4374 (3799)	1998 (1716)	1337 (1260)

^a The number of sequences split by both DomClust and COG and the number of splitting points on these sequences in DomClust.

^b The number of splitting points in DomClust that were located within 50 residues of their corresponding splitting points in COG.

Figure 7

Stability of the clustering methods measured through a comparison of the clusters created from the COG02 and COG03 datasets. Five automatic classification methods and the original COG database (COG) were examined. ( A ) The ratios of the number of clusters created from the COG02 datasets classified into each category. ( B ) The ratios of the number of entries belonging to the clusters classified into each category. Note that ‘domain-merged’ and ‘domain-divided’ are counted only in DomClust and COG.

Open in new tab Download slide

Figure 8

Orthologous groups containing the E.coli ModE protein, an example of DomClust groups whose domain organizations were changed between the COG02 and COG03 datasets. These clusters were generated using the cutoff score of 80. ( A ) Domain organization of the ModE orthologs generated from the COG02 dataset. ( B ) Domain organization of the ModE orthologs generated from the COG03 dataset. The genes added to the COG03 database but not to the COG02 database are indicated by italic letters. ( C ) Hierarchical clustering tree for Group A and Group B. The root node of each group is encircled. The intraspecies paralogs appearing in both subtrees of the root node of Group B are indicated by bold green text. Here, Hin ( Haemophilus influenzae ) and Pmu ( Pasteurella multocida ) are counted only once, since they are closely related organisms. Consequently, the effective number of overlapping organisms is 2, just half of the total number of organisms in each subtree. This does not satisfy the phylogenetic-tree cutting criterion with the cutoff value of 0.5. ( D ) Hierarchical clustering tree for Group C and Group D. The horizontal bars indicate the points of tree cutting, which generated Group C and Group D. The intraspecies paralogs appearing in both subtrees of the original root node of Group D are indicated by bold green text. In this case, the effective number of overlapping organisms is 4, which exceeds the cutoff of the tree cutting.

Open in new tab Download slide

The author thanks Drs Naotake Ogasawara and Satoru Kuhara for helpful comments. This work was supported in part by Grant-in-Aids for Scientific Research from MEXT, Japan. Funding to pay the Open Access publication charges for this article was provided by Japan Science and Technology Agency.

Conflict of interest statement . None declared.

REFERENCES

Liu, J. and Rost, B.

2003

Domains, motifs and clusters in the protein universe

Curr. Opin. Chem. Biol

–11

Ouzounis, C.A., Coulson, R.M., Enright, A.J., Kunin, V., Pereira-Leal, J.B.

2003

Classification schemes for protein structure and function

Nature Rev. Genet

508

–519

Sonnhammer, E.L., Eddy, S.R., Durbin, R.

1997

Pfam: a comprehensive database of protein domain families based on seed alignments

Proteins

405

–420

Schultz, J., Milpetz, F., Bork, P., Ponting, C.P.

1998

SMART, a simple modular architecture research tool: identification of signaling domains

Proc. Natl Acad. Sci. USA

5857

–5864

Sasson, O., Vaaknin, A., Fleischer, H., Portugaly, E., Bilu, Y., Linial, N., Linial, M.

2003

ProtoNet: hierarchical classification of the protein space

Nucleic Acids Res

348

–352

Sonnhammer, E.L. and Kahn, D.

1994

Modular arrangement of proteins as inferred from analysis of homology

Protein Sci

482

–492

Gracy, J. and Argos, P.

1998

Automated protein sequence database classification. II. Delineation of domain boundaries from sequence similarities

Bioinformatics

174

–187

Yona, G., Linial, N., Linial, M.

1999

ProtoMap: automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space

Proteins

360

–378

Enright, A.J., Van Dongen, S., Ouzounis, C.A.

2002

An efficient algorithm for large-scale detection of protein families

Nucleic Acids Res

1575

–1584

Heger, A. and Holm, L.

2003

Exhaustive enumeration of protein domain families

J. Mol. Biol

328

749

–767

Enright, A.J. and Ouzounis, C.A.

2000

GeneRAGE: a robust algorithm for sequence clustering and domain detection

Bioinformatics

451

–457

Fitch, W.M.

1970

Distinguishing homologous from analogous proteins

Syst. Zool

–113

Goodman, M., Czelusniak, J., Moore, W.M., Romero-Herrera, A.E., Matsuda, G.

1979

Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences

Syst. Zool

132

–163

Page, R.D.M.

1994

Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas

Syst. Biol

–77

Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O.

1999

Assigning protein functions by comparative genome analysis: protein phylogenetic profiles

Proc. Natl Acad. Sci. USA

4285

–4288

Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., Eisenberg, D.

1999

Detecting protein function and protein–protein interactions from genome sequences

Science

285

751

–753

Enright, A.J., Iliopoulos, I., Kyrpides, N.C., Ouzounis, C.A.

1999

Protein interaction maps for complete genomes based on gene fusion events

Nature

402

–90

Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G.D., Maltsev, N.

1999

The use of gene clusters to infer functional coupling

Proc. Natl Acad. Sci. USA

2896

–2901

Eisenberg, D., Marcotte, E.M., Xenarios, I., Yeates, T.O.

2000

Protein function in the post-genomic era

Nature

405

823

–826

Galperin, M.Y. and Koonin, E.V.

2000

Who's your neighbor? New computational approaches for functional genomics.

Nat. Biotechnol

609

–613

Jim, K., Parmar, K., Singh, M., Tavazoie, S.

2004

A cross-genomic approach for systematic mapping of phenotypic traits to genes

Genome Res

109

–115

Tatusov, R.L., Koonin, E.V., Lipman, D.J.

1997

A genomic perspective on protein families

Science

278

631

–637

Tatusov, R.L., Natale, D.A., Garkavtsev, I.V., Tatusova, T.A., Shankavaram, U.T., Rao, B.S., Kiryutin, B., Galperin, M.Y., Fedorova, N.D., Koonin, E.V.

2001

The COG database: new developments in phylogenetic classification of proteins from complete genomes

Nucleic Acids Res

–28

Remm, M., Storm, C.E., Sonnhammer, E.L.

2001

Automatic clustering of orthologs and in-paralogs from pairwise species comparisons

J. Mol. Biol

314

1041

–1052

Abascal, F. and Valencia, A.

2002

Clustering of proximal sequence space for the identification of protein families

Bioinformatics

908

–921

Zmasek, C.M. and Eddy, S.R.

2002

RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs

BMC Bioinformatics

Storm, C.E. and Sonnhammer, E.L.

2002

Automated ortholog inference from phylogenetic trees and calculation of orthology reliability

Bioinformatics

–99

Cotter, P.J., Caffrey, D.R., Shields, D.C.

2002

Improved database searches for orthologous sequences by conditioning on outgroup sequences

Bioinformatics

–91

Cannon, S.B. and Young, N.D.

2003

OrthoParaMap: distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies

BMC Bioinformatics

Li, L., Stoeckert, C.J., Jr, Roos, D.S.

2003

OrthoMCL: identification of ortholog groups for eukaryotic genomes

Genome Res

2178

–2189

Storm, C.E. and Sonnhammer, E.L.

2003

Comprehensive analysis of orthologous protein domains using the HOPS database

Genome Res

2353

–2362

Uchiyama, I.

2003

MBGD: microbial genome database for comparative analysis

Nucleic Acids Res

–62

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.

1997

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucleic Acids Res

3389

–3402

Smith, T.F. and Waterman, M.S.

1981

Identification of common molecular subsequences

J. Mol. Biol

147

195

–197

Jones, D.T., Taylor, W.R., Thornton, J.M.

1992

The rapid generation of mutation data matrices from protein sequences

Comput. Appl. Biosci

275

–282

Sneath, P.H.A. and Sokal, R.R.

Numerical taxonomy

1973

San Francisco, CA Freeman

Snel, B., Bork, P., Huynen, M.

2000

Genome evolution. Gene fusion versus gene fission

Trends Genet

–11

Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., et al.

2003

The COG database: an updated version includes eukaryotes

BMC Bioinformatics

McNicholas, P.M., Mazzotta, M.M., Rech, S.A., Gunsalus, R.P.

1998

Functional dissection of the molybdate-responsive transcription regulator, ModE, from Escherichia coli

J. Bacteriol

180

4638

–4643

Hall, D.R., Gourley, D.G., Leonard, G.A., Duke, E.M., Anderson, L.A., Boxer, D.H., Hunter, W.N.

1999

The high-resolution crystal structure of the molybdate-dependent transcriptional regulator (ModE) from Escherichia coli : a novel combination of domain folds

EMBO J

1435

–1446

Sonnhammer, E.L. and Koonin, E.V.

2002

Orthology, paralogy and proposed classification for paralog subtypes

Trends Genet

619

–620

Fitch, W.M.

2000

Homology a personal view on some of the problems

Trends Genet

227

–231

Heger, A. and Holm, L.

2000

Towards a covering set of protein family profiles

Prog. Biophys. Mol. Biol

321

–337

Kriventseva, E.V., Fleischmann, W., Zdobnov, E.M., Apweiler, R.

2001

CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins

Nucleic Acids Res

–36

Sasson, O., Linial, N., Linial, M.

2002

The metric space of proteins-comparative study of clustering algorithms

Bioinformatics

S14

–S21

© The Author 2006. Published by Oxford University Press. All rights reserved  The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected]

Download all slides

Month:	Total Views:
November 2016	1
January 2017	9
February 2017	11
March 2017	8
April 2017	10
May 2017	27
June 2017	8
July 2017	11
August 2017	8
September 2017	6
October 2017	4
November 2017	4
December 2017	25
January 2018	44
February 2018	47
March 2018	50
April 2018	44
May 2018	56
June 2018	32
July 2018	39
August 2018	49
September 2018	24
October 2018	13
November 2018	43
December 2018	32
January 2019	21
February 2019	21
March 2019	58
April 2019	54
May 2019	46
June 2019	40
July 2019	51
August 2019	37
September 2019	25
October 2019	15
November 2019	12
December 2019	18
January 2020	14
February 2020	13
March 2020	19
April 2020	14
May 2020	44
June 2020	25
July 2020	21
August 2020	22
September 2020	11
October 2020	16
November 2020	16
December 2020	36
January 2021	21
February 2021	14
March 2021	17
April 2021	18
May 2021	32
June 2021	20
July 2021	13
August 2021	7
September 2021	14
October 2021	22
November 2021	17
December 2021	4
January 2022	10
February 2022	12
March 2022	7
April 2022	12
May 2022	14
June 2022	18
July 2022	10
August 2022	6
September 2022	14
October 2022	18
November 2022	9
December 2022	8
January 2023	17
February 2023	4
March 2023	12
April 2023	7
May 2023	4
June 2023	6
July 2023	4
August 2023	16
September 2023	17
October 2023	11
November 2023	17
December 2023	13
January 2024	15
February 2024	16
March 2024	8
April 2024	15
May 2024	24
June 2024	12
July 2024	31
August 2024	17
September 2024	12
October 2024	32
November 2024	16
December 2024	9
January 2025	10
February 2025	17
March 2025	13
April 2025	19
May 2025	6

Article Contents

Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes

Abstract

INTRODUCTION

MATERIALS AND METHODS

Calculation of sequence similarities

Hierarchical clustering as graph contraction

Domain splitting: an overview

Some concepts: segment overlap, position mapping and alignment consistency

Determination of domain boundaries

Updating the alignment information

Phylogenetic-tree cutting

Rejoining adjacent domains

Validation test

Stability of the clustering methods

RESULTS

Evaluation of the clustering methods: the COG recovery test

Validation of domain boundaries

Clustering stability

Change in orthologous-domain organization: an example

DISCUSSION

DomClust as a tool for orthologous-domain classification: summary

DomClust as a tool for homologous-domain classification

Difficulty of the ortholog grouping problem

SUPPLEMENTARY DATA

REFERENCES

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes Open Access

Abstract

INTRODUCTION

MATERIALS AND METHODS

Calculation of sequence similarities

Hierarchical clustering as graph contraction

Domain splitting: an overview

Some concepts: segment overlap, position mapping and alignment consistency

Determination of domain boundaries

Updating the alignment information

Phylogenetic-tree cutting

Rejoining adjacent domains

Validation test

Stability of the clustering methods

RESULTS

Evaluation of the clustering methods: the COG recovery test

Validation of domain boundaries

Clustering stability

Change in orthologous-domain organization: an example

DISCUSSION

DomClust as a tool for orthologous-domain classification: summary

DomClust as a tool for homologous-domain classification

Difficulty of the ortholog grouping problem

SUPPLEMENTARY DATA

REFERENCES

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes