- Journal List
- BMC Bioinformatics
- v.7; 2006
- PMC1513609

# ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification

^{1}Chin-Hsien Tai,

^{2}Jean Garnier,

^{1,}

^{3}Jean-Francois Gibrat,

^{3}Byungkook Lee,

^{2}and Peter J Munson

^{}

^{1}

^{1}Mathematical and Statistical Computing Laboratory, DCB, CIT, NIH, DHHS, Bethesda, MD, USA

^{2}Laboratory of Molecular Biology, CCR, NCI, NIH, DHHS, Bethesda, MD, USA

^{3}Mathematique Informatique et Genome, INRA, Jouy-en-Josas, France

^{}Corresponding author.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

### Background

Current classification of protein folds are based, ultimately, on visual inspection of similarities. Previous attempts to use computerized structure comparison methods show only partial agreement with curated databases, but have failed to provide detailed statistical and structural analysis of the causes of these divergences.

### Results

We construct a map of similarities/dissimilarities among manually defined protein folds, using a score cutoff value determined by means of the Receiver Operating Characteristics curve. It identifies folds which appear to overlap or to be "confused" with each other by two distinct similarity measures. It also identifies folds which appear inhomogeneous in that they contain apparently dissimilar domains, as measured by both similarity measures. At a low (1%) false positive rate, 25 to 38% of domain pairs in the same SCOP folds do not appear similar. Our results suggest either that some of these folds are defined using criteria other than purely structural consideration or that the similarity measures used do not recognize some relevant aspects of structural similarity in certain cases. Specifically, variations of the "common core" of some folds are severe enough to defeat attempts to automatically detect structural similarity and/or to lead to false detection of similarity between domains in distinct folds. Structures in some folds vary greatly in size because they contain varying numbers of a repeating unit, while similarity scores are quite sensitive to size differences. Structures in different folds may contain similar substructures, which produce false positives. Finally, the common core within a structure may be too small relative to the entire structure, to be recognized as the basis of similarity to another.

### Conclusion

A detailed analysis of the entire available protein fold space by two automated similarity methods reveals the extent and the nature of the divergence between the automatically determined similarity/dissimilarity and the manual fold type classifications. Some of the observed divergences can probably be addressed with better structure comparison methods and better automatic, intelligent classification procedures. Others may be intrinsic to the problem, suggesting a continuous rather than discrete protein fold space.

## Background

A protein fold is often defined by the number, direction in space and connectivity (or topology) of its secondary structural elements[1] (alpha helices and beta strands). In two major fold databases, the definition of a fold is itself partially ambiguous. In SCOP[2], the definition is "same major number and direction of secondary structures with a same connectivity", without quantification of the term "major". In CATH[3], it is "overall shape and connectivity of the secondary structures", without a precise definition of "shape", although there is a degree of quantitation in this case since a structure comparison score is used to cluster domains in the same fold family. These "soft" definitions are required by the observed variations in the structures between proteins of identical biochemical function as amino acid sequence identities fall below 40%[4].

The situation is complicated by the presence of domains in protein structures. Their identification and delineation are not straightforward. Nevertheless, to have a better understanding of the effect of discrete classification as a description of the fold space, we analyzed the SCOP domain classification using two structure comparison methods applied directly to these domains. Numerous structure comparison methods exist [5-19] and some of them have been used to conduct such analyses. Shapiro & Brutlag[14], Ye & Godzik[17] and Kolodni et al[20] used the Receiver Operating Characteristic (ROC) curve[21,22] and Sierk & Pearson[23] a variant of it, mainly to compare their own method with other methods, using SCOP or CATH as the gold standard. Getz et al[24] devised an optimization algorithm to automatically classify new domains into existing SCOP folds or CATH topologies. They did not use the ROC curve, and only present the pairwise similarity score matrix. They also noted the existence of folds which are in twilight zone and difficult to classify. Hadley & Jones[25] and Day et al[26] compared 3 classifications: SCOP, CATH and FSSP ranging from completely manual to entirely automatic. They give the coverage, i.e. the percentage of pairs that are common between the 3 classifications, the percentages of pairs that are common to all 3 methods. Hadley and Jones[25] in their analysis briefly described a few examples of structural discrepancies between the automatic method FSSP and the manual and semi-manual SCOP and CATH classifications.

Here we use two structure comparison methods which are based on different principles and with which we are familiar. One, VAST[5,6], is based on only secondary structure elements in its first stage of comparison while the other, SHEBA[16], uses the amino acid sequence along with other structural properties of each residue in its initial step. We first construct the ROC curves using the SCOP fold definitions. We then generate the confusion matrix that results after setting a score cutoff value determined from the ROC curves. We analyze various aspects of this matrix to understand and extract the main properties of the fold space which cause the divergence with the automatic similarity assignment and the manual SCOP classification. Although some of the previous works[14,17,20,23-28] cover portions of what we describe here, none, as exhaustively analyzes and lists the fundamental mechanisms that produce the observed divergences.

## Results

### ROC curves

The ROC curves of each method show that both VAST and SHEBA are generally successful in detecting when two domains are in the same SCOP fold (Figure (Figure1).1). The ROC AUC (see Methods) is 0.93 for SHEBA, and 0.90 for VAST, indicating that SHEBA recognizes SCOP folds slightly better than does VAST. Also, the SHEBA ROC curve is above the VAST ROC curve at every point; there are no points of crossing, indicating that SHEBA is uniformly better than VAST at this recognition task. The ROC curve we present is actually an average of the curves obtained for each individual SCOP fold-recognition problem using a common cutoff value for all problems. For certain individual problems, VAST may dominate SHEBA or vice versa.

**ROC Curves**. ROC curves of VAST (dotted line) and SHEBA (solid line) obtained by plotting the True Positive Rate (

*TPR*, eq. 1, see Methods) against the False Positive Rate (

*FPR*, eq. 2, see Methods). Area Under the Roc Curve (AUC) for VAST is 0.90, AUC for

**...**

An optimal cutoff value for the binary decision of similarity can be determined from the ROC curve either by specifying the desired *FPR *(False Positive Rate, see Methods) or by specifying the desired *TPR *(True Positive Rate, see Methods). To reach a 1% *FPR*, the corresponding cutoff value is 2.5 for *Pcli *and 2.7 for *Zscore *(see Methods), with corresponding *TPR *values of 61.6% and 74.8% for VAST and SHEBA, respectively.

### Confusion matrix heat maps

Figure Figure22 shows the confusion matrix heat maps. An entry (*i*, *j*) of each matrix indicates the fraction of pairs of domains, one from each folds *i *and *j*, that are judged to be similar by the automatic similarity detection method using the cutoff value that produces 1% *FPR*. These maps constitute the basis for the analysis of the properties of the methods and the fold definition, conducted below. For high resolution heat maps of VAST and SHEBA, [See Additional file 1].

**Confusion matrix heat map**. Confusion matrix heat map for VAST with a

*Pcli*cutoff value of 2.5 and for SHEBA with a

*Zscore*cutoff value of 2.7. The cutoffs correspond to an overall average

*FPR*of 0.01, and result in an overall average

*TPR*of 0.616 and

**...**

Neither the VAST nor the SHEBA heat map is strictly symmetric (Figure (Figure2);2); the computed similarity measure depends on which domain is used as query and which as target. SHEBA gives a substantially more asymmetric heat map than VAST. Out of the total of *M*(M-1) *= 21,860,300 domain pairs (excluding identity pairs) on which the heat map is based, VAST and SHEBA have 11,007 and 189,551 asymmetric pairs, respectively. A domain pair similarity score is considered asymmetric if its similarity score exceeds the cutoff value in one comparison, but does not when the query and target structures are exchanged.

VAST uses an heuristic algorithm to find the maximal clique so the comparison of domain A with B may not select the same clique as the comparison of B with A when there are several near maximal cliques. The result is a slight asymmetry in the *Pcli *Score. The more noticeable asymmetry manifest by SHEBA is due to the *Zscore *computation which uses the average and the standard deviation of the distribution of *m*-scores between a fixed query domain and all other domains in the database, making the *m-score *distribution dependent on which domain, A or B, is declared the query domain. Since the average *m-score *similarity of a query domain A to the database may depend on parts of A which are not matched to B, the average similarity of B to the database might be quite different. Hence the *Zscore *becomes asymmetric.

### False negatives

The true positive rate varies with fold class, as illustrated in Figure Figure3.3. SCOP similarity detection differs widely among folds within a class and between the two methods. We now seek explanations of this variation.

**Distribution of true positive rates**. Distribution of fold specific true positive rates within each SCOP class (A to G) for VAST and SHEBA.

*TPR*

_{i }(eq. 4, see Methods) are obtained using same cutoff values as in Figure 2. The scale of the y axes for VAST

**...**

About 40% of the folds (216) achieve a fold specific true positive rate (*TPR*_{i}) above 0.85 for both methods. All classes are nearly proportionally represented in this set. For the exhaustive list of *TPR*_{i }obtained by each SCOP fold with VAST and SHEBA, [See Additional file 2].

To investigate why some domain pairs in the same SCOP fold are not detected as similar, we look at such domain pairs that belong to the same SCOP fold and for which the *Pcli *and *Zscore *values are below 1 and 1.6, respectively. These low cutoff values correspond to a *FPR *of 5%, and are chosen to exclude from consideration any borderline cases with computed similarity near but just below the original cutoffs. For all classes A to G, a total of 144 folds contain such extreme false negative domain pairs with 36 in A, 30 in B, 30 in C, 36 in D, 2 in E, 5 in F and 5 in G. The complete list of these folds is provided in Table Table11.

Detailed analysis of these false negative pairs highlights some common factors which explain the varying success of automated methods in detecting the similarity among domains in a SCOP fold. Most of the false negatives can be explained by structural variation within a fold and to a lesser extend by structures made of repeating units.

#### Structural variation of the common core

In many cases, the structure of the *common core*[29] of a fold varies significantly from one domain to the next in the same SCOP fold. We observe this phenomenon in folds across all SCOP classes. Many of the extreme false negative domain pairs described above are examples of such cases. Figure Figure44 shows the three domains, d1c5ch2 (a), d1akjd_ (b), and d1pama1 (c), from fold b.1. The similarity for pairs (a, b) and (b, c) is detected by both VAST and SHEBA, while it is not for the pair (a, c). The relative orientations between the beta sheets which form the beta sandwich, in domains (a) and (c) vary from those in domain (b). This variation is important enough, with regard to thresholds admissible by VAST and SHEBA, to make superposition of the structures (a) and (c) difficult, and to prevent a similarity detection. This results in a loss of transitivity for automatic similarity detection.

#### Structures made of repeating units

Automated similarity detection methods do not necessarily consider two structures similar if they contain the same simple structural motif but with a different number of repeats. The SCOP fold a.118 provides an extreme example. It is defined by domains that are comprised of repeated occurrences of a helix-loop-helix motif[30]. The number of occurrences of the helix-loop-helix motif varies greatly and is unspecified by the fold definition. Figure Figure55 shows three members in this fold and gives their pairwise similarity scores assigned by SHEBA and VAST. The d1qbkq_ (c) domain contains many repeats and is much larger (888 residues) than d1a17_ (a) and d1ku1a_ (b) domains (159 and 211 residues, respectively). Since both VAST and SHEBA look for global similarity, and since d1a17_ or d1kula_ would match at best only a small part of d1qbkq_, they yield the low Pcli and Zscore values. Size difference does not account for the low score between domains d1a17_ and d1ku1a_. Here, the reasons are that the helices of the repeated motifs vary in length and that the relative orientation of each motif varies between the structures. Thus, a multiple occurrence of locally similar motifs between two domains does not always produce a high global similarity score.

#### Decoration of the common core by many secondary structure elements

Occasionally two proteins in the same SCOP fold share a common core but are different in overall shape. An extreme example is shown in Figure Figure66 for the domain pair d1e9ga_ and d1enfa1 in the fold b.40. They both contain a beta barrel, but the beta barrel in domain d1e9ga_ is only a small part of its entire structure. A match between this domain and domain d1enfa1, based on the conserved common core is thus not found sufficient to consider them to be similar by the automatic pair-wise structure comparison methods.

#### Miscellaneous cases

Some folds, such as fold d.184 or a.138, are described in SCOP as including a variety of structures. We also note the existence of several ambiguous fold definitions leading necessarily to a low *TPR*_{i}. For instance, fold c.37 whose SCOP description is "3 layers: alpha/beta/alpha, parallel or mixed beta-sheets of variable sizes", can probably be split into at least 2 folds. We also spotted what appears to be a bookkeeping error by SCOP. Domains d1kkea2 and d1qiua2 of fold b.83 were not found to be similar either by VAST or SHEBA. The protein 1kke has two domains, which belong to two different folds. The N-terminal domain (residues 250–312) forms an extended structure belonging to the SCOP fold b.83 ("Triple beta-spiral"). The C-terminal domain (residues 313–455) forms a beta barrel belonging to the SCOP fold b.21 ("Virus attachment protein globular domain"). In SCOP and in the Astral database, the domain d1kkea1, which consists of the residues 250–312, is placed in the b.21 fold and d1kkea2, which consists of residues 313–455, is placed in the b.83 fold.

#### Differences between VAST and SHEBA

There are 27 folds with a *TPR*_{i }below 0.05 by VAST yet above 0.9 by SHEBA. They are a.16, a.37, a.38, a.97, a.115, a.121, a.130, a.158, a.159, b.76, c.107, d.6, d.83, d.88, d.101, d.118, d.175, f.10, f.14, f.17, g.14, g.22, g.24, g.38, g.49, g.50, g.53. These are mainly small folds with only 2 domains each. No fold has been identified with a *TPR*_{i }less than 0.05 by SHEBA but above 0.9 by VAST. Additionally, the class specific true positive rates reported in Figure Figure3,3, shows an important difference between VAST and SHEBA in the A class (0.37 for VAST and 0.74 for SHEBA).

Some of the differences observed between VAST and SHEBA are related to the calculation of the scoring function in VAST (see Appendix, Calculation of *Pcli*), and to the fact that structures sharing fewer than 3 secondary structure elements (SSEs) are often judged not significant by VAST. This latter factor also affects true positive rates computed by VAST in the A class, where the *TPR*_{i }averages only 0.2 for folds with 2,3 or 4 SSEs, but rises to 0.7 when the fold has about 9 or more SSEs (data not shown). But at least one case could not be explained by the issue of the *Pcli *calculation. Domains d1h8pa1, d1l6ja3, d1pmla_, d2hpqp_and d2pf1_1, of fold g.14 defined as a disulphide-rich fold, scored low in VAST similarity but surprisingly high by SHEBA. These domains have particularly small SSEs, distributed sparsely over the backbone of the structure. It is quite understandable that VAST which relies on the SSEs, finds low similarity among them. It was also observed that pairs for which SHEBA *Zscore *was high also had a higher level of sequence homology than those for which the SHEBA *Zscore *was low (data not shown). This indicates that SHEBA benefited by using the sequence homology in finding the initial alignment (see the Methods).

### False positives

The off-diagonal pixels in the heat maps, on Figure Figure2,2, represent fold pairs having a non-zero fold-specific false positive rate *FPR*_{i,j}. The confusion made by each method has different characteristics, shown by the difference in the distribution of the dark areas. There are a relatively small number of pixels between classes. In contrast, confusion within each class varies with the method and can be high.

The main confusion is within classes B, C and D, with respectively 37 folds out of 78 within B class, 80 folds out of 94 within class C, and 53 folds out of 139 within class D, involved in some type of confusion. VAST does not show a noticeable level of confusion within classes A, and F, although SHEBA does. The relatively high A-class confusion level for SHEBA is probably related to its use of the dynamic programming algorithm, without gap penalty, in finding the best alignment between a pair of superimposed structures[16].

Besides these global observations, more specific confusion trends can be determined by analyzing the predominant confusion patterns shown by the heat maps.

#### Intraclass confusion

Confused folds occur mainly near the diagonal of the sorted heat map, as a result of the hierarchal clustering and re-ordering of the folds within each fold class (see Methods).

Table Table22 reports a number of clusters of confused folds within classes A, B, C and D common to VAST and SHEBA.

Confused folds in the A class include helix bundles of either identical or a similar number of helices in similar relative orientations. Examples are reported in Table Table2,2, rows 1, 2 and 3. The close similarity of some domain pairs, for example d1m7ka_ and d1hs7a_ belonging to folds a.7 and a.47, respectively, indicates that these "confusions" appear to be cases wherein SCOP includes considerations other than purely structural similarity/dissimilarity.

Figure Figure77 illustrates clusters of confused folds within class B, which are found in both the VAST and SHEBA heat maps. There are two large clusters of confused folds in Figure Figure7.7. The darkest area covers the five beta-propeller folds (Table (Table2,2, row 4), with each fold containing different number of blades ranging from 4 to 8. These tend to be highly confused with each other, more by SHEBA than by VAST. Figure Figure88 shows domains d1gyha_ (a) and d1loqa2 (b) from folds b.67 and b.69 respectively. Since the 7 bladed beta-propeller domain can have up to 5 blades common with the 5 bladed beta-propeller domain, pairs of domains from these separate folds tend to have a high similarity scores.

**Confusion matrix for the B class**. Confusion matrix heat map for VAST and SHEBA showing confusion among some SCOP folds of the class B, mainly beta domains. Fold identifiers appear on the x and y axis. Grey scale from white to black for positive rates

**...**

**Similar structures in different SCOP folds**. Structures of domains (a) d1gyha_ of fold b.67 and (b) d1loqa2 of fold b.69, with 318 residues and 295 residues respectively. They correspond to beta propeller domains with respectively 5 and 7 four-stranded

**...**

The next cluster of five folds in Figure Figure77 (Table (Table2,2, row 5), includes all beta sandwich immunoglobulin-like folds, with 7, 8 or 9 strands in 2 sheets with a Greek-key topology. Their confusion is caused by the sharing of the motif of the beta sandwich of the common core. Others confused sets of folds in the B class also involve mainly beta sandwich folds (Table (Table2,2, rows 6, 9, and 12), or beta barrel folds (Table 2, rows 7, 8, 10 and 11). The confusion among domains of these clusters of folds is similarly caused by a common beta sandwich or beta barrel motif. In the B class, where folds defined by the beta barrel or the beta sandwich motifs are frequent, confusion among folds of either motif is frequent as well.

A large common confusion pattern among folds appears at the bottom right corner of the C class area of the heat map (Figure (Figure2).2). A highly confused set, Table Table22 row 17, from this large confused area consists of 3 layer alpha/beta/alpha folds with parallel beta sheets for some, and mixed beta sheets for other, and with 4, 5, 6 or 7 strands. The superposition of the domains d1a8p_2 (a) and d1a9xa2 (b), from folds c.25 and c.24 respectively (Figure (Figure9),9), illustrates that confusion is caused by the presence of a common sub-structure. Examination of other confused folds from this large confused set within the C class in Table Table2,2, rows 15, and 16, shows that they also involve 3-layer alpha/beta/alpha folds that share sub-structures of varying sizes that are similar. The large number of confusions in the C class can be attributed to the abundance of the 3-layer alpha/beta/alpha folding pattern, which get confused by a similar mechanism.

**Superposition of two structures**. Superposition by VAST of two structures from different 3 layers alpha/beta/alpha SCOP folds of class C. View of backbones of domains (a) d1a8p_2 and (b) d1a9xa2, from folds c.25 and c.24, respectively. The common parts

**...**

The C class also shows some small confused sets among folds with different architectures. For example, confused folds c.1 and c.6 (Table (Table2,2, row 13) correspond respectively to TIM beta-alpha barrel and variants having 7 strands or less. Confused folds c.8 and c.98 (Table (Table2,2, row 14) are described respectively as "3 layers: beta/beta/alpha; the central sheet is parallel, and the other one is anti-parallel" and "core: 3 layers, alpha/beta/alpha; parallel beta-sheet of 4 strands". The confusion occurs due to a common beta/alpha sub-structure.

Figure Figure1010 shows several clusters of varying size, common to VAST and SHEBA within the D class. They correspond to the sets of confused folds reported in Table Table2,2, row 23, 24, 25, 26, 27. Folds in these sets share a common core structure consisting of a 2-layer alpha/beta sandwich. An analysis of the clusters reported in Table Table22 for the D class shows that most of the confused folds are mainly variations of the 2-layer alpha/beta sandwich structures.

**Confusion matrix heat map for the D class**. Confusion matrix heat map for the D class for VAST and SHEBA showing clusters of confused SCOP folds. The fold identifiers appear on the x and y axis. Grey scale from white to black for positive rates from 0

**...**

We have noticed confusions involving distinct motifs such as between the beta sandwich fold b.4 and beta barrel fold b.107, (Table (Table2,2, row 9). Beta sandwich and beta barrel motif folds are generally well separated, but false positives due to proximity of some extreme members of these respective folds can happen. Figure Figure1111 reports a false positive between domains from the beta sandwich fold b.1 and beta barrel folds b.43, which are well distinguished on average. Domain d1pama1 (b) from fold b.1 is confused with domain d1ep3b1 (c) from fold b.43. Such confusion is caused by structural variation of the common core of the beta sandwich fold b.1 and barrel fold b.43 represented by prototypical domains d1tvda_ (a) and d1d2ea1 (d), respectively. Deformation causes the relative orientation between the beta sheets of domain (b) to become more similar to that of the barrel domain (c).

#### Interclass confusion

Finally, the heat maps also show off-diagonal grey or black pixels where members of a SCOP fold in one class are detected as similar to domains in another. Both heat maps present such confusion patterns. As apparent in Figure Figure2,2, the confusion between classes is very low for both methods. Nevertheless, it is still detectable between some classes, in particular, between classes B and D, C and D, C and E, and between D and G. For VAST, there is no additional noticeable confusion between classes. However SHEBA shows additional minor confusion of the class A with the classes B, C, D, F, and G, and between classes D and F.

The confusions involving the E-class ("Folds consisting of two or more domains belonging to different classes") are easily understandable. They all involve structures which contain a domain which shares similarity with another domain in a different class, mainly class C. Examples include fold e.24 confused with c.16, c.57, c.44, c.23 and c.5, (Table (Table2,2, row 28), and fold e.4 confused with folds c.48, c.2, c.32, c.33, c.34 and c.23, (Table (Table2,2, row 29).

Additionally, SHEBA confuses some folds from class A, with folds in classes D and F ("membrane proteins"). The most confused folds from the A and D classes, having more than 100 confused domain pairs, are: (a.118, d.211: 250 confused pairs), (a.60, d.58: 132), (a.1, d.58: 118), (a.77, d.58: 114), (a.6, d.58: 104), (a.4, d.95: 104). For confused folds a.118 and d.211, for example, even though VAST and SHEBA match a similar number of residues, the Sheba *Zscore *tends to be high while the VAST *Pcli *is below the cutoff value. A similar trend is observed between the A and F classes. The way the *Zscore *is computed has the tendency to increase the confusion, by over-emphasizing the significance of the match, compared to the number of matched residues when helices are matched.

## Discussion

The combined use of the ROC curve and the confusion matrix heat map has been the key in making this large scale analysis of protein classification. Several authors[14,17,20,23] have used the ROC curve to evaluate structure comparison methods using the CATH or SCOP protein classification database as the reference. In the most recent and comprehensive study, Kolodny *et al*[20] compared six different methods and found the highest true positive rate to be 50%, at 1% false positive rate, attained by the DALI method using the native DALI score and CATH as the reference. Our ROC analysis finds a true positive rate of 61.6% and 74.8% at 1% false positive rate, for the comparisons of VAST and SHEBA to SCOP, respectively. The differences between their result and ours might be explained by differences between the comparison methods (DALI, VAST, SHEBA), by differences between the definitions used for the false and true positive rates (they do not give explicit equations), and/or by the use of different databases of protein structures (CATH vs. SCOP). In particular, CATH groups domains into different numbers of folds than does SCOP, as noted by Hadley & Jones[25] and Day *et al*[26].

Aside from providing a global measure of the agreement, ROC curves are also useful because they provide a practical means to select a score cutoff value for deciding if a pair of structures is to be considered similar or not, by trading off true and false positive rates. Other approaches have used methods other than ROC analysis or have ignored that tradeoff entirely. In their comparison of several structure comparison methods with CATH, Sierk and Pearson[23] selected a decision level corresponding to the first 100 errors made by the program. Other approaches [24-28] do not use the ROC curve and often fail to properly acknowledge the obligatory trade off between false and true positive rates, making it difficult to compare the reported degree of agreement with others.

Although the ROC AUC varies somewhat by method, none of the reported values are high as desired. This raises a fundamental and important question: What mechanisms cause the automatic structural comparison methods to diverge so significantly from SCOP or CATH? To address this aspect of the problem, we need to descend from a global view of the database to a more detailed view of individual folds and finally of the domains comprising each fold. To investigate why structural comparison methods diverge from SCOP, we used the confusion matrix to distribute the 1% false positive comparisons to the individual fold pairs, resulting in a "false and true positive rates" map of the protein fold space. This can be distinguished from the map of the fold space constructed by Hou *et al.*[31,32] who applied multi-dimensional scaling to pair-wise similarity scores. The exploration of the fold space, guided by our map, leads directly and objectively to the areas or subsets of folds where divergence with structural comparison methods is most evident. In particular, it has allowed us to move from the areas of high false positive or negative rates to the corresponding properties of the fold space. False negative rates are seen to relate directly to the issues of core variation and repeated sub-structures within a fold, while false positive rates are linked to the sharing of a common sub-structure between folds. Since the mathematical quantities *FPR *and *TPR *are interdependent, so are the corresponding properties of the folds space.

In looking at a particular area of our heat map, we can calculate an index of how likely a method is to confuse those folds, as the ratio of the average of fold-specific false positive rates to the average fold-specific true positive rate in that area. A value near 1 indicates that the folds in this area cannot be distinguished by the structure comparison method, on the average. It is worth noting that this index is cutoff dependent, as expressed in terms of true and false positive rates, and can thus be obtained for more or less severe false positive rates. The index of confusion is related but distinct from the index of "gregariousness" in Harrison *et al.*[13] for the CATH folds (topology level), which is a property of a fold that measures the number of other folds that are similar to it as judged by comparing the score to that of an empirically established standard score distribution at a certain cutoff level. The substantial number of highly confused sets of folds listed in Table Table22 allows us to examine in detail the source of the discrepancy between SCOP and our structure comparison methods.

### Causes of false negatives and false positives

In the Results section we presented several examples of false negative and false positive cases related in one way or another to the common core. SCOP defines the common core of domains in the same fold to have the "same secondary structure elements in the same arrangement with the same topological connections" (Brenner et al[29]), leaving open the possibility for some variation such as differences in length, relative orientations and/or number of the SSEs which we call variation of the common core.

Variation of the common core of domains within a fold, considered insignificant by SCOP, may still be large enough to cause VAST and SHEBA to find the domains dissimilar, giving rise to false negatives as in Figures Figures44 and and5.5. False negatives may also occur when the common core is so small compared to the whole structure that the overall structural similarity is unrecognizable, as in Figure Figure6.6. The evidence of structural variations of the common core of proteins within the same fold was shown in the work by Chothia & Lesk[4]. When the percentage of sequence identity between domains decreases much below 40%, their common cores tend to diverge structurally. The analysis of the confusion matrix shows that some false negatives for folds reported in Table Table11 arise from such core structure variations.

When two domains share an apparent common core, but SCOP judges the core elements to be significantly different, SCOP places the domains in distinct folds. However, the automatic methods may find the domains similar, as in Figure Figure99 and and11,11, giving rise to false positives. Also, conversely to the case in Figure Figure5,5, when the repeats of a common motif are organized in a regular fashion in a domain, our methods may consider the domains similar, but SCOP may place them in distinct folds (see Figure Figure8).8). Table Table22 enumerates a number of false positive cases arising from closely related common cores in distinct SCOP folds.

VAST and SHEBA decide on the similarity on the basis of the largest fraction of matching secondary structural elements or residues. However, visual inspection may allow the overall context of the matching and mismatching parts to play a role. If only a small part matches, but the matching part appears to be the core of each structure, then the match may appear more meaningful. If the number of repeats in a structure appears to be an important property of the structure, structures with different numbers of repeats may be placed in different folds. If, on the other hand, the precise number of repeats is not important for a structure, structures with different numbers of repeats are all placed in the same fold. If almost all parts match, but some important part, perhaps one critical beta-strand or even an irregular loop, is missing or placed differently in one structure, it may be placed in a different fold, etc.

It is possible that the problem is rooted in part, in the way structural alignment is currently conceived. Analogous to sequence alignment methodology, structural alignment maximizes the match between two structures, at the residue or secondary structure level, to infer a similarity relationship. On the other hand, the concept of similarity implicitly defined by SCOP, is focused on the sharing of higher level (above SSEs) motifs. This is in contrast to similarity measures based on the residue or SSE-level matches as defined by many structure comparison methods. We have shown examples (beta propellers, or alpha-solenoids) where occurrence of a motif is more appropriate for inferring similarity than is the maximum residue or SSE-level structural match. Although not evaluated directly here, we suspect that the structural comparison methods agree with SCOP when these two concepts agree, i.e. when the motif in question coincides with the maximum residue or SSE-level structural match, but disagree otherwise. Automatic structural similarity measures might thus be improved either by incorporating higher level structural motifs such as barrels or sheets, rather than remaining at the level of residues, strands or helices, or by weighting matching residues according to their structural context or functional importance.

Problems encountered by structural comparison methods might also be a reflection of intrinsic properties of the protein fold space. We have reported examples which tend to support the idea of structural drift [33], i.e. a series of gradual steps which connect one fold with another, and showed areas where folds were highly confused. In such sets of folds, some structures within the same fold are too dissimilar to be detectable by structural comparison methods, while those in different folds are not always completely distinct. This raises questions about the fold definition. We have observed, for example, that distinction between beta barrel and two layer beta-sandwich domains can be surprisingly difficult. As the relative orientations of the strands in the two beta sheets in a barrel departs from orthogonality, and become more parallel, the distinction between barrel and two layer beta sandwich motifs becomes fuzzy. Drawing the proper separations within a set of domains in which such phenomenon is observed is not obvious and necessarily introduces some arbitrariness. Should such diverse folds be sub-divided into two, three or more folds? If this decision is taken at some point in time, with the then available structures, how stable and universal will this distinction remain over time? VAST and SHEBA are generally well able to a major part reproduce the fold classification of SCOP, consistent with the notion that protein folds are well-defined, discrete entities. However, despite many attempts, SCOP folds or CATH topologies continue to elude precise quantitative or computational definition. We suggest therefore, that for some parts of the fold space, folds are not well separated entities but more nearly a continuum of structural arrangements as also observed in [1,3,34-36], with some regions more populous than others. Here, apparent "folds" may arise as much from density fluctuations in regions where experimentally determined structures are sparse, as from thermodynamic stability wells which would partition the fold space. We speculate that the idea of continuum will become more apparent as a larger number of new structures are solved by structural genomics projects[31]. In any case, the classification of structures into folds is probably a valuable and practical way of describing the fold space. When the fold space is continuous, this necessitates some arbitrary classification decisions, which may in fact not be completely reproducible by any automated approach.

## Conclusion

The results of this comprehensive comparison of VAST and SHEBA with the SCOP classification demonstrate that these two methods in their present form can reproduce at best 75% of the SCOP fold classification (for 1% false positive rate). Our detailed study of over 20 million pairs of protein domains underlines the difficulties encountered by automatic methods analyzing a classification of protein structures. A major difficulty arises from structural variation, which naturally accompanies amino acid sequence divergence, within the core of a defined fold. When severe enough, this can produce false negatives. When common cores of different folds are too similar, false positives result. Another, though less common, difficulty also arises when a motif is repeated several times within a single domain and in variable numbers. When the defining "common core" corresponds to only a small part of a whole structure, when the core is decorated extensively, automatic recognition of its similarity to other fold members becomes difficult. These divergences suggest a continuous rather than a discrete protein fold space, further complicating the problem of automatic classification. Clearly, improved algorithms of comparison must be developed and/or other types of classifications must be considered, and will be considered in future work.

## Methods

### Structural comparison methods

VAST is a method to superimpose and compare protein 3D structures. It consists of a two stage procedure. The first stage is based on a high-level description of protein structures. Secondary Structure Elements (SSEs) are represented by vectors and an algorithm based on a maximum clique search which finds the best one-to-one correspondence of a set of vectors in a query structure to a set of vectors in a target structure. Special care is paid to the significance of the one-to-one correspondence found between the two 3D structures. The method calculates the probability of generating a similar one-to-one correspondence by chance, and then correspondences are ranked and selected according to the value of this probability. Results of the first stage are used as seeds for the second stage.

In the second stage proteins are described using the alpha carbons (CAs) of the residues. The algorithm, based on a Gibbs-Monte Carlo procedure, tries to extend alignment of the initial seed to CAs belonging to the connecting loops. Usually, one wants to find the alignment that includes the maximum number of CAs yet with the smallest root mean square deviation (RMSD) value possible. Unfortunately, there is a correlation between the number of CAs included in the alignment and the value of RMSD: the larger the number of residues the higher the resulting RMSD value. The algorithm, in this second stage intended to solve this problem by answering questions such as: which alignment, one having 60 superimposed residues with a RMSD of 2.0, or one having 80 superimposed residues with a RMSD of 2.5 is the best one? This question is settled by choosing the alignment least likely to occur by chance, based on a *Z*-score calculation with respect to random distributions of the RMSD (see Appendix for more details). Please note that the VAST program we use is homologous to the version that can be downloaded at NCBI[37] (they both descent from a common ancestor[5,6]). The original VAST source code includes S+ subroutines. In this version, these subroutines were re-implemented in C language and regular PDB files can be used as input. It can be downloaded at VAST INRA server [38]. No other changes were made from the original version.

SHEBA is a protein structure comparison program which performs pairwise protein structure alignment in two steps. The initial alignment is made by maximizing the weighted sum of scores for the sequence homology, secondary structural similarity, and the similarity of the environment profile. The environment profile includes the solvent accessibility and polarity of the atoms around a given residue. The alignment is then iteratively refined in the second step, in which a new alignment is obtained from the three-dimensionally superimposed structures based on the current alignment, using a dynamic programming procedure that maximizes the number of residue pairs for which the CAs distance is less than 3.5 Å.

For each pair of proteins compared, SHEBA computes the *m *score defined as the number of matched residues divided by the mean length of the two protein domains. When one protein is compared to each protein in a database of target proteins, SHEBA also computes the *Zscore *computed from the score *m *as

$Zscore=\frac{m-<m>}{\sigma \left(m\right)}$

where <*m*> and σ*(m) *are the mean and the standard deviation of the scores *m *between the same query domain and all other target domains in the database. SHEBA source code can be downloaded at SHEBA server[39].

### Analysis of structure comparison methods

We consider a set of protein structural domains, *D*, and a collection of *N *folds {*F*_{i}} of the SCOP classification. Structural similarity of a query domain *q *to a target domain *t *is declared when *q *and *t *are members of the same SCOP fold. In other words, given a query *q*, we say that *q *is *similar to *(or *SCOP-similar to*) another domain *t *if and only if *q *∈ *F*_{i }and *t *∈ *F*_{i}, for some SCOP fold *F*_{i}. Under this definition, structural similarity is an all-or-none phenomenon, as judged by SCOP, used as the reference.

Structure comparison methods are said to detect the structural similarity between a query domain *q *and a target domain *t *when the value of the computed similarity score *S*(*q*,*t*) (*Pcli *for VAST and *Zscore *for SHEBA) is above a pre-specified cutoff value. Formally, domain *q *is *detected as structurally similar *to *t *if and only if *S*(*q*, *t*) ≥ *c*, for some fixed cutoff value *c*.

We proceed as follow. First, the similarity scores *S*(*q*,*t*) for every *q *∈ *D *and every *t *∈ *D*, are calculated by VAST and SHEBA. Then, the overall accuracy of detection of structural similarity, compared to SCOP fold similarity is evaluated using ROC methodology, for structure comparison methods VAST and SHEBA, excluding similarity scores for *q*≡*t*. Finally, divergences between structural similarities measured by either structure comparison methods, and the SCOP classification are investigated using a heat map representation of the confusion matrix.

### ROC analysis

The four possible outcomes for a particular domain *q *evaluated against a particular target domain *t *are summarized in the Table Table33.

The True Positive Rate, *TPR(c)*, the overall rate that a domain *q *is correctly detected to be similar to another domain is calculated first by comparing that domain to all other domains in its fold, then averaging this rate over all domains in the dataset *D*. Formally,

$TPR\left(c\right)=\frac{1}{M}{\displaystyle \sum _{i=1}^{N}{\displaystyle \sum _{q\in {F}_{i}}\frac{{\displaystyle \sum _{t\in {F}_{i},t\ne q}\text{I}\left(S\left(q,t\right)\ge c\right)}}{\left({n}_{i}-1\right)}}}\text{}\left(1\right)$

where *M *is the total number of structural domains in the dataset *D*, *N *is the number of folds (we consider only folds with *n*_{i }>1), *n*_{i }is the number of domains within a fold *F*_{i}, and I(≺) is the indicator function, i.e. I(TRUE) = 1 and I(FALSE) = 0. *TPR(c) *is also the sensitivity of the method for SCOP.

Likewise, the False Positive Rate, the rate at which a domain *q *is falsely detected to be similar to another domain, is

$FPR\left(c\right)=\frac{1}{M}{\displaystyle \sum _{i=1}^{N}{\displaystyle \sum _{q\in {F}_{i}}\frac{{\displaystyle \sum _{t\notin {F}_{i}}\text{I}\left(S\left(q,t\right)\ge c\right)}}{M-{n}_{i}}}}\text{}\left(2\right)$

The specificity of the method for SCOP is [1 - *FPR*(*c*)].

### The ROC curve and the area under the ROC curve

The ROC curve is obtained by plotting the True Positive Rate *TPR*(*c*) against the False Positive Rate *FPR*(*c*), for the entire range of possible cutoff values, *c*. On this plot, the line through the origin with slope 1 would correspond to the performance of a similarity detection based on a random similarity score. A method which detects SCOP similarity better than randomly must show a ROC curve situated above this diagonal. The overall performance of either VAST or SHEBA in detecting SCOP fold similarity can be measured by the area under the ROC curve (*AUC*)[21,22], where a perfect detection method would yield *AUC *= 1 and a random prediction *AUC *= 0.5. The area is estimated using the trapezoid integration rule. Strictly speaking, we are presenting an average of the ROC curves for each individual fold-recognition problem, where the average is taken with the cutoff value, *c*, in common.

### Confusion matrix heat map

The performance of a similarity detection method can be studied within specific folds or fold pairs. Thus, we define a fold specific false positive rate between two different folds, *FPR*_{i,j}(*c*) as the rate at which query domains in *F*_{i }are detected to be similar to target domains in *F*_{j}. We estimate this rate from our data as

$FP{R}_{i,j}\left(c\right)={\displaystyle \sum _{q\in {F}_{i}}{\displaystyle \sum _{t\in {F}_{j}}I\left(S\left(q,t\right)\ge c\right)}}/{n}_{i}{n}_{j}\text{}\left(3\right)$

We see this as a confusion in the similarity detection.

When *i *= *j*, we define the fold-specific true positive rate for domains in the same fold *F*_{i}, estimated as

$TP{R}_{i}\left(c\right)={\displaystyle \sum _{q\in {F}_{i}}{\displaystyle \sum _{t\in {F}_{i},t\ne q}I\left(S\left(q,t\right)\ge c\right)}}/{n}_{i}\left({n}_{i}-1\right)\text{}\left(4\right)$

The confusion matrix, defined by *TPR*_{i}(*c*) for *i *= *j *and by *FPR*_{i,j}(*c*) otherwise, for a particular value of *c*, can be visualized graphically as a heat map, with values of the matrix coded in grey scale (1 = black, 0 = white). To emphasize underlying patterns of confusion amongst the folds, the order of rows and columns (corresponding to the folds of SCOP) is permuted within fold class using hierarchical cluster analysis of the columns of the VAST heat map for class C and of the SHEBA heat map for the other classes. The hierarchical clustering was based on correlation between columns, and used Ward's method, as implemented in the Matlab Statistics toolbox (The Mathworks, Natick, MA). The same ordering of folds was then applied to the rows and columns of the confusion matrix heat map for both methods. Visually, the main diagonal of the heat map shows the agreement between SCOP and the structure comparison/similarity method, while the off-diagonal shows areas of confusion made by each structure comparison method in detecting SCOP fold similarity. The overall, global properties based on the entire set of SCOP folds may be appreciated by viewing the entire heat map, while particular properties of subsets of folds may be viewed by zooming into particular areas. Sets of confused folds may be quickly recognized and identified in this manner.

The *TPR*(*c*)*(eq. 1) *can be computed as the average of the *TPR*_{i}*(c) *weighted by fold size, $\frac{1}{M}{\displaystyle \sum _{i=1}^{N}{n}_{i}TP{R}_{i}\left(c\right)}$. Likewise the *FPR*(*c*)*(eq. 2) *can be computed as the fold size-weighted average of the fold specific false positive rate, $\frac{1}{M}{\displaystyle \sum _{i=1}^{N}\frac{{n}_{i}}{M-{n}_{i}}}\left({\displaystyle \sum _{j=1;j\ne i}^{N}{n}_{j}FP{R}_{i,j}\left(c\right)}\right)$.

The true positive rate averaged over a subset of folds, *S*, is defined by

$TP{R}_{S}\left(c\right)=\frac{1}{{M}_{S}}{\displaystyle \sum _{i\in S}{n}_{i}\cdot TP{R}_{i},}$

where *M*_{s }is the total number of domains represented in set *S*. Likewise, the false positive rate averaged over a subset of folds is defined

$FP{R}_{S}\left(c\right)=\frac{1}{{M}_{S}}{\displaystyle \sum _{i\in S}\frac{{n}_{i}}{{M}_{S}-{n}_{i}}}\left({\displaystyle \sum _{j\in S;j\ne i}{n}_{j}FP{R}_{i,j}\left(c\right)}\right).$

A *fold confusion index *may be defined as the ratio *FPR*_{S}/*TPR*_{S }for a set *S *of folds. When the index has a value near 1.0, it means that domains in the same fold are no more distinguishable than domains in different folds, using that particular cutoff value.

An alternate, but straightforward definition of the *TPR(c) *could be obtained by counting all correctly detected similar-appearing pairs divided by the total number of pairs in the same SCOP folds. This alternate definition would in fact weight the fold-specific *TPR*_{i}*(c) (eq. 4) *according to the square of the fold size, thus over-weighting the domains in large folds compared to those in small folds. Our preferred definition, given in the equation above, weights each domain, not each domain pair, equally. The confusion matrix computed for this study and the corresponding MATLAB code is available at MSCL server [40].

### Datasets

The set of SCOP domains considered here are drawn from ASTRAL[41] version 1.63, with less that 40% pairwise sequence identity. The total number of domains in that sample is 4948, classified by SCOP into 740 folds. As more than one domain is required to evaluate the fold-specific *TPR*_{i}, we study only the reduced data set of 468 folds containing 2 or more domains, which together contain *M *= 4676 domains. The sum of the squares of their respective content $\sum _{i}{n}_{i}^{2}$ is 226,900. The folds fall into classes A (all alpha helix proteins), B (all beta sheet proteins), C (alpha and beta proteins, alpha/beta), D(alpha and beta proteins, alpha+beta), E(multi-domain proteins, alpha and beta), F(membrane and cell surface proteins and peptides) and G(small proteins) with 97, 78, 94, 139, 18, 17 and 25 folds each, respectively. The classes hold 844, 1091, 1330, 1070, 80, 67 and 194 domains each respectively.

All domain pairs drawn from the reduced dataset were compared by both VAST and SHEBA, corresponding to a total number of pairs of *M**(*M *- 1) = 21,860,300, excluding identity pairs. The calculations were made using the high-performance computational capabilities of the Biowulf PC/Linux cluster at the National Institutes of Health, Bethesda, Md.[42]. Data resulting from the computations were used to produce two large matrices containing the *Zscore*, the *Pcli *score, respectively. For pairs of domains for which VAST could not assign a quantitative measure of similarity the *Pcli *value was arbitrarily set to -10.

## Abbreviations

AUC Area Under the ROC Curve

CA Carbon Alpha

CATH Hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(**C**), Architecture(**A**), Topology(**T**) and Homologous superfamily (**H**).

DALI Distance mAtrix aLIgnment

FPR False Positive Rate

NCBI National Center for Biotechnology Information

PDB Protein Data Bank

RMSD Root Mean Square Deviation

ROC Receiver Operating Characteristic

SCOP Structural Classification of Proteins

SHEBA Structural Homology by Environment-Based Alignment

SSE Secondary Structure Element

TPR True Positive Rate

VAST Vector Alignment Search Tool

## Authors' contributions

VS, PM – execution of pairwise comparisons using VAST on Biowulf computer, application of ROC methodology, development of confusion matrix heat maps, statistical analysis JFG, JG – development of VAST program, interpretation of confused structure pairs CHT, BKL – development of SHEBA program, and pairwise comparisons, 3D structure visualization, interpretation of confused structure pairs

## Appendix. VAST statistics: calculation of Pcli

In the first stage of VAST we consider a "high" level description of proteins. Proteins are represented by their secondary structure elements (SSEs), more specifically by the endpoints of vectors going through these SSEs. The basic task of the algorithm is to find the best 3D common substructure.

A 3D common substructure is formally defined as a one-to-one correspondence between a subset of SSE vectors in the first protein and a subset of the SSE vectors in the second protein. This correspondence respects the type of SSE (i.e., helices are only paired with helices and strands with strands) and the topology. A correspondence {(*i*,*k*),(*j*,*l*)} of SSEs *i *and *j *in the first protein with SSEs *j *and *k *in the second, is said to respect the topology when *i*<*j *implies *k*<*l*. For instance, if SSEs 1 and 2 in the first protein and SSEs 4 and 7 in the second protein are paired 1–4 and 2–7, the correspondence respects the topology. The correspondence 1–7 and 2–4 does not.

### Computation of the score for a common 3D substructure

The problem of searching for 3D common substructures is next transformed into a graph theory problem. A "comparison" graph is formed whose vertices are made of pairs of vectors, one from each protein to be compared. Two such vertices are connected by an edge if the two vectors in the first protein have the same relative orientation and spacing, within some tolerance, as the two vectors in the second. Each edge is labeled by a score *s*, reflecting the quality of the superimposition of the 2 pairs of vectors. Finding the best 3D common substructure is solved by finding the clique (i.e., a subgraph for which each vertex is linked with all others) with the best overall score. The overall score is defined as a normalized sum of scores for all the edges within the clique. The *n *vertices of a clique (an *n*-clique) are labeled by *i *and *j*, and the overall score is defined:

$score\left(n-clique\right)=\frac{2}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j>1}^{n}{S}_{ij}}}\text{}\left(1\right)$

### Computation of the score for a 2-clique

The 2-clique score, *s*, is computed from the RMSD of the superimposition of the two corresponding pairs of vectors. The RMSD is normalized to an observed distribution of such values obtained from a large sample of random pairings of 2 SSEs from pairs of proteins drawn from a representative set of proteins. This empirical distribution represents the behavior of random pairing of two secondary structures, when there is presumably no overall structural similarity between the two proteins. By definition, we set *s *= -log_{10 }(*P*) where *P *is the empirical cumulative distribution function (cdf) value associated with the particular RMSD value. For instance if the RMSD is found to be 5.8 Å when the 2 pairs of vectors are superimposed, the probability *P *that the RMSD is less than or equal to 5.8 can be read off the curve as, say, *P *= 0.2. The score is then defined as minus the log of this probability: *s *= -log_{10}(.2) = 0.7. Therefore the smaller the RMSD between the 2 pairs of SSEs the larger the resulting score.

### Computing the probability distribution for the best n-clique score

In the previous section, *P *represented the cdf value of the RMSD of a random 2-clique. Accordingly, *P *has a uniform distribution and its score (-log_{10}(*P*)/2.303) has a negative exponential distribution, after dividing by the natural logarithm of 10, ln(10) ≈ 2.303. Rewriting in terms of natural logarithms, we have *score*(*n *- *clique*) = $\frac{2}{2.303n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j>i}^{n}-\mathrm{ln}\left({P}_{ij}\right)}}$. Since this is a sum of *n*(*n *- 1)/2 independent exponential variates, multiplied by the factor 2/2.303*n*, it follows a classical Gamma distribution with parameters α = *n*(*n *- 1)/2 and β = 2/2.303*n*.

For example, assume that we found a common 3D substructure between 2 proteins, and it is a 6-clique with a score of 9.6. In order to determine the significance of this score one must compare it with a distribution of scores for randomly generated 6-cliques. The mean of the 6-clique score distribution is given by α·β = (6-1)/2.303 ≈ 2.171, 5 times larger than the mean of a 2-clique score, 0.434. As *n *grows, the distributions become broader and more symmetric. Having calculated the score probability density distribution for a 6-clique, it is then easy to estimate the significance of the score obtained for our 6-clique found while comparing the 2 proteins. The corresponding cumulative probability distribution is calculated and used to compute the probability that a random score is = 9.6. This probability is written Q(*s*,*n*) = 1 - P(*s*,*n*) where *s *and *n *refer respectively to the score and the number of elements of the clique.

### Number of n-cliques that can be generated with a particular pair of proteins

For the sake of simplicity, we consider proteins having only one type of SSE, for instance, helices (when both proteins contain helices and strands, the problem of estimating *C*(*n*, *N*1, *N*2) can be formulated as a substring matching problem, for which a fast recursive algorithm exists). If we compare 2 proteins having 6 helices and we find that the best clique has 6 elements, there is only one possibility of generating this 6-clique. On the other hand, if both proteins have 12 helices, the number of 6-cliques that can be generated is given by $\left(\begin{array}{c}12\\ 6\end{array}\right)\left(\begin{array}{c}12\\ 6\end{array}\right)={\left(\frac{12!}{6!6!}\right)}^{2}$ = 853776 cliques. It is thus much more probable to observe, just by chance, a 6-clique with a score of 9.6 with 2 proteins having 12 helices rather than with 2 proteins having 6 helices. Let us call C(*n*, *N1*, *N2*) the number of n-cliques that can be generated with a pair of proteins having respectively N1 and N2 SSEs.

### Calculation of Pcli for the best clique

Because C(*n*, *N1*, *N2*) independent *n*-cliques can be generated, in the best *n*-clique for randomly paired proteins, we may observe a score larger than one would expect from the empirical distribution. The corrected significance, termed EPcli, is the probability that any random score in C(*n*, *N1*, *N2*) trials would exceed our observed score *s*. The probability of finding one value *s* *higher than observed value *s*, by chance alone is

$\begin{array}{l}EPcli=\mathrm{Pr}\left\{s*\ge s\text{foranytrial}\right\}\hfill \\ =1-{\left[1-Q\left(s,n\right)\right]}^{C\left(n,N1,N2\right)}\hfill \\ =1-\left[1-C\left(n,N1,N2\right)\cdot Q\left(s,n\right)+\text{higherorderterms}\right]\hfill \\ \approx C\left(n,N1,N2\right)\cdot Q\left(s,n\right)\hfill \end{array}\text{}\left(2\right)$

This approximation is valid when *C*(*n*, *N*1, *N*2)·*Q*(*s*, *n*)<<1. With the above definition of EPcli, notice that the smaller the value of EPcli the more significant the clique. Finally, we define

Pcli = -log_{10}(EPcli).

### Remark on the calculation of Pcli

Two types of problems occur. The first one is related to the number of elements of the clique with respect to the number of secondary structure elements (SSEs) found in the 2 domains being compared. To illustrate let us consider the 7-element clique that is generated when comparing domains d1amx_ and d1h6fa_ (fold b.2) having 14 and 19 SSEs, respectively. The score of this 7-clique, calculated according to Eq. 1 is 8.6. The probability of generating a 7-clique having a score *s *= 8.6 is given by P(*s*, *n*)= 10^{-5.3}. The number of 7-cliques that can be generated with the above two domains is C(*7, 14, 19*) = 10^{+7}. This leads to an approximation of EPcli > 1, and hence *Pcli*<0. The match found between the 2 domains is thus not significant. The second problem occurs with small proteins having few (3 or 4) SSEs. As shown in Eq. 1 the score of a clique depends on the number of elements. Small cliques will have relatively small scores and will appear more likely to have been generated by chance than large scores. This is analogous to the problem of detecting similarities for small peptides with sequence comparison methods: although a perfect match might be found, the resulting score, due to the size of the query sequence, will always be small and thus will appear not significant.

## Supplementary Material

**Additional File 1:**

• VAST and SHEBA heat maps • Complete heat map of VAST and SHEBA, obtained for a *Pcli *cutoff value of 2.5 and a *Zscore *cutoff value of 2.7, respectively. The cutoffs correspond to an overall average *FPR *of 0.01, and result in an overall average *TPR *of 0.616 and 0.748 for VAST and SHEBA respectively. The x (target folds) and y (query folds) axes of the heat maps are labeled by the SCOP folds, grouped into the different classes A, B, C, D, E, F and G. Each pixel within the heat maps represents a fold-specific true or false positive rate and takes value between 0 and 1. Diagonal and off-diagonal pixels correspond to fold-specific true positive rate *TPR*_{i}(c) (eq. 4, see Methods) and fold-specific false positive rate *FPR*_{i,j}(c) (eq. 3, see Methods) respectively.

^{(147K, pdf)}

**Additional File 2:**

• Fold-specific True Positive Rates (see Methods) at 1% False Positive Rate, for VAST and SHEBA, for 468 SCOP Folds in the order of the Heat Map. • Rows 1 to 7 correspond respectively to: the row number, the SCOP fold identifier, the number of domains within a fold, *TPR*_{i }value obtained by the fold with VAST, *TPR*_{i }value obtained by the fold with SHEBA, SCOP name of the fold, and SCOP description of the fold.

^{(165K, pdf)}

## Acknowledgements

We thank Mr. Steve Fellini and Ms. Susan Chacko for their help with the Biowulf cluster, and Mr. Antej Nuhanovic for his contribution to make a version of the heat map publicly available. This research was supported in part by the Intramural Research Program of the NIH, Center for Information Technology and the National Cancer Institute.

## References

- Richardson JS. The anatomy and taxonomy of protein structure. Advance protein chemistry. 1981;34:167–339. [PubMed]
- Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for investigation of sequences and structures. Journal of Molecular Biology. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [PubMed] [Cross Ref]
- Orengo C, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH-a hierarchic classification of protein domains structures. Structures. 1997;5:1093–1108. doi: 10.1016/S0969-2126(97)00260-8. [PubMed] [Cross Ref]
- Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. The EMBO journal. 1986;5:823–826. [PMC free article] [PubMed]
- Gibrat JF, Madej T, Bryant SS. Surprising similarities in structure comparison. Current Opinion in Structural Biology. 1996;6:377–385. doi: 10.1016/S0959-440X(96)80058-3. [PubMed] [Cross Ref]
- Madej T, Gibrat JF, Bryant SH. Threading a database of protein cores. Protein: Structure, Function, and Genetics. 1995;23:356–369. doi: 10.1002/prot.340230309. [PubMed] [Cross Ref]
- Ortiz AR, Strauss C, Olmea O. MAMMOTH (Matching Molecular Models Obtained from Theory): An automated method for model comparison. Protein Science. 2002;11:2606–2621. doi: 10.1110/ps.0215902. [PMC free article] [PubMed] [Cross Ref]
- Zemla A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Research. 2003;31:3370–3374. doi: 10.1093/nar/gkg571. [PMC free article] [PubMed] [Cross Ref]
- Goldsmith-Fischman S, Honig B. Structural genomics: computational methods for structure analysis. Protein Science. 2003;12:1813–1821. doi: 10.1110/ps.0242903. [PMC free article] [PubMed] [Cross Ref]
- Koehl P. Protein structure similarities. Current Opinion in Structural Biology. 2001;11:348–353. doi: 10.1016/S0959-440X(00)00214-1. [PubMed] [Cross Ref]
- Subbiah S, Laurents DV, Levitt M. Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Current Biology. 1993;3:141–148. doi: 10.1016/0960-9822(93)90255-M. [PubMed] [Cross Ref]
- Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering. 1998;11:739–747. doi: 10.1093/protein/11.9.739. [PubMed] [Cross Ref]
- Harrison A, Pearl F, Mott R, Thornton J, Orengo C. Quantifying the similarities within fold space. Journal of Molecular Biology. 2002;323:909–926, doi:10.1016/S0022-2836(02)00992-0. doi: 10.1016/S0022-2836(02)00992-0. [PubMed] [Cross Ref]
- Shapiro J, Brutlag D. FoldMiner: Structural motif discovery using an improved superposition algorithm. Protein Science. 2004;13:278–294. doi: 10.1110/ps.03239404. [PMC free article] [PubMed] [Cross Ref]
- Yang AS, Honig B. An integrated approach to the analysis and modeling of protein sequences and structures. Protein structural alignment and a quantitative measure for protein structural distance. Journal of Molecular Biology. 2000;301:665–678, doi:10.1006/jmbi.2000.3973. doi: 10.1006/jmbi.2000.3973. [PubMed] [Cross Ref]
- Jung J, Lee B. Protein structure alignment using environmental profiles. Protein Engineering. 2000;13:535–543. doi: 10.1093/protein/13.8.535. [PubMed] [Cross Ref]
- Ye Y, Godzik A. Database searching by flexible protein structure alignment. Protein Science. 2004;13:1841–1850. doi: 10.1110/ps.03602304. [PMC free article] [PubMed] [Cross Ref]
- Shindyalov I, Bourne PE. An alternative view of protein fold space. Proteins: Structure, Function and Genetics. 2000;38:247–260. doi: 10.1002/(SICI)1097-0134(20000215)38:3<247::AID-PROT2>3.0.CO;2-T. [PubMed] [Cross Ref]
- Holm L, Sander C. Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology. 1993;233:123–138, doi:10.1006/jmbi.1993.1489. doi: 10.1006/jmbi.1993.1489. [PubMed] [Cross Ref]
- Kolodny R, Koehl P, Levitt M. Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. Journal of Molecular Biology. 2005;346:1173–1188, doi:10.1016/j.jmb.2004.12.032. doi: 10.1016/j.jmb.2004.12.032. [PMC free article] [PubMed] [Cross Ref]
- Hanley JA, McNeil BJ. The meaning of the area under the Receiver Operationg Characteristic (ROC) Curve. Radiology. 1982;143:29–36. [PubMed]
- Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997;30:1145–1159. doi: 10.1016/S0031-3203(96)00142-2. [Cross Ref]
- Sierk ML, Pearson WR. Sensitivity and selectivity in protein structure comparison. Protein Science(2004) 2004;13:773–785. doi: 10.1110/ps.03328504. [PMC free article] [PubMed] [Cross Ref]
- Getz G, Vendruscolo M, Sachs D, Domany E. Automated Assignment of SCOP and CATH Protein Structure Classifications from FSSP. Proteins: Structure, Function and Genetics. 2002;46:405–415. doi: 10.1002/prot.1176. [PubMed] [Cross Ref]
- Hadley C, Jones D. A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structures. 1999;7:1099–1112. doi: 10.1016/S0969-2126(99)80177-4. [PubMed] [Cross Ref]
- Day R, Beck D, Armen R, Daggett V. A consensus view of fold space: combining SCOP, CATH, and Dali Domain Dictionnary. Protein Science. 2003;12:2150–2160. doi: 10.1110/ps.0306803. [PMC free article] [PubMed] [Cross Ref]
- Gerstein M, Levitt M. Comprehensive assessment of automatic structural alignment against a manual standard, the SCOP classification of proteins. Protein Science. 1998;7:445–456. [PMC free article] [PubMed]
- Novotny M, Madsen D, Kleywegt GJ. Evaluation of protein fold comparison servers. PROTEINS: Structure, Function and Bioinformatics. 2004;54:260–270. doi: 10.1002/prot.10553. [PubMed] [Cross Ref]
- Brenner SE, Chothia C, Hubbard TJP, Murzin AG. Understanding protein structure: using SCOP for fold interpretation. Methods in Enzymology. 1996;266:635–643. [PubMed]
- Kajava A. What curves alpha-solenoids ? Evidence for an alpha-helical toroid structure of Rpn1 and Rpn2 proteins of the 26 S proteasome. The Journal of Biological Chemistry. 2002;277:49791–49798. doi: 10.1074/jbc.M204982200. [PubMed] [Cross Ref]
- Hou J, Sims GE, Zhang C, Kim SH. A global representation of the protein fold space. PNAS. 2003;100:2386–2390. doi: 10.1073/pnas.2628030100. [PMC free article] [PubMed] [Cross Ref]
- Hou J, Jun SR, Zhang C, Kim SH. Global mapping of protein structure space and application in structure-based inference of protein function. PNAS. 2005;102:3651–3656. doi: 10.1073/pnas.0409772102. [PMC free article] [PubMed] [Cross Ref]
- Krishna SS, Grishin NV. Structural drift: a possible path to protein fold change. Bioinformatics. 2005;21:1308–1310. doi: 10.1093/bioinformatics/bti227. [PubMed] [Cross Ref]
- Domingues FS, Koppensteiner WA, Sippl MJ. The role of protein structure in genomics. FEBS Letters. 2000;476:98–102. doi: 10.1016/S0014-5793(00)01678-1. [PubMed] [Cross Ref]
- Holm L, Sander C. Touring protein fold space with Dali/FSSP. Nucleic Acids Research. 1998;26 [PMC free article] [PubMed]
- Efimov AV. Structural trees for protein superfamilies. PROTEINS: Structure, Function and Genetics. 1997;28:241–260. doi: 10.1002/(SICI)1097-0134(199706)28:2<241::AID-PROT12>3.0.CO;2-I. [PubMed] [Cross Ref]
- Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Ly G, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Marchler GH, Mullokandov M, Shoemaker BA, Simonyan V, JS JSS, Thiessen PA, Yamashita RA, Yin JJ, Zhang D, Bryant SH. CDD: a Conserved Domain Database for protein classification. Nucleic Acids Research. 2005;33:D19–26. doi: 10.1093/nar/gni020. [PMC free article] [PubMed] [Cross Ref]
- VAST INRA server http://www-mig.jouy.inra.fr
- SHEBA server http://lmbbi.nci.nih.gov
- MSCL server http://abs.cit.nih.gov/strcomp
- Chandonia JM, Hon G, Walker NS, Conte LL, Koehl P, Levitt M, Brenner SE. The ASTRAL compendium in 2004. Nucleic Acids Research. 2004;32:D189–D192. doi: 10.1093/nar/gkh034. [PMC free article] [PubMed] [Cross Ref]
- Biowulf cluster http://biowulf.nih.gov
- DeLano WL. The PyMOL Molecular Graphics System. (2002) DeLano Scientific, San Carlos, CA, USA.

**BioMed Central**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.1M) |
- Citation

- Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.[PLoS Comput Biol. 2009]
*Pascual-García A, Abia D, Ortiz AR, Bastolla U.**PLoS Comput Biol. 2009 Mar; 5(3):e1000331. Epub 2009 Mar 27.* - Towards an automatic classification of protein structural domains based on structural similarity.[BMC Bioinformatics. 2008]
*Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ.**BMC Bioinformatics. 2008 Jan 31; 9:74. Epub 2008 Jan 31.* - Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.[BMC Bioinformatics. 2007]
*Ferragina P, Giancarlo R, Greco V, Manzini G, Valiente G.**BMC Bioinformatics. 2007 Jul 13; 8:252. Epub 2007 Jul 13.* - An introduction to modeling structure from sequence.[Curr Protoc Bioinformatics. 2006]
*Petsko GA.**Curr Protoc Bioinformatics. 2006 Oct; Chapter 5:Unit 5.1.* - The family feud: do proteins with similar structures fold via the same pathway?[Curr Opin Struct Biol. 2005]
*Zarrine-Afsar A, Larson SM, Davidson AR.**Curr Opin Struct Biol. 2005 Feb; 15(1):42-9.*

- Touring Protein Space with Matt[IEEE/ACM transactions on computational biol...]
*Daniels NM, Kumar A, Cowen LJ, Menke M.**IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM. 2011 Apr 1;10.1109/TCBB.2011.70* - Exploring the limits of fold discrimination by structural alignment: A large scale benchmark using decoys of known fold[Computational Biology and Chemistry. 2011]
*Hollup SM, Sadowski MI, Jonassen I, Taylor WR.**Computational Biology and Chemistry. 2011 Jun; 35(3)174-188* - Protein domain assignment from the recurrence of locally similar structures[Proteins. 2011]
*Tai CH, Sam V, Gibrat JF, Garnier J, Munson PJ, Lee B.**Proteins. 2011 Mar; 79(3)853-866* - Generation of a consensus protein domain dictionary[Bioinformatics. 2011]
*Schaeffer RD, Jonsson AL, Simms AM, Daggett V.**Bioinformatics. 2011 Jan 1; 27(1)46-54* - Automatic structure classification of small proteins using random forest[BMC Bioinformatics. ]
*Jain P, Hirst JD.**BMC Bioinformatics. 11364*

- ROC and confusion analysis of structure comparison methods identify the main cau...ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classificationBMC Bioinformatics. 2006; 7()206

Your browsing activity is empty.

Activity recording is turned off.

See more...