• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Dec 7, 1999; 96(25): 14318–14323.
PMCID: PMC24434
Biophysics

Detection of protein fold similarity based on correlation of amino acid properties

Abstract

An increasing number of proteins with weak sequence similarity have been found to assume similar three-dimensional fold and often have similar or related biochemical or biophysical functions. We propose a method for detecting the fold similarity between two proteins with low sequence similarity based on their amino acid properties alone. The method, the proximity correlation matrix (PCM) method, is built on the observation that the physical properties of neighboring amino acid residues in sequence at structurally equivalent positions of two proteins of similar fold are often correlated even when amino acid sequences are different. The hydrophobicity is shown to be the most strongly correlated property for all protein fold classes. The PCM method was tested on 420 proteins belonging to 64 different known folds, each having at least three proteins with little sequence similarity. The method was able to detect fold similarities for 40% of the 420 sequences. Compared with sequence comparison and several fold-recognition methods, the method demonstrates good performance in detecting fold similarities among the proteins with low sequence identity. Applied to the complete genome of Methanococcus jannaschii, the method recognized the folds for 22 hypothetical proteins.

The tremendous explosion in the amount of genome sequences during the past few years makes functional characterization of gene products overwhelming. The most common way of inferring the function of a new gene is based on sequence similarity with proteins of known function. Classical sequence comparison algorithms like ssearch (1), fasta (2), or blast (3) were designed to assess the degree of sequence similarities between compared sequences. However, an increasing number of proteins with weak sequence similarity has been found to assume similar three-dimensional (3D) folds, referred here as remote homologues, and often have similar or related biochemical or biophysical functions. (In this work remote homologues imply only structure similarity of proteins rather than their evolutionary relationship, because the latter is often difficult to establish reliably for strongly divergent sequences.) To detect such fold similarity a variety of 3D-threading methods have been developed; in these methods, amino acid sequence of a new protein is compared with the 3D amino acid profiles of proteins with known structures (48).

Because 3D-threading methods require the knowledge of the 3D structure of one of the two compared proteins, they are effective only for finding the remote homologues of the proteins with known 3D structures. To overcome this limitation, sequence alignment was combined with alignment of structural properties predicted or derived from sequence [one-dimensional (1D) threading]. The alignment of the predicted secondary structure only (9) or the predicted secondary structure and solvent accessibility of proteins (10) was shown to be useful for fold recognition. Adding sequence information by using a sequence similarity matrix works better (1114), though finding the optimal matrix remains a challenge. The matrices currently available were derived from the statistics of known protein sequences or structures (1116) and, thus, may be biased toward the current databases (17).

Because the three-dimensional structure of a protein is determined by the physical and chemical properties of all residues, we make a simplifying assumption that the local interactions in proximity of each residue in the protein are similar to those of the corresponding residue in its remote homologues. We make a further assumption that, because sequentially adjacent residues are usually proximal to each other in structure, the sequential arrangement of physical properties of amino acids flanking a given residue is likely to be correlated to that of the corresponding residue in remote homologues. This hypothesis is the basis of our method, the proximity correlation matrix (PCM) method, for detecting fold similarities between two protein sequences.

Detection of protein fold similarities has two major applications: (i) fold recognition, where a query sequence is compared with those of the proteins of known fold, and (ii) fold classification, where protein sequences are clustered into groups with the same predicted fold even when the fold information is not available. Here we present the results of the first application of the PCM method. The method is tested on a number of proteins with known structures and known remote homologues, compared with PSI-BLAST (18) and several 1D-threading techniques (1115), and applied to the complete genome of Methanococcus jannaschii (19).

Algorithm

Data Sets.

For query proteins representing 64 folds (Table (Table1),1), we looked for their remote homologues in a target set composed of 1,390 protein sequences with sequence identity among them not exceeding 25% [nonredundant set of FSSP database (20)]. Using structural classification of proteins (SCOP) (21), we chose the 64 protein fold families, each including at least three remote homologues in the target set. Four hundred and twenty of 1,390 proteins in the target set belong to these fold families. Protein domains with fewer than 90 residues as well as the composite fold domains, i.e., consisting of more than one polypeptide chain or sequentially distant parts of the same chain, were eliminated.

Table 1
The most-populated protein folds and their representative query proteins

Protein Representation.

Each amino acid residue in a protein is described in terms of two quantities: secondary structure conformation (helix, strand, or coil) and one of the five physical properties representing the five major clusters of amino acid indices summarized by Tomii and Kanehisa (22). They are hydrophobicity (23), volume (24), normalized frequencies of α-helix (25), normalized frequencies of β-sheet (25), and relative frequency of occurrence (26). Both real [assigned by DSSP (27)] and predicted [using program psipred by David Jones (28)] secondary structures are used for testing.

Proximity Correlation Matrix.

For an amino acid residue i we defined its proximity by a “window,” i.e., a short fragment of the protein sequence extended from position i to il in one direction and to i + l in the other. The size of the window, L = 2l + 1 (l = 1, 2, 3) is varied in different experiments. For two given fragments in the two sequences compared, each fragment represented by the middle position (i and j, respectively; see Fig. Fig.11a), we defined the correlation of a physical property p as:

equation M1
1

where [p with macron]i and σi are the average and SD, respectively, of the property in the fragment defined by the window centered at i.

Figure 1
Construction of a proximity correlation matrix. In each panel, the segment of amino acid sequence of a query protein (using a one-letter code) and the corresponding vector of properties are shown vertically. Those of a target protein are shown horizontally. ...

To reduce noise from chance correlation of physical properties between two randomly chosen short fragments we required that polypeptide chains must have the same secondary structure type in structurally aligned positions. In other words, we constrained the alignments between two sequences to the regions where their secondary structures match (Fig. (Fig.11b).

Finally, for a pair of sequences of lengths M and N, we composed a M × N proximity correlation matrix, where the matrix element, pcmij, is:

equation M2
2

where SS(i) is the secondary structure conformation of residue i, and corr(i, j) is calculated by Eq. 1. This matrix is used to find the optimal alignment between the sequence pair (Fig. (Fig.11c).

Alignment.

The alignment procedure is based on the global alignment algorithm of Needleman and Wunsch (29), with no penalties for terminal gaps. Because it is difficult to estimate the dependence of the alignment score on the lengths of the aligned sequences, especially if internal gaps are introduced, we used a simplified procedure, which compares only the whole sequences or sequence fragments of approximately the same length. The query and target sequences are directly compared if the difference in their lengths is less than 50 residues. If the length of a target sequence is longer than the query by more than 50 residues, the former is sliced into overlapping fragments of the length of the query sequence with 50-residue overlap between two adjacent fragments.

For a pair of sequences q and t, the alignment score, Sqt, is calculated as:

equation M3

where the first term is the sum of correlation coefficients (Eq. 2) over all aligned positions qi and tj, and the second term is the sum of the penalties for opening (O = 3.0) and elongation (E = 0.3) of all gaps (insertions or deletions), each extending for xk positions.

All possible alignments are evaluated with Z score:

equation M4

where Sq and σs are the average score and SD, respectively, of the alignments of the query (q) with all the targets (t). We found that the optimal window size (L) varies with different folds in detecting fold homologues. Therefore, for a given pair of sequences we took the best Z score among those obtained with different window sizes.

The overall flowchart of the PCM procedure is shown in Fig. Fig.2.2.

Figure 2
Detecting fold similarities with PCM: a flowchart of the overall procedure.

Results and Discussion

Remote Homologues.

Remote homologues in our test are defined as proteins with similar fold but sequence identity not more than 25%. In calculating sequence identity, only the structurally aligned positions, as indicated in the FSSP database (20), are considered. To judge whether two folds are similar to each other, we used both manual [SCOP (21)] and automated (FSSP) classifications of protein structures. SCOP, often referred to as the most reliable classification (30), involves expert judgment but provides no alignment information, whereas FSSP is objective but requires careful assessment to exclude proteins with the same local structural motif but different folds.

The extent of structural similarity in FSSP is provided by the DALI Z score (31). Although true remote homologues are found toward the top of the DALI list (ordered by the decreasing magnitude of Z score), the boundary between the true remote homologues and all other proteins is not well defined. We have observed that in most cases this boundary coincides with transition from “discrete” to “continuous” spectrum of Z scores and is marked with a prominent gap between adjacent Z scores in the DALI list (Fig. (Fig.3).3). Therefore, as an alternative to the classical, hard-cutoff model, Zcutoff = β = const, we introduced a new, heuristic model, which can be formally described as:

equation M5
3

equation M6

The models were compared for their ability to find the true remote homologues (as indicated by SCOP) of 64 query proteins (Table (Table1)1) among those automatically detected in the FSSP database. The constants, β and epsilon, were optimized with criteria:

equation M7
4

where ΔT (or ΔF) stands for the incremental number of true (or false) structural homologues with Z > Zcutoff. With a higher cutoff we lose more true than false remote homologues (ΔT > ΔF), whereas with lower cutoff we include more false than true positives (ΔT < ΔF).

Figure 3
Structural homologues in FSSP ordered by DALI Z score (31). For most queries the heuristic cutoff, i.e., the first large gap from the bottom, ΔZ = ZiZi+1 > epsilon (vertical lines), separates the true remote homologs ...

The optimal cutoffs, β = 6.5 and epsilon =0.5, find 58% and 67% of all true remote homologues, respectively, with less than 5% of false positives in both cases. Moreover, the heuristic cutoff, epsilon, works consistently better than the hard cutoff, β, for getting true remote homologues from the FSSP database (Fig. (Fig.4).4). Therefore, for proteins not yet classified by SCOP, we used the FSSP data with cutoff epsilon = 0.5 to establish their remote homology.

Figure 4
Cutoff optimization on FSSP database. The number of true remote homologues (T) and other proteins (F) is determined for each value of the hard (gray lines) and heuristic (black lines) cutoffs. The optimal values (in bold) are chosen where ΔT/ΔF ...

Correlation of Physical Properties in Remote Homologues.

For a pair of remote homologues in FSSP we calculated the correlation coefficient of amino acid properties within a window of three, five, or seven residues (l = 1, 2, or 3, respectively) for each structurally aligned position by using Eq. 1. The sum of the coefficients, a total correlation, is compared with those obtained for the pairs of other members of the same fold with shuffled sequences as well as those for the pairs of other proteins with limited fold similarity according to FSSP. Among the five tested amino acid properties, hydrophobicity and β-sheet frequency are the two best properties to distinguish between true remote homologues of the globin fold and other proteins (Fig. (Fig.5).5). However, in general, hydrophobicity is the best property to detect remote homology by PCM for all fold types. The results described below were obtained by using this property.

Figure 5
Distribution of the total correlation of physical properties in structural alignments of globin, 2FAL, and its true remote homologs (black lines) according to SCOP (21), proteins with limited structural similarity in FSSP (gray lines), and random sequences ...

Fold Recognition by PCM.

Using each of 420 proteins representing the 64 well populated folds as query protein, we searched for its remote homologues among 1,390 proteins in the target set. With real or predicted secondary structure, the PCM method finds 178 or 167 true remote homologues, respectively. They correspond to more than 40% of all remote homologues within the 64 selected fold families.

The cutoff value for PCM predictions has been determined by the optimal ratio of true remote homologues and false positives (Eq. 4). The heuristic cutoff (Eq. 3) performs better than the hard one, and we found the optimal cutoff, epsilon = 0.9, is the same using PCM combined with either real or predicted secondary structure. The number of false positives with this cutoff is equal to 16% (8%) for PCM with predicted (real) secondary structure.

For several highly populated folds like globins, EF hand, periplasmic-binding proteins, and Rossman-fold, PCM detected more than 70% of their remote homologues. In most populated folds, α/β (TIM) barrels and immunoglobulins, which tolerate slight variations in size and topology, about 40% of remote homologues were recognized. For some queries, the true remote homologues were predicted with a Z score below the cutoff. For others, either the property correlation in structurally aligned regions is low, close to that in random sequences, or secondary structure pattern is not conserved between remote homologues.

Comparison with 1D-Threading Methods.

We compared the PCM method with four different 1D-threading methods available on the Internet: PredictProtein (11, 12), FoldFit (14), “Gon+predSS” (13), and H3P2 (15). Predictions were obtained for the same 64 queries by using the default parameters and fold library (Table (Table2)2) of each method. Because these methods use different fold libraries and scores, strict comparison is not possible. Therefore, success of fold recognition is determined by a uniform performance criteria: finding, at least one remote homologue in the top five proteins with the highest Z score. Before ranking, all predicted homologues with sequence identity more than 25% have been excluded. Because the identity of protein sequences is determined on the basis of structural alignment, pairs of proteins with low structural similarity (Z < 2.0 in FSSP) have been eliminated as well.

Table 2
Fold recognition by different methods

The results of fold recognition are summarized in Table Table2.2. The PCM method using real secondary structure tops the performance and provides the highest numbers of correct prediction of remote homologues: 57 of 64 query proteins found correct remote homologues, including 39 cases in which the true remote homologues appear as the first choices. With predicted secondary structure, the PCM method is comparable to “Gon+predSS,” the next best performer (Table (Table2).2). Comparing these three, we found that in two cases (1HTP and 1PUT) “Gon+predSS” is better than both versions of PCM and worse in the other four (1COLA, 1KUH, 1LIT, and 1WHTB). For some query proteins correct fold is recognized only by one method: 1GSA and 1PRTF by PCM, 1HTP and 1PUT by “Gon+predSS,” and 1ZYMA by PredictProtein. Combining the results of all of these methods (excluding PCM with real secondary structure), 57 of 64 queries found correct folds. Including additional properties of amino acids is likely to improve the PCM method further.

Comparison with psi-blast.

An advanced sequence-comparison method PSI-BLAST (18) was shown to be able to detect efficiently some remote homologues (3236). We compared the PCM method with PSI-BLAST by using the same queries and target proteins for both methods, which allows us to compare the results directly (in contrast to comparison with 1D-threading, where each method uses its own fold library). All 420 remote homologues of the 64 most-populated folds were used as queries. PSI-BLAST predictions were obtained in three iterations. Two different e-value cutoffs, 10−3 and 10−4, that had been effective in other studies (3236), were tested here. The other parameters were default.

The PCM method with predicted secondary structure predicts more false positives (≈16%) than PSI-BLAST (≈2%). However, when compared for a similar number of predicted true remote homologues, PSI-BLAST is more successful in detecting remote homologues with sequence identities greater than 15%, whereas PCM does better for sequences with lower identities (Fig. (Fig.6).6). Therefore, a combination of these methods may be more efficient for predicting larger numbers of remote homologues.

Figure 6
Distribution of remote homologues in the 64 query protein folds detected by PCM by using real or predicted secondary structure and psi-blast with different cutoffs.

Fold Recognition in Methanococcus jannaschii Genome.

We used PCM to discover remote homologues of the 64 protein folds from all the predicted proteins of the M. jannaschii genome (19). The predicted secondary structure was used for these proteins, and the real secondary structures were used for the query proteins. All 420 remote homologues of the 64 most-populated folds were used as queries to maximize the number of fold assignments. The cutoff, epsilon = 0.9, was applied to PCM predictions.

Of the 64 tested folds, 29 were detected in the genome of M. jannaschii (Fig. (Fig.7).7). Fold is assigned to 75 proteins; 22 of them listed in Table Table33 currently are annotated as hypothetical proteins (Methanococcus jannaschii Genome Database: http://www.tigr.org/tdb/mdb/mjdb/mjdb.html).

Figure 7
Protein folds detected by PCM in the M. jannaschii genome and their population.
Table 3
PCM fold recognition of hypothetical proteins in genome of M. jannascii

Conclusions

We propose a new approach for detecting fold similarities between two proteins with weak or no sequence similarities by using the PCM of amino acid properties combined with predicted (or real) secondary structures of the proteins. The approach is based on our observation that physical properties of amino acid residues surrounding the corresponding residues in two proteins with the same fold are correlated along the sequences. Among the different properties tested in this work, hydrophobicity is shown to be the most strongly correlated property for all fold classes. In our future studies, we plan to incorporate the other properties that are correlated in some but not other fold classes.

The PCM method detects more than 40% of 420 remote homologues in the 64 selected folds. When the correct secondary structure is used, 89% of 64 query proteins, each representing a distinct fold, found at least one remote homologue among the top five choices. This number goes down to 73% after using predicted secondary structure. As the secondary structure prediction method improves, the performance of PCM is expected to improve as well. A test application of PCM method to the complete genome of M. jannaschii reveals its ability to infer fold information to hypothetical proteins as well as others with no fold information available with existing methods.

Compared with PSI-BLAST, our method demonstrates better sensitivity in detecting remote homologues with a sequence identity of less than 15%. Combined with existing methods, such as PSI-BLAST and/or 1D-threading, the PCM method can provide additional fold information for proteins with low sequence similarities.

Acknowledgments

We acknowledge Dr. Paul Gordon for his initial testing of the proximity correlation approach and Dr. Chao Zhang for valuable discussions and reading the manuscript. We thank Drs. Inna Dubchak, Jim Bowie, and Steven Brenner for their critical reading and helpful remarks. We also thank Dr. David Jones for providing the psipred program. This research was supported by the U.S. Department of Energy (DE-AC03-76SF00098) and National Science Foundation (DBI-9723352) and used the resources of the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory, Berkeley, CA.

Abbreviations

PCM
proximity correlation matrix
3D
three-dimensional
1D
one-dimensional
SCOP
structural classification of proteins

Footnotes

Alexandrov, N. N. & Solovyev, V. V., Proceedings of the Pacific Symposium on Biocomputing 1998, January 4–9, 1998, Hawaii, pp. 463–472.

References

1. Smith T F, Waterman M S. J Mol Biol. 1981;147:195–197. [PubMed]
2. Pearson W R, Lipman D J. Proc Natl Acad Sci USA. 1988;85:2444–2448. [PMC free article] [PubMed]
3. Altschul S F, Gish W, Miller W, Myers E W, Lipman D J. J Mol Biol. 1990;215:403–410. [PubMed]
4. Bowie J U, Luthy R, Eisenberg D. Science. 1991;253:164–170. [PubMed]
5. Jones D T, Taylor W R, Thornton J M. Nature (London) 1992;358:86–89. [PubMed]
6. Eisenhaber F, Persson B, Argos P. Crit Rev Biochem Mol Biol. 1995;30:1–94. [PubMed]
7. Lemer C M, Rooman M J, Wodak S J. Proteins. 1995;23:337–355. [PubMed]
8. Sternberg M J, Bates P A, Kelley L A, MacCallum R M. Curr Opin Struct Biol. 1999;9:368–373. [PubMed]
9. Sheridan R P, Dixon J S, Venkataraghavan R. Int J Peptide Protein Res. 1985;25:132–143.
10. Russell R B, Copley R R, Barton G J. J Mol Biol. 1996;259:349–365. [PubMed]
11. Rost B. Proc Conf Intelligent Systems Mol Biol ISMB. 1995;95:314–321. [PubMed]
12. Rost B, Schneider R, Sander C. J Mol Biol. 1997;270:471–480. [PubMed]
13. Fischer D, Eisenberg D. Protein Sci. 1996;5:947–955. [PMC free article] [PubMed]
14. Russell R B, Saqi M A S, Sayle R A, Bates P A, Sternberg M J E. Protein Eng. 1998;11:1–9. [PubMed]
15. Rice D, Eisenberg D. J Mol Biol. 1997;267:1026–1038. [PubMed]
16. Russell R B, Saqi M A S, Sayle R A, Bates P A, Sternberg M J E. J Mol Biol. 1997;269:423–439. [PubMed]
17. Gerstein M. Fold Des. 1998;3:497–512. [PubMed]
18. Altschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, Miller W, Lipman D J. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
19. Bult C J, White O, Olsen G J, Zhou L, Fleischmann R D, Sutton G G, Blake J A, FitzGerald L M, Clayton R A, Gocayne J D, et al. Science. 1996;273:1058–1073. [PubMed]
20. Holm L, Sander C. Nucleic Acids Res. 1998;26:316–319. [PMC free article] [PubMed]
21. Murzin A G, Brenner S E, Hubbard T J P, Chothia C. J Mol Biol. 1995;247:536–540. [PubMed]
22. Tomii K, Kanehisa M. Protein Eng. 1996;9:27–36. [PubMed]
23. Fauchere J L, Pliska V. J Eur J Med Chem. 1983;18:369–375.
24. Zamyatin A A. Prog Biophys Mol Biol. 1972;24:107–123. [PubMed]
25. Chou P Y, Fasman G D. Adv Enzymol. 1978;47:45–148. [PubMed]
26. Jones D T, Taylor W R, Thornton J M. CABIOS. 1992;8:275–282. [PubMed]
27. Kabsh W, Sander C. Biopolymers. 1983;22:2577–2637. [PubMed]
28. Jones D T. J Mol Biol. 1999;292:195–202. [PubMed]
29. Needleman S B, Wunsch C D. J Mol Biol. 1970;48:443–453. [PubMed]
30. Gerstein M, Levitt M. Protein Sci. 1998;7:445–456. [PMC free article] [PubMed]
31. Holm L, Sander C. Nucleic Acids Res. 1994;22:3600–3609. [PMC free article] [PubMed]
32. Teichmann S A, Park J, Chothia C. Proc Natl Acad Sci USA. 1998;95:14658–14663. [PMC free article] [PubMed]
33. Huynen M, Doerks T, Eisenhaber F, Orengo C, Sunyaev S, Yuan Y P, Bork P. J Mol Biol. 1998;280:323–326. [PubMed]
34. Salamov A A, Suwa M, Orengo C A, Swindells M B. Protein Sci. 1999;8:771–777. [PMC free article] [PubMed]
35. Wolf Y I, Brenner S E, Bash P A, Koonin E V. Genome Res. 1999;9:17–6. [PubMed]
36. Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C. J Mol Biol. 1998;284:1201–1210. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...