• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of prosciprotein sciencecshl presssubscriptionsetoc alertsthe protein societyjournal home
Protein Sci. Apr 2004; 13(4): 884–892.
PMCID: PMC2280064

Prediction of functional sites by analysis of sequence and structure conservation

Abstract

We present a method for prediction of functional sites in a set of aligned protein sequences. The method selects sites which are both well conserved and clustered together in space, as inferred from the 3D structures of proteins included in the alignment. We tested the method using 86 alignments from the NCBI CDD database, where the sites of experimentally determined ligand and/or macromolecular interactions are annotated. In agreement with earlier investigations, we found that functional site predictions are most successful when overall background sequence conservation is low, such that sites under evolutionary constraint become apparent. In addition, we found that averaging of conservation values across spatially clustered sites improves predictions under certain conditions: that is, when overall conservation is relatively high and when the site in question involves a large macromolecular binding interface. Under these conditions it is better to look for clusters of conserved sites than to look for particular conserved sites.

Keywords: protein domains, prediction of functional residues, evolutionary conservation

Despite recent growth of the protein sequence and structure databases, there remains only a small fraction of proteins whose functions have been experimentally characterized. It is sometimes possible to infer the function of uncharacterized proteins by comparison to the sequences or structures of functionally annotated homologs. Common descent does not necessarily imply functional similarity, however (Hegyi and Gerstein 1999; Devos and Valencia 2000; Todd et al. 2001) and functional annotation transferred from one homologous protein to another can result in incorrect functional assignment. To verify functional assignments one must examine the common features conserved among homologs and attempt to identify functionally important sites.

Several investigators have considered the problem of functional site prediction using multiple sequence alignments (Casari et al. 1995; Andrade et al. 1997; Hannenhalli and Russell 2000; Li et al. 2003). Casari et al. (1995), for example, applied principal component analysis to a vector representation of protein sequences in a multidimensional “sequence space,” to derive subfamily-specific residues involved in protein function. Andrade et al. (1997) proposed a rigorous clustering algorithm based on a self-organizing map as a means to identify protein subfamilies and retrieve characteristic sequence patterns. As functional similarity can be inferred from clades in phylogenetic trees, some methods of functional site prediction use phylogenetic analysis to identify residues associated with functional divergence (Lichtarge et al. 1996; Sjolander 1998; Aloy et al. 2001; Madabushi et al. 2002; del Sol Mesa et al. 2003). The evolutionary trace (ET) method, for example, delineates invariant residues responsible for subgroup specificity by partitioning the dendrogram into an increasing number of subgroups of similar sequences with subsequent analysis of their three-dimensional (3D) structures (Lichtarge et al. 1996; Aloy et al. 2001; Madabushi et al. 2002).

Despite the efforts in this field, the accuracy of functional site predictions remains low, suggesting that it may be worthwhile to consider other aspects beyond sequence conservation. Use of structure information is one possibility, because knowledge of the protein structure is necessary for predicting many aspects of protein function (Teichmann et al. 2001). Given that functionally important surface regions often contain residues with specific characteristics, some methods attempt to identify functional sites on the basis of physicochemical properties of individual residues, their electrostatic contribution, and their location in the 3D structure (Jones and Thornton 1997; Tsai et al. 1997; Elcock 2001; Bartlett et al. 2002). Landgraf and colleagues (2001), for example, offered an automated method for functional site prediction by identifying 3D clusters of conserved residues using residue-specific (regional) and global similarity scores.

Here we present a method which is based on the assumption that the structural location of functional sites is conserved between homologous proteins and that functionally important residues tend to cluster together in space, forming three-dimensional residue clusters or surface patches. In the method considered here, each residue is assigned a score which depends on its own conservation in homologs and the conservation of residues in its spatial neighborhood, as judged from the analysis of known structures within a given protein family. We hypothesize that high-scoring sites are more likely to be involved in specific binding or catalysis, and that one may identify functionally important residues even in the absence of structural data on protein–ligand or macromolecular complexes.

We tested the method on a benchmark of 86 protein domain families, including families with a wide variety of functions and sequence diversity. To assess the accuracy of functional site predictions, we applied a rigorous receiver operating characteristic (ROC) test (see Materials and Methods). This gave us a means to compare different scoring schemes directly, by calculating the actual number of correctly predicted functional sites at a given level of false assignments. We show that including information about conserved structural features in some cases helps to make more accurate predictions, especially for DNA/RNA binding macromolecular interfaces. When sequence diversity is low, spatial averaging also helps to detect functional sites against the high background of sequence conservation.

Results

Functional site predictions based on sequence conservation and sequence conservation with spatial averaging

Functionally relevant residues in proteins are often conserved among all or a majority of members of a protein family. Accordingly, these residues can be identified from the analysis of positional conservation in multiple sequence alignments using different sequence conservation measures. Here, we employed information content and maximum likelihood estimates of the expected number of substitutions per position (substitution rate), as calculated by the PAML package (Yang 1997). We found that substitution rates performed better in terms of detecting functional sites than information content; the recognition rate at 5% false positives (R0.05) for the whole test set was 0.32 and 0.25 using PAML substitution rate and information content, respectively. This difference is especially pronounced for highly divergent domain families and could be due to the fact that the substitution rate calculated by PAML takes into account the phylogenetic history of the protein family.

To determine whether clustering of conserved residues in space and consideration of their solvent accessibility help to identify functional sites, we compared scoring functions based on sequence conservation alone and sequence conservation with spatial averaging (see Materials and Methods). Figure 1 [triangle] shows the ROC30 statistic for the contact-based scoring function with an optimized distance cutoff (the distance cutoff yielding the best performance for each domain family) and with a fixed distance cutoff (less than 6 Å), plotted against ROC30 values obtained with a sequence-based scoring function. As can be seen from the figure, the contact-based scoring function with optimized distance cutoff detects more functional sites for 73% of domain families compared to sequence-based scoring function. Because the value of optimal distance cutoff is difficult to determine a priori for each domain family, in our work we used the 6 Å distance cutoff, which has been shown to yield the best performance.

Figure 1.
The ROC30 statistic for each domain family obtained with the contact-based scoring function (equation 1) and optimized distance interval cutoff is plotted vs. ROC30 values calculated with the original sequence-based scoring function (triangles). The ROC ...

Functional site predictions for different functional categories

Analyzing different functional categories we found that conserved contacts and solvent accessibility are particularly useful for predicting DNA/RNA-binding and protein–protein binding interfaces. The difference in recognition accuracies can be represented by ROC plots (Fig. 2A,B [triangle]), which show the fraction of false positives for any given recognition rate. For example, at 5% of false positives the structure-based scoring function detects about 20% of DNA/RNA-binding and 14% of protein–protein binding sites, whereas sequence-based scoring function yields a recognition rate of 9%–10%. An improvement in the ROC30 statistic upon including structural information is also observed for DNA/RNA binding and protein–protein binding sites, as can be seen from Table 11.. It was shown earlier that the level of conservation of DNA-binding and protein–protein binding sites and, as a consequence detection accuracy, depends on the conservation of the entire protein sequence (Luscombe and Thornton 2002; Nooren and Thornton 2003). Given that the average sequence identity in our test families is about 30%, DNA/RNA-binding and protein–protein binding sites are also predicted with limited accuracy.

Table 1.
Average ROC30values calculated with different scoring functions for different functional categories of test domains: catalytic, DNA/RNA-binding and protein-protein binding domains
Figure 2.
The fraction of correctly identified DNA/RNA binding sites (A), protein–protein binding sites (B), and catalytic sites (C) is plotted against the fraction of incorrectly identified functional sites for different scoring functions: the original ...

We found that the success rate in detection of catalytic sites is higher than for other types of functional sites, about 47% true positives recognized at the 5% false positive rate (Fig. 2C [triangle]). The increased prediction accuracy for catalytic sites can be explained by the fact that catalytic sites apparently are under stronger selection pressure (not counting those cases where different functional groups could mediate the same catalytic mechanisms in homologous enzymes [Todd et al. 2002]), such that even families with a high degree of sequence diversity exhibit strong conservation of catalytic sites. As can be seen from Figure 2C [triangle], structure information does not seem to assist the prediction of catalytic sites. Examination of Table 11 shows that residue solvent exposure is also not a very important factor in predicting catalytic sites, which agrees with the previous observation that despite their polarity, catalytic residues have lower solvent exposure compared to other residues (Bartlett et al. 2002).

It should be noted that there is great variety among different catalytic domains. They can vary in terms of the type of enzymatic activity, the sizes of protein clefts, and interacting ligands. These factors apparently make it difficult to predict active sites using structure-based scoring function with the fixed distance cutoff. As a consequence, the sequence-based scoring function alone gives more reliable predictions for sufficiently diverse domain families where conserved active sites become more apparent. On the other hand, DNA/RNA binding and protein–protein binding sites very often are nonspecific and form contiguous patches on the surface of the protein. These factors apparently allow the contact-solvent-accessibility scoring function to improve detection of functional sites.

Statistical significance of functional site predictions

To compare the results obtained by our method to the outcome of random assignments, we performed a binomial test for each domain family. The number of trials in the binomial test was equal to the overall number of functional residues in a given domain alignment, and the probability of success was calculated as a number of functional residues in the alignment divided by the overall number of residues in the alignment. Using the contact-solvent-accessibility scoring function, we found that predictions of functional sites for 57% of domain families are significant with P-values <0.05 (P-value here denotes the probability of finding an equal or higher number of correctly predicted functional sites purely from the binomial distribution). Values for domains with annotated catalytic, DNA/RNA-binding, and protein–protein binding sites were 76%, 35%, and 20%, respectively. Sequence conservation scoring yielded significant predictions of catalytic sites for 65% of domains, DNA/ RNA-binding sites for 24% of domains, and protein–protein interfaces for 20% of domains (50% overall). In all cases the site was predicted to be functional if it belonged to the top 5% of the most conserved sites in domain alignment.

These results are comparable to those of the 3D cluster analysis employed by Landgraf et al. (2001). Those investigators identified 36% of all interface residues at a threshold of less than 1% expected from reshuffled alignments and 67% at the less stringent threshold of 10%. An automated method based on the ET approach found the correct locations of catalytic residue clusters for 62 out of 80 enzymes (78% of clusters compared to 76% of catalytic domains with significant predictions found by our method) for multiple alignments with less than 30% identity (Aloy et al. 2001). Aloy et al. defined the predicted site/cluster to be correct if the overlap between the volume of predicted cluster and the volume of annotated functional site was more than 50%. Their method was considered to find a right prediction for a given protein if at least one of the predicted functional clusters was correct.

Conserved structural features help to predict functional residues for domain alignments with low sequence diversity

Our test set can be considered rather heterogeneous in terms of the sequence diversity of domain families (Table 22).). For domain families with low sequence diversity, sequence and structure similarity is extensive and the degree of residue conservation is high for all positions in alignments. Sequence profiles based on low-diversity alignments perform relatively poorly in a database search (Panchenko and Bryant 2002), and we similarly found that functional residue identification is problematic in these cases. As shown in Figure 3 [triangle], for low-diversity domain alignments (where the number of different amino acid types per column, Nobs is less than 5 and average sequence identity is about 45%), the average recognition rate (R0.05) is less than 0.2, whereas for more diverse alignments (Nobs is greater than 15 and average sequence identity is about 20%), the average recognition rate is twice as high. In agreement with these results, Aloy et al. (2001) reported that for multiple alignments with sequence identity of more than 30%, their method of functional site prediction has very limited applications.

Table 2.
Names of 86 CDD families used together with the pdb codes of their first structures, average sequence identities of family alignments (average number of different amino acid types per column, Nobs), alignment lengths, and the overall numbers of functional ...
Figure 3.
The site recognition rate (R0.05) obtained with the sequence based scoring function is plotted for different sequence diversity ranges. Domain family diversity is calculated as the average number of different amino acid types per column in the CDD alignment. ...

We found that spatial averaging nonetheless improves functional site recognition for low-diversity alignments. As can be seen from Figure 4A [triangle], the site recognition rate increases for low-diversity families upon including the structure-based term in the scoring function. The improvement in accuracy exceeds 20% for this range of diversity, mostly affecting domain families with catalytic and DNA/RNA-binding sites. Moreover, including the solvent accessibility term in the scoring function improves the prediction accuracy for families with medium sequence diversity (Nobs between 5 and 15), as shown in Figure 4B [triangle]. Diverse domain families with highly conserved functional sites, on average, show a decline in recognition rate when structure-based scoring function is used. For example, the recognition rate for a very diverse family of metal-dependent phosphohydrolases (HDc; average percent identity 18%) drops from 100% recognition with the original sequence-based scoring to 50% with contact-based scoring. This family has a particularly conserved HD-motif, which suggests that the conservation signal is high enough to be detected by sequence-based scoring alone. Structure-based scoring in this case can flatten the overall signal by averaging the conservation measure over neighboring residues.

Figure 4.
Improvement in the site recognition rate upon including the structural term in the scoring function is plotted vs. the sequence diversity of domain families. The difference in recognition rate is calculated as the average recognition rate (R0.05) obtained ...

Discussion

In an attempt to identify functionally important sites, we present a method which quantifies the conservation of protein sites in terms of preserving amino acid types and local structural environments. First, the scoring function, which accounted for the local environment and/or surface exposure of protein sites, was found to perform better than sequence-based scoring alone in many cases, serving mainly as a filter to eliminate nonfunctional residue conserved positions. The largest improvement was observed for predicting DNA/RNA binding sites. This observation is in agreement with the previous studies which similarly demonstrated that accounting for 3D clusters of conserved residues reduced the number of false positives identified (Landgraf et al. 2001).

Second, it was shown that the sequence divergence of domain alignments is a prerequisite for the successful functional prediction, and structurally conserved features help to discriminate functional and nonfunctional sites for families with low sequence diversity. Accordingly, to increase blind prediction accuracy we can formulate several rules based on these observations. The first: To predict functional residues for low-diversity families, whenever possible diversify them with more distantly related family representatives and, if not possible, use a structure-based scoring function. The second rule can be applied if the general function of the domain family is known: Whenever possible use contact-based and solvent accessibility-based scoring for predicting DNA/RNA binding and protein–protein binding sites; for catalytic sites use a contact-based scoring function for low-diversity families and the original sequence-based scoring function for all others. If a blind prediction of functional residues is being attempted, the simple strategy would be to apply these rules for initial family screening and then define functional residues as those having conservation scores among the top 7%, 6%, and 5% of conservation scores for catalytic, DNA/RNA binding, and protein–protein binding sites, respectively. These conservation score cutoffs correspond approximately to the error rate of 5% false positives.

As we showed, spatial averaging does not always help the function prediction, and prediction accuracy still remains quite low. Madabushi et al. (2002) demonstrated that the number of clusters (or size of the largest cluster) of functional residues determined by the ET method was larger than the number of clusters predicted by random simulations for 98% of their test cases (at the significance level of 5%). It should be noted that this result does not imply that the ET method is able to correctly identify active sites for 98% of test proteins at the 5% significance level. Similarly to Landgraf et al. (2001), we showed that the accuracy of functional site prediction, in fact, was far from reaching 100%. Applying ROC analysis we found that 47% of active sites, 20% of DNA/RNA binding sites, and 14% of protein–protein interfaces can be predicted at a 5% false positive rate. We note that the limited accuracy of functional prediction can be caused by the differences in functional specificity among homologous family members as well as by the functional plasticity of protein molecules. Even proteins sharing the same evolutionary origin and functional activity may show variability in the physicochemical properties of functional residues and their location in a 3D structure (Todd et al. 2001, 2002; Lichtarge and Sowa 2002).

Materials and methods

A benchmark for evaluating the methods of functional sites prediction

We selected 86 domain alignments from the curated Conserved Domain Database (CDD), a current version of which is available at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml (Marchler-Bauer et al. 2002). Multiple alignments in the CDD have been manually curated to reconcile sequence alignments with protein 3D structures and structure-structure alignments. Based on the crystal structures and experimental data from the literature, conserved functional sites have been annotated for each CDD domain by inspection of protein–ligand, protein–DNA/RNA, and protein–protein complexes for all structure representatives. Functionally important sites were defined as those residues making contacts with a ligand or a macromolecule. CDD alignments represent alignments of conserved core structures formed by presumably homologous sites, and positions outside the conserved cores are removed from the alignment, resulting in alignment lengths between 38 and 576 residues.

The selected test set covered a broad range of different functional categories including 37 domains with annotated catalytic sites, 17 domains with annotated DNA/RNA binding sites, 20 domains with annotated protein–protein binding sites, and domains from other functional groups (domains containing disulfide bonds and domains with less than two annotated functional sites were excluded). Names of CDD families used in the test set together with their sequence diversity, length, the number and the type of functional sites are listed in Table 22.. By definition, CDD alignments have at least one structural family representative, whereas in our test set the number of structures per family ranged from 1 to 15, with three structures per family on average.

Calculation of sequence conservation

We used two different measures to estimate the level of conservation at each position in CDD alignments. The first measure, information content, was based on counting the number of different amino acid types per aligned column and inferring the relationships between amino acid types with the pseudocount method (Altschul et al. 1997), where pseudocount frequencies were calculated using the PAM70 amino acid substitution matrix. The second measure of evolutionary conservation of different sites, the substitution rate per site, was calculated using the PAML3.12 package (Yang 1997) with its implementation of the Jones, Taylor, and Thornton amino acid substitution model (Jones et al. 1992), where the variable substitution rates across sites were described with the γ-model. Phylogenetic trees required for this analysis were constructed by the neighbor-joining method (Saitou and Nei 1987) with the PHYLIP package (Felsenstein 1989).

Scoring the clusters of conserved residues

For each position in the alignment, two regional conservation scores were calculated. The first one represented the average over conservation scores for residues located within a given distance from each position “i” of the alignment, namely,

equation M1
(1)

where Δij is equal to 1 if residues i and j are in contact, and 0 otherwise. Cj is the residue conservation score of residue j, N is the total number of positions in the alignment, and n is the number of residues in contact with residue “i.” Contacts were defined between the virtual Cβ atoms (points 2.4 Å away from Cα atom) of residues separated along the chain by at least five peptide bonds and having the distance less than a given distance cutoff (4, 5, 6, 7, 8, and 9 Å). It should be noted that contacts were calculated for all structural representatives of domain alignments, and only conserved contacts were used in the evaluation of Ccont. The contact between positions i and j was defined as conserved if aligned residues in these positions formed the contact in all structural representatives. For those residues which did not make any contacts, the original residue conservation value was assigned. Inter-residue contacts conserved between all structural representatives were shown to increase prediction accuracy for 60% of domain families (for families with more than one structure) compared to the scoring function based on one representative structure (data not shown).

The second regional conservation score gave emphasis to solvent accessible residues, because these residues are very often involved in the formation of functionally important interfaces:

equation M2
(2)

where Δsolv is equal to 1, if solvent accessibility of position “i” is greater than 0.05, and 0 otherwise. Reversing equation 2 and considering only buried residues in contact did not improve the prediction accuracy (data not shown). The cutoff threshold of 0.05 was derived from an analysis of homologous protein structures forming a conserved hydrophobic interior (Miller et al. 1987). Solvent-accessible area was calculated by the DSSP algorithm (Kabsch and Sander 1983), where solvent accessibility of residue “X” was defined as the ratio of its solvent-accessible area in protein structure to that for extended tripeptide Gly-X-Gly. The solvent accessibility of position “i” in a multiple alignment was calculated by averaging solvent accessibility values in a given position for all structural representatives.

Evaluation of prediction accuracy

To evaluate the accuracy of functional site predictions, we calculated the number of correctly predicted functional sites (true positives) and the number of incorrectly predicted functional sites (false positives) found at different thresholds of conservation score. True positives were identified as those functionally important sites which had scores higher than a given score threshold. False positives, in turn, were identified as sites with scores higher than a given threshold, but unrelated to the functional activity of a given domain family. To measure the performance of retrieval methods, the truncated receiver operating characteristic (ROC) has been widely used (Gribskov and Robinson 1996; Schaffer et al. 2001). A ROCn statistic was calculated as the sum of the number of true positives found at 1,2,3, . . . n false positive levels (ti) divided by the overall number of true positives (T): ROCn = (∑I=1, . . . , n ti)/nT. Here, the total number of true positives (T) was calculated as the total number of annotated functionally important sites in a given domain family, whereas the total number of false positives was equal to the difference between the total number of sites in the alignment and the number of functional sites annotated for a family. Knowing the number of true positives detected and overall number of true positives, it is possible to calculate the fraction of true positives detected and, correspondingly, the fraction of false positives detected, and plot them in the order of decreasing score threshold (see Fig. 2 [triangle]). The false positive cutoff “n” was set to 30, which corresponds approximately to the first quarter of false positives detected. In those cases where the prediction performance was compared for different families with the different numbers of false positives, the R0.05 was used.

Acknowledgments

We thank John Spouge, Ben Shoemaker, and Michael Galperin for helpful suggestions, and the NIH Intramural Research Program for support.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Notes

Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03465504.

References

  • Aloy, P., Querol, E., Aviles, F.X., and Sternberg, M.J. 2001. Automated structure-based prediction of functional sites in proteins: Applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J. Mol. Biol. 311 395–408. [PubMed]
  • Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25 3389–3402. [PMC free article] [PubMed]
  • Andrade, M.A., Casari, G., Sander, C., and Valencia, A. 1997. Classification of protein families and detection of the determinant residues with an improved self-organizing map. Biol. Cybern. 76 441–450. [PubMed]
  • Bartlett, G.J., Porter, C.T., Borkakoti, N., and Thornton, J.M. 2002. Analysis of catalytic residues in enzyme active sites. J. Mol. Biol. 324 105–121. [PubMed]
  • Casari, G., Sander, C., and Valencia, A. 1995. A method to predict functional residues in proteins. Nat. Struct. Biol. 2 171–178. [PubMed]
  • Chambers, J.M. 1998. Programming with data. A guide to the S language. Springer Verlag, New York.
  • del Sol Mesa, A., Pazos, F., and Valencia, A. 2003. Automatic methods for predicting functionally important residues. J. Mol. Biol. 326 1289–1302. [PubMed]
  • Devos, D. and Valencia, A. 2000. Practical limits of function prediction. Proteins 41 98–107. [PubMed]
  • Elcock, A.H. 2001. Prediction of functionally important residues based solely on the computed energetics of protein structure. J. Mol. Biol. 312 885–896. [PubMed]
  • Felsenstein, J. 1989. PHYLIP—Phylogeny inference package. Cladistics 5 164–166.
  • Gribskov, M. and Robinson, N.L. 1996. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem. 20 25–33. [PubMed]
  • Hannenhalli, S.S. and Russell, R.B. 2000. Analysis and prediction of functional sub-types from protein sequence alignments. J. Mol. Biol. 303 61–76. [PubMed]
  • Hegyi, H. and Gerstein, M. 1999. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome. J. Mol. Biol. 288 147–164. [PubMed]
  • Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8 275–282. [PubMed]
  • Jones, S. and Thornton, J.M. 1997. Prediction of protein–protein interaction sites using patch analysis. J. Mol. Biol. 272 133–143. [PubMed]
  • Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22 2577–2637. [PubMed]
  • Landgraf, R., Xenarios, I., and Eisenberg, D. 2001. Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J. Mol. Biol. 307 1487–1502. [PubMed]
  • Li, L., Shakhnovich, E.I., and Mirny, L.A. 2003. Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases. Proc. Natl. Acad. Sci. 100 4463–4468. [PMC free article] [PubMed]
  • Lichtarge, O. and Sowa, M.E. 2002. Evolutionary predictions of binding surfaces and interactions. Curr. Opin. Struct. Biol. 12 21–27. [PubMed]
  • Lichtarge, O., Bourne, H.R., and Cohen, F.E. 1996. An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol. 257 342–358. [PubMed]
  • Luscombe, N.M. and Thornton, J.M. 2002. Protein–DNA interactions: Amino acid conservation and the effects of mutations on binding specificity. J. Mol. Biol. 320 991–1009. [PubMed]
  • Madabushi, S., Yao, H., Marsh, M., Kristensen, D.M., Philippi, A., Sowa, M.E., and Lichtarge, O. 2002. Structural clusters of evolutionary trace residues are statistically significant and common in proteins. J. Mol. Biol. 316 139–154. [PubMed]
  • Marchler-Bauer, A., Panchenko, A., Shoemaker, B., Thiessen, P., Geer, L., and Bryant, S. 2002. CDD: A database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30 281–283. [PMC free article] [PubMed]
  • Miller, S., Janin, J., Lesk, A.M., and Chothia, C. 1987. Interior and surface of monomeric proteins. J. Mol. Biol. 196 641–656. [PubMed]
  • Nooren, I.M. and Thornton, J.M. 2003. Structural characterisation and functional significance of transient protein–protein interactions. J. Mol. Biol. 325 991–1018. [PubMed]
  • Panchenko, A.R. and Bryant, S.H. 2002. A comparison of position-specific score matrices based on sequence and structure alignments. Protein Sci. 11 361–370. [PMC free article] [PubMed]
  • Saitou, N. and Nei, M. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4 406–425. [PubMed]
  • Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., and Altschul, S.F. 2001. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29 2994–3005. [PMC free article] [PubMed]
  • Sjolander, K. 1998. Phylogenetic inference in protein superfamilies: Analysis of SH2 domains. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6 165–174. [PubMed]
  • Teichmann, S.A., Murzin, A.G., and Chothia, C. 2001. Determination of protein function, evolution, and interactions by structural genomics. Curr. Opin. Struct. Biol. 11 354–363. [PubMed]
  • Todd, A.E., Orengo, C.A., and Thornton, J.M. 2001. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307 1113–1143. [PubMed]
  • ———. 2002. Plasticity of enzyme active sites. Trends Biochem. Sci. 27 419–426. [PubMed]
  • Tsai, C.J., Lin, S.L., Wolfson, H.J., and Nussinov, R. 1997. Studies of protein–protein interfaces: A statistical analysis of the hydrophobic effect. Protein Sci. 6 53–64. [PMC free article] [PubMed]
  • Yang, Z. 1997. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13 555–556. [PubMed]

Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society
PubReader format: click here to try

Formats: