• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Mol Biol. Author manuscript; available in PMC Dec 18, 2008.
Published in final edited form as:
PMCID: PMC2605514

Functional specificity lies within the properties and evolutionary changes of amino acids


The rapid increase in the amount of protein sequence data has created a need for automated identification of sites that determine functional specificity among related subfamilies of proteins. A significant fraction of subfamily specific sites are only marginally conserved, which makes it extremely challenging to detect those amino acid changes that lead to functional diversification. To address this critical problem we developed a method named SPEER (Specificity prediction using amino acids’ Properties, Entropy and Evolution Rate) to distinguish specificity determining sites from others. SPEER encodes the conservation patterns of amino acid types using their physico-chemical properties and the heterogeneity of evolutionary changes between and within the subfamilies. To test the method, we compiled a test set containing thirteen protein families with known specificity determining sites. Extensive benchmarking by comparing the performance of SPEER with other specificity site prediction algorithms has shown that it performs better in predicting several categories of subfamily specific sites.

Keywords: Functional divergence, subfamily specificity, physico-chemical properties, combined relative entropy, evolutionary rate


According to the neutral theory of molecular evolution the majority of mutations are selectively neutral at the molecular level and do not affect the fitness of the organism1. As a consequence many protein sites undergo random amino acid changes which are apparently not functional and are not conserved in evolution. Other sites are under more stringent evolutionary constraints that are reflected in the more prominent conservation of sequence and structural properties. It has been argued that changes in the conservation or evolutionary rate at a particular site reflect functional divergence after the gene duplication2, 3. Indeed, after duplication of a gene, one copy evolves under relaxed evolutionary constraints which allow it to accumulate changes and develop new functions and specificities3, 4. Such mechanisms of functional diversification have recently been studied in proteins with promiscuous functions57, and two types of functional divergence have been distinguished8. Type I functional divergence is the result of the change in evolutionary rate where the site is conserved for one subfamily and is variable in another. Type II divergence is a consequence of the rate change where purifying selection causes similar levels of conservation of different amino acid types for different protein subfamilies.

Various site specific conservation scores have been offered to distinguish conserved functionally important sites from the background of neutral changes9. Some of them are based on combinatorics and information theory, including different variations of Shannon entropy and frequency scores1014. Others take into account amino acid stereochemical properties1519 and amino acid substitution matrices20, 21. Since there is heterogeneity in evolutionary rates between sites, models which account for the difference in rates and amino acid substitution probabilities among different sites can be very valuable as well22, 23. It has also been shown that prediction of functional sites and site specific rate inference can be improved considerably if phylogenetic trees and evolutionary models are considered2431. Other methods attempt to identify functional sites based not only on the sequence conservation but also on their location in the 3D structure 26, 3238.

Several computational methods have been developed which are exclusively designed to predict specificity determinants. Earlier algorithms applied principal component analysis to a vector representation of protein sequences39 or self-organizing maps to retrieve sequence patterns characteristic of subfamilies40. The evolutionary trace method, for example, identified invariant specific residues by partitioning the phylogenetic tree into subgroups of similar sequences and its later versions estimate the statistical significance of the predictions27, 41. Some more recent methods use multiple sequence alignments and various conservation scores like relative entropy, mutual entropy or “sequence harmony” to predict subfamily specific sites4246. The majority of specificity determining methods require pre-defined grouping into subfamilies while several of them overcome this limitation by simultaneous identification of optimal groups and conserved positions47, 48. In the first approach the likelihood score is calculated for each position using the phylogenetic tree and a shuffling procedure47. The second approach uses a Bayesian-based model for identification of specificity determinants, and in this case the Bayes factors allow one to estimate the uncertainty level of the solution48.

It is extremely difficult to detect amino acid changes which lead to functional divergence. It is indeed much easier to distinguish globally conserved sites from the overall background rather than differentiate between the two types of conservation in various subfamilies. The reason is that specificity is determined by subtle changes in residue stereochemistry and the residue conservation score should be tuned to detect these changes. Moreover, in many cases sites responsible for specificity are located on flexible or disordered loops that are difficult to characterize5. Finally, experiments on specificity determinants are difficult and compiling a comprehensive dataset for testing these prediction methods is a major task.

Indeed, despite all efforts at predicting subfamily specific sites, accuracy remains very limited and some methods are tuned to predict only type I functional sites while others are biased toward the type II functional sites. In reality it is almost impossible to judge their performance using a few test families, which is the case for most of the studies. In our work we compiled a more comprehensive test set which consisted of 13 protein families with the pre-determined specificity sites. Using this test set we analyzed the site attributes which can distinguish between different subfamilies of the same family alignment. We developed a method named SPEER (Specificity prediction using amino acids’ Properties, Entropy and Evolution Rate) that encodes the specific conservation pattern of amino acid types together with their physico-chemical properties and the evolutionary rates between the subfamilies. We have also undertaken by far the most extensive benchmarking analysis in this field where the SPEER method has been compared to other available specificity site prediction methods. Comparison results suggest better performance of SPEER with respect to other methods. The prediction sensitivity provided by our combinatorial approach is good (close to 70% at 15% error rate) and our findings are encouraging for future investigations.

Results and Discussion

Characterization of subfamily specific sites

Subfamily specific sites (110 sites altogether) collected from thirteen families are categorized mainly into three major classes, Type I, Type II and marginally conserved (MC) sites (Figure 1). As can be seen on this figure, about half of subfamily specific sites are only marginally conserved which reflects the lack of regularity in conservation pattern and thereby illustrates the difficulties in identifying them through prediction methods. Another half constitute Type I and Type II sites; these two types of conservation are shown to occur more frequently among subfamily specific rather than non subfamily specific sites.

Figure 1
Distribution of different categories of subfamily specific sites. Percentage of TypeI, TypeII, marginally conserved (MC) and absolutely conserved (AC) sites are shown in known subfamily specific sites (Sub site; white bar) and non-subfamily specific sites ...

We developed a scoring function (SPEER score) that represents a linear combination of Euclidean distances (ED score) based on amino acids physico-chemical properties, evolution rate (ER) and combined relative entropy (CRE). All three terms account for the variability of sites within the subfamilies in terms of their physico-chemical properties, evolutionary rates and amino acid types. Figure 2 shows the distribution of three components of our combined scoring scheme, ED score, combined relative entropy (CRE) and evolutionary rate (ER) scores together with the combined SPEER score calculated for subfamily specific and all other sites in the alignments. Although not all scores demonstrate good discrimination between subfamily determinants and other sites, the combined score clearly has power to discriminate between these two site populations which suggests the complementarity of the proposed scoring schemes. Indeed, the correlation matrix calculated for different scoring terms shows that correlation coefficients are low (Table SM1 in Supplementary materials).

Figure 2
Distribution of three component scores i.e. ED score (a), CRE score (b) and ER (c) along with the combined SPEER score (d) are shown for subfamily specific sites (Sub site; white bar) and non-subfamily specific sites (Non sub site; black bar). X-axes ...

Prediction of subfamily specific sites

We have used multiple alignments of thirteen protein families to predict subfamily specific sites. The combined SPEER score was calculated for each gapless column of the alignment where no amino acid type was represented more than 80% of the times (see Methods). The prediction sensitivities at 1, 5 and 15% error rate together with the ROC statistics and their standard deviations are given in Table 1. Prediction sensitivities (at 1 and 15% error rate) for individual families are also provided in Table 2. As can be seen from these tables and the overall ROC curve (Figure 3), for the majority of families (62%, 8 out of 13) the SPEER method outperforms other methods such as SDP-pred44, SPEL47 and SH45. The difference in prediction performance between SPEER and other methods is also statistically significant as suggested by ROC500, ROCtotal and the Wilcoxon signed-ranked test p-values (p-value <0.004). For 3 out of 13 families (cd00120, LacI and GST) other methods yield better predictions at certain error rates. SPEL and SDP-pred, overall, yield similar performance, although SPEL seems to show somewhat higher sensitivities at low error rates. The SH algorithm can not be compared with other methods across all the families as it can not make predictions for families with more than two subfamilies. It should be mentioned that this comparison does not take into account certain strong points of the other methods which are not directly associated with the problem being solved in the current study. For example, SPEL can simultaneously define subfamilies and predict specificity determinants, and SDP-pred takes full advantage of orthologous-paralogous groupings in defining the subfamily specific sites. We further examine the performance of methods in predicting different types of subfamily specific sites (Figure 4). It is clear from the figure that SPEER performs well for all three categories including the most difficult type I and marginally conserved (MC) sites which pose a significant challenge for computational identification of subfamily determinants. Likewise, SPEL makes very good predictions for the MC category as well. We have also showed that, overall, the prediction accuracy depends on the level of conservation of physico-chemical properties within the subfamilies (Pearson correlation coefficient is 0.61) as well as between them (Pearson correlation coefficient is −0.66) (Figure SM1).

Figure 3
Comparison of prediction performances. ROC-curves for prediction of subfamily specific sites are shown for SPEER, SDP-pred44 and SPEL47 methods.
Figure 4
Comparison of prediction performances for different categories of subfamily specific sites. Percentage of sites (Y-axes) predicted by SPEER, SDP-pred and SPEL at 1, 5 and 15% error rates are shown for TypeI, TypeII and marginally conserved (MC) sites. ...
Table 1
Comparison of overall prediction sensitivities.
Table 2
Comparison of prediction sensitivities for individual families.

Examples of successful predictions

We illustrate the performance of the SPEER method on different examples (Figure 5). Figure 5a shows a representative structure of dihydropteroate synthase (1AJ0) taken from the pterin binding enzymes domain family (cd00423). This family includes two subfamilies, dihydropteroate synthase (DHPS) and cobalamin-dependent methyltransferases. DHPS catalyzes the condensation of p-aminobenzoic acid (pABA) in the biosynthesis of folate, which is an essential cofactor in both nucleic acid and protein biosynthesis. DHPS represents a very important subfamily as it can be targeted by sulfonamide drugs, which are substrate analogs of pABA. Both DHPS and cobalamin-dependent methyltransferases bind to pterin substrates while sulfonamide (pABA) acts as a specific ligand to DHPS. SPEER and SDP-pred methods successfully identified all four (Lys220, Arg221, Arg255 and His257; marked as space-filling model) sites for pABA/sulfonamide binding in DHPS4951. In addition to that SPEER was able to predict three additional sites (Ile20, Gly187 and Gly189) that could be important in specific interaction and reside within 5 Å from the specific pABA ligand.

Figure 5
Examples of successful predictions.

Another example shows a representative structure of a novel NTPase from M. jannaschii (2MJP, chain A) belonging to Maf_Ham1 domain family (cd00985, Figure 5b). Ham1-related protein is a novel NTPase that has been shown to hydrolyze nonstandard nucleotides, such as hypoxanthine/xanthine NTP. The Maf subfamily includes nucleotide binding proteins which have been implicated in inhibition of septum formation in eukaryotes, bacteria and archaea. Despite the fact that proteins from both subfamilies share structural similarities in the nucleotide binding cleft, the locations and nature of conserved residues differ, which could lead to adoption of different functions and ligand binding properties. Three such conserved residues (Ser9/Thr15, Ser11/Asn17 and Arg14/Lys20 in Maf and Ham1, respectively) could be important for binding to different nucleotides and therefore can be regarded as subfamily specific52, 53. SPEER successfully identified all three binding sites (shown in space filling model in Figure 5b) at 15% false positive rate. Additionally, we predict seven extra potential specificity determinants which have high scores and reside within 5A° (Glu23, Glu72, Gly75, Ser89, Phe149, His177 and Arg178) from the ligand (Figure 5b).

The third example constitutes the G protein α subunit (Gα, Gprotein) which controls important cellular signaling processes involving G protein coupled receptors through a regulated cycle of GTPase activity. Gα subunits can be divided into four main subtypes where each of the subtypes performs different biological functions through specific interactions with the effectors [e.g. cyclic GMP phosophodiesterase (PDE)] and regulators [e.g. Regulator of G protein signaling (RGS) domains]47. Figure 5c shows the predicted subfamily specific sites mapped onto the 3D structure representative (1fqj, cartoon representation) bound to PDE (purple ribbon) and RGS (black ribbon). Potential specificity determinant sites that reside within 5 A° or 10 A° from the effector and/or regulator molecules are marked in black and grey correspondingly.

The LacI/PurR family is a large family of bacterial transcription factors (15 subfamilies) that are regulated by small molecules, such as sugars and nucleotides. In addition to available experimental and structural information, the LacI/PurR family has been widely used by researchers for prediction of subfamily specific sites. Specific sites predicted by SPEER are mapped onto a representative 3D structure (PDB code: 1WET) from the PurR subfamily complexed with the guanine (effector) and the ligand, DNA (Figure 5d). All the predicted sites are color coded based on their SPEER prediction score (see color scale in Figure 5) and the known subfamily specific sites are marked in space filling model.


The problem of identifying specificity determinants is both challenging and captivating as its solution would point to the evolutionary and physico-chemical mechanisms producing a wide variety of specific functional activities based on the same fold and overall function of a protein family. Since proteins with similar specificities use similar amino acids, specificity prediction methods look for the specific distribution patterns (that could be directly related to the biochemical function or be characteristic of a given subfamily) of amino acids across the subgroups or with respect to the overall family and try to identify those sites where such a subfamily specific distribution is observed.

In this paper we investigated the factors which can distinguish between different subfamilies of the same family. First we found that it is important to encode the conservation of amino acids’ properties within each subfamily and differences between subfamilies (ED term). Second, we showed that the conservation of subfamily specific features can be successfully described in terms of amino acid substitution rates (ER term) which are calculated from the phylogenetic trees and reflect the evolutionary history of family divergence. Finally, we noticed that amino acid properties can be very similar between different subfamilies at specificity determining sites, although their amino acid usage can vary. Consequently, the difference in amino acid usage between and within subfamilies should also be encoded explicitly (CRE term).

We note that variations of many measures employed in our cost function have been used previously8, 23, 28 for characterization and prediction of specificity determinants for selected families. Here we present a more general approach tested on a benchmark encompassing a diverse set of protein families which showed that the simple combination of seemingly redundant but in fact complementary terms performs well in prediction. Comparison with other sensitive methods of specificity prediction showed that although SPEER in many cases yields better results, the methods’ sensitivities are still moderate. On the other hand, many examples of successful predictions have been found by our method. Considering the difficulty level and the current state of the field, the prediction sensitivity provided by our combinatorial approach is very much acceptable and encouraging enough for further future investigations. Therefore, the present study provides a platform for future endeavors to understand the critical issue of protein subfamily specificity determination.

Materials and methods

Benchmark for prediction validation

We have performed an extensive analysis to collect reliable alignments of protein families, for which experimental evidence is available on subfamily specific sites. Our benchmark includes seven families that have been used for validation of previously published prediction methods and six families from the version 2.10 of the Conserved Domain Database (CDD54). Subfamily specific sites for six CDD families were assigned based on an extensive literature search (see Supplementary materials for details). A complete list of the test set families together with their subfamily specific site locations is provided in Table 3. Highly conserved positions within the overall family alignment (where any amino acid type was represented more than 80% of the time) were not regarded as subfamily specific and excluded from the analysis. The resulting test set covers a wide range of families with different functions, types of functional sites, number of subfamilies and sequence diversity (Table SM2). To our knowledge this is the most comprehensive benchmark used so far for validation of subfamily specific site prediction. These alignments and subfamily specific sites information can be obtained through email request or can be downloaded via ftp (ftp://ftp.ncbi.nih.gov/pub/chakraba/SPEER).

Table 3
Description of the dataset.

All specificity determining sites were categorized into three groups, Type I, Type II and marginally conserved (MC). Type I functional sites were defined as those conserved for one subfamily and variable in another while type II sites were defined as those where different types of amino acids were conserved across different subfamilies. In this study we considered a site to be conserved for one subfamily if any amino acid type is represented more than 75% of the time. The sites that failed to satisfy the above criteria are marked as MC (none of the subfamilies are conserved in this site). For families with more than two subgroups, sites were categorized into different types based on the category assigned to the majority of subfamily pairs.

Cost function to distinguish subfamily specific sites

In our approach we devise a cost function which represents a linear combination of Euclidean distances based on amino acids’ physico-chemical properties, evolution rate and combined relative entropy. All three terms account for the variability of sites within the subfamilies in terms of their physico-chemical properties, evolutionary rates and amino acid types. The first and the third terms also approximate the variability of physico-chemical properties and amino acid types between the subfamilies.

Euclidean distance based on amino acids’ physico-chemical properties (ED)

Comparison of amino acids’ physico-chemical properties can be very useful to characterize subtle variations in stereochemistry of subfamily specific sites. Matrices/indices containing quantitative values for amino acid physico-chemical properties (such as hydrophobicity, polarity, charge etc) scaled between 0 and 1 were obtained from the UMBC AAIndex database55 (Table SM3). To quantify the variability between different amino acid properties within or between subfamilies, we employed different distance metrics that to various extents encoded the distance between subfamilies and conservation of properties within them (Figure SM1). We found that the ED-score performs best among the various metrics and is calculated as shown below. To quantify the difference between any two sequences i and j at a given site we use a weighted Euclidean distance:


Here xi and xj are the normalized values of the physico-chemical properties of amino acids at a given site from sequences i and j; Nm is the number of different amino acid property indices; wi and wj are the sequence weights of corresponding sequences56. The average variability of properties within the subfamilies in a given column referenced to the background variability of the whole column is estimated as follows:




Nps is the number of all possible pair combinations of residues within each subfamily, Nall is the overall number of residue combinations in a given column and Ns is the number of subfamilies. It should be mentioned that using sequence weights together with the reference distribution of all sites in the alignment attempts to decouple subfamily specificity from the overall phylogenetic similarity of proteins in the subfamilies. The ED score is positive and its low values correspond to the situation where amino acid properties are very well conserved within the subfamilies (low SED values) and vary in between them (large GED values). The ED score equals 0 if all residues are absolutely conserved within each subfamily but different in between. For absolutely conserved (AC) columns the ED scores become undefined and such columns are excluded from the prediction procedure. Alignment columns that contain gaps are also excluded from the prediction procedure.

Evolutionary rate

Functional divergence can be inferred from the changes in the evolutionary rate at a particular site and evolutionary rate in turn can be estimated using probabilistic evolutionary models. A maximum likelihood approach allows one to estimate evolutionary rates taking into account the topology and branch lengths of the phylogenetic tree as well as the rate heterogeneity over different sites in a protein family. In our study we used the ML approach implemented in the rate4Site25 program to calculate the evolution rate at each site separately for each subfamily and then average it among all subfamilies. The low average ER value would indicate that there is a slowly evolving site in certain subfamilies.

Combined relative entropy (CRE)

Relative entropy or Kullback–Leibler divergence is a very important concept in information theory and has been successfully implemented to distinguish the distributions of amino acid types between two different subfamilies23, 28. We calculated relative entropy for each pair of subfamilies and took an average over these values at a given site:


Here pk(x) and pm(x) are the probabilities to find amino acid type x in the subfamilies k and m respectively and Nsp is the number of all possible combinations of subfamily pairs. The CRE is equal to zero if all distributions of pk and pm are the same while large values of CRE correspond to large differences between amino acid distributions of subfamilies. The relative entropy cannot be calculated if a particular type of amino acid is absent from the subfamily, such singularity is taken into account by adding pseudo counts to the calculation of probabilities p23.

Normalization of scores and their statistical significance

As the background conservation levels may vary substantially between different protein families we normalize each of the three scores by subtracting the mean value and dividing by the standard deviation of the score distribution obtained for all columns in a given alignment. As a pilot project we wanted to stick to equal weighing instead of putting arbitrary weights to three component terms. Determination of differential weights for ED score, CRE and ER may require a much more detailed investigation and a larger dataset to deal family specific biases for individual terms. The linear combination of three normalized scores is used to predict the specificity determinants. To calculate the statistical significance of our predictions we shuffled a given column of the alignment 100 times disregarding subfamily annotations (the procedure is similar to the one described in Ref. 43). Assuming this distribution to be normal we estimate the probability that a site without the specific functional constraints would have a score equal to or higher than the observed score (P-value). The P-value assigns statistical confidence to blind predictions but the ranking of predictions with respect to P-value has not improved considerably the performance of our method (Table SM4). We think it is partially because the cost function employed in the study uses the reference distribution of nonspecific sites (Eq 4).

Evaluation of prediction accuracy

We tested the performance of our method using the alignments of thirteen families (Table 3) by calculating the Receiver Operating Characteristics (ROC) curves and ROC statistics. For a given alignment, we estimated the sensitivity and error rate based on the number of true positives (known specificity sites) and false positives (non specificity sites) found at each score cutoff. Sensitivity was defined as the fraction of true positives found at each score threshold over the overall number of true positives in the family alignment and error rate was estimated as the fraction of false positives found at same score threshold over all false positives in the alignment (difference between the total number of sites in the alignment and the number of subfamily specific sites). True positives were defined as those sites annotated as being subfamily specific based on literature and previous studies. We have evaluated the method’s performance by estimating the sensitivity at 1, 5 and 15% of false positive or error rates and by calculating ROC statistics and their standard deviations57. A ROCn statistic was calculated as the sum of the number of true positives found at 1,2,3, … n false positive levels (ti) divided by the overall number of true positives (T): ROCn = (∑I=1, …, n ti)/nT. To compare sets of ROC statistics produced by different methods we used the Wilcoxon signed rank test and calculated p-values under the null hypothesis that the medians of two distributions are equal58. We compared the performance of our method with three other independent methods, which predict specificity determinants: SDP-pred44 (http://math.genebee.msu.ru/~psn/), Sequence-Harmony server45 (www.ibi.vu.nl/programs/seqharmwww) and the SPEL program from Pei et al., 2006 47. We also analyzed and compared the performance of each method in predicting different types of subfamily specific sites (e.g. Type I, Type II and MC).

Supplementary Material



We thank Michael Galperin and Oishee Chakrabarti for helpful discussions and Thomas Madej for critically reading the manuscript. This work was supported by the Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


1. Kimura M. The neutral theory of molecular evolution. Cambridge: Cambridge University Press; 1983.
2. Gu X. Statistical methods for testing functional divergence after gene duplication. Mol Biol Evol. 1999;16:1664–1674. [PubMed]
3. Ohno S. Evolution by gene duplications. Berlin: Springer-Verlag; 1970.
4. Doolittle RF. Similar amino acid sequences: chance or common ancestry? Science. 1981;214:149–159. [PubMed]
5. Aharoni A, Gaidukov L, Khersonsky O, Gould SMcQ, Roodveldt C, Tawfik DS. The 'evolvability' of promiscuous protein functions. Nat Genet. 2005;37:73–76. [PubMed]
6. Glasner ME, Gerlt JA, Babbitt PC. Evolution of enzyme superfamilies. Curr Opin Chem Biol. 2006;10:492–497. [PubMed]
7. Yoshikuni Y, Ferrin TE, Keasling JD. Designed divergent evolution of enzyme function. Nature. 2006;440:1078–1082. [PubMed]
8. Gu X. Maximum-likelihood approach for gene family evolution under functional divergence. Mol Biol Evol. 2001;18:453–464. [PubMed]
9. Valdar WS. Scoring residue conservation. Proteins. 2002;48:227–241. [PubMed]
10. Lockless SW, Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science. 1999;286:295–299. [PubMed]
11. Cover TM, Thomas JA. Elements of Information Theory. New York, Wiley: ser. Wiley Series in Telecommunications; 1991.
12. Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins. 1991;9:56–68. [PubMed]
13. Schneider TD. Information content of individual genetic sequences. J Theor Biol. 1997;189:427–441. [PubMed]
14. Baczkowski AJ, Joanes DN, Shamia GM. Range of validity of alpha and beta for a generalized diversity index H (alpha, beta) due to Good. Math Biosci. 1998;148:115–128. [PubMed]
15. Taylor WR. The classification of amino acid conservation. J Theor Biol. 1986;119:205–218. [PubMed]
16. Zvelebil MJ, Barton GJ, Taylor WR, Sternberg MJ. Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J Mol Biol. 1987;195:957–961. [PubMed]
17. Livingstone CD, Barton GJ. Identification of functional residues and secondary structure from protein multiple sequence alignment. Methods Enzymol. 1996;266:497–512. [PubMed]
18. Williamson RM. Information theory analysis of the relationship between primary sequence structure and ligand recognition among a class of facilitated transporters. J Theor Biol. 1995;174:179–188. [PubMed]
19. Mirny LA, Shakhnovich EI. Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J Mol Biol. 1999;291:177–196. [PubMed]
20. Tatusov RL, Altschul SF, Koonin EV. Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci U S A. 1994;91:12091–12095. [PMC free article] [PubMed]
21. Brooks DJ, Fresco JR, Lesk AM, Singh M. Evolution of amino acid frequencies in proteins over deep time: inferred order of introduction of amino acids into the genetic code. Mol Biol Evol. 2002;19:1645–1655. [PubMed]
22. Soyer OS, Goldstein RA. Predicting functional sites in proteins: site-specific evolutionary models and their application to neurotransmitter transporters. J Mol Biol. 2004;339:227–242. [PubMed]
23. Abhiman S, Daub CO, Sonnhammer EL. Prediction of function divergence in protein families using the substitution rate variation parameter alpha. Mol Biol Evol. 2006;23:1406–1413. [PubMed]
24. Mayrose I, Graur D, Ben-Tal N, Pupko T. Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Mol Biol Evol. 2004;21:1781–1791. [PubMed]
25. Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics. 2002;18 Suppl 1:S71–S77. [PubMed]
26. Panchenko AR, Kondrashov F, Bryant SH. Prediction of functional sites by analysis of sequence and structure conservation. Protein Sci. 2004;13:884–892. [PMC free article] [PubMed]
27. Lichtarge O, Bourne HR, Cohen FE. An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol. 1996;257:342–358. [PubMed]
28. Sjolander K. Phylogenetic inference in protein superfamilies: analysis of SH2 domains. Proc Int Conf Intell Syst Mol Biol. 1998;6:165–174. [PubMed]
29. Aloy P, Querol E, Aviles FX, Sternberg MJ. Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J Mol Biol. 2001;311:395–408. [PubMed]
30. del Sol Mesa A, Pazos F, Valencia A. Automatic methods for predicting functionally important residues. J Mol Biol. 2003;326:1289–1302. [PubMed]
31. Krishnamurthy N, Brown D, Sjolander K. FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function. BMC Evol Biol. 2007;7 Suppl 1:S12–S22. [PMC free article] [PubMed]
32. Jones S, Thornton JM. Analysis of protein-protein interaction sites using surface patches. J Mol Biol. 1997;272:121–132. [PubMed]
33. Tsai CJ, Lin SL, Wolfson HJ, Nussinov R. Studies of protein-protein interfaces: a statistical analysis of the hydrophobic effect. Protein Sci. 1997;6:53–64. [PMC free article] [PubMed]
34. Elcock AH. Prediction of functionally important residues based solely on the computed energetics of protein structure. J Mol Biol. 2001;312:885–896. [PubMed]
35. Bartlett GJ, Porter CT, Borkakoti N, Thornton JM. Analysis of catalytic residues in enzyme active sites. J Mol Biol. 2002;324:105–121. [PubMed]
36. Landgraf R, Xenarios I, Eisenberg D. Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol. 2001;307:1487–1502. [PubMed]
37. Amitai G, Shemesh A, Sitbon E, Shklar M, Netanely D, Venger I, Pietrokovski S. Network analysis of protein structures identifies functional residues. J Mol Biol. 2004;344:1135–1146. [PubMed]
38. Rossi A, Marti-Renom MA, Sali A. Localization of binding sites in protein structures by optimization of a composite scoring function. Protein Sci. 2006;15:2366–2380. [PMC free article] [PubMed]
39. Casari G, Sander C, Valencia A. A method to predict functional residues in proteins. Nat Struct Biol. 1995;2:171–178. [PubMed]
40. Andrade MA, Casari G, Sander C, Valencia A. Classification of protein families and detection of the determinant residues with an improved self-organizing map. Biol Cybern. 1997;76:441–450. [PubMed]
41. Mihalek I, Res I, Lichtarge O. Evolutionary trace report_maker: a new type of service for comparative analysis of proteins. Bioinformatics. 2006;22:1656–1657. [PubMed]
42. Hannenhalli SS, Russell RB. Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol. 2000;303:61–76. [PubMed]
43. Mirny LA, Gelfand MS. Using orthologous and paralogous proteins to identify specificity- determining residues in bacterial transcription factors. J Mol Biol. 2002;321:7–20. [PubMed]
44. Kalinina OV, Novichkov PS, Mironov AA, Gelfand MS, Rakhmaninova AB. SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Res. 2004;32:W424–W428. [PMC free article] [PubMed]
45. Pirovano W, Feenstra KA, Heringa J. Sequence comparison by sequence harmony identifies subtype-specific functional sites. Nucleic Acids Res. 2006;34:6540–6548. [PMC free article] [PubMed]
46. Donald JE, Shakhnovich EI. Predicting specificity-determining residues in two large eukaryotic transcription factor families. Nucleic Acids Res. 2005;33:4455–4465. [PMC free article] [PubMed]
47. Pei J, Cai W, Kinch LN, Grishin NV. Prediction of functional specificity determinants from protein sequences using log-likelihood ratios. Bioinformatics. 2006;22:164–171. [PubMed]
48. Marttinen P, Corander J, Toronen P, Holm L. Bayesian search of functionally divergent protein subgroups and their function specific residues. Bioinformatics. 2006;22:2466–2474. [PubMed]
49. Achari A, Somers DO, Champness JN, Bryant PK, Rosemond J, Stammers DK. Crystal structure of the anti-bacterial sulfonamide drug target dihydropteroate synthase. Nat Struct Biol. 1997;4:490–497. [PubMed]
50. Smith AE, Mathews RG. Protonation State of Methyltetrahydrofolate in a Binary Complex with Cobalamin-Dependent Methionine Synthase. Biochemistry. 2000;39:13880–13890. [PubMed]
51. Hampele IC, D'Arcy A, Dale GE, Kostrewa D, Nielsen J, Oefner C, Page MG, Schonfeld HJ, Stuber D, Then RL. Structure and function of the dihydropteroate synthase from Staphylococcus aureus. J Mol Biol. 1997;268:21–30. [PubMed]
52. Minasov G, Teplova M, Stewart GC, Koonin EV, Anderson WF, Egli M. Functional implications from crystal structures of the conserved Bacillus subtilis protein Maf with and without dUTP. Proc Natl Acad Sci USA. 2000;97:6328–6333. [PMC free article] [PubMed]
53. Hwang KY, Chung JH, Kim SH, Han YS, Cho Y. Structure-based identification of a novel NTPase from Methanococcus jannaschii. Nat Struct Biol. 1999;6:691–696. [PubMed]
54. Marchler-Bauer A, Anderson JB, DeWeese-Scott C, Fedorova ND, Geer LY, He S, Hurwitz DI, Jackson JD, Jacobs AR, Lanczycki CJ, et al. CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res. 2003;31:383–387. [PMC free article] [PubMed]
55. Bulka B, desJardins M, Freeland SJ. An interactive visualization tool to explore the biophysical properties of amino acids and their contribution to substitution matrices. BMC Bioinformatics. 2006;7:329–338. [PMC free article] [PubMed]
56. Henikoff S, Henikoff JG. Position-based Sequence Weights. J Mol. Biol. 1994;243:574–578. [PubMed]
57. Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001;29:2994–3005. [PMC free article] [PubMed]
58. Sokal RR, Rohlf FJ. Biometry: the Principles and Practice of Statistics in Biological Research. 3rd ed. New York: W.H. Freeman and Co; 1995.
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...