- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC1941674

# Evolutionary Plasticity of Protein Families: Coupling Between Sequence and Structure Variation

^{1}Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland

^{2}Department of Biology, Moscow State University, Moscow, Russia

## Abstract

In this work we examine how protein structural changes are coupled with sequence variation in the course of evolution of a family of homologs. The sequence–structure correlation analysis performed on 81 homologous protein families shows that the majority of them exhibit statistically significant linear correlation between the measures of sequence and structural similarity. We observed, however, that there are cases where structural variability cannot be mainly explained by sequence variation, such as protein families with a number of disulfide bonds. To understand whether structures from different families and/or folds evolve in the same manner, we compared the degrees of structural change per unit of sequence change (“the evolutionary plasticity of structure”) between those families with a significant linear correlation. Using rigorous statistical procedures we find that, with a few exceptions, evolutionary plasticity does not show a statistically significant difference between protein families. Similar sequence–structure analysis performed for protein loop regions shows that evolutionary plasticity of loop regions is greater than for the protein core.

**Keywords:**protein structural evolution, sequence variation, protein loops, sequence-structure correlation

## INTRODUCTION

A protein sequence folds into a unique, highly ordered conformation which maintains its specific function. As proteins evolve, their sequences change due to amino acid replacements, the majority of which are believed to be effectively neutral.^{1} Consequently, protein-specific function, structure, folding, and the protein–protein interaction network as a rule change gradually in the course of evolution. Indeed, the overall protein structural topology is so well preserved throughout evolution that proteins that diverged billions of years ago may still show remarkable structural resemblance and, in many cases, sequence conservation as well.^{2}

The fundamental question of whether protein structures evolve by divergence or by convergence inspired many comparative studies of protein structures and networks of protein similarities.^{3}^{–}^{10}^{,}^{42} According to the convergent scenario, protein structural similarity can occur independently in two proteins due to the limited number of topological arrangements.^{11}^{,}^{12} Recently, it has been shown that convergent models do not adequately describe the patterns of sequence and structural similarity observed in the populations of real proteins by using graph theoretical methods.^{8}^{,}^{10} By contrast, the scale-free behavior and other important characteristic features of protein networks can be correctly reproduced using divergent models of structural evolution.^{7}^{–}^{10} In these models, new protein structures emerge, and existing structures change through the processes of duplication and subsequent divergence from a common ancestor.

The sequence and structural analysis of many commonly observed protein folds points to the dominant role of divergent mechanisms in protein structural evolution as well.^{13}^{–}^{17} It has been demonstrated, for example, that proteins from the TIM barrel, OB-fold, cupredoxin, and β-trefoil folds have common features in their topology, nature of ligands, and location of catalytic residues, which points to the plausibility of divergent scenarios for these and other protein folds comprising the protein universe. In a previous study, we likewise observed a significant linear correlation between sequence similarity and loop structural similarity for the aforementioned folds.^{18} Given that the loops do not contribute much to the protein core stability, we argued that the strong coupling between the changes in sequence and loop structure can only happen due to divergent evolution.

Chothia and Lesk first addressed the question of coupling between the structural and sequence changes in proteins, and found an exponential dependence of root-mean-square deviation on percent of sequence identity.^{2} Further studies that were performed on larger datasets of proteins showed similar results.^{5}^{,}^{19} Recently, however, it has been shown on a sample of 36 protein families that most of the structural variation in aligned regions of homologous proteins is linearly correlated with the changes in sequence which supports the “global” model of protein structure.^{20} According to this model, all residue–residue interactions, not just a few key residues, are important in determining the unique protein structure. In an attempt to solve the “fold recognition” problem and design structural models for new sequences, Koehl and Levitt performed an analysis of how structural changes between two protein folds correlate with the differences between the sequences that are compatible with these folds.^{21} They also found, on a benchmark of 12 protein families, that structural changes as measured by cRMS are linearly related to the changes in sequence.

In this article we study how the protein structure changes in its conserved aligned core regions and unaligned loop regions as proteins diverge from a common ancestor. We performed a sequence–structure correlation analysis on a large number of families of homologous proteins and found a statistically significant linear correlation between measures of sequence and structural similarity for the great majority of these families. This finding allows us to address the next important question of how much sequence change can protein structure tolerate, and whether it depends on the type of protein fold, or on some other sequence and structural characteristics. We call this quantity “the evolutionary plasticity of structure” (EPS), and estimate it by calculating the regression coefficients of linear sequence–structure dependencies for homologs.

## METHODS

### Test Set

Sets of homologous protein families were extracted from the CDD search database version 1.62 at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. The CDD collection of protein domain alignments includes curated CDDs^{22} and preprocessed domain families imported from SMART and PFAM, 6222 protein domain families altogether.^{23} Upon import, the sequences from SMART/PFAM alignments with more than 75% identity with known structures were substituted by the most similar structure from the Protein Data Bank.^{24} Those families containing short sequence repeats and having average alignment length of less than 50 residues were excluded from the test set.

Each CDD family was decomposed into a set of pairwise structure–structure alignments. Structural alignments within CDD families were computed by the VAST algorithm,^{25} and were selected for analysis according to the following criteria: (a) the mutual overlap between the VAST alignment footprint and CDD footprint (the footprint for a given sequence was defined as a region between the first and the last residues aligned by VAST or CDD) was at least 80%; (b) X-ray resolution of both structures in a pair was better than 3.0 Å; (c) BLAST E-value calculated for VAST alignment was less than 0.01; (d) any discontinuous domain^{26} inconsistently aligned between VAST and CDD was disregarded.

Additionally, to the requirements imposed on structural pairs we selected protein families based on the following criteria: (a) the protein family should contain at least 10 structurally aligned protein pairs; (b) proteins from a given family should span a wide range of sequence similarity, that is, should cover a range of at least 30% in sequence identity between the most diverged and least diverged structural pair; (c) not more than two protein family alignments from the same domain cluster were retained in the final test set; the redundancy between protein families was checked by using the procedure implemented in the CDART algorithm.^{27} Even though these protein families can belong to the same domain cluster, they are coming from different sources and have rather different alignments (Table I).

The final test set comprised 81 CDD families covering a wide range of functional and structural classes. The list of test families together with their length, number of protein pairs, and the PDB code of the first structure is shown in Table I. The test set for loop analysis contained 59 families, excluding 22 families that had a high fraction of pairs with missing coordinates in loop regions (see the next section).

### Measures of Structural and Sequence Similarity

To measure the quality of linear correlation between sequence and structural characteristics for homologous proteins from the same family, we first need to choose the most sensitive and reliable measures of sequence and structural similarity. Because most of the structural similarity measures (RMSD, AHM, LHM) are extensive and depend on the number of residues and protein size, the aforementioned structural measures should be divided by the radius of gyration (similar but not identical results were obtained with the normalization by the square root of the number of aligned residues). The radius of gyration for a protein pair was calculated for each of the two proteins in the pair based on the structurally aligned part and then was averaged. As a result, the normalized RMSD, AHM, and LHM quantities do not depend on the number of residues any more. Nonnormalized conventional measures of structural similarity yielded weaker sequence/structure correlation (not shown) so that in our further analysis we used only normalized structural similarity measures.

The sequence similarity was measured as the BLAST bitscore^{28} divided by the alignment length (bitscore per residue). Structural similarity measures based on comparing the structures in the aligned regions comprised RMSD, fraction of conserved contacts (CC), and aligned Hausdorff measure (AHM), whereas the loop-based Hausdorff measure (LHM) quantified the difference in the loop regions. The fraction of conserved contacts was calculated as a fraction of identical residue contacts in both structures divided by the average number of contacts in both structures made by the aligned residues.^{29} The contacts were defined between residues separated along the chain by at least five peptide bonds and having C^{α} atoms less than 8 Å apart.

The root-mean-squared deviation (RMSD) was calculated using the superposition algorithm due to McLachlan.^{30} Another measure that quantified the structural difference of proteins between the aligned regions and between the loops was based on the mathematical concept of Hausdorff distance.^{18}^{,}^{31} Let *A* = {*a*_{1}, …,*a _{m}*} and

*B*= {

*b*

_{1}, …,

*b*} be finite point sets in a Euclidean space. The Hausdorff distance between the sets

_{n}*A*and

*B*is then defined by:

Here, the terms d(*a _{i}*,

*b*) denote the Euclidean distance between the points. In other words, the Hausdorff distance between the sets

_{j}*A*and

*B*is the smallest distance such that every point

*a*

_{i}*A*is within this distance of some point

*b*

_{j}*B*, and vice versa. Hausdorff distance can be defined under the assumption that the structural alignment between two domains is known and the

*C*

^{α}atoms for both structures are in a common coordinate frame.

The Hausdorff measure for loops (LHM) was calculated as follows:

Here “loop” is defined as a region between two consecutive aligned secondary structure elements and *n _{s}* is the number of aligned secondary structure elements:

*h*= 0, if the

_{i}*i*th loop regions do not have any unaligned residues;

*h*=

_{i}*d*(

_{H}*AZ*,

*B*), where

_{i}*A*contains the set of

_{i}*C*

^{α}coordinates of nonaligned residues in the

*i*th loop of the first structure in a pair, the last aligned residue from the preceding aligned region, and the first aligned residue from the following aligned region. Similarly,

*B*is defined for the second structure in a pair. The sets (

_{i}*A*) are defined to include two aligned residues so that the measure can be defined even if one of the sets of nonaligned residues is empty. In the calculation of LHM, those pairs where one or the other protein had more than 25% missing residues in nonaligned loops were excluded. In the case of AHM, instead of the coordinates for the

_{i}, B_{i}*C*

^{α}atoms in the loops, we use the coordinates for the

*C*

^{α}atoms in the aligned segments and average over the number of aligned segments.

Definitions of disulfide bonds were obtained from the PDB files of all protein structures for each family. Bonds formed outside of the structure–structure alignment footprint regions (see “Test set” section) were disregarded. The average number of disulfide bonds per family was calculated as the sum of the number of SS-bonds in each protein in a family divided by the number of proteins. The fraction of conserved disulfide bonds was calculated as a ratio between the number of identical SS-bonds in a protein pair and the average number of disulfide bonds within the footprint regions of two proteins.

### Statistical Analysis

The statistical analyses described in this study used the Splus statistical package(version 6). To investigate the relationship between sequence and structural similarity we performed correlation and regression analyses. The Pearson linear correlation (ρ) and Spearman rank correlation coefficients were calculated, and the *p*-value under the null hypothesis that the correlation coefficient was equal to zero was estimated. Those families with *p*-values less than 0.01 were considered as having correlation coefficients significantly different from zero. To quantify how much the nonlinear terms improve the data fitting we included a quadratic term in the linear model and performed nonlinear regression analysis. The ratio of squared linear correlation coefficient for the linear model (*R _{l}*

^{2}) and squared multiple correlation coefficient for the nonlinear model (

*R*

_{n}^{2}) (

*r*=

^{2}*R*=

*R*

_{l}^{2}

*/R*

_{n}^{2}) in this case would indicate the relative improvement in the data fitting upon inclusion of the nonlinear term in the model. The higher this ratio is, the lower the contribution of nonlinear terms upon data fitting.

The *F*-test has been used to test the null hypothesis that all regression coefficients are equal, with alternative hypothesis being that the regression coefficients are not all equal. The null hypothesis has been rejected, and therefore we employed multiple comparison procedures. First we checked which regression coefficients were different from each other by using the Tukey-Kramer method.^{32} For the purpose of illustrating the Tukey-Kramer method, the approximate method proposed by Gabriel can be applied, which computes the comparison intervals for all regression coefficients.^{32} According to Gabriel’s method, two regression coefficients are considered significantly different if and only if their comparison intervals do not overlap.

## RESULTS

### The Quality of Sequence–Structure Correlation for Different Protein Families

Table II shows the accuracy of correlation obtained between the BLAST bitscore per residue and various measures of structural similarity (RMSD, CC, AHM, and LHM). As can be seen from this table, the linear correlation is strong for most of the families, and half of them have correlation coefficients better than 0.73–0.87, depending on the structural similarity measure used (Table II lists Pearson correlation coefficients; Spearman rank correlation coefficients give similar results). This result is consistent with the studies of Wood and Pearson,^{20} who showed on a smaller test set of 35 protein families that half of them have correlation coefficients greater than 0.878. Comparing different measures of structural similarity, one can see that normalized AHM tends to yield a stronger correlation than other quantities yielding 98% of families with statistically significant linear correlation coefficients (with *p*-value <0.01). In agreement with this observation, our previous studies showed that the AHM measure performs very well in distinguishing homologs from analogs.^{18} High accuracy of the AHM is due to the higher sensitivity of the Hausdorff measure to subtle dissimilarities between the aligned parts of protein structures. Based on this observation, we chose this quantity to characterize the structural change in the present analysis.

Figure 1(a–d) illustrates the high quality of linear correlation for four protein families: Picornavirus capsid protein (pfam00073), Pancreatic ribonuclease (cd00163), GLFV-dehydrogenase (pfam00208), and Alpha-amylase (smart00632), which all have Pearson linear correlation coefficients less than −0.87. As shown in Figure 1(e–f), not all families, however, exhibit such good correlation between sequence and structure changes. The Trypsin-like serine protease family (cd00190), for example, has a correlation coefficient of only −0.57 [Fig. 1(f)], while the Copper-binding proteins family (pfam00127) is more adequately described by the nonlinear regression model taking into account higher order quadratic terms (*r ^{2}*-ratio being equal to 0.88) [Fig. 1(e)]. In the overall test set, among those with statistically significant correlation (79 families), 17 families had an

*r*

^{2}-ratio smaller than 0.9 indicating that, for these cases, adding the nonlinear term improves the performance of modeling by about 10%. It should be noted that alignments from different sources but belonging to the same protein family (see Methods, Table I) except for three cases exhibit consistent behavior with respect to the quality of linear correlation. Furthermore, random exclusion of duplicate families does not have any effect on the quality of linear correlation, nor on the results discussed below.

**a**) Picornavirus capsid protein (pfam00073), (

**b**) Pancreatic ribonuclease (cd00163), (

**c**) GLFV-dehydrogenase (pfam00208), (

**d**) Alpha-amylase (smart00632), (

**e**) Copper binding proteins family

**...**

Although the correlation between protein sequence and structure is found to be statistically significant for the great majority of test families, there is still a high degree of variability in the magnitudes of the correlation coefficients among the families. There seems to be no strong relationship between the domain length (i.e., the average length of structure–structure alignments in a family) and the quality of linear correlation (ρ = −0.30, *p*-value = 0.01). No connection between correlation coefficients and contact density (ρ = −0.23, *p*-value = 0.04) or contact order^{33} (ρ = −0.27, *p*-value = 0.02) has been observed either.

One might hypothesize that changes in structure should not always be strongly coupled with changes in amino acid sequence, especially if protein stability is determined mainly by the set of strong interactions such as covalent disulfide bonds. Figures 2 and and33 show how the quality of linear correlation depends on the disulfide bond content in protein families. As can be seen from Figure 2, protein families having on average two or more disulfide bonds per family (Sample 1, 13 families) exhibit rather poor sequence–structure correlation and proteins from the families with high correlation coefficients usually contain less than two disulfide bonds (Sample 2, 68 families). We should note that the difference between these two distributions is not caused by the difference in the family length (there is no significant correlation between the number of disulfide bonds per family and protein length).

**a**) and more (

**b**) than two disulfide bonds per family.

To test the difference between two distributions of correlation coefficients (Sample 1 and Sample 2), we applied the Wilcoxon two-sample test, which showed that these two samples come from populations with different mean values (the null hypothesis was rejected with the *p*-value = 0.0016). We found that the majority of S—S bonds in Sample 1 were well conserved among different family representatives (more than 75% conserved S—S bonds) except for the three cases of Carboxylesterase (pfam00135, 72% conserved S—S bonds), Trypsin-like serine protease (smart00020, 71% conserved S—S bonds), and Papain family Cysteine protease (pfam00112, 63% conserved S—S bonds), whereas two of these families (pfam00135 and pfam00112) are also characterized by high sequence–structure correlation (ρ = −0.80, ρ = −0.84).

Figure 3 shows as well that the quality of sequence–structure correlation depends on the average number of disulfide bonds per family (the correlation coefficient is 0.44 with *p*-value of 0.001). Because not all disulfide bonds are conserved in protein families, we also calculated the fraction of conserved S—S bonds per family and showed in this figure those families that had the fraction of conserved S—S bonds higher than 0.5 (Fig. 3, crosses). A high fraction of conserved S—S bonds in a family points to the preservation of specific S—S bonds in evolution and can be used as a measure of reliability of their definition (correlation coefficient for data points shown by crosses is equal to 0.64 with *p*-value of 0.0007).

### The Evolutionary Plasticity of Structure Estimated for Different Protein Families

As we showed in the previous section, for the majority of families, the sequence–structure dependence can be quite well described by the linear regression. The regression coefficients (the slope of the regression line) in these cases would estimate the relative structural to sequence change in the evolution of a particular protein family or, in other words, “the evolutionary plasticity of structure” (EPS). This measure is discussed below in more detail. To compare regression coefficients for different protein families, first we excluded families with poor correlation (ρ_{RMSD} > −8.0 or ρ_{AHM} > −0.8) and large contribution of nonlinear terms (
${r}_{\text{RMSD}}^{2}<0.9\hspace{0.17em}\text{or\hspace{0.17em}}{r}_{\text{AHM}}^{2}<0.9$). This filtering procedure resulted in 43 families with high linear correlation (these families are marked by asterisks in Table I). Figure 4 depicts the histogram of regression coefficients for this set of 43 protein families. As can be seen from this figure, the EPS varies by about a factor of 3 among different protein families. Likewise, Wood and Pearson^{20} reported a 3.9-fold change in their “structural mutation sensitivity” for a similar but smaller test set.

Although the regression coefficients vary between families, one needs to test whether this difference is statistically significant. To compare the slopes of the various families, we first tested the null hypothesis that all regression coefficients are equal (see Methods). This hypothesis is rejected with *P* 0.0001. To determine which families have different structural tolerances, we employed multiple comparison methods and calculated the comparison intervals (95% confidence) for the regression coefficients of every protein family (Fig. 5). The comparison intervals are constructed such that two regression coefficients are significantly different if and only if their intervals do not overlap.^{32} As can be seen from Figure 5, there are apparently two groups of protein families that have significantly different regression coefficients and nonoverlapping comparison intervals, while the rest of the protein families do not exhibit a significant difference in slopes between each other.

The first group consists of several protein families having the steepest slopes (highest EPS) and positioned in the left side of the plot. These include GLFV-dehydrogenases (pfam00208, b = −0.27), Copper/zinc superoxide dismutase (pfam00080, b = −0.22), Protein tyrosine phosphatase (smart00194, b = −0.21), and Proteasome A-type and B-type (pfam00227, b = −0.21). The second group is formed by proteins with the smallest EPS, which are positioned on the right side of Figure 5; among them are Picornavirus capsid protein family (pfam00073, cd00205, b = −0.10), Beta/gamma-crystallins (smart00247, b = −0.11), IPT/TIG domain (pfam01833, b = −0.12), and Xylose isomerase (pfam00259, b = −0.12). Interestingly enough, some protein families characterized by the lowest EPS, form large interaction interfaces with other proteins or cell components. For example, Picornavirus capsid proteins are packed in highly ordered icosahedral shells that are maintained through multiple interactions between the subunits whereas crystallins, IPT/TIG and Xylose isomerase domains also participate in macromolecular interactions.

Overall, we found that EPS values for the majority of protein families do not differ significantly between each other because their comparison intervals (see Methods) overlap. Because our test protein families spanned a wide range of structural folds (Table I) and functions, the previous observation implies that EPS, in general, depends neither on the structural class nor on the protein fold type. For example, the Glycosyl hydrolase family (smart00633) has an EPS of −0.18, whereas the aldo/keto reductase/K+ channel beta subunit family has an EPS of about −0.12, although both protein families have the TIM barrel fold. The superfolds, the most populated structural topologies (TIM barrels, beta trefoils, four-helical bundles, and others), show EPS values comparable to those of other folds (not shown).

### The Evolutionary Plasticity Is Different in Loop Regions Compared to the Protein Core Regions

The evolutionary relatedness between proteins can be successfully gauged from the comparison of their loop regions.^{18}^{,}^{34} Table II shows that, within the families of homologous proteins, structural changes in loops are strongly coupled with the evolutionary distance which, in this case, was measured by the normalized BLAST bitscore for the aligned regions. The sequence–structure dependence in loop regions for 71% of protein families (the test set for the loop analysis, see Methods) can be well described by a linear model and, for 88% of the protein families the linear correlation coefficients are found to be statistically significant. Among families with a particularly high sequence–LHM correlation, are the families of Xylose isomerase, Class I Histocompatibility antigen, Protein tyrosine phosphatase, IG-like plexins, and others. For some families, for example, Ribonuclease A, the sequence–structure correlation for loops is even higher than the correlation observed for aligned core regions. The linear sequence–structure correlation suggests that loop regions are, in general, under constant evolutionary pressure, which preserves their overall structure and they therefore change gradually as proteins diverge.

To compare the EPS of aligned core regions with the EPS of loop regions, we computed the ratio of their regression coefficients (*b ^{core}/b*

^{loop}). The test set depicted in Figure 6 comprises 16 protein families with a good linear correlation for both LHM and AHM (with the requirement that both correlation coefficients are less than −0.8 and

*r*

^{2}> 0.9). Assuming equal plasticity of core regions and loops (the null hypothesis), we expect that, in half of the instances,

*b*

^{core}/b^{loop}ratios will fall below 1, and in half of the instances these ratios will be above 1 (8:8 ratio). However, we observed 15 cases where the

*b*

^{core}/b^{loop}ratio was less than 1. The probability to observe such bias given the above assumption can be estimated from the binomial distribution as

*p*(0.5, 0, 16) +

*p*(0.5, 1, 16) = 0.00026. Thus, equal plasticities of core regions and loops is not likely to be compatible with our observations. This suggests that loop regions have higher evolutionary plasticity of structure compared to the protein core and, as can be seen from Figure 6, for the majority of families (12 families), the ratio of regression coefficients for the core and loop regions lies between 0.2 and 0.6.

## DISCUSSION AND CONCLUSION

In this article, we study the structural evolution of homologous proteins in terms of their sequence–structure dependence. We showed that the protein structural variability for a great majority of protein families is linearly coupled with the sequence variability, which suggests that, typically, protein structure gradually changes as proteins diverge during evolution. However, when the protein structural core is stabilized by strong interactions such as disulfide bonds, the correlation between structural and sequence divergence is much weaker if detectable at all. Protein families that have large number of disulfide bonds (which are usually conserved) typically do not show a linear sequence–structure correlation in contrast to families with fewer disulfide bonds. Apparently, during the evolution of these families, purifying selection preserves the disulfide contacts and has a much weaker effect in the rest of the protein molecule such that, in these cases, the structural variability cannot be explained predominantly by the changes in sequence.

Drawing an analogy with solid mechanics, the sequence–structure dependence curves can be viewed as stress–strain curves where the physical body undergoes geometrical deformation after applying a stress. In the case of protein evolution, amino acid substitutions introduce the stress on protein structure, and structure either adjusts to the change or breaks apart. The linear dependences of measures of structural similarity on sequence similarity observed for the majority of protein families in our test set allows us to compare “the evolutionary plasticity of structure” (EPS) between different families. The evolutionary plasticity of structure for a given family is defined, accordingly, as a degree of structural variation per unit of sequence variation. Low values of EPS (shallow slope of the regression line) correspond to the situation when protein structure is highly conserved within a family of homologs relative to sequence changes. This could be caused either by strong functional constraints imposed on the structure or by high structural stiffness, that is, the inability to accommodate large structural variations without breaking the molecule apart. High values of EPS (steep slope) correspond to the situation when large structural shifts (within a framework of a given protein fold) can occur upon minor sequence divergence as a result of relaxed functional constraints on the structure and/or high structural tolerance of a given fold.

The rigorous statistical analysis performed in this work suggests that, with several exceptions, the values of the EPS for protein structural cores do not significantly differ between protein families. Interestingly enough, despite the variability among protein families in functional constraints and types of structural folds, the proteins from different families respond similarly to the sequence drift in evolution. This observation is based on the evaluation of multiple comparison intervals for the EPS values rather than on direct comparison of sequence–structure correlation slopes as has been done by others.^{20} One could argue that this result could be an artifact caused by possible flaws in the analysis such as insufficient structural data and/or derivation of sequence and structure similarity measures. However, the observed high correlation between sequence and structural divergence within individual families suggests that the analysis described here is robust. Moreover, the observed EPS values were not found to be statistically different, even though the test set was designed in such a way (protein families with high linear correlation and sufficient number of sequences) to reduce the uncertainty of the EPS estimates.

It is commonly observed that the size of the sequence space is much larger than the size of structure space, and the number of different structural folds is rather small, estimated to be several thousand.^{35}^{–}^{40} Moreover, certain protein topologies are realized in evolution much more often than others (so-called “superfolds”), and the existence of such inequality in fold frequencies is sometimes attributed to specific physicochemical or geometrical properties of superfolds. Our results demonstrate that the gradual change of structure follows the same pattern in different protein families, suggesting that the role of intrinsic characteristics of superfolds in evolution might be exaggerated. In this respect we argue that the differences between common and rare folds may arise in evolution semirandomly, that is, via self-enhancing stochastic fluctuations of abundance of essentially equal folds.^{7} In any case, until the existence and significance of differences in “evolutionary plasticity of structure” between protein families is conclusively demonstrated, there is probably no ground to use their inequality as a working hypothesis in studies of protein structural evolution.

## Acknowledgments

We thank Stephen Bryant (NCBI), Eugene Koonin (NCBI), and Nick Grishin (University of Texas Southwestern Medical Center) for helpful discussions, and Lewis Geer for help with CDART database.

## Footnotes

Grant sponsor: the NIH Intramural Research Program

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (810K)

- Structure is three to ten times more conserved than sequence--a study of structural response in protein cores.[Proteins. 2009]
*Illergård K, Ardell DH, Elofsson A.**Proteins. 2009 Nov 15; 77(3):499-508.* - Use of a database of structural alignments and phylogenetic trees in investigating the relationship between sequence and structural variability among homologous proteins.[Protein Eng. 2001]
*Balaji S, Srinivasan N.**Protein Eng. 2001 Apr; 14(4):219-26.* - Evolution of protein sequences and structures.[J Mol Biol. 1999]
*Wood TC, Pearson WR.**J Mol Biol. 1999 Aug 27; 291(4):977-95.* - Contemporary approaches to protein structure classification.[Bioessays. 1998]
*Swindells MB, Orengo CA, Jones DT, Hutchinson EG, Thornton JM.**Bioessays. 1998 Nov; 20(11):884-91.* - A structure-centric view of protein evolution, design, and adaptation.[Adv Enzymol Relat Areas Mol Biol. 2007]
*Deeds EJ, Shakhnovich EI.**Adv Enzymol Relat Areas Mol Biol. 2007; 75:133-91, xi-xii.*

- Computational prediction of the human-microbial oral interactome[BMC Systems Biology. ]
*Coelho ED, Arrais JP, Matos S, Pereira C, Rosa N, Correia MJ, Barros M, Oliveira JL.**BMC Systems Biology. 824* - Consequences of domain insertion on sequence-structure divergence in a superfold[Proceedings of the National Academy of Scie...]
*Pandya C, Brown S, Pieper U, Sali A, Dunaway-Mariano D, Babbitt PC, Xia Y, Allen KN.**Proceedings of the National Academy of Sciences of the United States of America. 2013 Sep 3; 110(36)E3381-E3387* - A Stochastic Evolutionary Model for Protein Structure Alignment and Phylogeny[Molecular Biology and Evolution. 2012]
*Challis CJ, Schmidler SC.**Molecular Biology and Evolution. 2012 Nov; 29(11)3575-3587* - Local Structural Differences in Homologous Proteins: Specificities in Different SCOP Classes[PLoS ONE. ]
*Joseph AP, Valadié H, Srinivasan N, de Brevern AG.**PLoS ONE. 7(6)e38805* - A vitellogenin polyserine cleavage site: highly disordered conformation protected from proteolysis by phosphorylation[The Journal of Experimental Biology. 2012]
*Havukainen H, Underhaug J, Wolschin F, Amdam G, Halskau Ø.**The Journal of Experimental Biology. 2012 Jun 1; 215(11)1837-1846*

- CompoundCompoundPubChem Compound links
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Evolutionary Plasticity of Protein Families: Coupling Between Sequence and Struc...Evolutionary Plasticity of Protein Families: Coupling Between Sequence and Structure VariationNIHPA Author Manuscripts. Nov 15, 2005; 61(3)535PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...