![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2005 Panchenko and Madej; licensee BioMed Central Ltd. Structural similarity of loops in protein families: toward the understanding of protein evolution 1Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894, USA Corresponding author.Anna R Panchenko: panch/at/ncbi.nlm.nih.gov; Thomas Madej: madej/at/ncbi.nlm.nih.gov Received October 6, 2004; Accepted February 3, 2005. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Protein evolution and protein classification are usually inferred by comparing protein cores in their conserved aligned parts. Structurally aligned protein regions are separated by less conserved loop regions, where sequence and structure locally deviate from each other and do not superimpose well. Results Our results indicate that even longer protein loops can not be viewed as "random coils" and for the majority of protein families in our test set there exists a linear correlation between the measures of sequence similarity and loop structural similarity. Results suggest that distance matrices derived from the loop (dis)similarity measure may produce in some cases more reliable cluster trees compared to the distance matrices based on the conventional measures of sequence and structural (dis)similarity. Conclusions We show that by considering "dissimilar" loop regions rather than only conserved core regions it is possible to improve our understanding of protein evolution. Background Globular proteins are considered to be structurally similar if their regular secondary structure elements can be superimposed well and are connected in the same order. The loop regions connecting secondary structures demonstrate less regularity in their conformations even though short loops linking specific secondary structures can be classified into distinct classes [1-6]. The structures and sequences in loop regions may deviate from each other so that they do not superimpose well and as a result loops are very often not aligned by structure-structure or sequence alignment methods. Loops apparently do not contribute much to protein stability but may be quite important for protein specific function and for the interaction with other components of the cell. In our previous work we showed that a measure derived from the loop regions can distinguish homologous from analogous proteins with the same or higher accuracy compared to the conventional measures which are based on comparing proteins in structurally aligned regions only [7]. Recently it has been observed that structural variation in the core of homologous proteins is linearly correlated with sequence changes [8,9]. As was also shown several years ago, the probability of insertion and deletion events, which occur predominantly in the loop regions, strongly depends on the evolutionary distance between two homologous proteins [10,11]. Based on these observations one might argue that more closely related proteins may exhibit more similarity in the structure of their loop regions compared to distantly related proteins and the structural loop (dis)similarity should correlate with evolutionary distance. To check this hypothesis we performed an analysis of structural variation in the loop regions within different homologous protein families using a recently introduced new measure of loop similarity [7]. This new measure is based on the concept of the Hausdorff metric, which is used in mathematical topology to define a distance between two point sets of a metric space. It does not require an alignment or one to one correspondence between two point sets. We show that there exists a linear correlation between the average structural change in the loop regions and the evolutionary distance, which allows us to use the loop (dis)similarity measure for inferring the phylogenetic history of homologous protein families. Methods Test set To select sets of homologous proteins the Conserved Domain Database (CDD) version 1.62 was taken, which can be accessed at [12]. The CDD collection of protein domain alignments included curated CDDs [13] and preprocessed domain families imported from SMART and PFAM, altogether 6222 protein domain families[14]. Upon import, the sequences from SMART/PFAM alignments with more than 75% identity with known structures were substituted by the most similar structures from the Protein Data Bank [15]. Each CDD family was decomposed into a set of pairwise structure-structure alignments. Structural alignments were computed by the VAST algorithm [16] and only those structures which had more than 80% mutual overlap between the VAST alignment footprint and CDD footprint were considered in the analysis. The footprint for a given sequence was defined as a region between the first and the last residues aligned by VAST or CDD. Those families containing short sequence repeats and having average alignment length less than 50 residues were excluded from the test set. The structural pairs within the remaining CDD families were disregarded if at least one of the following conditions held true: - at least one structure in a pair had X-ray resolution of greater than 3.0 Å - the Blast E-value calculated for the VAST alignment exceeded 0.01 - at least one structure in a pair contained a chain discontinuous domain inconsistently aligned between VAST and CDD - at least one structure in a pair contained more than 25% of its nonaligned loops with missing residues. To ensure that protein families span a wide range of sequence similarity, all families were examined and those having less than 30% sequence identity span were not considered in further analysis. The redundancy between protein families was checked by using the procedure implemented in the CDART algorithm [17] and not more than 2 protein families from the same CDD cluster were retained in the final test set. At the end, the test set comprised 59 CDD families with more than 10 structurally aligned pairs of homologs. This test set covered a wide range of functional and structural classes and the list of test families together with their length, number of protein pairs and correlation coefficients is shown in Table 1.
Measures of structural and sequence similarity To measure the sequence similarity between homologous proteins from the same family we used a Blast bitscore normalized by the alignment length. Among structure similarity measures used in this paper, two of them, RMSD and alignment-based Hausdorff measure (AHM) were computed by comparing the proteins in structurally aligned regions, while the loop-based Hausdorff measure (LHM) quantified the difference in the loop regions. The root mean squared deviation (RMSD) was calculated using the superposition algorithm due to McLachlan [18]. The AHM and LHM measures were based on the mathematical concept of Hausdorff distance[19]. Let A = {a1,..., am} and B = {b1,..., bn} be finite point sets in a Euclidean space. The Hausdorff distance between the sets A and B is then defined by: dH (A, B) = max {min j d(a1, bj),..., min j d(am, bj), min i d(ai, b1),..., min i d(ai, bn)} (1) Here the terms d(ai, bj) denote the usual Euclidean distance between the points. In other words, the Hausdorff distance between the sets A and B is the smallest distance such that every point ai A is within this distance of some point bj B and vice versa. Hausdorff distance can be calculated under the assumption that the Cα atoms for both structures are in a common coordinate frame which is defined by the structural alignment between two domains. The Hausdorff measure for loops (LHM) was calculated as an average of Hausdorff distances over all loops in the protein pair, where ns is the number of aligned secondary structure elements:![]() The "loop" was defined as a region between two consecutive aligned secondary structure elements and: hi = 0, if the i-th loop regions do not have any unaligned residues; hi = dH (Ai, Bi), where Ai contains the set of Cα coordinates of non-aligned residues in the i-th loop of the first structure in a pair, the last aligned residue from the preceding aligned region and the first aligned residue from the following aligned region. Similarly, Bi is defined for the second structure in a pair. The sets (Ai, Bi) are defined to include two aligned residues so that the measure can be defined even if one of the sets of non-aligned residues is empty. The Hausdorff measure for the structurally aligned regions (AHM) was defined similarly. In this case, instead of the sets that contain the coordinates for the Cα atoms in the loops, we use the coordinates for the Cα atoms in the aligned segments and average over the number of aligned segments. The correlation analysis between the measures of sequence and structural similarity, linear/nonlinear regression analyses and cluster analysis were performed using Splus version 6. Pearson (ρ) and Spearman correlation coefficients were calculated to quantify the accuracy of linear correlation. The P-value under the null hypothesis that the correlation coefficient between two variables is equal to zero has been estimated and those families with the P-values less than 0.01 were considered as having statistically significant correlation. The cluster analysis was done using the complete linkage clustering [20] where the distance between two clusters was measured as a maximum distance between a point in one cluster and a point in another cluster. The cluster trees based on p-distance and LHM were compared using the Phylip program [21] by generating 1000 bootstrap alignments from the structural alignments of a protein family and by calculating p-distance based cluster trees from the bootstrap alignments. The bootstrap support for the LHM based tree or different partitions of this tree was calculated by counting how many times the LHM topology occurs among the bootstrap cluster trees. Results and discussion Tables 1 and 2 show the accuracy of correlation obtained between the various measures of structural similarity (RMSD, AHM and LHM). As can be seen from these tables, the correlation quantified by the Pearson correlation coefficient is quite high for most of the families and half of the families have coefficients between -0.76 and -0.81 depending on the structural similarity measure used (Spearman rank correlation coefficients were shown to be very close to those reported in Tables 1 and 2). This result is consistent with the studies of Wood and Pearson who showed on a smaller test set of 35 protein families that half of them have correlation coefficients greater than 0.878 [8]. In their case the sequence-structure correlation was quantified, however, by using only the measures based on the structurally aligned regions of the proteins.
The dependence of structural similarity on sequence similarity in some cases can be more accurately described by the nonlinear regression model taking into account higher order quadratic terms. To quantify how much the nonlinear terms improve the data fitting, we use the ratio of squared correlation coefficient for linear ( ) and nonlinear ( ) models ( ). In the overall test set only 12 families have r2 – ratio smaller than 0.9 (with LHM used as a structural similarity measure) indicating that for these cases adding the non-linear term improves the performance of modeling by about 10%.As was shown previously, the evolutionary relatedness between proteins can be successfully gauged from the comparison of their loop regions [7]. Indeed, Table 2 and Figure Figure11
However, not all families exhibit such good correlation. One example of a protein family showing particularly low LHM correlation is the family of Actin depolymerisation factor/cofilin-like domains (ADF). The sequence-structure correlation for loop regions of this family is not statistically significant (the Pearson correlation coefficient is close to zero) whereas the sequence-structure correlation for the protein core is very high (ρ = -0.85 with AHM). Indeed, different proteins of this family show distinctly different loop conformations and evolutionary analysis of ADF family argued that the insertions present in the vertebrate ADF/cofilins (and not present in non-vertebrate cofilins) might be important for nuclear function of mammalian cofilins [22]. Therefore, in this case the structural heterogeneity of loop regions can be explained by the acquisition of a new distinct function by some members of this family. For some families, for example, Trypsin-like serine protease (Tryp_SPc), neither LHM (ρ = -0.31) nor AHM (ρ = -0.55) similarity measures exhibit a good sequence-structure correlation (Figure 1(c) Among families with particularly high LHM correlation are the families of Xylose isomerase (Xylose_isom), Class I Histocompatibility antigen (domains alpha 1 and 2, MHC_I), Protein tyrosine phosphatase (PTPc) and others. Figure Figure11 To understand whether significant sequence-structure correlation for loop regions has an underlying biological meaning, we performed a cluster analysis of proteins from two diverse families, Ribonuclease A (RnaseA), and SH2 domain (SH2, ρAHM = -0.48, ρLHM = -0.78), using different measures of sequence and structural similarity. Figure Figure22
The RnaseA family represents a very interesting example to study as it is characterized by considerably different catalytic efficiency and substrate preferences among family members and the different aspects of its activity is not well understood. Although cysteines that form disulfide bonds, catalytic histidines and lysine residues are mostly structurally and sequence conserved, there is a great variability in sequence between other regions of RnaseA proteins [23,24]. We compared the obtained cluster trees (Figure (Figure2)2 SH2 domains represent phosphor-tyrosyl peptide binding modules which are found in many signaling proteins. The specificity of phosphate interaction with a protein has been attributed to the hydrophobic pocket which is mostly formed by two loop regions [25]. Our analysis shows that indeed the loop regions have a much higher accuracy in clustering of functional subfamilies of SH2 domains. Comparing our cluster trees with the classification of Songyang et al [26] and cluster trees of SH2 phosphotyrosyl binding sites [25] we can see from Figure Figure33
Conclusions Here we have presented an analysis of how the structure of protein loops changes in evolution as homologous proteins diverge from each other. We showed that for the majority of protein families there exists a statistically significant linear correlation between measures of sequence similarity and average loop structural similarity. This in turn suggests that loops change in evolution via a stepwise insertion or deletion process and clearly one can not portray even longer loop regions as "irregular conformations" or "random coils". Indeed, our results imply that, in general, loops are under constant evolutionary constraints which, apparently, are weaker than those for a protein core but still strong enough to preserve the loop overall structure. Since loops do not contribute much to the protein core stability, these constraints predominantly arise from the importance of loops in interacting with ligands, other proteins and cells, as well as a possible role of loops in protein folding. Modeling of insertion and deletion events in evolution poses a lot of difficulties and protein evolution is usually reconstructed based only on the aligned regions of proteins. We demonstrated that loop regions which usually correspond to the non-aligned protein regions can be very important in inferring the phylogenetic history of a protein family. Moreover, it was shown, that sometimes sequence and structure similarity measures comparing proteins in their core are not sensitive enough to detect subtle (dis)similarities between the subfamilies. Loop-based measures which emphasize the dissimilarities between different protein members can shed light on the evolutionary relationships between homologous proteins. Authors' contributions AP and TM contributed equally to this paper. Acknowledgements We especially thank Stephen Bryant and Yuri Wolf for insightful discussions. This work has been supported by the NIH Intramural Research Program. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Adv Protein Chem. 1981; 34():167-339.
[Adv Protein Chem. 1981]Bioinformatics. 2000 Jun; 16(6):513-9.
[Bioinformatics. 2000]Proteins. 2004 Nov 15; 57(3):539-47.
[Proteins. 2004]J Mol Biol. 1999 Aug 27; 291(4):977-95.
[J Mol Biol. 1999]J Mol Biol. 2002 Oct 25; 323(3):551-62.
[J Mol Biol. 2002]J Mol Biol. 1993 Feb 20; 229(4):1065-82.
[J Mol Biol. 1993]J Mol Biol. 1992 Mar 20; 224(2):461-71.
[J Mol Biol. 1992]Proteins. 2004 Nov 15; 57(3):539-47.
[Proteins. 2004]Nucleic Acids Res. 2003 Jan 1; 31(1):383-7.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2002 Jan 1; 30(1):281-3.
[Nucleic Acids Res. 2002]Nat Struct Biol. 2000 Nov; 7 Suppl():957-9.
[Nat Struct Biol. 2000]Curr Opin Struct Biol. 1996 Jun; 6(3):377-85.
[Curr Opin Struct Biol. 1996]Genome Res. 2002 Oct; 12(10):1619-23.
[Genome Res. 2002]J Mol Biol. 1979 Feb 15; 128(1):49-79.
[J Mol Biol. 1979]J Mol Biol. 1999 Aug 27; 291(4):977-95.
[J Mol Biol. 1999]Proteins. 2004 Nov 15; 57(3):539-47.
[Proteins. 2004]Mol Biol Cell. 1998 Aug; 9(8):1951-9.
[Mol Biol Cell. 1998]J Mol Evol. 2001 Jul; 53(1):31-8.
[J Mol Evol. 2001]Nucleic Acids Res. 2003 Jan 15; 31(2):602-7.
[Nucleic Acids Res. 2003]Protein Eng. 2003 Mar; 16(3):217-27.
[Protein Eng. 2003]Cell. 1993 Mar 12; 72(5):767-78.
[Cell. 1993]