• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Mar 28, 2000; 97(7): 3288–3291.
Published online Mar 21, 2000. doi:  10.1073/pnas.97.7.3288

Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap


Quantitative analyses of biological sequences generally proceed under the assumption that individual DNA or protein sequence elements vary independently. However, this assumption is not biologically realistic because sequence elements often vary in a concerted manner resulting from common ancestry and structural or functional constraints. We calculated intersite associations among aligned protein sequences by using mutual information. To discriminate associations resulting from common ancestry from those resulting from structural or functional constraints, we used a parametric bootstrap algorithm to construct replicate data sets. These data are expected to have intersite associations resulting solely from phylogeny. By comparing the distribution of our association statistic for the replicate data against that calculated for empirical data, we were able to assign a probability that two sites covaried resulting from structural or functional constraint rather than phylogeny. We tested our method by using an alignment of 237 basic helix–loop–helix (bHLH) protein domains. Comparison of our results against a solved three-dimensional structure confirmed the identification of several sites important to function and structure of the bHLH domain. This analytical procedure has broad utility as a first step in the identification of sites that are important to biological macromolecular structure and function when a solved structure is unavailable.

Quantitative analyses of biological sequences are the cornerstone for studies in bioinformatics and molecular evolution. Such analyses generally proceed assuming that the sites in individual DNA or protein sequences vary independently, i.e., amino acid replacements at site X occur independently of those at site Y (1). Biochemical and biophysical studies show this assumption is not biologically realistic because sequence elements often change in a concerted manner (26). Nonrandom associations among sites within sequences arise from at least three sources: (i) chance, (ii) common ancestry (= phylogeny), and (iii) structural or functional constraints. (For simplicity, associations resulting from structure and function are considered to be equivalent.) Effectively discriminating among these underlying causes facilitates understanding the origin and magnitude of associations observed among sites in biological sequences and clarifying the role of such associations in evolution.

The first step in resolving questions about the origins of associations among sequence elements is to generate replicate data sets that vary according to specific underlying evolutionary models. For biological sequences, the typical model components are a reconstructed phylogeny and a nucleotide or amino acid substitution matrix. These components are relevant because sequence diversity has been generated by a process of descent with modification from a common ancestor.

Historical associations between sequences are represented by the reconstructed phylogeny. The topology of the evolutionary tree specifies the cladistic relationships among sequences, whereas the branch lengths reflect the amount of change that has occurred among sequences. The specific changes that occur in the various sequences are summarized by the substitution matrix. This matrix can consist of uniform substitution probabilities [e.g., Jukes–Cantor model for DNA substitution (7)], be partially parameterized [Kimura two-parameter model for DNA (8)], or completely parameterized [Jones–Taylor–Thornton (9) substitution matrix for proteins]. In combination, the phylogeny and substitution matrix provide the parameters necessary to generate stochastic data having historical relationships and substitution classes reflecting specific conditions. The parametric bootstrap procedure (1012) uses this data-generation algorithm to create replicate data sets that can be used to investigate the underlying properties of aligned biological sequences.

Herein, we describe a general analytical method based on parametric bootstrap simulations for the discrimination of intersite associations resulting from stochastic and phylogenetic sources from those resulting from structural and functional associations. When a general substitution matrix (i.e., one derived from a broad survey of protein sequences rather than the specific data set being analyzed) is used, data generated with the parametric bootstrap procedure will have intersite associations arising only from shared evolutionary history. Therefore, an intersite association statistic calculated for data sets generated by using the parametric bootstrap will reflect only associations among aligned sequence sites resulting from phylogeny or chance. From the distribution of this statistic, one can calculate a threshold value above which the statistic will have a specific probability of resulting from causes other than phylogeny. Comparison of association statistic values calculated for the empirical data alignment against this parametric bootstrap threshold allows identification of pairs of sites having a specific probability of interaction resulting from structure or function.

To demonstrate the utility of this approach, we analyzed a set of 237 sequences containing the basic helix–loop–helix (bHLH) DNA binding and dimerization domain. The bHLH proteins have a well-described structure and are represented by a large number of diverse sequences (13, 14). Having a well-defined three-dimensional structure permits direct comparison of the physical structure of the molecule with numeric data of intersite associations. Thus, sites of known functional and structural importance can be compared against the association statistics involving these sites. The availability of a large number of bHLH sequences increases confidence in the results by reducing the effect of spurious associations.


Sequence Alignment and Phylogeny Construction.

An alignment of 237 bHLH domains was generated by using clustal w (15) and improved by eye. A phylogenetic tree was then derived by using the neighbor-joining algorithm (16) with mean pairwise distances. We used a substitution matrix generated from a broad collection of protein sequences using the Jones–Taylor–Thornton (JTT) algorithm (9). As a consequence, our model for amino acid substitution is not unduly influenced by the idiosyncrasies of a particular protein family. Further, the resulting model has broad generality because the JTT algorithm accounts for the underlying phylogeny of the sequences when calculating the probability of change between amino acids. Thus, data generated from a random ancestral sequence using this general substitution matrix and a specified phylogeny should have either chance or phylogeny as their only sources of observed association.

Alternatively, one could use a substitution matrix derived from the specific data set being analyzed. To demonstrate the effect a substitution matrix of this type would have on the parametric bootstrap analysis, we used the rind program (17) to calculate a maximum-likelihood substitution matrix based on the bHLH protein sequences. It is expected that a matrix of this type would reflect the biases resulting from phylogeny, structure, and function that are inherent in the empirical data being analyzed.

Calculation of Intersite Associations.

The next step is to accurately estimate the magnitude of association between pairs of amino acid sites. Because sequence elements are symbol variables with no underlying metric, conventional statistical procedures for estimating correlation among sites cannot be used (14). Thus, intersite associations were estimated by using the mutual information statistic (MI) from information theory (18, 19). Mutual information measures the extent of association between two positions in a sequence beyond that expected resulting from chance. The mutual information MIXY between sites X and Y is calculated as:

equation M1

where P(Xi) is the probability of i at site X, P(Yj) is the probability of j at site Y, and P(Xi, Yj) is the joint probability of i at site X and j at site Y (XY). The double summation runs over all possible symbols at those sites. This formula has the property that when symbols vary independently [i.e., P(Xi)P(Yj) = P(Xi, Yj)], so that knowledge of j at site Y does not reduce the uncertainty of i at site X, the mutual information is zero (0).

The minimum MI value of 0 also occurs for invariant sites. Generally, the less variable a site is, the smaller its associated MI values will be. The maximum MI value will occur when the variation at two sites is perfectly correlated. Using a base-20 logarithm (n = 20 in Eq. 1, corresponding to the 20 peptide-forming amino acids) scales the maximum possible MI value to unity, which will occur when the residues at these sites are uniformly distributed. The maximum MI value will decline as the distribution of residues at each site departs from uniformity.

Results and Discussion

MI Distributions.

Fig. Fig.11 provides inverted cumulative frequency distributions of MI values calculated for the alignment of 237 bHLH domains and 1,000 parametric bootstrap replicates calculated using two different types of substitution models. Inverted cumulative distributions are calculated by subtracting from unity the cumulative frequency within a particular range of MI values. In this way, one achieves a distribution that declines in value as the independent variable increases.

Figure 1
Inverse cumulative frequency distribution of MI values for the alignment of 237 bHLH protein sequences and 1,000 parametric bootstrap replicates using either the JTT substitution matrix or the rind substitution matrix. MI values were calculated by using ...

The inverted cumulative frequency distribution of MI values for the parametric bootstrap replicates is then used to calculate a threshold for acceptability of a false-positive result, as described in the Fig. Fig.11 legend. Setting a statistical acceptability threshold permits the identification, within a quantifiable error, of those intersite associations most probably arising from structural/functional causes. For example, any pair of amino acid sites within the bHLH domain alignment having an MI value >0.188 has a probability of <0.01 of resulting from phylogeny or chance and, consequently, a >0.99 probability of reflecting an association resulting from structural/functional constraints. These probabilities are reduced and increased by an order of magnitude (0.001 and 0.999, respectively) for any pair of sites having MI >0.250. Because these MI values have been calculated using a base-20 logarithm, the maximum possible MI value is unity, although the largest MI value calculated for any pair of sites in the bHLH domain was 0.413. The sites having MI values >0.188 are presented in Table Table1.1.

Table 1
MI values calculated for 237 bHLH domains and arranged by site number

Comparison Against Three-Dimensional Structure.

To gauge the efficacy of this algorithm, we compared the sites presented in Table Table11 with the solved three-dimensional structure of a representative bHLH domain. Crystal structure studies have been carried out on the bHLH domains of six proteins: Max (20), E47 (21), MyoD (22), USF (23), PHO4 (24), and SREBP (25). As the bHLH domains in these molecules all have the same general organization of a DNA-binding, predominantly basic α-helix (b), an amphipathic α-helix contiguous with the basic region (H1), a variable length loop, and a second α-helix (H2), we used the bHLH domain of the Max protein as our representative bHLH structure. All site numbers refer to the Max structure as presented by Ferre-D'Amare et al. (20).

Each turn in an α-helix requires approximately 3.6 residues. Therefore, residues that are seven sites apart will lie on the same face of the helix. Also, residues that are three or four sites apart will lie approximately above or below each other. In the initial α-helix (b/H1), site pairs (30, 37), (30, 44), (38, 45), (41, 48), and (42, 49) would be on the same face of the helix and have significant MI values. In this same region, site pairs (37, 41), (38, 41), (38, 42), (41, 44), (41, 45), (42, 45), (44, 47), (44, 48), and (45, 49) would be spatially adjacent in the helix and have significant MI values. In H2, the site pairs (61, 65), (62, 65), (65, 68), (65, 69), (68, 72), and (69, 72) are spatially adjacent and have significant MI values. Site pairs (61, 68), (62, 69), and (65, 72) are on the same face of the helix and have significant MI values. In both helical regions, many of the same sites are involved in these interactions separated by three, four, and seven residues, prompting speculation that these sites are important to helical integrity.

Ferre-D'Amare et al. (20) identified several sites having important interactions within the molecule, with the dimerization partner, or with the DNA recognition sequence. Sites 47 and 57, which have a significant association at P ≤ 0.003, were identified as being important to the stability of the loop conformation. Sites 70 and 71 were shown to be involved in several packing interactions. Many associations involving these two sites [(37, 70), (59, 70), (59, 71), (62, 71), and (65, 70)] were significant at 0.009 ≤ P ≤ 0.007. However, many of the sites involved in the specific packing interactions identified in ref. 20 did not have significant MI values because of the lack of variability at one or both of the sites.

Effect of an Alternative Substitution Model.

In any numerical simulation of a physical process, the validity of the results depends on the assumptions of the underlying models. For phylogenetic analyses, the results are dependent on the confidence one has that the tree is a realistic description of the history of the data being analyzed. The parametric bootstrap also depends on the tree as the source of information about the level and distribution of sequence variation. The residue substitution matrix specifies the probabilities of specific amino acid changes that occur between sequences in the simulation. Biases in this matrix can affect the potential associations measured in the resulting simulated sequences. However, a matrix having no biases (i.e., a matrix of identical substitution probabilities) would ignore the biology of the substitution process.

As seen in Fig. Fig.1,1, the distribution of MI values generated using the parametric bootstrap with the rind substitution matrix is much more similar to the distribution of the empirical MI values than the distribution generated using the JTT substitution matrix. The MI values for the two statistical thresholds (P < 0.01 and P < 0.001) are increased to 0.359 and 0.408, respectively, for the rind matrix distribution. Although there are empirical MI values greater than these thresholds, several of the significant associations identified above have MI values that fall below the rind thresholds. This reduction in sensitivity is the result of the specificity of the rind substitution matrix to the bHLH sequence data, which guarantees that any biases because of structural and functional constraints on substitution will be incorporated into the substitution matrix. For this reason, any analyses incorporating constraints on the evolution of sites in biological sequences should use a substitution matrix derived from a broad sample of sequences.

The way in which structural and functional constraints act on the evolutionary process will influence the variation seen in existing molecular sequences. These influences will be incorporated into the reconstructed phylogeny by the algorithm used to derive it. This leads to a certain level of circularity in the use of the parametric bootstrap to partition sources of association. However, the existence of empirical values greater than reasonable statistical thresholds for acceptance of false-positive results, and the divergence of the empirical and bootstrapped JTT MI distributions, lead us to believe that the problem of circularity is not insurmountable.

Statistical Identification of Structurally and Functionally Important Sites.

We used the parametric bootstrap algorithm to construct a statistical distribution that reflects the associations between sites in a biological sequence exclusively resulting from a specific phylogeny (and chance). This distribution was then used to calculate a threshold, above which the calculated statistic should (with a specific probability) reflect structural and functional associations. Several sites identified from the solved three-dimensional structure as being important to bHLH domain structure and function were found to correlate with predictions based on MI values. It is possible that pairs of sites with values less than the threshold could be exhibiting associations resulting from structure or function. However, based on the distribution from the parametric bootstrap replicates, the level of confidence one would have in making this assertion would be reduced.

Using this parametric bootstrap-based algorithm to differentiate phylogenetic and chance associations from those resulting from structure and function will be quite useful for any sequence analysis that requires knowledge of higher-level structure. For example, in phylogenetic studies, this approach allows one to construct a character weighting scheme so that the resulting analysis more closely reflects the primary assumption of intersite independence. For molecular function analyses, the statistical threshold permits identification of sites of possible importance in site-directed mutagenesis analyses. For protein structural analyses, the statistical threshold allows the identification of sites important to structure without having a solved structure. Comparison against a solved structure could identify sites important to secondary or tertiary structure that may not be obvious by inspection of the solved structure.


We thank Jim Rosinski, Brian Rhees, Jason Lowry, and two anonymous reviewers for comments that have strengthened this work. We also thank Walter Fitch for his advice and assistance. This work was supported by National Institutes of Health Grant GM-5-46472 (to W.R.A.).


basic helix–loop–helix
mutual information


This paper was submitted directly (Track II) to the PNAS office.

Article published online before print: Proc. Natl. Acad. Sci. USA, 10.1073/pnas.070154797.

Article and publication date are at www.pnas.org/cgi/doi/10.1073/pnas.070154797


1. Swofford D L, Olsen G J, Waddell P J, Hillis D M. In: Molecular Systematics. 2nd Ed. Hillis D M, Moritz C, Mable B K, editors. Sunderland, MA: Sinauer; 1996. pp. 407–514.
2. Chelvanayagam G, Eggenschwiler A, Knecht L, Gonnet G H, Benner S A. Protein Eng. 1997;10:307–316. [PubMed]
3. Pollock D D, Taylor W R. Protein Eng. 1997;10:647–657. [PubMed]
4. Thompson M J, Goldstein R A. Proteins. 1996;25:28–37. [PubMed]
5. Gobel U, Sander C, Schneider R, Valencia A. Proteins Struct Funct Genet. 1994;18:309–317. [PubMed]
6. Taylor W R, Hatrick K. Protein Eng. 1994;7:341–348. [PubMed]
7. Jukes T H, Cantor C. In: Mammalian Protein Metabolism. Munro H N, editor. Vol. 3. New York: Academic; 1969. pp. 21–132.
8. Kimura M. J Mol Evol. 1980;16:111–120. [PubMed]
9. Jones D T, Taylor W R, Thornton J M. Comput Appl Biosci. 1992;8:275–282. [PubMed]
10. Efron B, Tibshirani R J. An Introduction to the Bootstrap. New York: Chapman & Hall; 1993.
11. Goldman N. J Mol Evol. 1993;36:182–198. [PubMed]
12. Huelsenbeck J P, Hillis D M, Jones R. In: Molecular Zoology: Advances, Strategies, and Protocols. Ferraris J D, Palumbi S R, editors. New York: Wiley–Liss; 1996. pp. 19–45.
13. Atchley W R, Fitch W M. Proc Natl Acad Sci USA. 1997;94:5172–5176. [PMC free article] [PubMed]
14. Atchley W R, Terhalle W, Dress A W. J Mol Evol. 1999;48:501–516. [PubMed]
15. Thompson J D, Higgins D G, Gibson T J. Nucleic Acids Res. 1994;22:4673–4680. [PMC free article] [PubMed]
16. Saitou N, Nei M. Mol Biol Evol. 1987;4:406–425. [PubMed]
17. Bruno W. Mol Biol Evol. 1996;13:1368–1374. [PubMed]
18. Shannon C, Weaver W. The Mathematical Theory of Information. Urbana, IL: Univ. of Illinois Press; 1949.
19. Applebaum D. Probability and Information: an Integrated Approach. New York: Cambridge Univ. Press; 1996.
20. Ferre-D'Amare A R, Prendergast G C, Ziff E B, Burley S K. Nature (London) 1993;363:38–45. [PubMed]
21. Ellenberger T, Fass D, Arnaud M, Harrison S C. Genes Dev. 1994;15:970–980. [PubMed]
22. Ma P C, Rould M A, Weintraub H, Pabo C O. Cell. 1994;77:451–459. [PubMed]
23. Ferre-D'Amare A R, Pognonec P, Roeder R G, Burley S K. EMBO J. 1994;13:180–189. [PMC free article] [PubMed]
24. Shimizu T, Toumoto A, Ihara K, Shimizu M, Kyogoku Y, Ogawa N, Oshima Y, Hakoshima T. EMBO J. 1997;16:4689–4697. [PMC free article] [PubMed]
25. Parraga A, Bellsolell L, Ferre-D'Amare A R, Burley S K. Structure. 1998;6:661–672. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...