![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2005, The National Academy of Sciences Evolution Evolutionary population genetics of promoters: Predicting binding sites and functional phylogenies Institut für Theoretische Physik, Universität zu Köln, Zülpicherstrasse 77, 50937 Cologne, Germany * To whom correspondence should be addressed. E-mail: lassig/at/thp.uni-koeln.de. Edited by Tomoko Ohta, National Institute of Genetics, Mishima, Japan Received June 30, 2005; Accepted August 8, 2005. This article has been cited by other articles in PMC.Abstract We study the evolution of transcription factor-binding sites in prokaryotes, using an empirically grounded model with point mutations and genetic drift. Selection acts on the site sequence via its binding affinity to the corresponding transcription factor. Calibrating the model with populations of functional binding sites, we verify this form of selection and show that typical sites are under substantial selection pressure for functionality: for cAMP response protein sites in Escherichia coli, the product of fitness difference and effective population size takes values 2NΔF of order 10. We apply this model to cross-species comparisons of binding sites in bacteria and obtain a prediction method for binding sites that uses evolutionary information in a quantitative way. At the same time, this method predicts the functional histories of orthologous sites in a phylogeny, evaluating the likelihood for conservation or loss or gain of function during evolution. We have performed, as an example, a cross-species analysis of E. coli, Salmonella typhimurium, and Yersinia pseudotuberculosis. Detailed lists of predicted sites and their functional phylogenies are available. Regulatory interactions between genes are believed to provide an important mode of evolution, which accounts for a substantial part of the differentiation between species (1). This is reflected by the sequence variability of regulatory DNA: there is ample case evidence of compensatory evolution at conserved function but also of rapid functional changes even between closely related species (2). Lacking a quantitative model of regulatory evolution, however, alignments of regulatory sequences and predictions of their functionality have proven notoriously difficult. A large body of existing work has focused on the identification of transcription factor-binding sites as the main functional elements of regulatory DNA. For factors with known binding specificity (given in the form of a position weight matrix), putative binding sites are identified from their conservation in cross-species comparisons. Different measures of conservation have been introduced, which involve, e.g., the sequence similarity of aligned loci or their independent high scoring in all species compared (3–7). These methods are powerful prediction tools for binding sites. From an evolutionary point of view, however, the conservation criteria are heuristic. Hence, it is difficult to quantify the statistical significance of the results, which depends on the number and evolutionary distance of the species compared. Sequence conservation tends to be too restrictive in cases where substantial sequence variation is compatible with the position weight matrix, in particular for distant species. Simple sequence similarity measures implicitly assume neutral evolution, whereas independent scoring of orthologous sites ignores the evolutionary link between the species altogether. Most importantly, none of these conservation measures allows a consistent statistical treatment of functional innovations in the evolution of binding sites. A more explicit model for regulatory DNA should address two issues: How does the sequence divergence between species depend on their evolutionary distance, and how does the specific biological function of binding sites enter? Answering these questions is a considerable task for experiment and theory, which must link the biophysics of binding sites with their population dynamics. In particular, it involves quantifying the selection by which the sequence evolution at functional loci is distinguished from that of neutral background DNA. As an important step in this direction, the notion of a fitness landscape for binding-site sequences has been introduced, where the fitness of a site depends on the binding energy of the corresponding factor (8). The evolutionary importance of the binding energy has also been highlighted in ref. 9, where it was shown that nucleotide substitution rates within functional sites in Escherichia coli depend on the energy difference induced by the substitution as predicted from the position weight matrix. The biophysics of factor-DNA binding imposes stringent constraints on the form of the fitness landscape (10) and has important consequences for bioinformatic binding site searches (11). Using such fitness landscapes, we have introduced a stochastic evolution model for functional loci, which is based on Kimura–Ohta point substitutions with rates governed by the fitness difference between the corresponding sequence states (12, 13). This model demonstrates the possibility of rapid adaptive formation of binding sites under positive selection and provides evolutionary constraints on eukaryotic promoter architecture. A similar evolutionary model (14) underlies a recently introduced method to identify conserved binding sites in multiple alignments (15). In this paper, we develop a quantitative evolutionary rationale for the cross-species analysis of regulatory sequences, which goes beyond the mere prediction of binding sites. For aligned regulatory DNA of orthologous genes, our method predicts sites together with their functional evolution. The method is based on the evolution model of refs. 12 and 13 and uses a bioinformatic measurement of selection pressures for functionality, which is obtained from sequence data of verified functional sites. Typical functional loci for pleiotropic factors, as exemplified by the cAMP response protein (CRP) family in E. coli, are found to be under substantial selection, in contrast to nonfunctional loci, which evolve neutrally. For families of aligned loci, our method assigns likelihood values to different modes of evolution and associates them with functional histories: (i) neutral evolution of nonfunctional loci, (ii) evolution of functional loci under time-independent selection, and (iii) evolution under time-dependent selection, corresponding to loss or gain of function along a given branch of the phylogeny. Theory Evolution Models for Nonfunctional Sequence and Functional Loci. We consider genomic loci a = (a1,..., al) consisting of l contiguous nucleotides and elementary substitution processes a → b, where a, b are any two sequence states differing by exactly one nucleotide. For nonfunctional (background) sequences, we use uniform nucleotide substitution rates μa→b depending on the nucleotide to be mutated and on its nearest sequence neighbors (16). Models of this type are neutral with respect to factor binding and have been shown to provide a good description of intergenic background DNA in E. coli (11). A locus is defined as functional if binding of the corresponding factor at that locus affects the regulation of a gene. Functional loci are assumed to be under selection. This is described by a (Malthusian) fitness function F(a), which measures the contribution of a genotype a to the growth rate of the number of individuals carrying that genotype (and is therefore defined only up to an additive constant, the genotype-independent fitness). Notice that this definition of a functional locus is weaker than that of a functional binding site, which is a functional locus with a sequence state a that is likely to actually bind the factor. A functional locus can lose its binding sequence due to deleterious mutations, and conversely, a nonfunctional locus can become a spurious binding site. According to the Kimura–Ohta theory (17–19), selection leads to modified substitution rates at functional loci,
Stationary Population Distributions and Evolutionary Scoring. For background sequences, we use a stationary distribution of the form (11)
If the background distribution is approximated by a factorized form,
The binding probability, however, is a strongly nonlinear function of the energy. Hence, the fitness effect of a substitution at one position depends on the nucleotides present at all other positions. This induces correlations between nucleotide frequencies at any two positions within functional loci (13), in addition to the short-ranged correlations already present in background sequences. These correlations prevent the factorization of the distributions P0(a) and Q(a). However, because the fitness F(a) depends on the sequence state a only via the binding energy E(a), we can project these distributions on the energy E as independent continuous variables, summing over all sequence states a with an (approximately) equal value of E. Denoting the projected ensembles for simplicity with the same symbols P0 and Q, Eq. 3 takes the form
It is this simplification that enables us to infer the functional form of these distributions from bioinformatic frequency counts. The total distribution of energies in the noncoding part of the genome is
From a bioinformatic point of view, this is a hidden Markov model for the sequence composition of noncoding DNA. The two alternative distributions P0(E) and Q(E) have prior probabilities 1 - λ and λ, i.e., the parameter λ measures the overall fraction of the genome covered by functional loci. The relative likelihood between the distributions Q and P0 is described by the score function S(E) log[Q(E)/P0(E)]. Comparing with Eq. 6, we obtain the identification
Time-Dependent Distributions and Cross-Species Scores. It is straightforward to generalize the probabilistic analysis to pairs of species separated by an evolutionary time t. Defining the conditional transition probabilities
The ensembles
Time-Dependent Selection and Functional Switching. Here we generalize the evolutionary model to include loss or gain of function at the level of individual loci. Consider a rooted phylogeny consisting of two species at evolutionary distances t1 and t2 from their last common ancestor, i.e., at distance t = t1 + t2 from each other. We assume that an initially functional locus can lose function at a small rate ν- (with ν-t << 1), and conversely, an initially nonfunctional locus can gain function at a comparable rate ν+ (such that the average fraction λ of functional loci is maintained). There are now four alternative evolutionary histories involving at most one functional switch: evolution under time-independent neutrality or selection, time-dependent selection leading to functionality at t1, and nonfunctionality at t2, and vice versa (see Fig. 2
Site Prediction and Quality Measures. For a given pair of aligned loci with energies (E1, E2), the hidden Markov model (Eq. 14) determines the probability of belonging to each of its four ensembles. The probability of conserved functionality is
The corresponding probabilities
Analogous definitions apply to the single-species case. Results Selection Pressure for CRP Sites in E. coli. Scanning the genome of E. coli (obtained from the NCBI database, accession no. NC_000913) produces sequence counts of n = 520,729 loci in 4,244 intergenic regions. We use a relatively large window size of l = 22, taking into account core binding motifs as well as informative flanking positions. Our CRP position weight matrix qi(a)(i = 1,..., l; a = A, C, G, T) is obtained from 48 experimentally verified binding sites in the DPInteract database (23). For each sequence count a = (a1,..., al), we hence infer the CRP-binding energy E(a) from Eq. 5 (in units of ε0 and with E = 0 set to the point of maximal binding). The resulting energy histogram is shown in Fig. 3a 2N(F(E) - F0) is shown as the orange line in Fig. 3b
With this fitness landscape, the hidden Markov model (Eq. 7) thus gives an excellent description of the CRP-binding energy statistics in intergenic DNA of E. coli. For an individual locus, the model predicts the probability ρQ(E) of being functional, given its binding energy E. This probability is indicated by the color shading in Fig. 3a Evolution Between E. coli and Salmonella typhimurium. The Salmonella genome is also obtained from NCBI (accession no. NC_003197). Our alignment of the two genomes contains 135,534 pairs of loci in well aligned intergenic regions flanked by orthologous genes (for details, see Supporting Text and refs. 24 and 25). The average identity between aligned sequences is 93%, which measures the evolutionary distance t between the two species. The CRP-binding energies E1 in E. coli and E2 in S. typhimurium are inferred using the same position weight matrix, which is justified, because the factor itself is highly conserved (3, 9). The resulting dot plot of energy pairs (E1, E2) is shown in Fig. 4a
Fig. 4b Functional Histories. For an individual pair of aligned loci with energies (E1, E2), the hidden Markov model (Eq. 14) predicts the probabilities of conserved neutrality and conserved function, A fraction of the functional binding sites in one species is predicted to lose their binding ability during evolution due to deleterious mutations, although their loci remain functional, i.e., under conserved selection. From E. coli to Salmonella or vice versa, we estimate this fraction to be ≈5%, by using the energy transition probabilities We have extended our analysis to include a third species, Yersinia pseudotuberculosis (NCBI accession no. NC_006155). Dot plots of energies (E1, E2, E3) for triplets of aligned loci and their probabilistic scoring are reported in Supporting Text and Figs. 6 and 7, which are published as supporting information on the PNAS web site. As expected, we find a further improvement of the detection error tradeoff for prediction of loci with conserved function; see Fig. 5. This is due in part to the alignment, which introduces a bias toward conserved loci. We also find candidate loci with loss or gain of function, such as the fourth malE-malK locus in E. coli, which we predict to be nonfunctional in both S. typhimurium and in Y. pseudotuberculosis. Three other verified sites in that region are conserved in all three species. Similar candidates for functional switches are the second verified binding sites for dadAX, tsx, and araB. Discussion Binding Sites in Bacteria Evolve Under Substantial Selection. Our evolutionary model is based on stochastic substitutions in the space of sequences (a1,..., al) of binding loci. These loci are treated as coherent population genetic units, taking into account that the evolution of any two nucleotides within a functional locus is correlated (13). This differs from standard bioinformatic approaches such as position weight matrices, which assume the nucleotides ai to be independent. Our in silico measurement of the selection pressure is based on the sequence ensemble Q of functional loci and its background counterpart P0. Assuming that functional loci evolve under selection, and background loci evolve neutrally, the log-likelihood score of these ensembles determines the effective fitness difference of sequence states at these two kinds of loci: S log(Q/P0) = 2NΔF, where N is the effective population size. For CRP loci in E. coli, we obtain effective fitness differences 2NΔF of order 10 between strong factor binding and no binding. Because our method involves ensembles, this is an order-of-magnitude estimate for typical loci, which does not exclude that some sites may be under substantially stronger or weaker selection. We note, however, that our estimate also predicts the correct amount of energy conservation for functional loci found in our cross-species analysis.A substantial level of selection explains well known evolutionary characteristics of regulatory sequences (2): they may be well conserved between distant species (if under constant selection for functionality) but can also show considerable variation even between closely related species (if under positive selection for change). At the level of selection found, binding-site gain by rapid adaptive evolution as discussed in ref. 13 is indeed possible. On the other hand, conservation will not be complete even under selection. A certain fraction (increasing with evolutionary distance) of initially functional sites will be lost because of deleterious mutations. This opens the possibility of compensatory changes involving different loci, as they are observed in ref. 27. It also indicates that the theory of promoter evolution should not stop at the level of individual binding sites. Selection couples not only the nucleotides within one locus but also the evolutionary fate of different loci. Understanding the long-term dynamics of regulation ultimately requires a consistent population-genetic theory of entire promoters. Improving Binding Site Searches Requires a Quantitative Evolutionary Rationale. The difficulty of predicting functional sites from their binding score in a single species is well known and has been called the “futility theorem” in ref. 28. It is caused by the coexistence of functional and nonfunctional loci in the twilight region of marginal binding. In the framework of our probabilistic model, this is quantified by tradeoff curves between false positives and false negatives. What is a computational dilemma, may, however, reflect evolutionary design. If a sufficient reservoir of marginal binding seeds is present even in background sequences, a fully functional site can form by rapid adaptation as a response to new demand, ensuring the evolvability of regulatory interactions (13). To overcome the futility theorem, we have introduced here a quantitative method that includes evolutionary information into binding-site searches. At the core of our model are the cross-species energy transition probabilities We have applied the method to comparative analysis of three bacterial species. We find a substantial improvement of the predictive quality already at the level of two species and a further improvement for three species. This confirms the results of a recent study of conserved sites in several Saccharomyces species, where the significance of the evolutionary information as a function of evolutionary distances is discussed in detail (15). Of course, elementary evolutionary steps other than point mutations are expected to become important in eukaryotes. Nevertheless, our general rationale of inferring selection pressures from site frequencies should remain applicable. Putative Regulatory Innovations in Bacterial Phylogenies Can Be Traced by Comparative Sequence Analysis. Previous approaches have focused on the conservation of regulatory sequences as a sign of their functionality. Here we aim at a more comprehensive view, which includes functional changes into a quantitative statistical model. We emphasize again that in the presence of selection, there are two conceptually different modes of change: binding sites can lose or gain functionality due to deleterious or beneficial mutations at constant selection for binding, or they can respond to changes in the selection itself. Our model distinguishes these modes statistically by their energy transition probabilities and thus builds functional phylogenies for specific loci (as exemplified in Fig. 2 In our comparative analysis of bacterial species, we find a large number of loci predicted to have conserved function but also some cases with evidence for gain or loss of function. It has been shown that changes in the gene regulation of orthologous genes can lead to phenotypic differences between E. coli and S. typhimurium (29). With caveats due to uncertainties in the rates of loss or gain of function, our findings provide at least a starting point for further targeted cross-species experiments. We can thus begin to quantify the role of regulatory innovations in molecular evolution. Supporting Information
Acknowledgments We thank Johannes Berg and Ulrich Gerland for a critical reading of the manuscript. V.M. acknowledges support through the STIPCO European network, Contract HPRN-CT-2002-00319 (grant to M.L.). Notes Author contributions: V.M. and M.L. designed research, performed research, analyzed data, and wrote the paper. This paper was submitted directly (Track II) to the PNAS office. Abbreviation: CRP, cAMP response protein. References 1. Ptashne, M. & Gann, A. (2002. ) Genes and Signals (Cold Spring Harbor Lab. Press, Woodbury, NY). 2. Wray, G. A., Hahn, M. W., Abouheif, H., Balhoff, J. P., Pizer, M., Rockman, M. V. & Romano, R. A. (2003. ) Mol. Biol. Evol. 20, 1377-1419. [PubMed] 3. Rajewsky, N., Socci, N. D., Zapotocky, M. & Siggia, E. D. (2002. ) Genet. Res. 12, 298-308. 4. McCue, L. A., Thompson, W., Carmack, C. S. & Lawrence, C. E. (2002. ) Genet. Res. 12, 1523-1532. 5. Lenhard, B., Sandelin, A., Mendoza, L., Engström, P., Jareborg, N. & Wasserman, W. W. (2003. ) J. Biol. 2, 13. [PubMed] 6. Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R., Cohen, B. A. & Johnston, M. (2003. ) Science 301, 71-76. [PubMed] 7. Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E. S. (2003. ) Nature 423, 241-254. [PubMed] 8. Gerland, U. & Hwa, T. (2002. ) J. Mol. Evol. 55, 386-400. [PubMed] 9. Brown, C. T. & Callan, C. G., Jr. (2004. ) Proc. Natl. Acad. Sci. USA 101, 2404-2409. [PubMed] 10. Gerland, U., Moroz, D. & Hwa, T. (2002. ) Proc. Natl. Acad. Sci. USA 99, 12015-12020. [PubMed] 11. Djordjevic, M., Sengupta, A. M. & Shraiman, B. I. (2003. ) Genome Res. 13, 2381-2390. [PubMed] 12. Berg, J. & Lässig, M. (2003. ) Biophysics (Moscow) 48, Suppl. 1, 36-44. 13. Berg, J., Willmann, S. & Lässig, M. (2004. ) BMC Evol. Biol. 4, 42. [PubMed] 14. Halpern, A. L. & Bruno, W. J. (1998. ) Mol. Biol. Evol. 15, 910-917. [PubMed] 15. Moses, A. M., Chiang, D. Y., Pollard, A. P., Iyer, N. I. & Eisen, M. B. (2004. ) Genome Biol. 5, R98. [PubMed] 16. Arndt, P. & Hwa, T. (2005. ) Bioinformatics 21, 2322-2328. [PubMed] 17. Kimura, M. (1962. ) Genetics 47, 713-719. [PubMed] 18. Kimura, M. & Ohta, T. (1969. ) Genetics 61, 763-771. 19. Ohta, T. & Tachida, H. (1990. ) Genetics 126, 219-229. [PubMed] 20. Berg, O. & von Hippel, P. (1987. ) J. Mol. Biol. 193, 723-750. [PubMed] 21. Fields, D., He, Y., Al-Uzri, A. & Stormo, G. (1997. ) J. Mol. Biol. 271, 178-194. [PubMed] 22. Stormo, G. D. & Fields, D. S. (1998. ) Trends Biochem. Sci. 23, 109-113. [PubMed] 23. Robison, K., McGuire, A. M. & Church, G. M. (1988. ) J. Mol. Biol. 284, 241-254. 24. Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T. J., Higgins, D. G. & Thompson, J. D. (2003. ) Nucleic Acids Res. 31, 3497-3500. [PubMed] 25. Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. (1998. ) Biological Sequence Analysis (Cambridge Univ. Press, Cambridge, U.K.). 26. Egan, J. P. (1975. ), Signal Detection Theory and ROC Analysis (Academic, New York). 27. Ludwig, M. Z. & Kreitman, M. (1995. ) Mol. Biol. Evol. 12, 1002-1011. [PubMed] 28. Wasserman, W. & Sandelin, A. (2004. ) Nat. Rev. Genet. 5, 276-287. [PubMed] 29. Winfield, M. D. & Groisman, E. A. (2004. ) Proc. Natl. Acad. Sci. USA 101, 17162-17167. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||
Mol Biol Evol. 2003 Sep; 20(9):1377-419.
[Mol Biol Evol. 2003]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]J Mol Evol. 2002 Oct; 55(4):386-400.
[J Mol Evol. 2002]Proc Natl Acad Sci U S A. 2004 Feb 24; 101(8):2404-9.
[Proc Natl Acad Sci U S A. 2004]Proc Natl Acad Sci U S A. 2002 Sep 17; 99(19):12015-20.
[Proc Natl Acad Sci U S A. 2002]Genome Res. 2003 Nov; 13(11):2381-90.
[Genome Res. 2003]BMC Evol Biol. 2004 Oct 28; 4(1):42.
[BMC Evol Biol. 2004]BMC Evol Biol. 2004 Oct 28; 4(1):42.
[BMC Evol Biol. 2004]Bioinformatics. 2005 May 15; 21(10):2322-8.
[Bioinformatics. 2005]Genome Res. 2003 Nov; 13(11):2381-90.
[Genome Res. 2003]Genetics. 1962 Jun; 47():713-9.
[Genetics. 1962]Genetics. 1990 Sep; 126(1):219-29.
[Genetics. 1990]Genome Res. 2003 Nov; 13(11):2381-90.
[Genome Res. 2003]BMC Evol Biol. 2004 Oct 28; 4(1):42.
[BMC Evol Biol. 2004]J Mol Biol. 1987 Feb 20; 193(4):723-50.
[J Mol Biol. 1987]Mol Biol Evol. 1998 Jul; 15(7):910-7.
[Mol Biol Evol. 1998]Genome Biol. 2004; 5(12):R98.
[Genome Biol. 2004]J Mol Evol. 2002 Oct; 55(4):386-400.
[J Mol Evol. 2002]BMC Evol Biol. 2004 Oct 28; 4(1):42.
[BMC Evol Biol. 2004]J Mol Biol. 1997 Aug 15; 271(2):178-94.
[J Mol Biol. 1997]BMC Evol Biol. 2004 Oct 28; 4(1):42.
[BMC Evol Biol. 2004]Nucleic Acids Res. 2003 Jul 1; 31(13):3497-500.
[Nucleic Acids Res. 2003]Proc Natl Acad Sci U S A. 2004 Feb 24; 101(8):2404-9.
[Proc Natl Acad Sci U S A. 2004]Genome Res. 2003 Nov; 13(11):2381-90.
[Genome Res. 2003]Proc Natl Acad Sci U S A. 2004 Feb 24; 101(8):2404-9.
[Proc Natl Acad Sci U S A. 2004]BMC Evol Biol. 2004 Oct 28; 4(1):42.
[BMC Evol Biol. 2004]Mol Biol Evol. 2003 Sep; 20(9):1377-419.
[Mol Biol Evol. 2003]BMC Evol Biol. 2004 Oct 28; 4(1):42.
[BMC Evol Biol. 2004]Mol Biol Evol. 1995 Nov; 12(6):1002-11.
[Mol Biol Evol. 1995]Nat Rev Genet. 2004 Apr; 5(4):276-87.
[Nat Rev Genet. 2004]BMC Evol Biol. 2004 Oct 28; 4(1):42.
[BMC Evol Biol. 2004]Genome Biol. 2004; 5(12):R98.
[Genome Biol. 2004]Proc Natl Acad Sci U S A. 2004 Dec 7; 101(49):17162-7.
[Proc Natl Acad Sci U S A. 2004]