![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2006 by the Genetics Society of America Population Genetics of Translational Robustness *Section of Integrative Biology and Center for Computational Biology and Bioinformatics, University of Texas, Austin, Texas 78712 and †Program in Computation and Neural Systems, California Institute of Technology, Pasadena, California 91125 1Corresponding author: Integrative Biology, University of Texas, 1 University Station—C0930, Austin, TX 78712. E-mail: cwilke/at/mail.utexas.edu Communicating editor: N. S. Wingreen Received September 21, 2005; Accepted February 15, 2006. This article has been cited by other articles in PMC.Abstract Recent work has shown that expression level is the main predictor of a gene's evolutionary rate and that more highly expressed genes evolve slower. A possible explanation for this observation is selection for proteins that fold properly despite mistranslation, in short selection for translational robustness. Translational robustness leads to the somewhat paradoxical prediction that highly expressed genes are extremely tolerant to missense substitutions but nevertheless evolve very slowly. Here, we study a simple theoretical model of translational robustness that allows us to gain analytic insight into how this paradoxical behavior arises. THE increasing availability of whole-genome sequences from many different species has revealed a surprising fact: Different genes within the same organism evolve at dramatically different rates. For example, the evolutionary rates of the fastest- and slowest-evolving genes in Saccharomyces cerevisiae are separated by three orders of magnitude (Drummond et al. 2005). Because the dominant force shaping genomewide patterns of evolutionary rate is most likely stabilizing selection, the evolutionary rates of genes should correlate with quantities that measure how important or otherwise constrained a gene is (Kimura 1983; Ohta 1992). A wide array of such quantities have been proposed, shown to correlate with evolutionary rate, and subsequently disputed. Examples include a gene's dispensability or essentiality (Hurst and Smith 1999; Hirsh and Fraser 2001; Jordan et al. 2002; Pal et al. 2003; Zhang and He 2005; Wall et al. 2005), its number of interaction partners (Fraser et al. 2002; Bloom and Adami 2003; Jordan et al. 2003; Hahn et al. 2004; Agrafioti et al. 2005), its length (Marais and Duret 2001), or its centrality in the protein interaction network (Hahn and Kern 2005). However, it seems that most importantly, the expression level (Pal et al. 2001; Rocha and Danchin 2004), or perhaps more accurately the frequency of translation events (Drummond et al. 2005, 2006), influences evolutionary rate. Drummond et al. (2005) have recently introduced a theory for why highly expressed genes evolve slowly. Translation is error prone, and inactivated or misfolded proteins resulting from mistranslation impose substantial costs on the cell (Bucciantini et al. 2002), costs that increase with expression level. One way in which the cost associated with a highly expressed gene is reduced is translational accuracy (Akashi 1994, 2001), whereby the gene is encoded with optimal codons whose translation is less error prone than the translation of other codons. Translational accuracy can explain why the rate of synonymous evolution dS is correlated with expression level, codon adaptation index, or protein abundance. However, it cannot explain why the rate of nonsynonymous evolution dN shows an even stronger correlation with these quantities (Drummond et al. 2005). Selection for translational accuracy can reduce the translational error rate by a factor of 4–9 (Precup and Parker 1987), but even optimally coded genes that are highly expressed may produce a large amount of erroneous polypeptides. Therefore, Drummond et al. (2005) suggest that highly expressed genes should be under additional selection to be tolerant to translation errors, that is, the polypeptides produced from these genes should fold properly even if they were erroneously translated. Recent work on protein biochemistry has shown that proteins can differ widely in their tolerance to missense substitutions and that properly chosen point mutations can dramatically increase the tolerance of a protein to additional substitutions (Bloom et al. 2005). Drummond et al.'s hypothesis, referred to as selection for translational robustness, predicts a constraint on the nonsynonymous rate of evolution, whereas selection for translational accuracy predicts primarily a constraint on the synonymous rate of evolution. We must assume that both selective constraints will operate on genes that are frequently translated. Genes that are highly tolerant to translational missense errors must also, by definition, be highly tolerant to missense substitutions. However, the translational robustness hypothesis predicts that these genes will nevertheless be strongly conserved under evolution. An example of a gene that exhibits this paradoxical behavior is Rubisco, an extremely abundant protein that fixes carbon dioxide during photosynthesis. Rubisco is strongly conserved across phyla, but appears to tolerate most missense substitutions in the laboratory without loss of fold (Spreitzer 1993; Kellogg and Juliano 1997). The purpose of this article is to put the translational robustness hypothesis into precise mathematical terms and to demonstrate how highly expressed genes can evolve to be tolerant to missense substitutions and yet remain strongly conserved under evolution. MATERIALS AND METHODS Model: We consider the evolution of a gene encoding a protein of length L. Each site in the protein can be in one of two states, neutral or nonneutral. We denote the number of neutral sites in the protein by n (see Table 1 for definitions of variables). A mutation at a neutral site of a folded protein leaves the protein folded, but changes the site from neutral to nonneutral. A mutation at a nonneutral site of a folded protein will usually cause misfolding and consequent loss of function, but with a small probability α, such a mutation will not affect folding but turn the site into a neutral one. For simplicity, we assume that once an amino acid sequence loses the ability to fold, it is impossible to mutate it back into a folded state. The rationale behind this assumption is that the likelihood of a mutation restoring fold to an unfolded amino acid sequence is so low as to be negligible. Our model is a reasonable abstraction of a thermodynamic framework of protein evolution that has recently been shown to have good predictive power for both simulated and real proteins (Bloom et al. 2005; Wilke et al. 2005). The key insight of this framework is that a protein's tolerance to substitutions is closely related to the protein's stability—more stable proteins can withstand more missense substitutions—and that therefore proteins can change from being highly fragile to being highly robust to mutations and vice versa through the accumulation of stabilizing or destabilizing mutations. In this sense, a mutation from a nonneutral to a neutral site in our model corresponds to a stabilizing mutation, and the opposite mutation corresponds to a destabilizing mutation. Thus, our model captures the following key aspects of protein biochemistry: (i) Homologous proteins can vary widely in their tolerance to mutations, and individual point mutations can increase or decrease this tolerance; (ii) mutations that increase a protein's tolerance are much rarer than mutations that decrease its tolerance; (iii) highly tolerant proteins are extremely rare, while moderately tolerant proteins are abundant; and (iv) nonfunctional mutant proteins are likely to be misfolded.
The gene is expressed at a level that leads to the synthesis of x polypeptides. For simplicity, we assume that the total number of polypeptides translated per gene is proportional to the gene's expression level and that the constant of proportionality is 1. Thus, x is also the expression level, measured in mRNA molecules per cell. Under translation, each site is mistranslated with probability τ (we neglect premature termination of the translation process). The probability that a single mRNA molecule is mistranslated and leads to a misfolded protein is 1 − [1 − τ(1 − α)]L−n ≈ τ(L − n), where the approximation holds for
Simulations: We implemented a stochastic simulation of N sequences reproducing in discrete, nonoverlapping generations. We employed standard Wright–Fisher sampling, that is, the probability that a sequence in generation t + 1 is the offspring of a sequence at generation t is directly proportional to the latter's fitness. We measured the evolutionary rate along the line of descent from the most recent common ancestor (MRCA) of the final population backward in time, as described by Wilke (2004). Briefly, we let the simulated population evolve until the birth time of the population's MRCA, designated t0, exceeded a fixed equilibration time tequil plus a time window tmeas, t0 > tequil + tmeas. All quantities were measured on the equilibrated population during this latter time window. For all results reported, we chose tequil = tmeas = 400,000, U = 0.001, τ = 0.001, and L = 100. At all parameter settings, we carried out 100 replicas and averaged results over all replicas. ANALYTICAL RESULTS Solution based on Sella–Hirsh theory: We can calculate the steady-state solution of our model using the analogy between evolutionary biology and statistical physics recently demonstrated by Sella and Hirsh (2005). The theory of Sella and Hirsh (2005) is applicable whenever the product of population size and mutation rate is 1, NU 1. In this regime, the population is essentially homogeneous at all times and can be represented at any given point in time by a single sequence. We say that the population is in state i if the dominant sequence in the population is sequence i. The key insight of Sella and Hirsh (2005) is that the probability pi to find the population in state i is proportional to a function F(i) (also called a Boltzmann factor) that depends only on the fitness of sequence i, the population size, and details of the mutation process. Thus, it follows that
As the fitness of a sequence in our model depends only on the sequence's number of neutral sites n, it is useful to lump all sequences with the same n into a single class and calculate the probability pn that the population is in a state represented by any sequence with n neutral sites. Since there are
With the formalism outlined in the previous paragraphs, we can calculate the expected number of neutral sites in the steady-state E[n] as
We can simplify these expressions in the special cases that A is either very large or very small. After inserting Equation 1 into Equation 3, we have
For
For
Approximate solution: Sella–Hirsh theory yields an exact solution for our model. However, the resulting expressions are somewhat unwieldy and do not lead to simple analytical expressions for intermediate A. Therefore, we now derive an approximate solution to our model. For small τ, we can approximate wi ≈ exp[−Aτ(L − i)] and find ln(wj/wi) = Aτ(j − i). This approximation is equivalent to neglecting the small number of additional translation events required to replace polypeptides that misfold. The probability of fixation follows from Equation 7 as
In the limit A→0, we have π+ = π− = 1/N and n* = αL/(α + 1). [Note that this expression is identical to the result found through Sella–Hirsh theory (Equation 9), if we equate n* with E[n].] Therefore, in this limit,
Limitations on the number of neutral sites: Certain residues may never tolerate any substitutions, such as the active-site serine of a serine protease or the heme-binding histidines of hemoglobin. Under the assumption that m sites can never be neutral, we can write L = l + m, and for small τ the fitness wn is approximately
For Sella–Hirsh theory, if we assume that τm is negligibly small, the Boltzmann factor F′(n) gains a similar leading factor, which, as Equations 4 and 5 make clear, also drops out, this time through the normalization term. In the case of E[n], the result is only that L must again be replaced by l = L − m, while the expected value E[w] also gains a leading factor e−Aτm, just as in the approximate case. In short, when there are m never-neutral sites, the main effects are to lower the population's fitness by a factor e−Aτm and to reduce the expected number of neutral sites E[n] and the evolutionary rate K roughly as though the evolving gene were shorter by m residues. SIMULATION RESULTS First, we studied the rate of evolution K as a function of the protein abundance A, for various population sizes (Figure 1
Second, we studied the effect of varying α on K(A) (Figure 2
Third, we studied the behavior of the expected fitness and the expected number of neutral sites for varying levels of A. The expected fitness is ~1 for both very low and very high A, but drops below 1 for the intermediate values of A for which K(A) starts to decay (Figure 3a
Finally, the calculation in the appendix predicts that the evolutionary rate should be independent of N if we plot it as a function of the expected number of neutral sites E[n] rather than a function of A. Figure 4
Throughout this study, we found good agreement among the numerical simulations, Sella–Hirsh theory, and our simple analytical approximation. Some discrepancies appeared between theory and simulations for the largest population size (N = 1000) and for very small α in conjunction with large A. We attribute the former to a violation of the condition DISCUSSION We have developed a simple model that describes the slowdown of the rate of evolution of highly expressed genes under selective pressure for translational robustness. We have studied the model with numerical simulations and have solved the model exactly using Sella–Hirsh theory. We have also developed a simple analytic approximation that is in excellent agreement with the predictions from Sella–Hirsh theory and is valid for the entire range of possible parameter values (as long as The model abstracts a previous thermodynamic model of protein mutational tolerance introduced by Bloom et al. (2005) in which mutations may leave unperturbed or destabilize the protein's native structure (common) or stabilize it (uncommon). Increases in stability tend to increase the number of sites at which substitutions can be tolerated, so-called neutral sites, while decreases in stability usually cause misfolding or decrease the number of neutral sites. In this work, we have modeled neutral sites directly. In doing so, we allow only stepwise changes in neutral sites, sacrificing treatment of large stability changes that might radically alter the number of neutral sites and the potential stability dependence of mutational effects to gain analytical tractability. Our results show a clear example of the paradox cited in the Introduction (Figure 4 This observation has an important corollary. When selection for translational robustness is weak, functional loss is likely the main determinant of fitness costs. Thus, our results suggest that evolutionary conservation of sites in low-expression proteins may be more likely to indicate functional importance than similar conservation at sites in high-expression proteins. Our simple model produces an exponential decline in evolutionary rate with increasing expression level, whereas in yeast, a power law better describes this relationship (Drummond et al. 2005). Several possibilities may explain the discrepancy. First, our model assumes a symmetric binomial distribution of the number of neutral sites, but the distribution for real proteins may be skewed or heavy tailed. Second, the cost of additional misfolded proteins may not be independent of the number of already misfolded proteins. For example, misfolded proteins form toxic aggregates (Bucciantini et al. 2002), and aggregation is not a linear function of protein concentration. Finally, differences in protein structure and function between high- and low-expression proteins may play a role. Drummond et al. (2005) have previously examined differences between functionally and structurally similar paralogs and found a similar power-law relationship. However, more subtle but important differences may separate paralogs and influence their evolution. Our results here demonstrate that profound differences in protein evolutionary rates can arise even in the absence of functional and structural differences and when variables such as protein length, the translation error rate, and the underlying distribution of the number of neutral sites are held constant. In real genomes, of course, all these features vary and some, perhaps all, are under selection. The value of the model is its utility in explaining why highly expressed proteins evolve slowly across taxa (Drummond et al. 2005). Interestingly, our model reveals two evolutionary-rate regimes (Figures 1 Our model distinguishes between the number of polypeptides produced per gene, x, and the abundance of functional proteins, A, yet our approximate solution essentially equates these quantities with only minor accuracy loss. The approximation works for two reasons. First, misfolded polypeptides impose a negligible cost for low-abundance proteins, while for high-abundance proteins, misfolded polypeptides are rare because of selection for translational robustness. We expect these nontrivial results to hold for many organisms. Second, in our model, the number of translation events x, the primary determinant of the number of mistranslated proteins, is estimated accurately by A, a situation unlikely to hold for most organisms. Protein abundance reflects a balance of ongoing translation and turnover (Greenbaum et al. 2003), such that a high abundance can result either from moderate translational frequency and long protein half-life or from rapid translation and short half-life. Because half-lives can vary over orders of magnitude (Hargrove and Schmidt 1989), abundance and translation frequency may only weakly correlate in real organisms. Among protein abundance, mRNA expression level, and translation frequency, we hypothesize that the latter, even though difficult to measure, will best predict evolutionary rate. Bürger et al. (2005) recently studied a question closely related to this article, asking why phenotypic mutation rates (corresponding to the translational error rate in this article) are much higher than genotypic mutation rates. Within the framework of their model, Bürger et al. (2005) found very little pressure for reduction of phenotypic mutation rates below a certain threshold. Even though we keep the translational error rate constant in our model, we can consider a change in the number of neutral sites n as a change in the phenotypic mutation rate and thus compare our results to those of Bürger et al. (2005). In contrast to their conclusions, we find that the fraction of neutral sites, n/L, quickly rises to the maximum possible for highly expressed genes, thus reducing the phenotypic mutation rate to zero except when some sites cannot be made neutral. We believe that the differences in results are caused by differences in the way in which we and Bürger et al. (2005) treated costs of erroneously translated proteins in our models. Bürger et al. (2005) consider the total cost of protein synthesis, but do not consider additional penalties imposed by misfolded proteins, not only for their recognition and cleanup by the quality-control system but also for their innate toxicity (Bucciantini et al. 2002). Clearly, if we neglect these unique costs, then the only pressure to reduce the phenotypic mutation rate is to reduce the cost of synthesis for misfolded proteins, and this pressure will be weak if this cost is only a small proportion of the total cost of protein synthesis. In our model, on the other hand, we have focused exclusively on costs of misfolded proteins apart from their synthesis costs, implicitly assuming that the total cost of protein synthesis is approximately equal to the cost of synthesis of functional proteins and that the benefit of the functional proteins will pay for their synthesis. We believe that there is indeed a strong selective pressure to reduce the phenotypic mutation rate for highly expressed genes, but that it is cheaper for cells to evolve translationally robust genes than to evolve highly accurate transcription and translation machinery. Can translational robustness really be obtained cheaply? Drummond et al. (2005) have suggested that increased protein stability both confers mutational tolerance and constrains sequence evolution. Increasing protein stability, that is, decreasing the free energy of folding ΔGf, provides a plausible mechanism for obtaining translational robustness for numerous reasons. First, increased stability leads to increased mutational tolerance and can be obtained by point mutations (Bloom et al. 2005). Second, the stability-increase mechanism is sufficiently general to encompass proteins of diverse functions and to operate in a wide range of organisms. Third, stability is free in the sense that obtaining a protein with lower ΔGf requires only a chance mutation. While many researchers have noted an apparent trade-off between protein stability and enzymatic activity, it is crucial to emphasize that this trend may be statistical rather than intrinsic: Because both high activity and high stability are rare properties, mutations that improve both are exceedingly unlikely unless both are constrained (Giver et al. 1998). Selection for translational robustness provides precisely that dual constraint, and because many millions of mutations may be screened over evolutionary time, the very few resulting in highly expressed proteins with increased stability (conferring tolerance to translation errors) and uncompromised activity will be found. Finally, the very rarity of such stabilizing mutations provides a measure of the scarcity of highly stable proteins available for exploration by evolutionary drift. If increased stability is a dominant response to the need for mutational tolerance in highly expressed proteins, it will restrict drift and slow evolution relative to less-constrained low-expression proteins. Acknowledgments D.A.D. acknowledges, with gratitude, the support of Frances Arnold. C.O.W. was supported by National Institutes of Health (NIH) grant AI 065960 and D.A.D. was supported by NIH National Research Service award 5 T32 MH19138. APPENDIX Here we derive an expression for the evolutionary rate K as a function of the number of neutral sites n* rather than the protein abundance A. For the remainder of this appendix, we drop the superscript from n* for simplicity. We begin by noting that Equation 16 implies
References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Proc Natl Acad Sci U S A. 2005 Oct 4; 102(40):14338-43.
[Proc Natl Acad Sci U S A. 2005]Curr Biol. 1999 Jul 15; 9(14):747-50.
[Curr Biol. 1999]Nature. 2001 Jun 28; 411(6841):1046-9.
[Nature. 2001]Genome Res. 2002 Jun; 12(6):962-8.
[Genome Res. 2002]Mol Biol Evol. 2005 Apr; 22(4):1147-55.
[Mol Biol Evol. 2005]Proc Natl Acad Sci U S A. 2005 Oct 4; 102(40):14338-43.
[Proc Natl Acad Sci U S A. 2005]Nature. 2002 Apr 4; 416(6880):507-11.
[Nature. 2002]Genetics. 1994 Mar; 136(3):927-35.
[Genetics. 1994]J Biol Chem. 1987 Aug 15; 262(23):11351-5.
[J Biol Chem. 1987]Proc Natl Acad Sci U S A. 2005 Jan 18; 102(3):606-11.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2005 Jan 18; 102(3):606-11.
[Proc Natl Acad Sci U S A. 2005]Biophys J. 2005 Dec; 89(6):3714-20.
[Biophys J. 2005]BMC Genet. 2004 Aug 27; 5():25.
[BMC Genet. 2004]Proc Natl Acad Sci U S A. 2005 Jul 5; 102(27):9541-6.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2005 Jul 5; 102(27):9541-6.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2005 Jul 5; 102(27):9541-6.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2005 Jan 18; 102(3):606-11.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2005 Oct 4; 102(40):14338-43.
[Proc Natl Acad Sci U S A. 2005]Nature. 2002 Apr 4; 416(6880):507-11.
[Nature. 2002]Proc Natl Acad Sci U S A. 2005 Oct 4; 102(40):14338-43.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2005 Oct 4; 102(40):14338-43.
[Proc Natl Acad Sci U S A. 2005]Immunity. 2003 Mar; 18(3):343-54.
[Immunity. 2003]Proc Natl Acad Sci U S A. 1993 Oct 15; 90(20):9538-41.
[Proc Natl Acad Sci U S A. 1993]Annu Rev Genet. 1992; 26():29-50.
[Annu Rev Genet. 1992]Genome Biol. 2003; 4(9):117.
[Genome Biol. 2003]FASEB J. 1989 Oct; 3(12):2360-70.
[FASEB J. 1989]Nature. 2002 Apr 4; 416(6880):507-11.
[Nature. 2002]Proc Natl Acad Sci U S A. 2005 Oct 4; 102(40):14338-43.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2005 Jan 18; 102(3):606-11.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 1998 Oct 27; 95(22):12809-13.
[Proc Natl Acad Sci U S A. 1998]