• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Oct 9, 2007; 104(41): 16152–16157.
Published online Oct 3, 2007. doi:  10.1073/pnas.0705366104
PMCID: PMC2042177

Protein stability imposes limits on organism complexity and speed of molecular evolution


Classical population genetics a priori assigns fitness to alleles without considering molecular or functional properties of proteins that these alleles encode. Here we study population dynamics in a model where fitness can be inferred from physical properties of proteins under a physiological assumption that loss of stability of any protein encoded by an essential gene confers a lethal phenotype. Accumulation of mutations in organisms containing Γ genes can then be represented as diffusion within the Γ-dimensional hypercube with adsorbing boundaries determined, in each dimension, by loss of a protein's stability and, at higher stability, by lack of protein sequences. Solving the diffusion equation whose parameters are derived from the data on point mutations in proteins, we determine a universal distribution of protein stabilities, in agreement with existing data. The theory provides a fundamental relation between mutation rate, maximal genome size, and thermodynamic response of proteins to point mutations. It establishes a universal speed limit on rate of molecular evolution by predicting that populations go extinct (via lethal mutagenesis) when mutation rate exceeds approximately six mutations per essential part of genome per replication for mesophilic organisms and one to two mutations per genome per replication for thermophilic ones. Several RNA viruses function close to the evolutionary speed limit, whereas error correction mechanisms used by DNA viruses and nonmutant strains of bacteria featuring various genome lengths and mutation rates have brought these organisms universally ≈1,000-fold below the natural speed limit.

Keywords: biological evolution, genome sizes, lethal mutagenesis, mutation load, population genetics

Phenomenological approaches to study molecular evolution, developed since the pre-DNA era, assume selective advantage of certain alleles or the existence of a single-fitness peak in the genome space (1). Although mathematical genetics approaches have brought remarkable insights over the years, they lack a fundamental connection between the fitness (reproductive success) of organisms and molecular properties of proteins encoded by their genomes. On the other hand, our understanding of the molecular basis of folding, stability, and function of proteins has advanced significantly. In particular, statistical–mechanical studies provided key insights into sequence requirements for proteins to fold and to be stable in their native functional states (2). Here, we develop an evolutionary model where the fitness of an organism, in principle, can be directly inferred from its DNA sequences using a protein-folding model and a well defined physiological assumption of genotype–phenotype relationship. This physiological assumption is motivated by recent experiments that showed that knockout of any essential gene confers a lethal phenotype to an organism (3, 4). The number of such essential genes varies from organism to organism: all genes are essential in viruses, whereas in bacteria, essential genes can reach up to one-third of all genes. As a minimal functional requirement, most proteins [except intrinsically disordered ones (5, 6)] have to be stable in their native conformations. Therefore, here we assume the following fundamental genotype–phenotype relationship: for an organism to be viable, all of its essential genes must encode (at least minimally) stable proteins. Although this requirement is certainly minimal, it is necessary for essential genes and universal. There have been conflicting opinions as to whether greater stability of proteins confers selective advantage or disadvantage or is neutral (79). Here we assume that protein stability is essentially a physiologically neutral trait insofar as a protein possesses sufficient stability to stay in the folded state (9, 10). However, proteins accumulate mutations over the course of evolution. Although many mutations may be neutral with respect to protein stability (9), and some of them can be stabilizing, eventually accumulation of too many mutations will render the protein unstable and nonfunctional, and it will confer a lethal phenotype to its carrier organism.

Therefore, molecular evolution can be rendered as diffusion in a space of protein stabilities. To describe the space in which such random walks occur, we first turn to an elementary consideration of the thermodynamics of protein folding.


In their native conformations, proteins possess low effective energy (or enthalpy, because the difference between energy and enthalpy of a protein at normal pressure is negligible) and low conformational entropy, whereas in the denatured (unfolded) state, their effective energy and conformational entropy are both high. Effective energy of a protein is essentially a potential of mean force that includes direct intraprotein interactions as well as protein- and solvent-solvent contributions that change as proteins fold and unfold. Proteins unfold in a two-state manner at temperature Tf when free energy of the native state GF equals to free energy of the unfolded state GU (11):

equation image

where kB, EF, EU, and SU are the Boltzmann constant, free energy, energy, and entropy of folded and unfolded states, respectively; we assumed here that conformational entropy of the folded state is low compared with conformational entropy of the unfolded state (11, 12). Mutations affect mostly the compact native state by changing its energy EF (13, 14). A protein gets unfolded when its energy reaches value Emax, such that it loses stability at the temperature T at which the organism lives,

equation image

or assuming the conformational entropy of unfolding approximately constant for a standard size 100-aa domain (11)

equation image

[This is a simplified description; see supporting information (SI) Text for a more complete analysis.] Therefore, if the energy of the native state of a mutant protein exceeds Emax, the protein becomes unstable and dysfunctional, causing the death of the organism that carries its gene (see Fig. 1).

Fig. 1.
A schematic representation of the model. (Top) Illustration of the effect of sequence depletion as energy of the native state of a protein decreases. Middle and Bottom show the accessible range of native-state energies within which proteins can mutate. ...

Because of the effect of sequence depletion, the range of the possible native energies of proteins is limited from below as well; there are simply no sequences that can deliver energies below a certain threshold energy Emin (2, 15, 16) (see SI Text for an estimate of Emin).

At any given time t, the genotype of an organism that carries Γ genes can be fully characterized in this model by the native effective energies of the corresponding Γ proteins, i.e., as a point (E1, E2EΓ) in the Γ-dimensional space. We assume that changes of stability of different proteins upon mutations are independent and can be statistically described by the probability distribution W (EiWTEimut) = W (EiWT, Eimut). In vitro experiments on mutant proteins provide a reasonable approximation for the shape of W (see ref. 13 and Methods). For simplicity, we assume that the probability distribution W is the same for all proteins, because it is a statistical property directly following from the physics of interactions among amino acid residues in the protein structure. As each protein accumulates mutations in its sequence, its stability changes, and so is the position of the organism in the stability space E(t) (we use vector notation here to highlight the Γ-dimensional character of space of native energies of all proteins in a genome). By construction, our analysis excludes natively unstructured proteins, so they are not counted in Γ even if essential for the organism, because the effect of mutations in the unstructured proteins on the fitness of the organism cannot be easily incorporated in the model.

Our model is equivalent to the following Γ-dimensional diffusion problem. Population consisting of N noninteracting organisms represents N independent random walkers, each diffusing in Γ-dimensional space. Organisms replicate with rate b, and there is natural mutation-unrelated decay rate of genomes d (death rate due, e.g., to degradation of viral RNA). Introducing the joint (not normalized) distribution of genotypes in a population, i.e., number of organisms P(E, t) having genotype E = (E1, E2EΓ) at time t, we can write the replication-mutation balance diffusion equation

equation image

(See Methods for derivation and detailed discussion of its applicability.) In this equation, h is the average, and D is the mean square change of protein stability upon point mutation,

equation imageequation image

The first two terms in Eq. 4 are due to the mutational flux of organisms to and from the vicinity of point E, and the third term corresponds to replication of organisms. m is mutation rate per gene. Very importantly, Eq. 4 should be augmented by 2Γ boundary conditions P(E1Ei…, EΓ) = 0 for any EI > Emax and for any Ei < Emin corresponding to organismal death and sequence depletion conditions for each gene, as explained above. We assume that organism replication rate b does not depend on E, so the fitness landscape is flat, and there are only two phenotypes, the viable phenotype for Emin < EI < Emax, i = 1… Γ, and a lethal phenotype otherwise.

Eq. 4 provides a deterministic description of the evolution of distribution of genotypes. It is applicable in the regime when Nm > 1 (9, 17). In the opposite case, Nm < 1, the population becomes essentially monoclonal after each mutation, so the boundary conditions at the organism level would not be straightforward.

Because mutations in all proteins are independent, and boundary conditions are the same in each dimension (i.e., for each gene), one can write P(E) = [product]i=1Γ p(Ei), separate the variables, and reduce the problem to a product of Γ 1D diffusion problems for each gene:

equation image

Again, h and D are mean and mean square (de)stabilization effects upon point mutations. The analysis of the data on point mutations available in the ProTherm database (18) and in ref. 13 gives h = 1(kcal/mol); D ≈ 3(kcal/mol)2 (see Methods and SI Text). In the diffusion approximation, Eqs. 4 and 5, the exact distribution of protein stability changes upon mutations W is reduced to its first two moments. This approximation does not change any qualitative features of the model.

The long-time solution of Eq. 4 has the form P(E, t) = eλtP(E). The steady-state solution of Eq. 5 is

equation image

where A is the normalization constant. Converting E, Emin, and Emax into protein stabilities ΔG, ΔGmin, and ΔGmax using Eqs. 1 and 2 and assuming a standard unfolded state, we get the distribution of stabilities of all proteins:

equation image

The population growth rate λ is then

equation image

The population survives only when λ ≥ 0, which imposes an upper bound Γ* on the number of genes of an organism at a given mutational load:

equation image

Note that mΓ/b is simply the number of mutations per portion of the genome encoding essential genes per replication event. Even in the absence of the natural death process (i.e., when d = 0), the right-hand side (r.h.s.) of Eq. 8 establishes an absolute upper physical limit on this number for any unicellular asexually reproducing organism.

Eqs. 68 represent the main results of this work, and now we turn to their biological implications. First, Eq. 6a predicts a universal distribution of the stabilities of all existing protein domains. Stabilities of proteins vary in the range from 0 to ≈20 kcal/mol (18, 19), providing an estimate for ΔGmax − ΔGmin ≈ 20 kcal/mol. As can be seen from Fig. 2, Eq. 6a describes the distribution of protein stabilities quite well with parameters h and D derived from independent point mutation experiments. In particular, it reproduces a characteristic asymmetric distribution of protein stabilities. Previously, the distribution of stabilities of evolving proteins has been studied in lattice protein simulations by Taverna and Goldstein (10) and by Bloom et al. (9). The latter paper (9) also developed a master equation description of the protein stabilities. The results of these studies are qualitatively consistent with Eq. 6a in the limit of high stabilities (low values of E). However, unlike Eq. 6a, they predict p(Emax) ≠ 0 and apparently monotonic dependence of p on ΔG. This difference is partly due to the conversion of the integro-differential equation Eq. 11 to the differential equation Eq. 4, to make it amenable to the exact analytical treatment. Also, it may be partly due to the use of the 2D lattice model in simulations in refs. 9 and 10. A numerical solution of full integro-differential Eq. 11 yields small but nonzero values of p at Emax (or ΔG = 0). However, importantly, the full numeric solution of Eq. 11 still predicts a pronounced maximum of p(ΔG) at ΔG ≈ −5 kcal/mol (data not shown).

Fig. 2.
Distribution of stabilities of single-domain proteins and the prediction (smooth line) from analytical equation (6) with parameters h and D derived from mutation data on proteins as explained in the text and in SI Text. The stability data were collected ...

Interestingly, the experimental distribution of protein stabilities (Fig. 2) vanishes at ΔG = 0, in agreement with our differential equation analysis. This feature may be due to a number of reasons. First, there is a possible bias in the ProTherm database (e.g., aggregation impedes precise measurements of marginal values of ΔG; proteins from many different species of all kingdoms of life are mixed in the database, etc.). Second, there is a possibility (and likelihood) of an actual additional selection against barely stable proteins. For example, one can imagine that as protein folding rates decrease with protein stability (20, 21), marginally stable proteins would fold too slowly to perform their biochemical function. Therefore, although our theory provides a reliable quantitative description of the distribution in a broad range of protein stabilities, a complete quantitative understanding of its behavior in the region close to the ΔG = 0 threshold can be achieved only through an unbiased high-throughput experimental measurement of protein stabilities in a complete genome, possibly in parallel with a structural proteomics project.

Eqs. 7 and 8 establish an upper limit on the genome sizes of organisms at a given mutational load and replication and death rates. They predict that, if the genomes are too large or the mutation rate too high, populations will go extinct because of lethal mutagenesis. Our theory predicts a universal upper-limit threshold of a mutational load that a population can sustain, even without the death of organisms or the loss of parent genomes due to natural causes (e.g., degradation of parent RNA in viruses). With the estimated values of the parameters h, D, and EmaxEmin, the mutational threshold can be estimated (for d = 0) as approximately six mutations per essential portion of the genome per replication. This prediction is in good quantitative agreement with many lethal mutagenesis experiments on RNA viruses (22). The exponential decay of the distribution of protein stabilities at low E makes Eqs. 7 and 8 not very sensitive to the particular choice of the value of Emin. Moreover, the upper limit of six mutations per genome per generation is an estimate of the lethal mutagenesis threshold from above, because it assumes that movements of a genome in the protein stability space along each axis are independent, reflected in the separation of variables in Eq. 4. In fact, when mΓ/b>1 (more than one mutation per generation), the possibility exists that because mutations are (mostly) introduced upon genome replication, more than one mutation will be introduced in a genome at the same time. In the language of our diffusion equation, that means that the “random walker” advances along several dimensions at a time (see Methods for more detail). Because h > 0, these correlated moves in the stability space result in a faster drift toward Emax, slightly decreasing the threshold mutation rate (unpublished results).

Our theory predicts that the maximum essential genome size (number of essential genes) Γ* is smaller for organisms with higher mutation rate m. This prediction is in excellent agreement with observations for a broad range of organisms (see figure 1 in ref. 22, reproduced as SI Fig. 7). A further example is provided by data on the distribution of viral genome lengths. It is well known that the mutation rate in RNA viruses is much greater than that in dsDNA viruses (22, 23). Correspondingly, the theory predicts much longer genomes for dsDNA viruses than for RNA viruses, in harmony with observations (Fig. 3).

Fig. 3.
The distribution of number of genes per viral genome. The red histogram corresponds to RNA viruses, whereas the black histogram is for dsDNA viruses. The data are taken from National Center for Biotechnology Information Genome database, www.ncbi.nlm.nih.gov/genomes/static/vis.html ...

Another important prediction from the theory concerns thermophilic organisms. Stability of proteins at elevated environmental temperature requires that their native-state energy be lower than of mesophilic proteins, according to Eq. 1 (see also SI Text). Correspondingly, the range of possible native-state energies EmaxEmin in Eqs. 68 shrinks for thermophilic organisms, and the mutational meltdown threshold decreases. The magnitude of the effect can be estimated using a typical value of the entropic difference between unfolded and folded states for a typical 100-aa protein domain, 0.25 kcal/mol/K (11). An increase of environmental temperature by 60 K, typical of hyperthermophiles, results in a decrease of EmaxEmin from 20 to 8 kcal/mol, leading to a decrease in the mutational meltdown threshold (r.h.s. of Eq. 8) from 6 to 2. The implication of that is 2-fold. First, it suggests that hyperthermophilic (crenarchaeal) viruses (phages) can only be dsDNA, because RNA viruses are already in the mutational meltdown regime at this temperature. Indeed, all known crenarchaeal viruses are dsDNA (24), and so far no extremophilic virus with an RNA genome has been found.

Second, assuming the same mutation rate per nucleotide in mesophilic and thermophilic organisms (25), our theory predicts that organisms living at elevated temperatures should have shorter genomes. In Fig. 4, we plotted the number of genes in the genomes of 202 bacteria and archaea (see supporting information table 1 of ref. 14) as a function of their optimal growth temperature. It is clearly seen that prokaryotes living at 60°C or above systematically possess shorter genomes (≈2,000 genes) than mesophilic prokaryotes with optimal growth temperature of 20–40°C. The existence of mesophilic prokaryotes with short genomes does not contradict our theory, because Eq. 8 sets only an upper limit on the size of the essential proteome; if an organism functions with a smaller number of genes, it simply does not “feel” the pressure on the genome size because of protein stability requirements. In these cases, regulation of the genome size is governed by other biological mechanisms, such as the energetic cost of genome duplication and repair.

Fig. 4.
Distribution of the number of genes in 202 prokaryotic genomes for organisms with various optimal growth temperatures. Thermophilic and hyperthermophilic organisms systematically possess shorter genomes than mesophilic ones.

Discussion and Conclusions

In contrast to other approaches (8), we did not a priori assume here an existence of optimal stability of proteins that renders highest fitness. Such assumptions are often motivated by a circular argument that observed stabilities of proteins are not too high. Here we assume that greater stability of proteins is not functionally advantageous to organisms, so long as proteins possess at least minimal stability (ΔG = 0) to function, On the other hand, the sequence entropy factor favors less stable proteins; sequences that can deliver higher stability become more scarce (10, 26). The entropic factor in sequence space is the main reason why destabilizing mutations are statistically more frequent (i.e., that mutation drift parameter h in Eq. 4 is positive in our theory); mutated sequences of lower stability are more likely to be found than those stabilizing a protein (10, 26) [a more detailed analysis that explicitly takes into account sequence statistics, i.e., it considers the realistic dependence of h and D in Eq. 4 on E; ref. 26 supports this conclusion (data not shown)].

One would then expect that most proteins should be “marginally stable,” with the distribution of stabilities peaked close to ΔG = 0 value (9, 10). However, the distribution of stabilities of real proteins has a characteristic peak at a moderate (not too low) value ≈5 kcal/mol and a sharp asymmetric decrease at both lower and higher stabilities (see Fig. 2). This theory explains why naïve expectations of marginal stability of proteins are not borne out. Proteins evolve as part of organisms, and very low values of their stability would result in too-frequent mutational “falls from the ΔG = 0 cliff,” resulting in the death of the organism and subsequent elimination of the gene encoding the marginally stable protein. This is a clear manifestation of the effect of the constraints imposed by organismal evolution on the distribution of the molecular properties of proteins. An important lesson from this study is that a peaked distribution of a specific property of evolved proteins, such as stability ΔG, does not imply or require that proteins with the peak value of the property confer the highest fitness to their carrier organisms. Instead, the peaked distribution can arise from a balance between opposing factors (stability above lethal threshold and sequence entropy) in a locally flat fitness landscape with just two (lethal and viable) phenotypes.

Another key universal prediction of this model is lethal mutagenesis at high mutation rate, approximately six mutations per genome per replication for mesophiles and one to two for hyperthermophiles living close to 100°C. Because this number is per replication, a higher replication rate would make it possible to avoid lethal mutagenesis at a given absolute mutation rate. Lethal mutagenesis is often attributed to Eigen's error catastrophe (27, 28). However, this interpretation may be misleading. In fact, at high mutation rates, the quasispecies theory predicts a very different phenomenon, delocalization in sequence space. Because the quasispecies theory applies only to soft selection, extinction of populations via lethal mutagenesis cannot occur there (29, 30). The lethal mutagenesis predicted in the present work is also different from the Muller ratchet (31), because it can occur even for infinite populations upon exceeding a well defined mutation rate threshold, in contrast to the Muller ratchet mechanism (32). As Wilke and coworkers pointed our recently (33), there has been no clear understanding of lethal mutagenesis. Our findings represent a simple first-principle theory of this important phenomenon and may have direct implications for further development of therapies based on mutation-inducing drugs or radiation.

The results presented here are most directly applicable to RNA viruses whose mutation rates are close to the mutational meltdown threshold of Eq. 8. In DNA organisms, mutation rates are typically 2–4 orders of magnitude lower because of the action of error-correction mechanisms (23). However, highly pathogenic strains of bacteria often exhibit mutator phenotypes (34) whose mutation rates can approach the mutational load limit discovered in this work. As in the case of viruses, it was argued that higher mutation rates in some strains of bacteria may emerge to facilitate adaptation to their environment. However, the mutational meltdown threshold puts a physical limit on the mutational response of pathogens to rapidly changing host environments.

Our model is basic, and it does not consider many biologically relevant factors, such as functional selection, epistasis, or death of organisms because of environmental fluctuations. Organisms do not interact in the model and do not compete for resources. We consider hard selection whereby population size is not fixed.

We further simplified our consideration of protein thermodynamics by assuming a roughly equal length of all proteins (reflected in universal EmaxEmin values). It is known that protein lengths vary in a broad range. However, a two-state folding unit to which this theory is applicable is a protein domain. The range of variation of domain sizes is somewhat smaller than that for complete proteins, so we made our estimates for a typical 100-aa domain. The variation of domain lengths can be taken into account in further development of the theory.

Our analysis establishes the limits on the genome size that an organism can maintain at a given mutational load. In particular, the limitations set by this theory may be important for understanding the origins of life or de novo design of artificial life, because early or newly designed organisms probably could have an imperfect replication machinery and, thus elevated mutation rates.


Derivation of Mutation-Diffusion Eq. 4.

Assuming that replication and mutations are decoupled.

We consider the population of noninteracting N organisms. Each organism mutates and replicates independently, and therefore the mutation-diffusion equations will be derived for organisms. As a first and conceptually simplest step, we consider that mutations and replications are decoupled.

Consider a small time interval Δ t, such that the number of mutations occurring during this interval mΔ t is small. Now the probability density that an organism mutates from state (E′1, E′2E′N) to (E1, E2ES) is given by the expression:

equation image

Here we assumed that mutations in a gene occur independently. The first term in the r.h.s. of Eq. 9 accounts for situations when mutations do not occur; the probability of that for each gene is (1 − mΔ t), and δ(E′iEi) has the usual meaning of Dirac's delta-function. The second term in the r.h.s. describes the probability that a mutation actually occurred in a gene, and W(E′iEi) describes the probability that native state energy of a protein encoded by this gene changes from E′i to Ei. This quantity depends only on the distribution of energetic effects of point mutations in proteins (see below). Now consider the ensemble consisting of N organisms. The distribution of genotypes in the population is given by the function P(E1, E2EΓ), such that:

equation image

and the population size N is not fixed in this model; it describes hard selection, so that N itself depends on time

equation image

where the growth/decay rate λ is defined in the text.

The distribution of genotypes of all organisms at time t + Δ t is then given by the relation:

equation image

where the last term describes replication events: the probability that an organism replicates in a time interval Δ t is bΔ t, where b the is replication rate. Now, considering m Δt to be small enough to keep only first-order in mΔ t terms in Eq. 9, we obtain from Eqs. 9 and 10,

equation image

Now we can convert the integro-differential Eq. 11 into a differential diffusion equation following a standard procedure: expanding P in the Taylor series up to a second order:

equation image

Substituting Eq. 12 into Eq. 11, we arrive at Eq. 4 with properly defined parameters h and D and boundary conditions as outlined above.

The case when replication and mutations are coupled.

Consider now a scenario when replication and mutations are coupled, i.e., that mutations occur exclusively upon replication, with formation of the daughter genome on the parents' genome as a template. Now we have to distinguish between conservative (single-stranded viruses) and semiconservative replication (double-stranded organisms) (35).

Conservative replication.

Again, consider a small time interval Δt → 0. Introducing dimensionless quantity m=m/b and the probability of mutations per gene per replication and using otherwise the same notation as in Eq. 9:

equation image

Here, the first term corresponds to the situation that replication did not occur in the time interval Δt, the second term says that when replication occurs, one master genome is preserved (i.e., conservative replication), and the third term says that when replication occurred, it could be either without mutation (probability (1 − m) for each gene) or with mutations, and we assumed that more than one mutation in one gene is unlikely. Assuming that m [double less-than sign] 1, i.e., that the probability of a missense mutation per gene per replication is small [which holds under most biologically relevant conditions (23, 36)], we arrive at the same integro-differential equation, Eq. 11.

Semiconservative replication.

In this case, the master equation for T will be slightly different, because mutations can occur now in both organisms so that no “master genome” is preserved (35),

equation image

After straightforward algebra, we again arrive at the same final integro-differential equation, Eq. 11. However, now the effective mutation rate is doubled, −2m. This correction is intuitive because, in the case of semiconservative replication, both progenies can acquire mutations (35).

Although the final equations appear to be the same in both cases (coupled and decoupled mutations and replications), this is so only under the additional condition that the probability of a mutation upon replication per gene is low, m [double less-than sign] 1. The reason such requirement appears in the coupled case (and does not seem to be relevant in the decoupled case) is that, in the decoupled case, the mutations are smeared in time, so the probability that two genes mutate at the same time can always be neglected. In the case of coupled mutations and replications, mutations occur in “bursts,” i.e., upon replication only. For that reason, potentially mutations may become correlated between genes (which will lead to a more complex mutation-replication master equation). This factor will decrease the lethal mutagenesis threshold from six mutations per genome per generation to a slightly lower value (unpublished work). However, these correlations still remain insignificant when m [double less-than sign] 1, which is a realistic condition, as we noted above.

The key biological difference between coupled and uncoupled scenarios in conservative cases is that when mutation and replication are coupled, the master genome is always preserved in a population, because mutations are introduced in the daughter strand only. This condition may significantly alter the conditions for the lethal mutagenesis threshold, because a preserved wild-type genome can always produce a nonmutated wild-type progeny (33). However, the fully coupled conservative replication scenario does not appear realistic for most organisms, such as viruses and bacteria. In most viruses such as the HIV retrovirus, mutations and replication are decoupled, and parent genomes are not preserved as mutations are introduced, e.g., mostly at the reverse-transcription stage, after which original parent RNA is degraded. In bacteria, replication is semiconservative, i.e., mutations can be introduced in either strand by the action of an error-correction mechanism (35, 37), so that both progenies may carry mutations after replication, again eliminating the preservation of the parent genome.

Derivation of Energetic Mutation Parameters h and D from the Database.

The statistical data on how protein stability changes in response to mutations have been collected from the ProTherm database (18). In total, 2,804 and 2,418 mutations were considered for 105 and 71 proteins separately in two kinds of experimental measurements. In cases when multiple data were reported for the same mutant (e.g., measured by different techniques or evaluated at different conditions with subsequent extrapolation to normal), the average was taken over all data points corresponding to the same mutant. As noted in the text, mutations affect mostly the native states of proteins, so that to a good approximation,

equation image

The histogram in Fig. 5 represents the mutation free-energy probability distribution,

equation image

Using this distribution and the definitions provided in the text, we obtain the values of h and D quoted there. As a note of caution, we observe that possible biases in the ProTherm database can affect W, because some of the mutations have been performed with a specific biochemical purpose in mind.

Fig. 5.
Histogram of protein point mutation data obtained from the ProTherm database (15). Two kinds of unfolding methods, ΔΔG and ΔGH2O, are considered separately, each with 2,804 and 2,418 different point mutations. The two groups of ...

Supplementary Material

Supporting Information:


We are grateful to Boris Shakhnovich, Jesse Bloom, and Claus Wilke for many fruitful discussions. This work was supported by the National Institutes of Health.


right-hand side.


The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0705366104/DC1.


1. Gillespie JH. The Causes of Molecular Evolution. Oxford: Oxford Univ Press; 1991.
2. Shakhnovich EI. Chem Rev. 2006;106:1559–1588. [PMC free article] [PubMed]
3. Fraser AG, Kamath RS, Zipperlen P, Martinez-Campos M, Sohrmann M, Ahringer J. Nature. 2000;408:325–330. [PubMed]
4. Herring CD, Blattner FR. J Bacteriol. 2004;186:2673–2681. [PMC free article] [PubMed]
5. Dyson HJ, Wright PE. Nat Rev Mol Cell Biol. 2005;6:197–208. [PubMed]
6. Uversky VN. Protein Sci. 2002;11:739–756. [PMC free article] [PubMed]
7. Bloom JD, Labthavikul ST, Otey CR, Arnold FH. Proc NatlAcad Sci USA. 2006;103:5869–5874. [PMC free article] [PubMed]
8. DePristo MA, Weinreich DM, Hartl DL. Nat Rev Genet. 2005;6:678–687. [PubMed]
9. Bloom JD, Raval A, Wilke CO. Genetics. 2007;175:255–266. [PMC free article] [PubMed]
10. Taverna DM, Goldstein RA. Proteins. 2002;46:105–109. [PubMed]
11. Privalov PL. Adv Protein Chem. 1979;33:167–241. [PubMed]
12. Shakhnovich EI, Finkelstein AV. Biopolymers. 1989;28:1667–1680. [PubMed]
13. Sanchez IE, Tejero J, Gomez-Moreno C, Medina M, Serrano L. J Mol Biol. 2006;363:422–432. [PubMed]
14. Berezovsky IN, Zeldovich KB, Shakhnovich EI. PLoS Comput Biol. 2007;3:e52. [PMC free article] [PubMed]
15. Shakhnovich EI. Fold Des. 1998;3:R45–R58. [PubMed]
16. Meyerguz L, Grasso C, Kleinberg J, Elber R. Structure (London) 2004;12:547–557. [PubMed]
17. van Nimwegen E, Crutchfield JP, Huynen M. Proc Natl Acad Sci USA. 1999;96:9716–9720. [PMC free article] [PubMed]
18. Kumar MD, Bava KA, Gromiha MM, Prabakaran P, Kitajima K, Uedaira H, Sarai A. Nucleic Acids Res. 2006;34:D204–D206. [PMC free article] [PubMed]
19. Makhatadze GI, Privalov PL. Adv Protein Chem. 1995;47:307–425. [PubMed]
20. Sali A, Shakhnovich E, Karplus M. Nature. 1994;369:248–251. [PubMed]
21. Finkelstein AV, Badretdinov A. Fold Des. 1997;2:115–121. [PubMed]
22. Anderson JP, Daifuku R, Loeb LA. Annu Rev Microbiol. 2004;58:183–205. [PubMed]
23. Drake JW. Ann NY Acad Sci. 1999;870:100–107. [PubMed]
24. Ortmann AC, Wiedenheft B, Douglas T, Young M. Nat Rev Microbiol. 2006;4:520–528. [PubMed]
25. Jacobs KL, Grogan DW. J Bacteriol. 1997;179:3298–3303. [PMC free article] [PubMed]
26. Shakhnovich EI, Gutin AM. J Theor Biol. 1990;149:537–546. [PubMed]
27. Crotty S, Cameron CE, Andino R. Proc Natl Acad Sci USA. 2001;98:6895–6900. [PMC free article] [PubMed]
28. Swetina J, Schuster P. Biophys Chem. 1982;16:329–345. [PubMed]
29. Wilke CO. BMC Evol Biol. 2005;5:44. [PMC free article] [PubMed]
30. Takeuchi N, Hogeweg P. BMC Evol Biol. 2007;7 author reply, p 15.
31. Muller HJ. Mutat Res. 1964;106:2–9. [PubMed]
32. Lynch M, Burger R, Butcher D, Gabriel W. J Hered. 1993;84:339–344. [PubMed]
33. Bull JJ, Sanjuan R, Wilke CO. J Virol. 2007;81:2930–2939. [PMC free article] [PubMed]
34. Denamur E, Matic I. Mol Microbiol. 2006;60:820–827. [PubMed]
35. Tannenbaum E, Deeds EJ, Shakhnovich EI. Phys Rev E. 2004;69 061916. [PubMed]
36. Drake JW, Holland JJ. Proc Natl Acad Sci USA. 1999;96:13910–13913. [PMC free article] [PubMed]
37. Brumer Y, Shakhnovich EI. Phys Rev E. 2004;70 061912. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • Cited in Books
    Cited in Books
    PubMed Central articles cited in books
  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles
  • Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...