- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Modeling the site-specific variation of selection patterns along lineages

^{*}Bioinformatics Institute and Allan Wilson Centre for Molecular Ecology and Evolution, University of Auckland, Private Bag 92019, Auckland, New Zealand;

^{†}Department of Biology, University of Rochester, Rochester, NY 14627; and

^{‡}Section of Ecology, Behavior, and Evolution, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116

^{§}To whom correspondence should be addressed. E-mail: ude.dscu.liamoib@hnhoj.

## Abstract

The unambiguous footprint of positive Darwinian selection in protein-coding DNA sequences is revealed by an excess of nonsynonymous substitutions over synonymous substitutions compared with the neutral expectation. Methods for analyzing the patterns of nonsynonymous and synonymous substitutions usually rely on stochastic models in which the selection regime may vary across the sequence but remains constant across lineages for any amino acid position. Despite some work that has relaxed the constraint that selection patterns remain constant over time, no model provides a strong statistical framework to deal with switches between selection processes at individual sites during the course of evolution. This paper describes an approach that allows the site-specific selection process to vary along lineages of a phylogenetic tree. The parameters of the switching model of codon substitution are estimated by using maximum likelihood. The analysis of eight HIV-1 *env* homologous sequence data sets shows that this model provides a significantly better fit to the data than one that does not take into account switches between selection patterns in the phylogeny at individual sites. We also provide strong evidence that the strength and the frequency of occurrence of selection might not be estimated accurately when the site-specific variation of selection regimes is ignored.

**Keywords:**positive selection, codon-based model of nucleotide substitutions, phylogeny, maximum likelihood

An excess of nonsynonymous changes to synonymous changes in protein-coding DNA sequences is an unambiguous signature of adaptive molecular evolution (1). Such a pattern is best explained by a selective advantage in the past for substitutions that cause amino acid changes. Indeed, there are numerous examples where substitutions causing amino acid changes confer a selective advantage. For example, in the MHC, overdominant selection appears to be responsible for the excess of replacement substitutions in the antigen-recognition site (1). Similarly, positive selection has been detected in viral proteins subject to immune surveillance (2–4), in abalone sperm lysins (5), primate lysozymes (6), and regions involved in species-specific sperm–egg interaction (7).

A number of methods have been proposed to estimate the number of nonsynonymous and synonymous substitutions per site between two homologous sequences (e.g., refs. 8–10). These approaches aim to estimate the ratio between the number of nonsynonymous (and synonymous) substitutions per codon and the number of nonsynonymous (and synonymous) mutations per codon. Simulations have shown that some of these methods provide reliable average estimates of the nonsynonymous/synonymous rate ratio (10). However, these approaches assume that amino acids in the sequence evolve under the same selection pressure. This assumption is unrealistic because only a few amino acid sites are found to be responsible for adaptive evolution in almost all proteins that evolve under positive selection (1, 2, 11). As a result, these methods have little power in detecting positive selection when analyzing real data sets (e.g., refs. 12–14).

Nielsen and Yang (2) described a codon-based model of nucleotide substitutions that allows the selection process to vary among sites. This model is very similar in its structure to the one proposed earlier by Goldman and Yang (15). These authors describe the substitutions at the codon level as a continuous-time Markov chain, with a state space on the 61 sense codons. Under such a model, a (codon) site evolves under purifying, neutral, or positive selection. The nonsynonymous/synonymous substitution rate ratio varies across these different classes of selection. For example, an amino acid position that is evolving under strong purifying selection should have a small nonsynonymous/synonymous substitution rate ratio, whereas a position under positive selection should exhibit a nonsynonymous/synonymous rate ratio of >1. The parameters of the model explored by Nielsen and Yang (2), and later by Yang *et al.* (16), include the nonsynonymous/synonymous rate ratio for each selection class of substitution, frequency parameters that are interpreted as the prior probability that a site falls into any specific selection category, and parameters such as the equilibrium codon frequencies and transition/transversion rate ratio that are intended to describe the vagaries of the substitution process beyond the most basic ones of selection pressures on nonsynonymous and synonymous substitutions. All of these parameters can be estimated from alignments of protein-coding DNA sequences by using either a maximum-likelihood or a fully Bayesian approach. The analysis of real data sets suggests that this approach is more efficient than pairwise-based methods for detecting positive selection at the protein level (e.g., ref. 14).

In the widely used codon models described by Nielsen and Yang (2), and by Yang *et al.* (16), the selection process does not vary along lineages of the phylogeny. This constraint is probably unrealistic because several studies have demonstrated that switches between selection processes, or variation of the selection intensity, are likely to occur (e.g., refs. 4, 6, and 17–19). Yang and Nielsen (20) proposed a test for positive selection at some branches of the phylogeny, while simultaneously allowing selection to vary among sites. However, under their model, branches with a different pattern of positive selection must be specified *a priori*. This model is especially useful when one wants to test for an association between speciation (19) or gene-duplication (20, 21) events and positive selection. Similarly, Forsberg and Christiansen (22) described a codon-based model of substitution that allows nonoverlapping subtrees of the phylogeny to evolve under different selection patterns. Again, the parts of the tree that undergo different selection regimes must be specified *a priori*. The authors demonstrated that this model is well suited for analyzing how selection acts on variants of a parasite's genes in different hosts.

Whereas earlier studies (4, 6, 17–23) demonstrate that switches between selection patterns across lineages are likely to occur, little is known about the rates at which these changes happen. Indeed, current methods make it difficult to evaluate how frequently variable selection over time leaves an identifiable footprint in alignments of homologous protein-coding sequences. Moreover, the variation of selection processes during the course of evolution may show different patterns, depending on the site to be considered. For example, independent convergent/parallel evolution events might occur in distinct lineages at different sites. In this case, models that define lineages where molecular adaptation occurred *a priori* are not well suited.

In this paper, we generalize a codon-based model of DNA substitution to allow selection to change over time. Unlike previous approaches, our model does not constrain switches among selection categories to any particular lineage *a priori*. Instead, the switch between one selection regime and another at a given codon position is a stochastic process, mediated presumably by external forces. This approach is similar to the one followed by Tuffley and Steel (24) for modeling the site-specific variation of substitution rates. Our model has been implemented in the maximum-likelihood framework. Because the traditional model is a special case of our more general model, we could perform likelihood-ratio tests of the null hypothesis that the site-specific selection regime remains constant across lineages.

The analysis of eight HIV-1 *env* sequence data sets shows that our model provides a statistically significant better description of these data than does the traditional model. We also argue that the role played by negative selection is underestimated when site-specific selection process varies across lineages and is ignored. Finally, a site-by-site analysis demonstrates that our approach can be used to detect episodic adaptive events at individual amino acid position.

## Methods

**The Model.** Our codon-based substitution model combines two processes. As with other models [e.g., Nielsen and Yang (2)], we allow substitutions between codons to occur under one of three selection regimes (purifying, neutral, and positive selection). However, we also allow changes between selection classes (or switches) to occur according to a continuous-time Markov chain. In the following description, we will first describe the assumptions we make about the substitution process, and then detail the switching process. Finally, we will describe the combined substitution/switching process.

Substitutions between codons are modeled as a continuous-time Markov process. The instantaneous rate of change from one sense codon to another is described by a 61 × 61 rate matrix. Following examples in earlier work (15), we parameterize the rate matrix as:

where *Q _{x}*(

*ij*) is the rate of change from codon

*i*to codon

*j*when sequences evolve under selection pattern

*x*, ω

*is the nonsynonymous/synonymous rate ratio that characterizes selection process*

_{x}*x*, and π

*is the stationary frequency of codon*

_{j}*j*. For each

*i*and

*j*, π

*, π*

_{i}*, and*

_{j}*Q*(

_{x}*ij*) remain constant through time and π

*(*

_{i}Q_{x}*ij*) = π

*(*

_{j}Q_{x}*ji*). The matrix

**Q**

_{x}= {

*Q*(

_{x}*ij*)} therefore defines a stationary, homogeneous, and time-reversible Markov process of substitution between codons.

We consider two parameterizations of the substitution model. Both parameterizations have three classes of nonsynonymous/synonymous rate ratios. The first was described by Nielsen and Yang (2) and was later designated the “M2” model by Yang *et al.* (16). Under this model, nonsynonymous mutations are strongly deleterious and eliminated under the first selection pattern (*x* = 1), so that ω_{1} = 0. Rates of nonsynonymous and synonymous substitutions are equal under the second selection process (*x* = 2), so that ω_{2} = 1. Nonsynonymous mutations provide a selective advantage under the third selection process (*x* = 3), so that ω_{3} > 1. The second parameterization we consider was designated the “M3” model by Yang *et al.* (16). The nonsynonymous/synonymous rate ratio is less constrained under this model; the only constraint is that ω_{1} < ω_{2} < ω_{3}.

Substitutions between selection classes are governed by a three-state continuous-time Markov process with rate matrix

The element *R*(*xy*) on the *x*th row and *y*th column of **R** corresponds to the rate of switching between selection processes *x* and *y*. The parameter δ is the rate of interchange between selection patterns and *p _{x}* is the equilibrium frequency of selection process

*x*. The α and β parameters are two relative rate ratios with important biological meanings; values of α >1 indicate that switches between negative and positive selection occur more frequently than switches between negative selection and a neutral process of molecular evolution. The comparison of these relative rates might be relevant. For example, Messier and Stewart (6) have shown that adaptive episodes (i.e., positive selection) during the evolution of primate lysozyme were followed by episodes of negative selection. A value of α >1 and also greater than β would confirm this observation. For each

*x*and

*y, p*, and

_{x}, p_{y}*R*(

*xy*) remain constant through time, and

*p*(

_{x}R*xy*) =

*p*(

_{y}R*yx*). Hence, the process describing switches between selection classes follows a stationary, homogeneous, and time-reversible stochastic process.

The combined substitution and switching process may be formulated in terms of a single continuous-time Markov chain with stationary distribution (*p*_{1} π_{1},..., *p*_{1} π_{61}, *p*_{2} π_{1},..., *p*_{2} π_{61}, *p*_{3} π_{1},..., *p*_{3} π_{61}), and rate matrix

where **I** denotes the 61 × 61 identity matrix. It is straightforward to demonstrate that **S** is stationary, time-reversible, and homogeneous whenever **Q**_{1}, **Q**_{2}, **Q**_{3}, and **R** are. The matrix **S** is scaled so that time is measured by an expected number of codon substitutions per amino acid position. The transition probabilities from state *i* to state *j* on a branch of length *v* are contained in a 183 × 183 matrix designated **P*** _{v}*. The matrix

**P**

*is calculated by using the following matrix exponentiation:*

_{v}**P**

*=*

_{v}*e*

^{S}_{v}. The probability of observing each site pattern (given a phylogenetic tree and the values of the parameters) under the combined substitution/switch process can be calculated by using Felsenstein's pruning algorithm (25).

Note that the model described here has only three additional parameters (δ, α, and β) as compared with the traditional model. Moreover, the models are nested: the traditional model arises as the switching rate tends to zero (δ → 0). This nesting of the models allows us to apply likelihood ratio tests of the following null hypotheses: *H*_{1}, there is no switching among selection categories; and *H*_{2}, there is no bias in the switching pattern among selection categories. The likelihood ratio test statistic, minus twice the difference in the log likelihoods under the null hypothesis and the more general alternative hypothesis, asymptotically follows a 50:50 mixture of and distributions for *H*_{1} and a distribution for *H*_{2} [see Self and Liang (26)].

It is also worthwhile to note that the switching model of codon substitution is identical in its structure to the covarion model described by Tuffley and Steel (24). Whereas a covarion process takes into account site-specific variation of substitution rates in a phylogenetic tree, our model describes site-specific variation of selection forces along lineages.

Codon models with switches among categories are designated with “+S1” when α = β = 1 or with “+S2” when α and β are free to vary. For example, “M2+S1” designates the model of codon substitution described by Nielsen and Yang (2), allowing switching among categories under the constraint that α = β = 1, and “M3+S2” designates the M3 model first described by Yang *et al.* (16), while allowing for a potentially biased pattern of switching among selection categories. “M2+S” designates any switching model derived from M2. “M3+S” is defined in a similar manner.

**Detection of Positively Selected Sites.** Under M2 or M3, the posterior probability of a given selection class at some site can be estimated by using either an empirical Bayesian approach (2) or a fully Bayesian one (27). The category that maximizes the posterior probability is the most likely selection process to have acted at the corresponding site. For switching models, one cannot rely on a site remaining in any particular selection category through time. Instead, we calculate the expected fraction of time that selection process spends in a particular class. Therefore, these models can be used to detect sites in the alignment where positive selection is likely to have occurred in most of the lineages.

Let *d _{z}*(

*v, x, y*) be the amount of time that the process dwells on selection regime

*z*on a branch of length

*v*, with

*x*and

*y*being the selection patterns at both ends of this branch. We have

where *p _{xz}*(

*s*) is the probability of change from selection process

*x*to

*z*after

*s*codon substitution events;

*p*(

_{zy}*v*–

*s*) and

*p*(

_{xy}*v*) are defined in a similar manner. These probabilities can be derived from

**P**

*, or from the calculation of the matrix*

_{v}

**e**

^{R}^{v}. Conditional on the selection class

*x*at one end of a branch of length

*v*and class

*y*at the other end, the probability of finding the switch process in state

*z*is then

*Pr*(

*z*|

*x, y, v*) =

*E*[

*d*(

_{z}*v, x, y*)]/

*v*.

Of course, one cannot be certain that the selection regimes *x* and *y* really occur at either end of the branch. Therefore, the expected frequency of the selection regime *z* is a weighted average over all possibilities of states *x* and *y* at either end of the branch, conditional on the observed codon states at the tips of the tree.

By using this approach, we are able to calculate two important quantities: (*i*) the expected frequency of the positive selection regime on a branch-by-branch basis, for each site of the alignment and (*ii*) the probability that a site is under positive selection over the entire phylogenetic history of the group by taking the expected time of occurrence of the positive selection pattern for the entire tree and dividing that number by the expected number of codon substitutions at this site (i.e., the sum of the branch lengths).

**Data.** We analyzed partial *env* sequences from HIV-1 (C2-V5) that had been obtained in a longitudinal study (28). These data sets are derived from samples that were collected on average every 8 months from eight infected patients. Sequences were aligned collectively by using clustalx (29) and were then manually adjusted within subjects. Gaps were removed in a balanced manner to preserve codon alignment (30). An earlier study (4) of the eight homologous sequence alignments using traditional codon models of substitution showed that the broad genetic diversity of HIV-1 in infected individuals is a consequence of site-specific positive selection, a likely consequence of immune recognition.

**Parameter Estimation.** An initial phylogeny was estimated for each of the eight data sets by using maximum likelihood under the general time-reversible model of nucleotide substitution (a model that allows four states) with the program phyml (31). Among-site rate variation was modeled by using a discrete gamma distribution (32).

The tree topologies were considered fixed during the estimation of the codon model parameters. The equilibrium frequencies of codons (π* _{i}*) were estimated by using the observed frequencies of the nucleotides at the three codon positions. The other parameters of the model; branch lengths, transition/transversion rate ratio (κ), equilibrium frequencies of the selection classes (

*p*s), switching parameters (δ, α, and β), and nonsynonymous/synonymous rate ratios (ωs), were estimated by means of maximum likelihood by using a program written by S.G., which is available on request.

_{x}## Results

**Likelihood Analysis.** Table 1 shows the maximum values of the log likelihood (ln*L*), nonsynonymous/synonymous rate ratios (ω_{1}, ω_{2}, and ω_{3}) and equilibrium frequencies of the three selection classes (*p*_{1}, *p*_{2}, and *p*_{3}) obtained under the six models examined in this study, for the eight HIV-1 *env* data sets. We first tested the null hypothesis that switching between selection categories did not occur during the history of these sequences.

Under the hypothesis that M2 describes perfectly the substitution process, twice the difference of log likelihoods obtained under M2 and M2+S1 asymptotically follows a 50:50 mixture of and distributions (see *Methods*). The same holds true for the comparison M3 vs. M3+S1. For both the M2 and M3 models, the null hypothesis that the site-specific selection pattern remains constant across lineages is rejected for each of the eight data sets. The smallest difference is obtained with the patient 2 data set by comparing the log likelihood obtained under M3 and M3+S1. The probability of such a difference or more if sequences evolved under M3 is *p* 1.48 × 10^{–4}. Therefore, the site-specific variation of selection regime is likely to have played an important role during the evolution of the sequences analyzed in this study.

We now examine the second null hypothesis posed above, that the pattern of switching from one category to another is not strongly biased (i.e., we test the null hypothesis: α = β = 1). Here, we compare the simple switching model S1 with the S2 model, which allows α and β to freely vary. Under the hypothesis that sequences evolved under M2+S1, twice the difference of log likelihood obtained under M2+S1 and M2+S2 now asymptotically follows a distribution. This finding also holds for the comparison M3+S1 vs. M3+S2. The largest differences in log likelihood values are observed in patient 8 by comparing M2+S1 vs. M2+S2 and M3+S1 vs. M3+S2, and in patient 7, by comparing M2+S1 vs. M2+S2. These three differences are statistically significant at the 5% level. However, these are the only cases where switches between selection patterns are likely to be biased. For all of the other combinations of data sets and switching models analyzed here, we cannot reject the null hypothesis that α = β = 1.

It is also interesting to note that the differences of log likelihood obtained under M2 and M3 decrease when switches between selection regimes are taken into account. Under the hypothesis that sequences evolved under M2, twice the difference of log likelihood obtained under M2 and M3 asymptotically follows a distribution. This finding also holds true for the comparison M2+S1 vs. M3+S1 and M2+S2 vs. M3+S2. The null hypothesis is rejected at the 1% level for each or the eight data sets when comparing M2 with M3. This is no longer the case when comparing M2+S1 with M3+S1: the null hypothesis is rejected at the 5% level for only three data sets (P3, P7, and P9). The comparison M2+S2 vs. M3+S2 leads to the same conclusions. Hence, the description of the substitution process given by M2 is often not significantly worse than the one given by M3 if switches between selection patterns at individual sites are modeled. From a statistical point of view, it is satisfactory to note that, in most cases, taking into account switches between selection regimes allows one to put a limit on the number of free rate parameters that are needed to describe the distribution of nonsynonymous/synonymous rate ratios. The comparison of log likelihoods obtained under M2+S1 and M3 is perhaps even more convincing: M2+S1 (five rate free parameters) provides a better description of the data than M3 (six free rate parameters) for the eight data sets analyzed in this study.

**Parameter Estimates.** Estimates of ω_{3} obtained under M3+S are greater than those obtained under M3 for every data set analyzed. A similar tendency is observed with M2+S vs. M2. This result is expected because M2 and M3 define ω_{3} as a site-specific rate ratio. Therefore, the value of this parameter corresponds to an average of different nonsynonymous/synonymous rate ratios over lineages. This averaging effect is less important with M2+S and M3+S because these models accommodate site-specific variation in the nonsynonymous/synonymous rate ratio. This result shows that the strength of positive selection that acts on the eight HIV-1 *env* sequences is likely to be underestimated when the site-specific variation of selection processes across lineages is not taken into account.

Note also that the negative selection and neutral process equilibrium frequency estimates (*p*_{1} and *p*_{2}) vary greatly from M2 to M2+S, with *p*_{1} being always smaller under M2 than under M2+S. Sites at which only a few nonsynonymous substitutions occurred as compared with synonymous substitutions provide an explanation. Under M2, the probability for such sites to have been generated under a negative selection process is low because of the presence of nonsynonymous substitutions. Under M2+S, these sites have a higher probability of having been generated under a negative selection regime if the nonsynonymous substitutions are clumped. Such sites are also well described by models that allow intermediate values of the nonsynonymous/synonymous rate ratio (such as M3). This observation explains why the equilibrium frequencies of the different selection regimes obtained under M3 and M3+S are similar (Table 1). Note, however, that estimates of the frequency of negative selection given by M3 are still almost systematically smaller than those given by M3+S. Therefore, even if the tendency to underestimate this parameter is much stronger under M2, estimates obtained under M3 are likely to be biased, too, because the site-specific variation of selection patterns is not taken into account.

**Site-By-Site Analysis.** Codon-based models of substitution provide a suitable framework for the probabilistic detection of positively selected amino acid positions (2, 27). The calculation of the posterior probability of the positive selection regime at each site is straightforward if the substitution model does not allow switches between selection patterns at individual sites. If the model allows such switches, the expected frequency of the positive selection process given the data are based on the calculation of the expected time the substitution process dwells in positive selection (see *Methods*).

When considering the eight HIV-1 *env* data sets together, 18 sites evolved under positive selection according to M3 but not M3+S1. Among these sites, four are of particular interest because they are associated with posterior probability >0.95 under M3, while positive selection is not the most likely regime to have occurred at these sites according to M3+S1.

For each of these four sites, ancestral codon states were inferred by using a joint maximum-likelihood approach (33). This method relies on a stochastic model of sequence evolution and the ancestral states were inferred under M3+S1 (M3 gave very similar results). The branches where substitutions have probably occurred in the corresponding phylogenies were then identified from the comparison of ancestral codon states between two adjacent nodes. The great majority of substitutions inferred from the four sites at which M3 and M3+S1 strongly disagree are likely to be nonsynonymous substitutions. These substitutions are generally clumped on a few branches of the phylogeny instead of being scattered on the whole tree.

Fig. 1 illustrates such an uneven distribution. The positions of the substitutions were inferred from two conflicting sites found in the data set of patient 6. Under the assumption that only one substitution occurred on branches with distinct ancestral codon states at their two ends, substitutions inferred at these two sites are all nonsynonymous substitutions. Their estimated positions are clearly not homogeneously distributed on the phylogeny. Indeed, these substitutions did not occur at the beginning and at the end of the infection for this patient. The conflict between M3 and M3+S1 is best explained by this rather obvious variation of substitution processes across lineages.

**...**

It is also striking that substitutions are located here on the same closely related lineages of the tree, regardless of the site to be considered. A detailed investigation of the role played by these substitutions is outside the scope of this paper. However, it is worth noting that the two sites that show this peculiar pattern of substitutions are located in HIV T-helper epitope regions (34). These observations suggest that the substitutions inferred at the sites where M3 and M3+S1 strongly disagree may be of biological interest.

## Conclusions

We introduce a codon-based model of nucleotide substitution that accommodates variation of natural selection processes across a tree for protein-coding DNA sequences. More precisely, site-specific switches between different values of the nonsynonymous/synonymous rate ratio are explicitly taken into account. This model extends the widely used traditional model of substitution between codons (2, 16). Because the traditional model is a special case of our model, their fit to the data can be compared in a rigorous statistical framework by using likelihood ratio tests.

Whereas the simplest switching model has only one additional parameter as compared with the standard model, the increase in the likelihood is noticeable and statistically significant for each of the eight HIV-1 *env* gene sequence data sets to be considered. Hence, switches between selection patterns at individual sites are an important evolutionary feature here. This feature is recovered under two different hypotheses concerning the processes of substitution between codons (i.e., M2 and M3). This result demonstrates, at least for these data sets, the robustness of our model.

Yang and Nielsen (20) pointed out that “... averaging (the nonsynonymous/synonymous rate ratio) over sites is a more serious problem than averaging over lineages” and showed convincing results that confirm this assertion. However, our results indicate that ignoring the site-specific variation of selection processes may result in an underestimation of the strength of positive selection. Note that this trend should not change drastically the results of analyses that aim to identify sites that evolve under positive selection. However, more accurate estimates to measure the intensity of positive selection are obviously of great interest for deciphering the processes of molecular evolution.

M2 and M2+S models largely agree on the frequency with which positive selection acts on the HIV-1 *env* sequences. However, they clearly diverge when considering the negative selection (ω_{1} = 0) and neutral process (ω_{2} = 1) equilibrium frequencies. Indeed, because the M2 model does not account for the distribution of nonsynonymous and synonymous substitutions on the phylogeny, it tends to overestimate the expected frequency of the neutral process category. This discrepancy between estimates leads to a different interpretation of the role played by purifying selection in HIV-1 *env* sequences. According to M2, these sequences mainly evolve under a neutral process, whereas the most common evolutionary force is negative selection according to M2+S. Therefore, taking into account the site-specific variation of selection processes across lineages leads to a dramatically different assessment of the roles played by distinct selection regimes.

However, it must be stressed that ω_{1} = 0 and ω_{2} = 1 correspond to very stringent definitions of negative selection and neutrality, respectively. M3 is a more flexible model because its three selection classes correspond to strong purifying selection (0 < ω_{1} « 1), weak purifying or diversifying selection (ω_{2} 1), or strong positive selection (ω_{3} > 1). The first class of this model is well suited for describing sites at which a few nonsynonymous substitutions occurred. As a result, differences between equilibrium frequencies estimated under M3 and M3+S are smaller than those observed when comparing M2 and M2+S (Table 1). Nonetheless, it is worth to keep in mind that any model that does not allow site-specific switches between selection regimes will underestimate the role played by negative selection in the presence of sites at which nonsynonymous substitutions are clustered due to episodic adaptive events.

We would also like to warn readers against another potential pitfall related to the use of M2. This model poorly describes the amino acid positions where the number of synonymous substitutions is much larger than the number of nonsynonymous changes. Because these sites are better described by M2+S, the difference of log likelihood obtained under the two models will frequently lead to reject the hypothesis of a constant site-specific selection pattern. However, these sites could actually have been generated under a constant site-specific selection regime with a value of the nonsynonymous/synonymous rate ratio intermediate between 0 and 1. In this case, the log likelihoods obtained under M3 and M3+S will be similar. Hence, it is much safer to test the hypothesis of the constancy of the site-specific selection process by comparing M3 vs. M3+S instead of M2 vs. M2+S.

Probabilistic-based approaches have been developed during the last decades to identify codon positions that are likely to have evolved under diversifying selection. In this context, detecting site-specific variations of the intensity of positive selection or changes in selection regimes is worthwhile. A site-by-site analysis of the eight HIV-1 data sets actually reveals positions at which switches between selection processes have probably occurred. Hence, the model described in this paper is also suited to investigate switches between selection regimes at the single-site level.

**Software Availability.** A program that implements the models described in this paper is available on request from S.G.

## Acknowledgments

We thank G. Ewing, N. Galtier, E. Kassardjian, P. Meintjes, H. Philippe, H. Ross, S. Plön, M. Steel, D. Welch, and two anonymous reviewers for their suggestions. This work was partially supported by grants from the U.S. Public Health Service, a postdoctoral fellowship from the Allan Wilson Centre for Molecular Ecology and Evolution (to S.G.), and National Institutes of Health Grant GM-69801 (to J.P.H.).

## Notes

This paper was submitted directly (Track II) to the PNAS office.

## References

**,**167–170. [PubMed]

**,**929–936. [PMC free article] [PubMed]

**,**7–15. [PMC free article] [PubMed]

**,**11715–11720. [PMC free article] [PubMed]

**,**231–238. [PubMed]

**,**151–154. [PubMed]

**,**2509–2514. [PMC free article] [PubMed]

**,**150–174. [PubMed]

**,**418–426. [PubMed]

**,**32–43. [PubMed]

**,**543–567.

**,**111–112. [PubMed]

**,**372–382. [PubMed]

**,**523–529. [PubMed]

**,**725–736. [PubMed]

**,**431–449. [PMC free article] [PubMed]

**,**1271–1276. [PMC free article] [PubMed]

**,**5885–5890. [PMC free article] [PubMed]

*et al.*(2003) Science 302

**,**1960–1963. [PubMed]

**,**908–917. [PubMed]

**,**13413–13417. [PMC free article] [PubMed]

**,**1252–1259. [PubMed]

**,**558–562. [PubMed]

**,**63–91. [PubMed]

**,**368–376. [PubMed]

**,**605–610.

**,**661–672. [PubMed]

*et al.*(1999) J. Virol. 73

**,**10489–10502. [PMC free article] [PubMed]

**,**4876–4882. [PMC free article] [PubMed]

**,**696–704. [PubMed]

**,**306–314. [PubMed]

**,**313–320. [PubMed]

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (294K)

- Extensive purifying selection acting on synonymous sites in HIV-1 Group M sequences.[Virol J. 2008]
*Ngandu NK, Scheffler K, Moore P, Woodman Z, Martin D, Seoighe C.**Virol J. 2008 Dec 23; 5:160. Epub 2008 Dec 23.* - Evolutionary models accounting for layers of selection in protein-coding genes and their impact on the inference of positive selection.[Mol Biol Evol. 2011]
*Rubinstein ND, Doron-Faigenboim A, Mayrose I, Pupko T.**Mol Biol Evol. 2011 Dec; 28(12):3297-308. Epub 2011 Jun 20.* - Not so different after all: a comparison of methods for detecting amino acid sites under selection.[Mol Biol Evol. 2005]
*Kosakovsky Pond SL, Frost SD.**Mol Biol Evol. 2005 May; 22(5):1208-22. Epub 2005 Feb 9.* - Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages.[Mol Biol Evol. 2002]
*Yang Z, Nielsen R.**Mol Biol Evol. 2002 Jun; 19(6):908-17.* - Maximum likelihood methods for detecting adaptive evolution after gene duplication.[J Struct Funct Genomics. 2003]
*Bielawski JP, Yang Z.**J Struct Funct Genomics. 2003; 3(1-4):201-12.*

- Analysis of hypoxia-inducible factor alpha polyploidization reveals adaptation to Tibetan plateau in the evolution of schizothoracine fish[BMC Evolutionary Biology. ]
*Guan L, Chi W, Xiao W, Chen L, He S.**BMC Evolutionary Biology. 14(1)192* - Subcellular Relocalization and Positive Selection Play Key Roles in the Retention of Duplicate Genes of Populus Class III Peroxidase Family[The Plant Cell. 2014]
*Ren LL, Liu YJ, Liu HJ, Qian TT, Qi LW, Wang XR, Zeng QY.**The Plant Cell. 2014 Jun; 26(6)2404-2419* - PSP: rapid identification of orthologous coding genes under positive selection across multiple closely related prokaryotic genomes[BMC Genomics. ]
*Su F, Ou HY, Tao F, Tang H, Xu P.**BMC Genomics. 14924* - Evolution of the bHLH Genes Involved in Stomatal Development: Implications for the Expansion of Developmental Complexity of Stomata in Land Plants[PLoS ONE. ]
*Ran JH, Shen TT, Liu WJ, Wang XQ.**PLoS ONE. 8(11)e78997* - Molecular Adaptation during a Rapid Adaptive Radiation[Molecular Biology and Evolution. 2013]
*Kapralov MV, Votintseva AA, Filatov DA.**Molecular Biology and Evolution. 2013 May; 30(5)1051-1059*

- Modeling the site-specific variation of selection patterns along lineagesModeling the site-specific variation of selection patterns along lineagesProceedings of the National Academy of Sciences of the United States of America. Aug 31, 2004; 101(35)12957PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...