• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Apr 21, 2009; 106(16): 6700–6705.
Published online Apr 1, 2009. doi:  10.1073/pnas.0901855106
PMCID: PMC2672471
Evolution

Reliabilities of identifying positive selection by the branch-site and the site-prediction methods

Abstract

Natural selection operating in protein-coding genes is often studied by examining the ratio (ω) of the rates of nonsynonymous to synonymous nucleotide substitution. The branch-site method (BSM) based on a likelihood ratio test is one of such tests to detect positive selection for a predetermined branch of a phylogenetic tree. However, because the number of nucleotide substitutions involved is often very small, we conducted a computer simulation to examine the reliability of BSM in comparison with the small-sample method (SSM) based on Fisher's exact test. The results indicate that BSM often generates false positives compared with SSM when the number of nucleotide substitutions is ≈80 or smaller. Because the ω value is also used for predicting positively selected sites, we examined the reliabilities of the site-prediction methods, using nucleotide sequence data for the dim-light and color vision genes in vertebrates. The results showed that the site-prediction methods have a low probability of identifying functional changes of amino acids experimentally determined and often falsely identify other sites where amino acid substitutions are unlikely to be important. This low rate of predictability occurs because most of the current statistical methods are designed to identify codon sites with high ω values, which may not have anything to do with functional changes. The codon sites showing functional changes generally do not show a high ω value. To understand adaptive evolution, some form of experimental confirmation is necessary.

Keywords: branch-site method, small-sample method

In the current statistical methods of inferring positive selection using the ω value, it is assumed that ω > 1, ω = 1, and ω < 1 represent positive, neutral, and negative selection, respectively (1). One of the statistical methods using this approach is the branch-site method (BSM) (2, 3). In this method, the branches of a phylogenetic tree are divided into a predetermined (foreground) branch and other (background) branches and codon sites are grouped into a few classes with different ω values (see Methods). The log likelihood (lnL) for the selection model used (modified model A) is then compared with that for the null model of no positive selection (ω ≤ 1), and the likelihood ratio test (LRT) is conducted to determine whether positive selection is operating in the foreground branch. This method has been widely used (e.g., 4–7), and one of the recent applications is Bakewell et al.'s (5) large-scale analysis of orthologous gene trios from humans, chimpanzees, and macaques. In this case, however, the numbers of synonymous (cS) and nonsynonymous (cN) substitutions per gene per branch were so small that the applicability of the large-sample theory of LRT is questionable.

Another test that is applicable for this type of datasets is the small-sample method (SSM) using Fisher's exact test (8). In this method the ancestral nucleotide sequence at each interior node is inferred by the parsimony method, and cS and cN for the branch to be tested are counted by comparing the sequences at the 2 terminal nodes of the branch. Positive selection is inferred when the cN/cS ratio is significantly higher than the ratio under the assumption of no selection. When cS and cN are small, the probability of occurrence of 2 or more substitutions at the same nucleotide site is negligibly small, and therefore parsimony estimates of cS and cN must be quite accurate. This is true even if the substitution rate varies with codon site to some extent. SSM should then be applicable for the primate dataset, and the results can be compared with those of BSM.

The ω value has also been used for predicting the positively selected codon sites in protein-coding genes (912). However, simulation studies showed that the Bayesian methods for predicting such sites often give false positives (13, 14). In fact, Yokoyama et al.'s (15) experimental study showed that these methods are not useful for identifying adaptive sites. These authors engineered the ancestral proteins of the dim-light vision opsins (RH1) from vertebrates and experimentally determined the critical amino acid substitutions that affect the maximum absorption wavelength (λmax) of the opsin (rhodopsin) encoded. Because the spectral tuning of λmax and the environmental condition of species were well correlated, these amino acid changes were considered to be adaptive. However, the Bayesian methods could not identify any of these critical sites. Because the critical amino acid changes affecting λmax have also been identified in color vision genes (ref. 16 for review), we can extend this type of analysis to these genes as well.

In this article, we first examine the reliability of BSM in comparison with SSM by using a computer simulation. We are particularly interested in evaluating the false-positive rates of BSM and clarifying their causes. We then study the reliabilities of the Bayesian and other statistical methods for detecting positively selected sites by using real sequence data.

Results

Computer Simulation Mimicking the Primate Data.

Our computer simulation for studying the reliability of BSM was done by mimicking Bakewell et al.'s (5) analysis of genes from the human-chimpanzee-macaque trios (Fig. 1). These authors considered the human or chimpanzee lineage as the foreground branch and the remaining lineages as the background. They examined ≈14,000 (actually 13,888) orthologous genes with an average of ≈450 (actually 432) codons. The average numbers of synonymous substitutions per synonymous site (bS) for the human, chimpanzee, and macaque lineages were ≈0.006, ≈0.006, and ≈0.06, respectively. [In the following, we use the notation bS = (0.006, 0.006, 0.06) for this case.] The average ω over all codon sites in these lineages was ≈0.25 (17). The transition/transversion rate ratio (κ) was ≈4 (18). On the basis of this information, we generated 14,000 sets of the human-chimpanzee-macaque trio sequences by a computer simulation (Fig. 1; see Methods for details). Because the ω value used was 0.25 for all sites, there were no sites under positive selection. Therefore, any site with an estimate ([omega with circumflex]) of ω >1 must be caused by sampling or estimation errors. When we applied BSM for the 14,000 sets of genes considering the human lineage as the foreground branch, we obtained 32 genes showing positive selection at the 5% significance level (α = 0.05) by using the computer program PAML 4 (19) (Table 1). (The results of BSM were obtained by PAML 4 unless otherwise stated.) By contrast, SSM based on Fisher's exact test showed no genes suggesting positive selection at the same significance level.

Fig. 1.
Phylogenetic tree showing the simulation scheme. The foreground branch is shown by a bold line. In all simulations, κ is 4 and the number of codon sites (n) is 450. Not to scale.
Table 1.
False-positive cases (P < 0.05) obtained by BSM in a computer simulation with n = 450, κ = 4, ωF = ωB = 0.25, and bS = (0.006, 0.006, 0.06)

One might wonder why SSM did not detect any positive selection. The reason is that cS and cN were both too small to give any statistical significance. In the case of SSM, positive selection is suggested only when cN is significantly greater than cS, and for this to happen cN must be 9 or greater even when cS = 0 (Table S1). In practice, cN was always equal to or smaller than 7 except for 2 cases, in which the cN/cS was 8/2 and 9/3. This indicates that the statistical information of the dataset used is not enough to get a significant result and that all of the 32 cases obtained by BSM are false positives.

In the present simulation, we used the parsimony method for estimating cS and cN. However, because we recorded all mutations in the evolutionary process using a discrete-time model, we can use the true values of cS and cN for SSM. Table 1 shows that the parsimony estimates are often decimal, because there are 2 or more equally parsimonious pathways when 2 or more nucleotide differences exist between the 2 codons compared (21). In the computer simulation, we can identify which pathway was used so that the true numbers of substitutions are always integer. When these true numbers of cS and cN were used, virtually no changes in P values (type I error rates) occurred in SSM, and none of the genes showed positive selection at the 5% significance level. Some authors (22) have been critical of parsimony estimates of cS and cN and consequently of the methods based on parsimony estimates. In the present case, however, SSM is clearly more reliable than BSM.

Intuitively, one might expect that the P value for BSM becomes low when the cN/cS ratio is high, but Table 1 shows that this is not necessarily the case. This result was obtained apparently because LRT is affected by sampling errors seriously when the number of nucleotide substitutions is small and the regularity conditions for the χ2 approximation are not satisfied in this method (3). Fig. 2 shows that the P value for SSM was always >0.1, indicating that the small cS and cN values are not informative for generating a significant result in any replication.

Fig. 2.
Relationship between the P values for BSM and SSM. n = 450, κ = 4, ωF = ωB = 0.25, and 14,000 replications.

Table 1 also shows the estimates ([omega with circumflex]2) of ω (= ω2) for the group of codons for which positive selection was inferred by BSM (see Methods). A surprising observation is that the [omega with circumflex]2 values in the false-positive cases are all >70 and some are as high as 999, which is the maximum value that is printed by PAML 4 (19). Analyzing all 14,000 cases, we found that the [omega with circumflex]2 value tends to be higher when P is small than when P is large (Fig. S1). The average [omega with circumflex]2 value for 14,000 cases was 56.6. These [omega with circumflex]2 values are obviously erroneous because the true value is 0.25 for all codons.

One might argue that there is no need to worry about this type of abnormal behaviors of LRT because the observed false-positive rate (32/14,000 = 0.23%) is lower than the expected rate (5%) in large-sample tests. In the present case, however, BSM produces significant results when these results are not supposed to be obtained theoretically. This indicates that there is a computational problem in BSM. In addition, the false-positive rate in BSM can be >5% even under the condition of ω ≤ 1, as will be shown below.

False-Positive Rates when ω Varies with Branch.

So far we considered only the case where ω = 0.25 for both foreground and background branches. In reality, ω may be different between the foreground (ωF) and background (ωB) branches because of changes in functional constraints of the gene or some other factors. We have therefore considered several cases of different ωF and ωB values (Table 2, A). In this simulation, we generated 1,000 sets of genes for each case. In the first case of Table 2, A, we assumed ωF = 0.25 and ωB = 0.5. All other parameters were the same as before. In this case, 4 of the 1,000 genes showed positive selection in BSM, but none was detected by SSM. When we assumed ωF = 0.5 and ωB = 0.25, 0.5, or 1.0, BSM falsely detected positive selection with appreciable frequencies (0.9–1.6%), but SSM showed none. When ωF = 1 and ωB = 0.25, 0.5, or 1.0, the false-positive rate of BSM was ≈6% irrespective of the ωB value. This rate is slightly higher than the expected false-positive rate (5%) in large-sample tests. By contrast, the false-positive rates for SSM were still ≈1%. This low rate indicates that the number of nucleotide substitutions was still quite small to be used for detecting positive selection. Because SSM is based on Fisher's exact test and the errors introduced by parsimony estimation of nucleotide substitutions are minor, the P value for BSM is apparently inflated by sampling errors. To see the effect of gene size, a similar simulation was conducted by using 900 codons instead of 450 codons. However, the results obtained were essentially the same as those of Table 2, A.

Table 2.
Numbers (percent) of false positives obtained by BSM and SSM in a computer simulation with n = 450, κ = 4, and 1,000 replications

In the above computer simulation, the number of nucleotide substitutions per site was small because the human-chimpanzee-macaque trio was considered. However, BSM is also used for a group of species, which are more genetically divergent. We therefore extended our simulation to the case of greater bS values [bS = (0.06, 0.06, 0.1)] for species 1, 2, and 3 (Fig. 1). Roughly speaking, bS = 0.06 for the foreground branch corresponds to the divergence time of ≈60 million years (MY) if humans and chimpanzees diverged ≈6 MY ago, whereas the total divergence time for the 3 species is ≈80 MY [corresponding to bS = (0.06 + 0.1)/2 = 0.08]. Therefore, these divergence times are similar to those of primates and placental mammals, respectively (23).

The results of this simulation are presented in Table 2, B. The false-positive rates for BSM were more or less the same as those in Table 2, A. These results indicate that a larger number of nucleotide substitutions (e.g., ≈80 substitutions when ωF = 1) do not improve the reliability of BSM. In the case of ωF = 1, however, SSM showed a higher false-positive rate when bS is large than when it is small, as expected from the increased number of nucleotide substitutions. Yet, the false-positive rates were still <5% and lower than those by BSM. In addition, there was no significant case when ωF = 0.25 or 0.5. Note that the true values of cS and cN gave essentially the same results in SSM (Table 2), indicating that the parsimony estimates of cS and cN are quite accurate. This accuracy was also confirmed by the small values (mostly <10%) of the average deviation (D) of parsimony estimates from the true values (Table 2).

These results were obtained from the simulation based on a discrete-time model with a time unit of bS = 0.0005 (see Methods). We also conducted a simulation with a continuous-time model, using the computer program “evolverNSbranches.exe” in PAML 4. The results of the simulation using this model were essentially the same as those for the discrete-time model (Tables S2 and S3 and Fig. S2). It should be noted that a similar computer simulation mimicking the sequence evolution of the human-chimpanzee-macaque trios was conducted by Bakewell et al. (5) and Suzuki (24). Suzuki examined the false-positive rates for the cases of ωF = 1 and ωB = 0.25 or 1 with 450 and 750 codons and obtained rates of 7–8% instead of the expected rate of 5% in large-sample tests. Bakewell et al.'s simulation for the cases of ωF = 1 with 400 and 1,000 codons also showed a false-positive rate of 6–8%. These results show excessively high false-positive rates although the rate for SSM was not computed in these studies.

Data Analysis for Evaluating the Accuracy of Site-Prediction Methods.

In addition to BSM, the ω approach is used for detecting positively selected sites. For this purpose, the Bayesian [M8 (9), and REL (11)] and likelihood [FEL (11)] methods are commonly used. We therefore examined the reliabilities of these methods, using real data. The data used here were the dim-light and color vision genes [RH1, RH1-like (RH2), short wavelength-sensitive type 1 (SWS1), SWS type 2 (SWS2), and middle and long wavelength-sensitive (M/LWS) genes] in vertebrates. In these genes potentially adaptive amino acid substitutions that affect the optimal light sensitivity measured by λmax have been experimentally identified (e.g., 15, 16, 25). Using this information, we compared statistically predicted sites of positive selection with experimentally determined adaptive sites.

The results are shown in Table 3. In RH1 genes, many sites were predicted by the Bayesian methods and one site by the likelihood method when squirrelfish species were used. However, none of these sites agreed with the experimentally determined adaptive sites. In addition, most of the predicted sites disappeared when all vertebrate species were used for the analysis, as reported in ref. 15. In RH2 genes, the Bayesian methods did not detect any sites, and one site detected by the likelihood method did not agree with the adaptive sites experimentally determined. In SWS1 and M/LWS genes a few adaptive sites were correctly identified by the statistical methods when closely related species were used. Yet, most of the adaptive sites could not be detected by these methods. In SWS2 genes none of the sites was predicted as positively selected. These results indicate that in most cases the current statistical methods for site-prediction with the ω value cannot detect the adaptive sites, and instead they often falsely identify other sites as positively selected.

Table 3.
Positively selected sites by the site-prediction methods and experimentally determined adaptive sites in dim-light and color vision genes in vertebrates

A different method called the DEPS method (28) was recently developed for predicting directional amino acid substitutions that may have changed protein function. This method does not rely on the ω value but uses the general pattern (baseline) of amino acid substitutions such as the JTT matrix. If a particular type of amino acid substitution occurs more frequently than the baseline matrix of amino acid substitutions, the amino acid substitution is assumed to be adaptive. The predicted sites by DEPS are also shown in Table 3. In RH1 genes, DEPS predicted 29 sites and 4 of them agreed with the experimentally determined adaptive sites. However, when only squirrelfish species were used, all of these predicted sites disappeared. In SWS1 and M/LWS genes, a few sites were correctly predicted as in the case of the ω based methods. In RH2 and SWS2 genes, however, none of the predicted sites agreed with the adaptive sites experimentally determined. (See Table S4 for the results obtained when the general time-reversible protein model was used.) These results indicate that DEPS also does not work well in predicting adaptive sites.

Why Does the Statistical Inference of Positive Selection Fail?

One obvious answer to this question is the effect of sampling errors (13, 14), but the major factor for the failure appears to be the inadequacy of the mathematical model of nucleotide or amino acid substitution used. In both Bayesian and likelihood methods, synonymous substitutions are assumed to be neutral in the codon substitution model and the rate of nonsynonymous substitution is ω times higher than the rate of synonymous substitution, ω being the same for all nonsynonymous substitutions occurring in the same codon. Furthermore, the current Bayesian and likelihood methods all attempt to identify codon sites where the [omega with circumflex] value is significantly >1 and these sites are regarded to be under positive selection. For this reason, the average [omega with circumflex] value for these predicted sites is >1 even when [omega with circumflex] was computed by the conservative Suzuki-Gojobori method (29) (Table 4).

Table 4.
Average [omega with circumflex] values for positively selected sites by site-prediction methods and experimentally determined adaptive sites in dim-light and color vision genes in vertebrates

However, the average [omega with circumflex] value for experimentally determined adaptive sites was much lower than that for the sites statistically inferred in all genes examined and was always <1 (Table 4). Why did this happen? The answer is that the functional change of a protein often occurs by replacement of a specific amino acid by another specific amino acid at 1 or few codon positions (see ref. 30 for review). For example, the SWS1 gene appears to have encoded a violet-sensitive opsin in the ancestor of birds, but the opsin became sensitive to UV in zebra finch, budgerigar, and canary (25). This change was caused by a single amino acid change from serine to cysteine at position 90 in the ancestor of these birds, and other amino acid changes were unlikely to be important (16, 25). In this case, it is quite difficult to detect this site by the statistical methods because adaptive substitution occurs very rarely. In fact, none of the statistical methods predicted this site even when only bird sequences were used. By contrast, the codon sites where many amino acid substitutions occurred may be falsely predicted as positively selected sites because of a high ω value that is obtained by chance even if the substitutions were essentially neutral (31). Note that >90% of amino acid substitutions are conservative and do not change protein function appreciably (15, 30, 32). For this reason, it is not easy to predict the evolutionary changes of protein function statistically.

A similar problem occurs with the DEPS method as well. In this method the amino acid changes that occur more often than the baseline expected from a given substitution matrix are regarded as adaptive. However, there is no reason to believe that these changes are adaptive, if only specific amino acid substitutions that occur rarely are adaptive. Furthermore, the theoretical basis of this method is not well established, because their baseline substitution matrix is not neutral but includes adaptive and conservative substitutions. Note that the baseline matrix is usually constructed from empirical data including all kinds of amino acid substitutions.

Discussion

We have shown that BSM gives false prediction of positive selection when the number of nucleotide substitutions in the foreground branch is small. This is apparently caused by the inadequacy of the statistical model used in BSM. For example, in the case of bS = (0.006, 0.006, 0.06), 450 codons (n = 450), and ω = 0.25, the number of substitutions (cS + cN) in the foreground branch was ≈4 on average, but we have to estimate the 6 parameters, p0, p1, p2a, p2b, ω0, and ω2 from this small number (see Methods). (The number of independent parameters is 4 instead of 6, because p2a and p2b are computed from p0 and p1.) Obviously, the number of substitutions is insufficient for obtaining reliable estimates of the parameters.

In fact, the estimates ([p with hat]0, [p with hat]1, [p with hat]2a, [p with hat]2b, [omega with circumflex]0, and [omega with circumflex]2) of the parameters varied widely among different replications. For example, the estimates of the parameters for randomly chosen 5 nonsignificant and 5 significant replications in the case of bS = (0.006, 0.006, 0.06) and ωF = ωB = 1 are given in Table 5 (see also Table S5). Because p1 represents the proportion of class 1 sites that are assumed to be under no selection, [p with hat]1 should be close or equal to 1 at least in nonsignificant cases. However, 5 nonsignificant cases showed that [p with hat]1 varies from 0 to 0.97. The [omega with circumflex]2 value also ranged from 1 to 50, although this value should be 1 theoretically because no selection was assumed. In significant cases, [p with hat]1 was 0 except in case 2 and [omega with circumflex]2 varied from 156 to 999. These wild variations of parameter estimates were apparently generated by sampling errors and the lack of the regularity conditions for the χ2 approximation in LRT mentioned earlier. Note that [p with hat]2 (sum of the estimates of the proportions of site classes 2a and 2b, which are assumed to be under positive selection for the foreground branch) was 1 in many cases (389 of 1,000 replications). This is unreasonable because no positive selection was assumed. Therefore, the results of BSM applied to primate data are not really reliable.

Table 5.
Estimates for the six parameters in the BSM analysis of 5 nonsignificant and significant cases when bS = (0.006, 0.006, 0.06) and ωF = ωB = 1

This unreliability of parameter estimates obtained by BSM is also revealed by their sensitivity to differences in the computational procedure. Table 1 shows the P and [omega with circumflex]2 values obtained by the computer program HyPhy (20) and PAML 4 (19). The maximization procedures of likelihood in the 2 programs are somewhat different, and this difference alone gave very different conclusions about the statistical significance in cases 8 and 28. In these two cases, PAML 4 gave a small P value, whereas HyPhy gave P = 1. Very different results were also obtained by PAML 4 and HyPhy for the case of the continuous-time model (Table S2). Note also that if we estimate the frequency of each codon (π) from the data rather than by assuming π = 1/61, the results of BSM again changes considerably in both PAML 4 and HyPhy (Table S6).

Another indication of the difficulty of obtaining reliable likelihood estimates of parameters by BSM is the fact that when multiple nonsynonymous substitutions occur in a codon the gene is often identified as positively selected even if no positive selection actually operates in the gene (24). We call this the Suzuki effect, and this effect causes an erroneous identification of positive selection when closely related species are studied. The well-known recommendation of the use of multiple initial ω values in PAML 4 is also a clear indication of difficulties of obtaining maximum likelihood estimates.

However, a more serious problem is the inadequacy of the ω approach, as was shown with respect to the statistical prediction of positively selected sites. If the ω approach is not applicable, BSM or any other method using ω would give questionable results. We have also shown that the prediction of adaptive amino acid substitution by the DEPS method (28) often disagrees with the experimentally determined adaptive sites.

What should we do if we want to study the adaptive significance of amino acid substitutions? The best way would be to use site-directed mutagenesis or similar techniques and study the functional or fitness change due to a specific amino acid substitution experimentally (16, 33). For example, it was experimentally shown that a high-virulence strain of West Nile virus in American crows is caused by a single amino acid substitution from threonine to proline at position 249 of NS3 helicase (34). A study combining experimental and statistical methods also identified natural selection in the digestive RNase genes in leaf-eating monkeys (35). In some proteins, however, this type of experimental study may be difficult to conduct. In such cases statistical tests of selection may be useful if the study is done by considering biochemical data available. For example, major histocompatibility complex (MHC) loci are known to be extraordinarily polymorphic, but the cause of this polymorphism was not known until Hughes and Nei (1, 36) showed that the ω value was significantly >1 at the antigen binding region (ABR) of the MHC molecules and the ABR tends to include amino acid substitutions that cause charge changes of the molecules (37). From these observations, Hughes and Nei (1, 36) concluded that the MHC polymorphism must be caused by some kinds of balancing selection. In general, statistical methods in combination of biological information may be useful for immune systems or antigenic genes.

In the above discussion we considered a model in which ω varies with codon site within a gene. Originally, however, ω was proposed to measure the direction and extent of selection operating for an entire gene. In this case ω is computed by using the average rates of synonymous and nonsynonymous substitutions for the entire nucleotide sequences of a gene (21). For this purpose, ω is still useful because the sampling error of this ω is generally small. The cN/cS value for the entire gene used in SSM is also useful for detecting positive selection when a new function of a gene evolves as a result of many nucleotide changes for the same direction [e.g., generating cationic proteins (38)].

The fitness of an individual is a complex character and is determined by a large number of genes particularly with respect to morphological characters. Therefore, even if some gene experiences a functional change, it may not necessarily affect the fitness of the individual. It is interesting to note that even the selective advantage of trichromatic color vision over the dichromatic vision has been disputed in New World monkeys (39, 40). It is important not to be overenthusiastic about statistical signatures of positive selection without biological confirmation.

Methods

Computer Simulation.

In generating DNA sequences we used the discrete-time model to compute the true cS and cN values for each evolutionary lineage. The 3 nucleotide sequences in Fig. 1 were generated by using the codon substitution model (19). The equilibrium frequencies were assumed to be the same for all 61 sense codons (π = 1/61) and κ = 4. The evolutionary time unit as measured by bS was 0.0005 in our simulation. To confirm the accuracy of our computation, we also generated sequences by using the continuous-time model “evolverNSbranches.exe” in PAML 4 (19).

Statistical Methods.

After generating sequences, we conducted the BSM analysis, using the program codeml.exe in PAML 4. In this method, the branches of a tree are divided into the foreground and the background branches. All codon sites are categorized into classes 0, 1, 2a, and 2b with proportions of p0, p1, p2a, and p2b, respectively. In the modified model A (Table S7), negative selection is assumed to operate on both the foreground and background branches (0 < ω0 < 1) in class 0. In class 1, no selection is assumed to occur for both the foreground and background branches (ω1 = 1). In class 2a, it is assumed that positive selection operates on the foreground branch (ω2 > 1), whereas negative selection operates on the background branches (ω = ω0). In class 2b, positive selection is assumed to operate on the foreground branch (ω = ω2), whereas no selection is assumed for the background branches (ω = ω1 = 1). The null model of no positive selection is the same as the selection model except that no selection is assumed on the foreground branch (ω2 = 1) in classes 2a and 2b. The lnLs for these models are computed, and positive selection is inferred for the foreground branch if the LRT is greater than χ12 = 3.84 (5% significance level) (3). In this study, we computed lnLs 3 times, using 3 different initial ω values (0.5, 1.5, and 2.5) as recommended. The highest lnL among 3 trials was used for the computation of LRT. For each replication, we used the estimates ([p with hat]0, [p with hat]1, [p with hat]2a, [p with hat]2b, [omega with circumflex]0, and [omega with circumflex]2) of the 6 parameters for the highest lnL. We also conducted the BSM analysis, using the program “YangNielsenBranchSite2005.bf” in HyPhy to compare the results with those by PAML 4.

For SSM, the ancestral nucleotide sequence of humans and chimpanzees (or species 1 and 2) (Fig. 1) was inferred by the parsimony method. The cS and cN values and the numbers of synonymous and nonsynonymous sites in the human (or species 1) lineage were then estimated by using the modified Nei-Gojobori method (38) with the ratio of the numbers of transitions to transversions (R) = 2 (κ = 4). The test of neutrality was conducted by using Fisher's exact test (see Table S1).

Real Data Analysis.

We used the dim-light and color vision (RH1, RH2, SWS1, SWS2, and M/LWS) genes in vertebrates. We obtained information about the critical amino acid changes for λmax from the previous studies (15 for RH1 genes, 16 for other genes). The nucleotide sequences of these genes were obtained from the GenBank (see SI Text for accession numbers). For RH1 genes, the sequences were provided by Shozo Yokoyama. Detailed procedures are presented in Tables 3 and and44.

Supplementary Material

Supporting Information:

Acknowledgments.

We thank Hiroshi Akashi, Sayaka Miura, Naoko Takezaki, and Koichiro Tamura for their help in our computer simulation and Saby Das, Hiroki Goto, Eddie Holmes, Austin Hughes, Sergei Kosakovsky Pond, Bing Li, Bruce Lindsay, Jongmin Nam, and Shozo Yokoyama for their comments on earlier versions of the manuscript. This work was supported by National Institutes of Health Grants GM020293 (to M. Nei) and KAKENHI 17770007 (to Y.S.).

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0901855106/DCSupplemental.

References

1. Hughes AL, Nei M. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature. 1988;335:167–170. [PubMed]
2. Yang Z, Nielsen R. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol Biol Evol. 2002;19:908–917. [PubMed]
3. Zhang J, Nielsen R, Yang Z. Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. Mol Biol Evol. 2005;22:2472–2479. [PubMed]
4. Arbiza L, Dopazo J, Dopazo H. Positive selection, relaxation, and acceleration in the evolution of the human and chimp genome. PLoS Comput Biol. 2006;2:e38. [PMC free article] [PubMed]
5. Bakewell MA, Shi P, Zhang J. More genes underwent positive selection in chimpanzee evolution than in human evolution. Proc Natl Acad Sci USA. 2007;104:7489–7494. [PMC free article] [PubMed]
6. Kosiol C, et al. Patterns of positive selection in six Mammalian genomes. PLoS Genet. 2008;4:e1000144. [PMC free article] [PubMed]
7. Studer RA, Penel S, Duret L, Robinson-Rechavi M. Pervasive positive selection on duplicated and nonduplicated vertebrate protein coding genes. Genome Res. 2008;18:1393–1402. [PMC free article] [PubMed]
8. Zhang J, Kumar S, Nei M. Small-sample tests of episodic adaptive evolution: A case study of primate lysozymes. Mol Biol Evol. 1997;14:1335–1338. [PubMed]
9. Yang Z, Nielsen R, Goldman N, Pedersen AM. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000;155:431–449. [PMC free article] [PubMed]
10. Suzuki Y. New methods for detecting positive selection at single amino acid sites. J Mol Evol. 2004;59:11–19. [PMC free article] [PubMed]
11. Kosakovsky Pond SL, Frost SD. Not so different after all: A comparison of methods for detecting amino acid sites under selection. Mol Biol Evol. 2005;22:1208–1222. [PubMed]
12. Massingham T, Goldman N. Detecting amino acid sites under positive selection and purifying selection. Genetics. 2005;169:1753–1762. [PMC free article] [PubMed]
13. Suzuki Y, Nei M. Reliabilities of parsimony-based and likelihood-based methods for detecting positive selection at single amino acid sites. Mol Biol Evol. 2001;18:2179–2185. [PubMed]
14. Suzuki Y, Nei M. Simulation study of the reliability and robustness of the statistical methods for detecting positive selection at single amino acid sites. Mol Biol Evol. 2002;19:1865–1869. [PubMed]
15. Yokoyama S, Tada T, Zhang H, Britt L. Elucidation of phenotypic adaptations: Molecular analyses of dim-light vision proteins in vertebrates. Proc Natl Acad Sci USA. 2008;105:13480–13485. [PMC free article] [PubMed]
16. Yokoyama S. Evolution of dim-light and color vision pigments. Annu Rev Genomics Hum Genet. 2008;9:259–282. [PubMed]
17. Gibbs RA, et al. Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007;316:222–234. [PubMed]
18. Rosenberg MS, Subramanian S, Kumar S. Patterns of transitional mutation biases within and among mammalian genomes. Mol Biol Evol. 2003;20:988–993. [PubMed]
19. Yang Z. PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24:1586–1591. [PubMed]
20. Kosakovsky Pond SL, Frost SD, Muse SV. HyPhy: Hypothesis testing using phylogenies. Bioinformatics. 2005;21:676–679. [PubMed]
21. Nei M, Kumar S. Molecular Evolution and Phylogenetics. New York: Oxford Univ Press; 2000.
22. Yang Z, Bielawski JP. Statistical methods for detecting molecular adaptation. Trends Ecol Evol. 2000;15:496–503. [PubMed]
23. Murphy WJ, Pringle TH, Crider TA, Springer MS, Miller W. Using genomic data to unravel the root of the placental mammal phylogeny. Genome Res. 2007;17:413–421. [PMC free article] [PubMed]
24. Suzuki Y. False-positive results obtained from the branch-site test of positive selection. Genes Genet Syst. 2008;83:331–338. [PubMed]
25. Shi Y, Yokoyama S. Molecular analysis of the evolutionary significance of ultraviolet vision in vertebrates. Proc Natl Acad Sci USA. 2003;100:8308–8313. [PMC free article] [PubMed]
26. Kosakovsky Pond SL, Frost SD. Datamonkey: Rapid detection of selective pressure on individual sites of codon alignments. Bioinformatics. 2005;21:2531–2533. [PubMed]
27. Yang Z, Wong WS, Nielsen R. Bayes empirical bayes inference of amino acid sites under positive selection. Mol Biol Evol. 2005;22:1107–1118. [PubMed]
28. Kosakovsky Pond SL, Poon AF, Leigh Brown AJ, Frost SD. A maximum likelihood method for detecting directional evolution in protein sequences and its application to influenza A virus. Mol Biol Evol. 2008;25:1809–1824. [PMC free article] [PubMed]
29. Suzuki Y, Gojobori T, Nei M. ADAPTSITE: Detecting natural selection at single amino acid sites. Bioinformatics. 2001;17:660–661. [PubMed]
30. Nei M. Selectionism and neutralism in molecular evolution. Mol Biol Evol. 2005;22:2318–2342. [PMC free article] [PubMed]
31. Hughes AL, Friedman R. Codon-based tests of positive selection, branch lengths, and the evolution of mammalian immune system genes. Immunogenetics. 2008;60:495–506. [PMC free article] [PubMed]
32. Perutz MF. Species adaptation in a protein molecule. Mol Biol Evol. 1983;1:1–28. [PubMed]
33. Jermann TM, Opitz JG, Stackhouse J, Benner SA. Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily. Nature. 1995;374:57–59. [PubMed]
34. Brault AC, et al. A single positively selected West Nile viral mutation confers increased virogenesis in American crows. Nat Genet. 2007;39:1162–1166. [PMC free article] [PubMed]
35. Zhang J. Parallel adaptive origins of digestive RNases in Asian and African leaf monkeys. Nat Genet. 2006;38:819–823. [PubMed]
36. Hughes AL, Nei M. Nucleotide substitution at major histocompatibility complex class II loci: Evidence for overdominant selection. Proc Natl Acad Sci USA. 1989;86:958–962. [PMC free article] [PubMed]
37. Hughes AL, Ota T, Nei M. Positive Darwinian selection promotes charge profile diversity in the antigen-binding cleft of class I major-histocompatibility-complex molecules. Mol Biol Evol. 1990;7:515–524. [PubMed]
38. Zhang J, Rosenberg HF, Nei M. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc Natl Acad Sci USA. 1998;95:3708–3713. [PMC free article] [PubMed]
39. Regan BC, et al. Fruits, foliage and the evolution of primate colour vision. Philos Trans R Soc Lond B Biol Sci. 2001;356:229–283. [PMC free article] [PubMed]
40. Hiramatsu C, et al. Importance of achromatic contrast in short-range fruit foraging of primates. PLoS ONE. 2008;3:e3356. [PMC free article] [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...