![]() | ![]() |
Formats:
|
||||||||||||
Copyright © 2007 by The National Academy of Sciences of the USA Evolution Simultaneous amino acid substitutions at antigenic sites drive influenza A hemagglutinin evolution *Institute of Information Science, †Institute of Biomedical Sciences, and ‡Genomics Research Center, Academia Sinica, Nankang, Taipei 115, Taiwan; and §Department of Ecology and Evolution, University of Chicago, 1101 East 57th Street, Chicago, IL 60637 ¶To whom correspondence should be addressed. E-mail: whli/at/uchicago.edu Contributed by Wen-Hsiung Li, February 15, 2007 .Author contributions: A.C.-C.S., M.-S.H., and W.-H.L. designed research; A.C.-C.S., T.-C.H., and W.-H.L. performed research; A.C.-C.S. and T.-C.H. analyzed data; and A.C.-C.S., M.-S.H., and W.-H.L. wrote the paper. Received January 30, 2007. This article has been cited by other articles in PMC.Abstract The HA1 domain of HA, the major antigenic protein of influenza A viruses, contains all of the antigenic sites of HA and is under continual immune-driven selection. To resolve controversies on whether only a few or many residue sites of HA1 have undergone positive selection, whether positive selection at HA1 is continual or punctuated, and whether antigenic change is punctuated, we introduce an approach to analyze 2,248 HA1 sequences collected from 1968 to 2005. We identify 95 substitutions at 63 sites from 1968 to 2005 and show that each substitution occurred very rapidly. The rapid substitution and the fact that 57 of the 63 sites are antigenic sites indicate that hitchhiking plays a minor role and that most of these sites, many more than previously found, have undergone positive selection. Strikingly, 88 of the 95 substitutions occurred in groups, and multiple mutations at antigenic sites sped up the fixation process. Our results suggest that positive selection has been ongoing most of the time, not sporadic, and that multiple mutations at antigenic sites cumulatively enhance antigenic drift, indicating that antigenic change is less punctuated than recently proposed. Keywords: amino acid switch, influenza virus, positive selection, virus evolution Influenza virus A causes flu epidemics or even pandemics that can kill millions of people in 1 year (1–3). Based on the antigenic specificities of the HA or neuraminidase (NA) protein, the influenza A viruses have been divided into 16 HA (H1–H16) and nine neuraminidase (N1–N9) subtypes, respectively. The HA protein consists of two domains, HA1 and HA2, and HA1 contains all of the antigenic sites of HA. All subtypes of type A viruses are maintained in aquatic bird populations (2), but only H1N1 and H3N2 have been circular in human populations. In circulating influenza viruses, antigenic drift is a major process that accumulates mutations at the antibody binding sites in the HA protein, enabling the virus to evade recognition by hosts' antibodies (4). Because such mutations in H3 HA occur often and new variants tend to replace older ones quickly, the evolution of the HA gene of H3 is much faster than that of other subtypes (4, 5). In the study of influenza evolution, it is important to know what kind of selection pressure operates on HA1 because it can enhance our understanding of influenza virus evolution as well as vaccine strain prediction. This is usually inferred by comparing the rates of nonsynonymous (Ka) and synonymous (Ks) substitutions in the sequences under study. Conventionally, Ka > Ks suggests positive selection, Ka < Ks suggests purifying (negative) selection, and Ka = Ks means no selection [e.g., Li et al. (6)]. To avoid a large variance in estimation, the traditional methods for identifying positively selected sites require the use of many codon sites to compute average Ka and Ks, and, consequently, the results have usually been assigned to an amino acid residue region (7). But if positive selection had operated at only a few amino acid sites, these sites would not be identified if the average number of nonsynonymous substitutions was smaller than that of synonymous substitutions in the analyzed region (7–9). Therefore, a number of methods have been proposed to study selection on a site-by-site basis (7–14). Using the Ka/Ks method and a method to determine positive selection along individual branches of a phylogenetic tree, Fitch et al. (9) and Bush et al. (8) identified, respectively, 14 and 18 amino acid sites in H3 HA1 as having undergone positive selection. On the other hand, without assuming that the Ka/Ks ratio was the same for all positively selected codon sites, Suzuki and Gojobori (7) detected only three positively selected codon sites and argued that the inferences of Fitch et al. (9) might have contained some false positives. However, this large difference could be because the method of Suzuki and Gojobori is less powerful than that of Fitch et al. More recently, using codon usage bias to distinguish between diversifying and purifying selection at a codon site, Plotkin and Dushoff (15) identified 25 codon sites as diverse codons from a comparison of 525 viruses isolated from 1968 to 2000. Thus, it remains unsettled as to how many amino acid sites in HA1 have undergone positive selection in the recent past. In this study we try to resolve this issue by developing a new approach. We shall also address the issues of whether positive selection on HA1 is absent most of the time during evolution, that is, it is punctuated, and whether antigenic change in HA1 is punctuated (epochal) as recently proposed (16–18). As will be seen below, our analyses suggest that positive selection has been ongoing most of the time and that antigenic change accumulates over time with occasional large changes due to multiple mutations at antigenic sites in HA1. Results Frequency Diagrams of Amino Acid Residues. The frequency diagram at an amino acid residue site shows the frequency changes of the amino acids at the site over years or flu seasons. It can readily reveal the temporal dynamics of mutations at any specific site. Such diagrams clearly demonstrate that sites 156 and 145 of HA1 have undergone multiple substitutions from 1968 to 2005 (Fig. 1
Frequency Switches. Frequency diagrams also accentuate drastic changes in mutant frequencies, such as when a new allele quickly predominates over the other alleles at a residue site. We define a “frequency switch” as the replacement of one major amino acid by another between successive years. We further define an “effective switch” as a frequency switch in which the new dominant amino acid at the site later became fixed or almost fixed in the population for at least 1 year; otherwise, the switch is said to be an “ineffective switch.” We identified 95 effective switches and 35 ineffective switches (Fig. 2
Transition Times. Information on how fast the 95 substitutions occurred can be enlightening. For this purpose we computed the transition time from the first appearance of a mutation in the sample to its fixation (or almost fixation) in the population (19). The transition time highlights the dynamics of a sweeping allele. Because the yearly sample size before 1987 was comparatively small, for this part of the analysis we consider only the 51 fixations that occurred after 1986. The shortest transition time was 4 years, and the longest one was 32 years (Fig. 3
Comparison with Previous Results. The facts that almost all of the 95 substitutions occurred at antigenic sites and that the transition time for each substitution was short suggest that most of the 63 sites were under positive selection; this should be especially true for the 23 sites that have undergone at least two substitutions during the period. It is informative to compare these 63 sites with those of previous inferences. Twelve of the 14 positively selected sites inferred by using the Ka/Ks test of Fitch et al. (9) and 15 of the 18 positively selected sites inferred in Bush et al. (8) overlap with our 63 sites. These agreements between their results and ours suggest that the methods they used are considerably more powerful than the method used by Suzuki and Gojobori (7), which detected only three positively selected sites. Additionally, all of the 25 sites inferred from codon usage bias (15) overlap with our 63 sites. Furthermore, the 11 parallel replacements in HA1 identified by Wolf et al. (16) were identified by our analysis, except for V546A, which is in HA2 and thus not included in our analysis. These comparisons show that our method has a higher detection power than previous methods. Simultaneous Multiple Fixations. A most striking observation from the frequency diagram analysis is the propensity of simultaneous (parallel) multiple amino acid fixations to occur in the same year (Fig. 4
Discussion Hitchhiking or Positive Selection? Simultaneous fixations can derive from enhancement of antigenic drift, compensatory mutations to retain function, or hitchhiking (17). Note that antigenic enhancement and compensatory mutation both lead to positive selection. Compensatory mutations have been found to occur at receptor binding sites: subneutralizing level of monoclonal antibody specifically against residue 155 resulted in resistant mutants with amino acid substitutions occurring at sites 190 and 226; all three sites are receptor binding sites (SI Fig. 6 and ref. 21). Hitchhiking may explain some, but not the majority, of the fixations because only eight of the 95 fixations were not in the known antigenic epitopes. If hitchhiking was prevalent, it should have occurred at many nonantigenic residue sites, because the number of nonantigenic residue sites is larger than that of antigenic ones (eight substitutions among 181 nonantigenic sites vs. 87 substitutions among 131 antigenic sites; P < 10−19). A close inspection of the frequency changes for each case of multiple fixations gives insight into the mechanisms that govern their dynamics. First, we use the simultaneous fixations in 1973 as a case study. The first observed mutation was 144D, which appeared at a frequency of 0.09 in 1968. Its frequency increased to 0.2 in the next year, but it disappeared from the sample (sample frequency 0) in the next 2 years. So 144D might first have a selective advantage for its frequency to increase to 0.2, but the advantage might not have been sufficient to carry it to fixation. Alternatively, 144D might have been out-competed by mutations at other sites, e.g., 63D, which was on the rise from 1968 to 1970. Interestingly, in 1972 the frequency of 144D suddenly increased to 0.94. This was unlikely to result from hitchhiking, because there was no other mutant with a frequency >0.88. It would be more likely that a combination of 144D with mutants at other antigenic sites conferred a strong antigenic drift, possibly of epistatic nature, so that the frequency increased rapidly. The second mutation to appear was 78G, which increased in sample frequency from 0 in 1969 to 0.33 in 1970 and then to 0.86 in 1971. Because there was no other mutant on the rise from 1969 to 1971 with such a high frequency, we invoke selective advantage as the most likely explanation for the rapid frequency increase of 78G. Similar comments apply to the other mutations that became fixed in 1973. One important feature to note is that when several mutations appeared together they quickly became fixed in the population, suggesting a cumulative or epistatic effect. Second, for a case of two simultaneous fixations, it is relatively simple to see what happened, so we consider the two fixations in 1989. Mutant 94Y was first observed in 1984 at a sample frequency of 0.13 but disappeared from the sample in 1985 and 1986, and 155H was first observed in 1985 at a sample frequency of 0.05 but disappeared from the sample in 1986. Apparently, individually each mutant did not have a selective advantage large enough to carry them individually to a high frequency. Surprisingly, they reappeared together in 1987 at a sample frequency of 91% and quickly went to fixation together in 1989. A simple explanation is that the two mutations enhanced each other's antigenic effect to increase the selective advantage, so that they could go to fixation together. Another cluster of two fixations occurring in 1998 assumed a similar pattern of fixation process as described above. One can examine the other cases of multiple fixations in a similar manner and find that in most cases the simplest explanation is that multiple mutations at antigenic sites increase antigenic effects, speeding up the fixation process and increasing the fixation probability. We note that, until recently, many isolates of HA1 were selected for analysis because of their antigenic diversity and so do not represent an unbiased sample of the influenza A virus. However, the substitution patterns in the earlier years (e.g., 1970s) are similar to those in recent years (Fig. 4 Antigenic Change. We now consider amino acid replacements together with data on antigenic changes. Smith et al. (17) used 273 HA1 sequences (isolates) and hemagglutination inhibition data to infer antigenic clusters and amino acid differences between clusters. We incorporate their data into the temporal map of amino acid frequency (Fig. 4 The antigenic cluster model has led to the interesting inference that genetic change is gradual, whereas antigenic change is punctuated or epochal (17, 18). However, there is considerable variation in cross-immunity among isolates within a cluster, and the antigenic distance between two isolates in the same cluster can be considerably larger than that between two isolates from two adjacent clusters. For example, the three largest pairwise antigenic distances within the BE92 cluster are all >5 units (≈5.6, 5.5, and 5.5 units), whereas the shortest pairwise distance between BE92 and WU95 is only ≈1.2 units (17). Moreover, because only 273 HA1 sequences (isolates) were used in the study, the data did not include the antigenic changes in other available sequences. For these reasons, the current antigenic cluster model may imply a rather discrete nature of antigenic evolution. Indeed, if the transition from HK68 to EN72 occurred abruptly in 1972 as the model implies, then it is difficult to explain the rapid increases in frequency before 1972 for the mutations V78G, V242I, D275G, and N188D. It is likely that the antigenic drift had already accumulated to a significant extent before 1972 because of the mutations V78G, V242I, D275G, and N188D, which were all at antigenic sites, and then the new mutations T122N, possibly G144D, T155Y, and R207K in 1972 together added a large antigenic change, completing the transition from HK68 to EN72 (Fig. 4 Concluding Remarks. From the analyses in this study, we propose the following scenario for the evolution of HA1. Positive selection appears to be ongoing most of the time. Occasionally, a single mutation at an antigenic site may confer sufficient antigenic drift for the virus to escape the existing herd immunity and become fixed in the population, e.g., mutants 67V and 213V became fixed in 1981 and 1982, respectively (Fig. 4 The present study provides some guidance for rational design of seasonal vaccines. If no new mutation occurs in a season and the frequency of a combination of certain existing mutations (e.g., 307R, 173K, 248T, 67I, 124G, 144V, and 2K in 1983) is on the rise, the combination is likely to be the prevailing strain in the next season (Fig. 4 Materials and Methods Data Collection. All sequences of the H3N2 HA1 domain were downloaded from National Center for Biotechnology Information on March 14, 2006. After removing those with a length shorter than 315 codons or without the record of the year of isolation, the number of the sequences became 2,248; their years of isolation were from 1968 to 2005. We used Muscle (23) to align the amino acid sequences. After eliminating the gap-rich regions at the carboxyl end of the alignment, the range of the final alignment is from position 1 to position 312. We clustered the sequences from the same year into one group and obtained 38 groups. Amino Acid Frequency Diagram. If the number of sequences with amino acid ak at the jth position at year t is n(t, j, ak), the amino acid frequency f(t, j, ak) at the jth position at year t is given by f(t, j, ak) = n(t, j, ak)/N(t). Therefore, for site j and amino acid ak we can show the amino acid frequency as a function of t in a diagram. Frequency Switch. For the frequencies of two different amino acids at a given site j at year t, f(t, j, ak) and f(t, j, am), we say there was a frequency switch between jak and jam between years t and t+1 if the following two conditions hold: (i) f(t, j, ak) + f(t, j, am) > 0.7 and [f(t+1, j, ak) + f(t+1, j, am)] > 0.7; and (ii) [n(t, j, ak) − n(t, j, am)] and [n(t+1, j, ak) − n(t+1, j, am)] have opposite signs and differ significantly in absolute value. The first condition checks whether these two amino acid residues formed the majority of amino acids at site j in years t and t+1. That is, either ak or am has a frequency >0.35 and is the major allele in year t or year t+1 because any other amino acid at the site has a frequency <0.3 (=1–0.7). The second condition checks whether the change in frequency from year t to t+1 was statistically significant. Thus, when the two conditions hold, a switch in the major amino acid at the site has occurred. To test the second condition above, we used a 1-year contingency table. The values in the cells of the table are the numbers of the residues at that site at year t and t+1. To deal with small sample sizes, we used the one-tailed Fisher exact test to examine whether there is a positive association of the values in the 1-year contingency table. Moreover, because the numbers of sequences in some years were very small, we also used a 2-year contingency table for the test. When a frequency switch between jak and jam is identified between t and t+1, we do not know whether the new dominant amino acid would be fixed or almost fixed in the population at a later time. Thus, we call a frequency switch from residue jak to jam an effective switch, if the following condition holds: f(t+τ, j, am) < 0.99, 0 ≤ τ ≤ Tf − 1 and f(t+Tf, j, am) ≥ 0.99, where Tf > 0. We say that jak underwent an effective switch at t and became fixed at t+Tf. Transition Time. In population genetics, the time required for the fixation of an allele depends on the initial frequency of the allele, its selective advantage or disadvantage, and the size of the population (20, 24). Because we do not know the time the mutation occurred, we cannot calculate the precise fixation time of an amino acid substitution. Instead, we consider the transition time, which is defined as the time period from the first time the mutant amino acid was observed in the sample to the first time the frequency reached ≥99%. However, we add 1 year to this time because when the mutant was observed in the sample its frequency was usually not low because of a small sample size. Moreover, in some cases, when a mutant amino acid was first observed, it was almost fixed in the sample. In this case, we assume that the transition time was 2 years; such cases occurred only before 1987. Because the yearly sample size before 1987 was <20, we calculate the transition times of the effective switches only from 1987 to 2005. Supporting Information
Acknowledgments We thank Eddie Holmes, J. J. Emerson, Walter Fitch, and Feng-Chin Chen for suggestions. This work was supported by Taiwan Pandemic Influenza Vaccine Research and Development Program, Taiwan Centers for Disease Control; Thematic Project of Academia Sinica, Taiwan; the National Science Council of Taiwan; the Institute of Information Science and the Genomics Research Center, Academia Sinica, Taiwan; and the National Institutes of Health. Footnotes The authors declare no conflict of interest. This article contains supporting information online at www.pnas.org/cgi/content/full/0701396104/DC1. References 1. Cox NJ, Subbarao K. Annu Rev Med. 2000;51:407–421. [PubMed] 2. Horimoto T, Kawaoka Y. Nat Rev Microbiol. 2005;3:591–600. [PubMed] 3. Hilleman MR. Vaccine. 2002;20:3068–3087. [PubMed] 4. Treanor J. N Engl J Med. 2004;350:218–220. [PubMed] 5. Hay AJ, Gregory V, Douglas AR, Lin YP. Philos Trans R Soc London Ser B. 2001;356:1861–1870. [PubMed] 6. Li WH, Wu CI, Luo CC. Mol Biol Evol. 1985;2:150–174. [PubMed] 7. Suzuki Y, Gojobori T. Mol Biol Evol. 1999;16:1315–1328. [PubMed] 8. Bush RM, Fitch WM, Bender CA, Cox NJ. Mol Biol Evol. 1999;16:1457–1465. [PubMed] 9. Fitch WM, Bush RM, Bender CA, Cox NJ. Proc Natl Acad Sci USA. 1997;94:7712–7718. [PubMed] 10. Suzuki Y. J Mol Evol. 2004;59:11–19. [PubMed] 11. Nielsen R, Yang Z. Genetics. 1998;148:929–936. [PubMed] 12. Yang Z, Nielsen R, Goldman N, Pedersen AM. Genetics. 2000;155:431–449. [PubMed] 13. Huelsenbeck JP, Dyer KA. J Mol Evol. 2004;58:661–672. [PubMed] 14. Yang Z, Swanson WJ. Mol Biol Evol. 2002;19:49–57. [PubMed] 15. Plotkin JB, Dushoff J. Proc Natl Acad Sci USA. 2003;100:7152–7157. [PubMed] 16. Wolf YI, Viboud C, Holmes EC, Koonin EV, Lipman DJ. Biol Direct. 2006;1:34. [PubMed] 17. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, Osterhaus AD, Fouchier RA. Science. 2004;305:371–376. [PubMed] 18. Koelle K, Cobey S, Grenfell B, Pascual M. Science. 2006;314:1898–1903. [PubMed] 19. Zanotto PM, Kallas EG, de Souza RF, Holmes EC. Genetics. 1999;153:1077–1089. [PubMed] 20. Li W-H. Molecular Evolution. Sunderland, MA: Sinauer; 1997. 21. Temoltzin-Palacios F, Thomas DB. J Exp Med. 1994;179:1719–1724. [PubMed] 22. Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y, St George K, Grenfell BT, Salzberg SL, Fraser CM, Lipman DJ, Taubenberger JK. PLoS Biol. 2005;3:e300. [PubMed] 23. Edgar RC. BMC Bioinformatics. 2004;5:113. [PubMed] 24. Nei M. Molecular Evolutionary Genetics. New York: Columbia Univ Press; 1987. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Annu Rev Med. 2000; 51():407-21.
[Annu Rev Med. 2000]Nat Rev Microbiol. 2005 Aug; 3(8):591-600.
[Nat Rev Microbiol. 2005]Vaccine. 2002 Aug 19; 20(25-26):3068-87.
[Vaccine. 2002]N Engl J Med. 2004 Jan 15; 350(3):218-20.
[N Engl J Med. 2004]Philos Trans R Soc Lond B Biol Sci. 2001 Dec 29; 356(1416):1861-70.
[Philos Trans R Soc Lond B Biol Sci. 2001]Mol Biol Evol. 1985 Mar; 2(2):150-74.
[Mol Biol Evol. 1985]Mol Biol Evol. 1999 Oct; 16(10):1315-28.
[Mol Biol Evol. 1999]Mol Biol Evol. 1999 Nov; 16(11):1457-65.
[Mol Biol Evol. 1999]Proc Natl Acad Sci U S A. 1997 Jul 22; 94(15):7712-8.
[Proc Natl Acad Sci U S A. 1997]J Mol Evol. 2004 Jul; 59(1):11-9.
[J Mol Evol. 2004]Proc Natl Acad Sci U S A. 1997 Jul 22; 94(15):7712-8.
[Proc Natl Acad Sci U S A. 1997]Mol Biol Evol. 1999 Nov; 16(11):1457-65.
[Mol Biol Evol. 1999]Mol Biol Evol. 1999 Oct; 16(10):1315-28.
[Mol Biol Evol. 1999]Proc Natl Acad Sci U S A. 2003 Jun 10; 100(12):7152-7.
[Proc Natl Acad Sci U S A. 2003]Biol Direct. 2006 Oct 26; 1():34.
[Biol Direct. 2006]Science. 2004 Jul 16; 305(5682):371-6.
[Science. 2004]Science. 2006 Dec 22; 314(5807):1898-903.
[Science. 2006]Mol Biol Evol. 1999 Nov; 16(11):1457-65.
[Mol Biol Evol. 1999]Proc Natl Acad Sci U S A. 1997 Jul 22; 94(15):7712-8.
[Proc Natl Acad Sci U S A. 1997]Mol Biol Evol. 1999 Nov; 16(11):1457-65.
[Mol Biol Evol. 1999]Proc Natl Acad Sci U S A. 1997 Jul 22; 94(15):7712-8.
[Proc Natl Acad Sci U S A. 1997]Genetics. 1999 Nov; 153(3):1077-89.
[Genetics. 1999]Proc Natl Acad Sci U S A. 1997 Jul 22; 94(15):7712-8.
[Proc Natl Acad Sci U S A. 1997]Mol Biol Evol. 1999 Nov; 16(11):1457-65.
[Mol Biol Evol. 1999]Mol Biol Evol. 1999 Oct; 16(10):1315-28.
[Mol Biol Evol. 1999]Proc Natl Acad Sci U S A. 2003 Jun 10; 100(12):7152-7.
[Proc Natl Acad Sci U S A. 2003]Biol Direct. 2006 Oct 26; 1():34.
[Biol Direct. 2006]Biol Direct. 2006 Oct 26; 1():34.
[Biol Direct. 2006]Science. 2004 Jul 16; 305(5682):371-6.
[Science. 2004]Science. 2004 Jul 16; 305(5682):371-6.
[Science. 2004]J Exp Med. 1994 May 1; 179(5):1719-24.
[J Exp Med. 1994]Science. 2004 Jul 16; 305(5682):371-6.
[Science. 2004]Science. 2004 Jul 16; 305(5682):371-6.
[Science. 2004]Science. 2006 Dec 22; 314(5807):1898-903.
[Science. 2006]PLoS Biol. 2005 Sep; 3(9):e300.
[PLoS Biol. 2005]BMC Bioinformatics. 2004 Aug 19; 5():113.
[BMC Bioinformatics. 2004]