![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||
Copyright © 2006, Cold Spring Harbor Laboratory Press The fate of laterally transferred genes: Life in the fast lane to adaptation or death Department of Biology, McMaster University, Hamilton, Ontario, Canada L8S 4K1 1Corresponding author. E-mail Golding/at/McMaster.CA; fax (905) 522-6066. Received September 28, 2005; Accepted February 21, 2006. Freely available online through the Genome Research Open Access option. This article has been cited by other articles in PMC.Abstract Large-scale genome arrangement plays an important role in bacterial genome evolution. A substantial number of genes can be inserted into, deleted from, or rearranged within genomes during evolution. Detecting or inferring gene insertions/deletions is of interest because such information provides insights into bacterial genome evolution and speciation. However, efficient inference of genome events is difficult because genome comparisons alone do not generally supply enough information to distinguish insertions, deletions, and other rearrangements. In this study, homologous genes from the complete genomes of 13 closely related bacteria were examined. The presence or absence of genes from each genome was cataloged, and a maximum likelihood method was used to infer insertion/deletion rates according to the phylogenetic history of the taxa. It was found that whole gene insertions/deletions in genomes occur at rates comparable to or greater than the rate of nucleotide substitution and that higher insertion/deletion rates are often inferred to be present at the tips of the phylogeny with lower rates on more ancient interior branches. Recently transferred genes are under faster and relaxed evolution compared with more ancient genes. Together, this implies that many of the lineage-specific insertions are lost quickly during evolution and that perhaps a few of the genes inserted by lateral transfer are niche specific. Gene insertions and deletions, together with gene inversions and translocations, play important roles in shaping bacterial genomes (Itaya 1997; Brunder and Karch 2000; Tillier and Collins 2000; Liu et al. 2002; Kuwahara et al. 2004; Cerdeno-Tarraga et al. 2005), and gene insertions and deletions, in particular, are essential driving forces that influence gene content (Ochman and Jones 2000; Kunin and Ouzounis 2003; Mirkin et al. 2003). It is clear that a large number of insertions/deletions can be observed in many bacterial species (Mirkin et al. 2003; Hao and Golding 2004) even though they may be comparatively rare in some endosymbiotic bacteria (Silva et al. 2003). The portion of insertions and deletions in a genome, therefore, varies among different species (Garcia-Vallvé et al. 2000). Gene insertions and deletions can be inferred by examining the presence or absence of a gene (or a gene family) on a phylogenetic tree. In some recent studies, the parsimony method has been used to infer insertions/deletions (Daubin et al. 2003a, b; Mirkin et al. 2003; Hao and Golding 2004). Gene insertions have been distinguished as gene genesis (birth) or lateral gene transfers (LGT), and insertions/deletions have been tested with varying penalties for LGTs in different methodologies (Snel et al. 2002; Kunin and Ouzounis 2003; McLysaght et al. 2003). However, the inference of insertions/deletions is difficult because of the possibility of parallel deletions and insertions on multiple branches (Copley and Dhillon 2002; Snel et al. 2002; Stoebel 2005) and because of variable evolutionary rates of change on different branches (Hao and Golding 2004). Furthermore, the parsimony method is well known to underestimate the number of events in phylogeny reconstruction (Galtier and Boursot 2000; Dean et al. 2002; Felsenstein 2004). Likelihood analysis has been successfully used to reconstruct phylogenies using sequence data since its first application by Neyman (Neyman 1971; Felsenstein 1988, 1989, 2004; Gu 2001). Maximum likelihood analyses have also been applied to the study of genome content (Gu and Zhang 2004; Huson and Steel 2004), and the phyletic pattern of gene presence/absence has been used to reconstruct evolutionary history in a Markov analysis (Lake and Rivera 2004). In this study, a maximum likelihood method is used to infer insertion/deletion rates on the phylogeny of the Bacillaceae group of Gram-positive bacteria. An advantage of this group is the large number of genomes that have been completely sequenced. For the likelihood analysis, the insertion rate was assumed to be equal to the deletion rate on each branch, but insertion/deletion rates could vary among different branches or in different parts of the phylogeny. These results suggest that recently transferred genes are more common. If this is to be an evolutionarily stable situation, it suggests that many laterally transferred genes are more likely to have a high propensity of being deleted quickly after transfer. The rates of insertion/deletion from the maximum likelihood analysis were compared to observed nucleotide substitution rates and found to be of a comparable or larger rate; the rates inferred increase at the tips of the phylogeny. Results The maximum likelihood analysis used the phylogeny of concatenated DNA sequences from the genes gmk, glpF, and pycA (Fig. (Fig.1)1
The strains from Bacillus anthracis, Bacillus cereus, and Bacillus thuringiensis are closely related and have been suggested to form the B. cereus group (the Bc group) (Priest et al. 2004). Two different insertion/deletion rates were assumed on the two parts of the phylogeny (Case 2 in Fig. Fig.2).2 As can be observed in Figure Figure2,2 In addition to the above likelihood analysis, an analysis was performed with the assumption that genes cannot be regained after having been deleted. These results yield insertion/deletion rates that are similar to those under the assumption that genes can be regained after deletion (Table 2). However, in every case, the likelihood value is much lower. With a single constant insertion/deletion rate, the MLE is 0.48. In the case of two separate rates, the rate α is 4.48 among the Bc group and the rate β is 0.33. When the rate on the branch leading to the Bc group was also considered distinct, the rate γ was estimated at 1.08 and the rate β as 0.28. Both rates β and γ are much smaller than the rate α at 3.90. Finally, the rate on internal branches in the Bc group was separated from that on external branches in the likelihood estimation (boxed portion of the phylogeny in Fig. Fig.2).2 To explore among the most closely related taxa, five strains from the Bc group; B. anthracis Ames (Ba1), B. anthracis Ames “ancestor” (Ba2), B. anthracis Sterne (Ba3), B. thuringiensis (Bt), and B. cereus ZK (Bc1) were analyzed separately. The comparison of homologs shows that >96% of the genes present in all five strains share at least 90% sequence identity with each other in their protein sequences. Therefore, the substitutions between homologs among these five strains should be considered as relatively limited. All phyletic patterns of these five strains are shown in Table 4. Of the 5076 gene families, there are only 3956 present in all five strains. Hence, 22.1% of the genes are not shared by all five strains, even though these five strains are believed to represent one species (Helgason et al. 2000) and to have diverged very recently.
To determine the rates of evolution in the recently transferred genes, the tree lengths for the Bc-group-specific genes were measured (Fig. (Fig.4A)4A
The rates of nonsynonymous (Ka) and synonymous (Ks) substitutions were estimated for the genes present only within the Bc group and compared to genes that are more broadly distributed within the Bacillaceae group. Again, only changes that have occurred within the Bc group are measured with the genes categorized by their breadth of distribution. Both the Ks and Ka rates are elevated in the Bc group (Supplemental material), but the Ka values are most strongly affected. The Ka/Ks ratios for genes limited to the Bc group are shown in Figure Figure5A.5A
Discussion To determine the patterns of LGT, it is useful to examine closely related but fully sequenced genomes. A complete genome sequence is necessary to eliminate the possibility of a hidden paralog or of a genome rearrangement masking a homolog. Closely related taxa help to determine the number of genes that might have been laterally transferred. To this end, we have examined the gene content from 13 completely sequenced genomes from the Bacillaceae group. The results demonstrate that LGT occurs rapidly and extensively between strains of the same species. A phylogeny was constructed to measure the rate of LGT relative to nucleotide substitutions. The concatenated DNA sequences of gmk, glpF, and pycA genes rather than ribosomal RNA sequences were used to reconstruct the phylogeny in this study. It is difficult to reconstruct the phylogenetic relationship within the Bc group owing to their remarkably similar rRNA sequences (Ash et al. 1991) and the divergence of rRNA sequences in genomes with multiple rrn operons (Klappenbach et al. 2001; Acinas et al. 2004). The gmk, glpF, pycA, tpi, ilvD, pta, and pur genes have been studied in the past as tools for reconstructing the evolutionary history of the B. cereus group. It was found that gmk, glpF, pycA, and tpi genes strongly conform to the concatenated tree of all seven genes (Priest et al. 2004). (Note that the topological position of Oceanobacillus iheyensis in the concatenated phylogeny is different from that in a 16S rRNA-based phylogeny [Hao and Golding 2004].) When maximum likelihood estimates of the rates of insertion/deletion are mapped onto this phylogeny, it suggests that there are more genes coming in and going out at the tips of phylogeny. This is clear even if one looks at the table of gene presence/absence (Table 1) and observes that differences in gene content between taxa that are considered a single species are among the most common patterns observed. If this is an evolutionarily stable situation, then most of the laterally transferred genes must be lost shortly after their insertion during evolution. Genome annotation can be an error-prone task (Kyrpides and Ouzounis 1999). As a result, all of the predicted ORFs that are present in only one genome and that do not have homologs detectable by BLAST were removed from this study. Since many of these may be proper and functional genes (Siew and Fischer 2003, 2004), this method tends to further underestimate the events on external branches. Not surprisingly, a maximum likelihood estimation including the uniquely present ORFs further inflates the rates at the tips of phylogeny (data not shown). It has been suggested that B. anthracis, B. cereus, and B. thuringiensis are one species (Helgason et al. 2000). A close evolutionary relationship among these strains is inferred by comparing substitutions in the concatenated sequences (Fig. (Fig.1)1 In the maximum likelihood estimation, the insertion rate is assumed to be equal to the deletion rate. This assumption was made to ensure that in the long term, genome sizes would not tend to zero or infinity. In the short term, this assumption is unlikely to be correct, and Thompson et al. (2005) have shown that even within closely related bacterial populations, the genome content may change. The model of gene insertion/deletion can be improved by assuming unequal rates of insertion/deletion, fixed numbers of genes that are not deleted, variation in indel rate among genes, and so on. But the increased rate at the tips of the phylogeny is not likely to be an artifact of genome size variation or the limitations of the likelihood model. Firstly, the members of the Bc group have larger genome size than the non-Bc group taxa, but, while the indel rate on the branch leading to the Bc group shows a higher rate, it is still much smaller than the rate within the Bc group. Secondly, the seven members of the Bc group have similar genome sizes, and an estimation using only the members of the Bc group again shows (Table 3) higher indel rates at the tip of phylogeny. Genome size variation is well known as a problem in genome content phylogeny reconstruction, but it has been noted that phylogeny, rather than phenotype or LGT, is the major quantitative determinant of gene content (Snel et al. 1999). The more recently transferred genes have longer tree length (with P < 0.001 in a Wilcoxon rank test) (Fig. (Fig.4),4 This study demonstrates that more recently transferred genes are under relaxed and faster evolution compared with the genes that have had a longer residence time. There are several possible reasons for this. It is possible that the laterally transferred genes with a higher rate are more prone to being laterally transferred. This is unlikely, as genes with a slightly longer residence time should not show the observed reduced rates of evolution. It has also been suggested that genes inserted into a new host will undergo amelioration of their sequence (Lawrence and Ochman 1997). In this process, the codon and base content bias of a new gene will mutate to more closely resemble the inherent bias within the new host. It is therefore plausible to explain the higher rates of evolution in recently transferred genes compared to the completely ameliorated/native genes (Fig. (Fig.4).4 Alternatively, the genes that have been recently transferred might be adapting to a new and local environment found in the new host. In this regard, it should be noted that several genes have a very large Ka/Ks ratio, suggesting directional selection. The recently transferred genes might also be evolving quickly as they are not required in their new hosts, offer minimal selective advantage, and could be in the process of being lost. This is in accord with the observation that genes come and go rapidly within closely related genomes. It is also in concordance with the very high tree lengths in Figure Figure44 Methods To gain a better concept of genome evolution in closely related bacteria, a group of bacteria with an abundance of completely sequenced congeneric species was selected. Thirteen complete Bacillaceae genome sequences were obtained from NCBI (http://www.ncbi.nlm.nih.gov/) to carry out the analysis. They are B. anthracis Ames, B. anthracis “Ames ancestor,” B. anthracis Sterne, B. thuringiensis, B. cereus ZK, B. cereus ATCC 10,987, B. cereus ATCC 14,579, Geobacillus kaustophilus, B. licheniformis, B. subtilis, B. clausii, B. halodurans, and Oceanobacillus iheyensis. It has been argued that B. anthracis, B. cereus, and B. thuringiensis might be one species (Helgason et al. 2000) as the ribosomal RNA sequences from these strains are remarkably similar (Ash et al. 1991). Therefore, the seven strains from B. anthracis, B. cereus, and B. thuringiensis are hereafter referred to as the B. cereus group (the Bc group) as suggested by Priest et al. (2004). The evolutionary history of the Bc group has been reconstructed using the nucleotide sequences of the gmk, glpF, and pycA genes (Priest et al. 2004) because the rRNA sequences are too similar to provide reliable evolutionary relationships. In this study, the concatenated DNA sequences of these three genes from each genome were used to reconstruct a phylogeny using Mr.Bayes (Huelsenbeck and Ronquist 2001) (200,000 generations sampled every 100 generations with a gamma distribution model and invariant class). The method to identify members of a gene family has been described in Hao and Golding (2004). In short, potential homologs were measured according to sequence similarities, and all paralogs in each genome were clustered as a single gene family and only one member was retained for further analysis. Non-annotated proteins in a genome were identified by carrying out a TBLASTN search against the DNA sequence of each genome and using all annotated proteins from other Bacillaceae genomes as query sequences. Potential taxon species genes are determined by searching against all completed bacterial genomes. (The GenBank accession numbers are given as Supplemental information at http://evol.biology.mcmaster.ca/~weilong/likelihood.) The phyletic patterns (gene presence or absence in each genome) of all genes were used for the maximum likelihood analysis. The gene families present in the Bc group were used to conduct tree length and Ka/Ks ratio (ω) analyses using the PAML package (Yang 1997). The tree length was calculated as the sum of the branch lengths for the taxa only with the Bc group, using the maximum likelihood method from the PAML package. The tree length gives the expected number of substitutions per site along all branches in the phylogeny. Genes were categorized into four groups based on their presence/absence in different taxa (and hence on the inferred time period when the genes were transferred). The four groups are characterized by genes present only in the Bc group; genes present in Bc, Gk, Bl, and Bs; genes present in Bc, Gk, Bl, Bs, Bk, and Bh; and genes present in all 13 taxa. A single Ka/Ks ratio was assumed throughout the length of each sequence in this study. To avoid the effects of duplication during evolution (Gu et al. 2002; Zhang et al. 2003), any paralogs of gene families were excluded from the tree length and Ka/Ks ratio analyses. Protein sequences and their corresponding DNA sequences were extracted from the annotated genomes. Protein sequences were aligned using ClustalW (Thompson et al. 1994), and nucleotide sequence alignments were created from the protein alignments by replacing each amino acid with its corresponding codon. To evaluate the likelihood of the observed phyletic patterns, a simple model of gene evolution was chosen. This model assumes that individual genes are inserted or deleted at constant rates. This model does not include a consideration of increasing or decreasing numbers of genes but, rather, a constant number of gene places that may or may not be occupied at any one time. All events are assumed to be independent. Let ν be the rate of gene insertion, and let μ be the rate of gene deletion. Let t be the length of time that separates a taxon from its ancestor. Let P indicate the gene presence, and let A indicate its absence. Then the probability of gene presence in the descendant taxon (d) can be calculated given knowledge of the state in the ancestral taxon (a). Thus,
All observed patterns were used to calculate the overall likelihood at the last common ancestral node by multiplying individual likelihoods together. The overall likelihood for a total of n patterns will be
Acknowledgments This work was supported by an NSERC grant to G.B.G. The authors wish to thank R. Morton for his suggestions on earlier versions of this manuscript and to thank the reviewers for their helpful suggestions. Footnotes [Supplemental material is available online at www.genome.org.] Article is online at http://www.genome.org/cgi/doi/10.1101/gr.4746406. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||
Microbiology. 1997 Dec; 143 ( Pt 12)():3723-32.
[Microbiology. 1997]Int J Med Microbiol. 2000 May; 290(2):153-65.
[Int J Med Microbiol. 2000]Nat Genet. 2000 Oct; 26(2):195-7.
[Nat Genet. 2000]J Bacteriol. 2002 May; 184(10):2626-33.
[J Bacteriol. 2002]Proc Natl Acad Sci U S A. 2004 Oct 12; 101(41):14919-24.
[Proc Natl Acad Sci U S A. 2004]Genome Biol. 2003; 4(9):R57.
[Genome Biol. 2003]Science. 2003 Aug 8; 301(5634):829-32.
[Science. 2003]BMC Evol Biol. 2003 Jan 6; 3():2.
[BMC Evol Biol. 2003]Mol Biol Evol. 2004 Jul; 21(7):1294-307.
[Mol Biol Evol. 2004]Genome Res. 2002 Jan; 12(1):17-25.
[Genome Res. 2002]Annu Rev Genet. 1988; 22():521-65.
[Annu Rev Genet. 1988]Mol Biol Evol. 2001 Apr; 18(4):453-64.
[Mol Biol Evol. 2001]Mol Biol Evol. 2004 Jul; 21(7):1401-8.
[Mol Biol Evol. 2004]Bioinformatics. 2004 Sep 1; 20(13):2044-9.
[Bioinformatics. 2004]Mol Biol Evol. 2004 Apr; 21(4):681-90.
[Mol Biol Evol. 2004]J Bacteriol. 2004 Dec; 186(23):7959-70.
[J Bacteriol. 2004]Appl Environ Microbiol. 2000 Jun; 66(6):2627-30.
[Appl Environ Microbiol. 2000]Int J Syst Bacteriol. 1991 Jul; 41(3):343-6.
[Int J Syst Bacteriol. 1991]Nucleic Acids Res. 2001 Jan 1; 29(1):181-4.
[Nucleic Acids Res. 2001]J Bacteriol. 2004 May; 186(9):2629-35.
[J Bacteriol. 2004]J Bacteriol. 2004 Dec; 186(23):7959-70.
[J Bacteriol. 2004]Mol Biol Evol. 2004 Jul; 21(7):1294-307.
[Mol Biol Evol. 2004]Mol Microbiol. 1999 May; 32(4):886-7.
[Mol Microbiol. 1999]Proteins. 2003 Nov 1; 53(2):241-51.
[Proteins. 2003]J Mol Biol. 2004 Sep 10; 342(2):369-73.
[J Mol Biol. 2004]Appl Environ Microbiol. 2000 Jun; 66(6):2627-30.
[Appl Environ Microbiol. 2000]J Struct Funct Genomics. 2003; 3(1-4):35-44.
[J Struct Funct Genomics. 2003]Physiol Genomics. 2003 Dec 16; 16(1):19-23.
[Physiol Genomics. 2003]J Bacteriol. 2005 Nov; 187(21):7176-84.
[J Bacteriol. 2005]Science. 2005 Feb 25; 307(5713):1311-3.
[Science. 2005]Nat Genet. 1999 Jan; 21(1):108-10.
[Nat Genet. 1999]Genome Res. 2004 Jun; 14(6):1036-42.
[Genome Res. 2004]J Mol Evol. 1997 Apr; 44(4):383-97.
[J Mol Evol. 1997]Nat Genet. 2005 Dec; 37(12):1372-5.
[Nat Genet. 2005]Proc Natl Acad Sci U S A. 2003 Aug 19; 100(17):9658-62.
[Proc Natl Acad Sci U S A. 2003]Mol Cell Proteomics. 2004 Aug; 3(8):780-7.
[Mol Cell Proteomics. 2004]Appl Environ Microbiol. 2000 Jun; 66(6):2627-30.
[Appl Environ Microbiol. 2000]Int J Syst Bacteriol. 1991 Jul; 41(3):343-6.
[Int J Syst Bacteriol. 1991]J Bacteriol. 2004 Dec; 186(23):7959-70.
[J Bacteriol. 2004]J Bacteriol. 2004 Dec; 186(23):7959-70.
[J Bacteriol. 2004]Bioinformatics. 2001 Aug; 17(8):754-5.
[Bioinformatics. 2001]Mol Biol Evol. 2004 Jul; 21(7):1294-307.
[Mol Biol Evol. 2004]Comput Appl Biosci. 1997 Oct; 13(5):555-6.
[Comput Appl Biosci. 1997]Trends Genet. 2002 Dec; 18(12):609-13.
[Trends Genet. 2002]Physiol Genomics. 2003 Dec 16; 16(1):19-23.
[Physiol Genomics. 2003]Nucleic Acids Res. 1994 Nov 11; 22(22):4673-80.
[Nucleic Acids Res. 1994]