![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2004, American Society for Microbiology Genome-Wide Molecular Clock and Horizontal Gene Transfer in Bacterial Evolution Department of Bioengineering and Bioinformatics, Moscow State University,1 Institute for Problems of Information Transmission RAS,4 State Scientific Center GosNIIGenetika, Moscow, Russia,5 Department of Pathology, F. E. Hebert School of Medicine, Uniformed Services University of the Health Sciences,2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland3 *Corresponding author. Mailing address: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894. Phone: (301) 435-5913. Fax: (301) 435-7794. E-mail: koonin/at/ncbi.nlm.nih.gov. Received April 7, 2004; Accepted June 28, 2004. This article has been cited by other articles in PMC.Abstract We describe a simple theoretical framework for identifying orthologous sets of genes that deviate from a clock-like model of evolution. The approach used is based on comparing the evolutionary distances within a set of orthologs to a standard intergenomic distance, which was defined as the median of the distribution of the distances between all one-to-one orthologs. Under the clock-like model, the points on a plot of intergenic distances versus intergenomic distances are expected to fit a straight line. A statistical technique to identify significant deviations from the clock-like behavior is described. For several hundred analyzed orthologous sets representing three well-defined bacterial lineages, the α-Proteobacteria, the γ-Proteobacteria, and the Bacillus-Clostridium group, the clock-like null hypothesis could not be rejected for ~70% of the sets, whereas the rest showed substantial anomalies. Subsequent detailed phylogenetic analysis of the genes with the strongest deviations indicated that over one-half of these genes probably underwent a distinct form of horizontal gene transfer, xenologous gene displacement, in which a gene is displaced by an ortholog from a different lineage. The remaining deviations from the clock-like model could be explained by lineage-specific acceleration of evolution. The results indicate that although xenologous gene displacement is a major force in bacterial evolution, a significant majority of orthologous gene sets in three major bacterial lineages evolved in accordance with the clock-like model. The approach described here allows rapid detection of deviations from this mode of evolution on the genome scale. The classical molecular clock concept holds that the sequence of a given gene (protein) evolves at a constant rate as long as its biological function remains unchanged (5, 26, 56, 57). The actual evolutionary rates for the entire set of genes in a genome show a broad distribution, with the rates for the slowest genes differing from the rates for the fastest genes by at least 2 orders of magnitude (18, 21, 26, 28). However, the molecular clock hypothesis holds that, within the same set of orthologs (i.e., homologous genes related via vertical inheritance) (16, 44), the rate does not change, within some inevitable dispersion. The theoretical foundation of the molecular clock concept is the neutral theory of molecular evolution, which holds that, at least when no functional changes occur, the great majority of fixed substitutions in nucleotide and protein sequences are neutral, their rate being determined by the fixed, gene-specific functional constraints (13, 26, 39). The modern reincarnation of neutralism, the near-neutral theory, predicts a greater dispersion of the molecular clock because the probability of fixation of slightly deleterious mutations critically depends on the effective population size, which is prone to major fluctuations (13, 39). Numerous tests of the molecular clock revealed both a general pattern conforming to the theory and numerous violations. It has been repeatedly shown that the molecular clock is overdispersed; i.e., the variation of evolutionary rates within a set of orthologs is greater than that predicted by a null model based on the Poisson distribution (8, 18, 19). Deviations from the molecular clock are thought to result from lineage-specific acceleration of evolution, which could be due either to functional changes entailing relaxation of purifying selection or positive selection or to increased mutational pressure caused, at least in part, by effective population size effects (5). These phenomena cause overdispersion of the molecular clock, which is manifested in unequal lengths of tree branches coming out of the same node, under the assumption that the topology of the phylogenetic tree for a given set of orthologs is known. Typically, the same species tree topology is assumed for all genes. This approach is likely to be valid for the multicellular eukaryotes, which, historically, have been the objects of the analyses that led to the molecular clock concept. However, recent comparative genomic studies strongly suggest that in addition to the regular pattern of vertical inheritance, evolution of prokaryotic genomes is dramatically affected by horizontal gene transfer (HGT) (6, 11, 12, 27, 30, 32, 38, 55). On many occasions, HGT seems to occur between evolutionarily distant organisms, although it has been argued that there could be a decreasing gradient of the HGT rate from closely related species to distantly related species; the existence of such a gradient could be one of the reasons why a species tree can be constructed at all, in spite of extensive HGT (20). There seem to be certain connections between the amount of HGT and the biology of prokaryotic genes. In particular, the so-called complexity hypothesis holds that HGT is much less common among genes that encode subunits of macromolecular complexes, such as those involved in translation, transcription, and replication, than in genes coding for metabolic enzymes (25). While this prediction might hold statistically, subsequent studies have shown that there are very few, if any, genes that are completely refractory to HGT. In particular, evidence of HGT has been obtained for several ribosomal proteins, translation factors, and the major RNA polymerase subunits (3, 4, 24, 34). From a comparative genomic perspective, HGT events have been classified into three categories: (i) acquisition of genes that are novel to a given phylogenetic lineage; (ii) acquisition of paralogs of genes preexisting in the given lineage; and (iii) xenologous gene displacement (XGD), in which the original gene from a given set of orthologs is displaced by a member of the same set of orthologs from a different lineage (30). Obviously, if an HGT event, particularly XGD, goes unnoticed in the course of phylogenetic analysis, an apparent gross violation of this molecular clock will be seen when evolutionary rates are measured in the affected set of orthologous genes on the basis of the assumed species tree topology. HGT is detected through anomalies in the topology of phylogenetic trees of individual sets of orthologs or by so-called surrogate approaches, which, in the case of HGT between distant species, are based primarily on the phyletic distribution of the homologs of a given gene (7, 30, 41). Simply put, unexpected phyletic patterns (e.g., the presence of orthologs of a given gene in all or nearly all sequenced bacterial genomes but in only one archaeon) suggest that there has been HGT (in this case from a bacterium to the archaeon) (27, 30). These patterns can be expressed either in terms of presence-absence only or, more quantitatively, by comparing the significance levels of taxon-specific best hits. The general validity of this approach seems to be supported by the biological plausibility of some of the trends in the apparent horizontal gene fluxes that were detected by phyletic pattern analysis. Thus, hyperthemophilic bacteria showed a clear preponderance of genes of possible archael origin compared to mesophiles (2, 29, 36), and probable gene transfer from eukaryotic hosts to some bacterial pathogens also has been inferred (17, 40). In principle, phylogenetic tree analysis is supposed to be a more precise indicator of probable HGT events than similarity-based surrogate methods because of inaccuracies in the latter resulting from the lack of exact correspondence between sequence similarity and phylogenetic affinity (31). A genome-wide phylogenetic analysis, aimed specifically at detection of horizontally transferred genes, has been described (43). However, it is well known that phylogenetic analysis is fraught with its own slew of artifacts, such as long branch attraction, particularly when fast methods, such as minimal evolution or neighbor joining, are employed (14). In addition, phylogenetic analysis can be prohibitively expensive computationally when it is attempted on the genome scale, especially with powerful methods, such as complete maximum-likelihood analysis, and large sets of species. Therefore, surrogate, similarity-based methods have proved to be extremely useful, at least as a rapid, first-tier strategy that allows workers to delineate a set of HGT candidates. We were interested in investigating a surrogate approach to genome-wide study of prokaryotic evolution, which combines a test of the validity and an analysis of the rate distribution of the molecular clock with detection of potential HGT events and lineage-specific acceleration of evolution. Using the Clusters of Orthologous Groups (COGs) database for proteins (46, 47), we analyzed the molecular clock behavior of COGs from three major bacterial lineages, the α-Proteobacteria, the γ-Proteobacteria, and the low-G+C-content gram-positive bacteria. We found that clock-like evolution was dominant in all three groups, but we also detected many anomalies, some of which are best explained by XGD. MATERIALS AND METHODS Sequences. Three unequivocally identified and well-characterized bacterial lineages, the γ-Proteobacteria, the α-Proteobacteria, and the Bacillus-Clostridium group of gram-positive bacteria, were selected for the present study. The γ-proteobacterial set included six species: Escherichia coli K-12, Haemophilus influenzae, Pasteurella multocida, Salmonella enterica serovar Typhimurium LT2, Vibrio cholerae, and Yersinia pestis. The α-proteobacterial set included seven species: Agrobacterium tumefaciens C58 Cereon, Brucella melitensis, Caulobacter crescentus CB15, Mesorhizobium loti, Rickettsia conorii, Rickettsia prowazekii, and Sinorhizobium meliloti. The Bacillus-Clostridium set included eight species: Bacillus halodurans, Bacillus subtilis, Clostridium acetobutylicum, Listeria innocua, Lactococcus lactis, Staphylococcus aureus N315, Streptococcus pneumoniae TIGR4, and Streptococcus pyogenes M1 GAS. For each of these species sets, we identified a set of COGs (46, 48) in which each of the relevant species was represented by exactly one protein (i.e., all species present, no paralogs). Additionally, the constituent proteins were required to have a sufficient alignable length (at least 60 amino acids in conserved blocks [see below]). This search resulted in 563 COGs for the γ-proteobacterial set, 274 COGs for the α-proteobacterial set, and 234 COGs for the Bacillus-Clostridium set; the overlap for the three sets comprised 114 COGs. Alignments. Multiple alignments of sequence families within the bacterial groups were produced by using the MAP program (23). Sequence families involving wider taxonomic sampling of proteins were aligned by using the T-Coffee program (37). Multiple alignments were aggressively filtered for potential incorrectly aligned positions; only conserved blocks with no gaps containing 10 or more positions were retained for further analysis (53). Evolutionary distances between genes and genomes and phylogenetic trees. Maximum-likelihood distances between individual protein sequences were computed for each of the COGs analyzed by using the PAML package, with the JTT substitution model corrected for observed amino acid frequencies and the α parameter of γ-distribution of intraprotein evolutionary rate variability set to 1.0; this estimate includes a correction for possible multiple substitutions in the same site (54). Additionally, multiple alignments consisting of 21 sequences each were produced for the 114 COGs in which all three groups were represented, and pairwise distances were similarly computed from this alignment. Phylogenetic trees for individual COGs were constructed by using the maximum-likelihood method implemented in the Tree-Puzzle package, with the expected likelihood weight determined for the tree topology involving split versus topologies in which the respective lineage remained monophyletic (42). To calculate intergenomic evolutionary distances, all intergenic distances obtained from the same pair of genomes in the set of 114 conserved COGs were pooled, and the median of the distribution of these distances was taken to represent the intergenomic distance (21, 52). Neighbor-joining and least-squares trees were reconstructed from the pairwise genome distance matrices by using the programs NEIGHBOR and FITCH of the PHYLIP package, respectively (15). Limited tree analysis. A comprehensive phylogenetic analysis of a protein family, even if it is limited to a set of orthologs from completely sequenced genomes, is rarely feasible. There are several reasons that usually preclude this type of analysis, as follows: in many cases, the sequences of orthologs from the most distant lineages, such as bacteria and archaea, are only weakly similar, which complicates the construction of an alignment suitable for building a phylogenetic tree; the large number of sequences hampers the use of advanced methods for reconstruction of the phylogeny; and the presence of in-paralogs of various ages hinders the interpretation of results. Therefore, we elected to perform a limited tree analysis for the cases where the split distance analysis indicated a likely split of a set of orthologs from a particular bacterial lineage into two subsets with distinct evolutionary histories; these groups are referred to below as left ({L}) and right ({R}) sets. For each sequence from the given COG, global pairwise alignment scores (calculated by using the ALIGN program with default parameters) (35) for alignments with the other COG members were obtained (sequences from closely related species were excluded). Members of the COG not belonging to either {L} or {R} (i.e., the sequences of orthologs outside the lineage analyzed) were arranged into two ordered lists, <HL> and <HR>, according to their similarity scores against {L} and {R}, respectively. Multiple alignments that included {L}, {R}, and the top five sequences from <HL> and <HR> were constructed by using the T-Coffee program (37). Maximum-likelihood trees were constructed by using the ProtML program of the MOLPHY package (1). For all internal branches separating the {L} and {R} subsets (see Fig. Fig.4),4
RESULTS AND DISCUSSION Theory. (i) Molecular clock and deviations from clock-like evolution. The approach developed here is based on comparing the evolutionary distances between genes (proteins) in a given COG to the distances between the corresponding genomes. There is no single correct way to calculate the intergenomic distances. In principle, a reference gene believed to be a good clock, such as rRNA, or a set of genes believed to evolve under the same model, such as genes for ribosomal proteins (3, 22, 51), can be employed for this calculation. We chose one of the genome-wide approaches to evolutionary distance determination, in which the median of the distribution of the distances between orthologs from a given pair of genomes is used as a proxy for the intergenomic distance (51, 52) (see Materials and Methods). Under a perfect molecular clock and strict vertical inheritance, the evolutionary rates of all genes differ from each other only by proportionality constants, so the distance between any pair of genes (dAB) is proportional to the distance between genomes A and B (DAB):
Various events that could occur during evolution could distort this idealized picture. Accelerated evolution on one terminal branch of the tree would increase all distances involving the corresponding species; as a result, the points corresponding to this species on the scatter plot are located above the main line. Similar acceleration on an internal branch of the tree would result in an upward shift of the points corresponding to two or more species. We were particularly interested in cases of HGT of a gene encoding a protein of a given COG from an outside source, resulting in displacement of the original gene characteristic of the given lineage (i.e., XGD). In these cases, the evolutionary distance between the transferred and nontransferred sequences is determined by the distance between the transfer source and the recipient lineage (D*), rather than by the intragroup distances. The scatter plot is expected to consist of two distinct sets of points: (i) points corresponding to nontransferred genes that fit a line with the intercept at 0 and the slope v, like those obtained with the clock model, and (ii) points corresponding to the transferred gene (i.e., the distances between this gene and all other members of the COG; these points fit a horizontal line with the intercept corresponding to the distance to the transfer source) (Fig. (Fig.1B1B (ii) Statistical model. The accumulation of substitutions in protein sequences was treated as a Poisson stochastic process. The number of substitutions in a given gene, accumulated over a given time interval, is a Poisson-distributed variable with the variance proportional to the expected value:
(iii) Statistical analysis: vertical inheritance. Let us consider a set of N genomes {G}, each with a single ortholog in a given COG. One can measure the distance between orthologs in the given COG (dIJ) and the distance between the genomes (DIJ) for all N′ = N(N − 1)/2 pairs ([I,J]) from {G}. Minimizing the square error over all such pairs,
(iv) Statistical analysis: HGT. If genes in one of the lineages in a COG were acquired via HGT from a distant source, all intergenic distances in this COG fall into two groups: those that reflect vertical inheritance (DAD and DBC in Fig. Fig.1B)1B
(v) Statistical analysis: baseline noise. If the pattern of a gene's inheritance is completely disjointed from the pattern of intergenomic relationships, neither equation 3a nor equation 3b adequately describes the relationships between the intergenomic and intergenic distances. In the absence of a clear dependence between these variables, the scatter plot for dAB versus DAB represents random scatter of points, and the following simple equation applies:
is simply the mean intergenic distance over all pairs. The baseline variance of eIJ is:
(vi) Statistical analysis of COG evolution. Each COG was analyzed with the three models described above. (a) Noisy data, no clock-like evolution (equation 6a). The baseline variance of intergenic evolutionary distances (uN2) was calculated by using equation 6b. (b) Simple molecular clock (equation 3a). (c) Single significant deviation from molecular clock (equations 3a and 3b). The genomes were partitioned into two sets by breaking each branch of the species tree. For each of the possible splits, the residual variance (uT2) and the fit error (sT2) were calculated by using equation 5c. The split with the minimal sT2 was accepted. The relative evolutionary rate (v) was calculated by using equation 5a, and the distance to the transfer source was calculated by using equation 5b. Additionally, to detect the transfers originating from outside the group, the relative transfer distance (DT) was calculated as D*/max(DKL), with the maximum taken over all cross-group pairs. Two statistical tests were performed for each COG. The first test aimed at discriminating between H0 (the data do not follow either of the two clock-like models) and H1 (the data fit either the simple-clock or single-transfer model). The ratio FC = u2N/min(u2C, u2T) was subjected to Fisher's test with (N′ − 1, N′ − 3) degrees of freedom. If the value of FC exceeded the critical level at the 0.05 level of significance (1.94 to 2.64, depending on the group of species analyzed), H0 was rejected, and the data were considered to conform to the molecular clock model. The second test discriminated between H0 (the data fit the simple-clock model) and H1 (the data fit the single-transfer model). The ratio FT = sC2/sT2 was subjected to Fisher's test with (N′ − 2, N′ − 3) degrees of freedom. If the value of FT exceeded the critical level at the 0.05 level of significance (1.95 to 2.66 depending on the group of species analyzed), H0 was rejected, and the data were considered to conform significantly better to the single-transfer model. In this case, the value of v calculated by using equation 5a (rather than that calculated by using equation 4a) was used to describe the relative evolutionary rate for the COG in question; a DT value of >1 was considered to be an indication of HGT from an outside source. Empirical results. (i) Molecular clock and deviations from clock-like evolution in bacteria. We applied the theory described above to an analysis of the evolution of three major bacterial lineages, the α-Proteobacteria, the γ-Proteobacteria, and low-G+C-content gram-positive bacteria. For each of these groups, the COGs that contained a single representative from each species were selected for analysis (Table 1). Examples of relationships between the intergenomic and intergenic distances are shown in Fig. Fig.2.2
(ii) Relative evolutionary rates in the three bacterial lineages. All three bacterial lineages analyzed displayed a wide range of relative evolutionary rates (i.e., evolutionary rates normalized by using the intergenomic rate). The difference between the COGs that evolved fastest and the COGs that evolved slowest reached almost 2 orders of magnitude for the γ-Proteobacteria and almost 1 order of magnitude within the set of 108 conserved COGs that fit one of the two evolutionary models (Table 2). The distributions of relative evolution rates were close to log normal in all three lineages (Fig. (Fig.3).3
(iii) Apparent violations of molecular clock and anomalies in gene phylogenies. All COGs that fit the two-mode model better than the simple clock-like model were analyzed further by using the limited tree procedure (see Materials and Methods for details). Briefly, the purpose of this procedure was to investigate whether the two sets of species separated by the statistical analysis described above were also separated by a strongly supported internal branch in the phylogenetic tree for the given COG (Fig. (Fig.4).4
Table 4 lists the top 10 COGs with the most pronounced deviation from the clock-like model (i.e., the greatest DT value) for each of the bacterial lineages analyzed. These COGS were further subjected to complete phylogenetic analysis, including statistical tests (see Materials and Methods), in order to determine the nature of the anomalies detected. Upon detailed inspection of the trees produced with the three methods, the cases were classified into those that most likely reflected HGT resulting in XGD, those that were best explained by accelerated evolution in a particular lineage, and those for which the evolutionary scenario remained uncertain. Of the 30 cases analyzed, only 5 did not appear to be resolved well enough to reach conclusions on the evolutionary mechanism; one case involved acceleration of evolution, and the remaining cases seemed to be best explained by invoking HGT, mostly resulting in XGD. The statistical tests comparing the likelihoods of tree topologies with different positions of the corresponding branches validated the anomalous clustering and thus supported the case for HGT (Table 4). In functional terms, the majority of the inferred cases of HGT involved metabolic enzymes, in agreement with the complexity hypothesis, which postulated that genes coding for proteins that are not involved in tight interactions with other proteins as subunits of macromolecular complexes are more readily subject to HGT (25). It is also noteworthy that 5 of the top 30 cases included amianoacyl-tRNA synthetases, a category of enzymes which appears to be particularly prone to HGT (49, 50). In COG0060 (isoleucyl-tRNA synthetase, IleS), both Rickettsia species and C. acetobutylicum are separated from their sister groups (α-Proteobacteria and gram-positive bacteria, respectively) and partition into the archaeon-eukaryote part of the tree (Fig. (Fig.5A).5A
COG1217 (predicted membrane GTPase involved in stress response, TypA) is another case of the C. acetobutylicum sequence being separated from the sequences of the rest of the Bacillus-Clostridium group. The C. acetobutylicum protein clusters with the protein from Fusobacterium nucleatum (an arrangement which is observed in phylogenetic trees of many genes [Wolf, unpublished data]) and is separated from its cognate group by the cyanobacterial, actinobacterial, and Deinococcus branches (Fig. (Fig.5B).5B In COG1282 (NAD/NADP transhydrogenase beta subunit, PntB) the Agrobacterium sequence clusters with γ-Proteobacteria and Neisseria, whereas the rest of the α-Proteobacteria cluster with Ralstonia (Fig. (Fig.5C).5C COG1526 (uncharacterized protein required for formate dehydrogenase activity, FdhD) exemplifies the frequently observed separation of Vibrio from the rest of the γ-Proteobacteria (Fig. (Fig.5D).5D We also examined whether the presence a molecular clock violation in a COG is correlated for the three bacterial lineages studied. In the set of 108 conserved COGs, the Pearson correlation coefficients were −0.02 between γ- and α-Proteobacteria, −0.11 between γ-Proteobacteria and gram-positive bacteria, and 0.17 between α-Proteobacteria and gram-positive bacteria. The distribution of the number of anomalies detected in each COG (range, 0 to 3) nearly perfectly agreed with the expectation under the independence hypothesis [P(χ2) > 0.5]. Thus, somewhat counterintuitively, HGT appeared to occur independently in different lineages, and there was no obvious HGT propensity that could be considered a characteristic of evolution of an entire COG. Nor did we detect any significant connection between the violations of the molecular clock detected and the relative evolution rates. The distributions of v values were very similar for split and nonsplit sets, and the difference between the means was insignificant for all three groups according to the t test (data not shown). General discussion and conclusions. Comparative genomics allows researchers to restate old questions of evolutionary biology on a grander scale and, perhaps, in a more biologically meaningful way. Thus, rather than analyzing the molecular clock for one or a few selected protein families, it is possible to address the issue at the level of complete genomes and to ask what fraction of the genes follow the clock-like model of evolution and which genes demonstrably deviate from it. Here we describe a simple theoretical framework that allowed us to classify orthologous sets of bacterial genes (COGs) into these two categories. A similar relative rate test was described by Syvanen in his analysis of the acceleration of evolution of rRNA in eukaryotes (45). We found that for several hundred COGs analyzed representing three well-defined bacterial lineages, the α-Proteobacteria, the γ-Proteobacteria, and the Bacillus-Clostridium group, the clock-like null hypothesis could not be rejected for ~70% of the COGs, whereas the rest showed substantial anomalies. It should be noted that the null hypothesis employed here is, in fact, a soft clock, which, strictly speaking, does not require constancy of evolutionary rates for the genes analyzed. All that is required is that the rate distribution remains constant (i.e., all genes are allowed to accelerate or decelerate synchronously) (21). Notably, we also found that, within the set of 108 COGs that were represented by a single ortholog in all species analyzed from the three lineages, the relative evolutionary rates were strongly correlated among the lineages. This observation emphasizes the general validity of the soft genomic clock. We also analyzed the nature of the observed anomalies and found that, for the most conspicuous anomalies, the majority were most readily explained by HGT from phylogenetically distant lineages. Importantly, this is a conservative estimate because we analyzed only a special set of well-behaved COGs, which contained exactly one ortholog from each of the species included. HGT events have been classified into three broad categories: (i) acquisition of genes new to the recipient lineage, (ii) acquisition of paralogs of resident genes, and (iii) XGD, in which a resident gene is displaced by an ortholog from a different lineage (30). The present analysis was designed to identify only cases of XGD (although in some exceptional situations we obtained indications of paralog acquisition where the COG analyzed seemed to contain hidden paralogs). The results suggest that XGD occurs during evolution of ~10 to 15% of the bacterial genes. This relatively low fraction of HGT among single-ortholog bacterial genes is compatible with the notion that, at least within well-defined clades, such as the γ-Proteobacteria, the α-Proteobacteria, and the Bacillus-Clostridium group, these genes may be combined to produce organismal phylogenies, preferably after exclusion of the genes with detected HGT (9, 33, 51). However, the fraction of likely HGT detected here is considerably greater than that recently reported for single-ortholog gene sets from γ-Proteobacteria (33). It seems likely that the difference is a consequence of the criterion used for selection of orthologous gene sets in the latter study, in which only genes that are highly conserved within γ-Proteobacteria were examined. This criterion probably resulted in exclusion of orthologous sets deviating from the clock-like behavior. An unexpected finding of this study is the lack of significant correlation among the three bacterial lineages analyzed with respect to the deviations from the clock model and the probable occurrence of XGD. An observation of such a deviation in any one of the lineages was a poor predictor of deviations in other lineages. This result seems to be poorly compatible with the complexity hypothesis (25) and similar notions concerning the dependence of HGT on biological function and is more in line with the ideas on random scatter and lineage-specific trends of HGT events (55). It should be noted that we analyzed a limited gene set and only one form of HGT (XGD). Functional correlates are likely to emerge in larger-scale studies, but our present results indicate that these connections are far from being absolute. The approach described here allows rapid identification of orthologous gene sets whose evolution significantly deviates from the soft clock model. Interpretation of these deviations as lineage-specific acceleration of evolution, XGD, or a combination of the two requires detailed phylogenetic analysis. Nevertheless, we believe that this methodology has its own advantages and could be useful in the study of genome-wide evolutionary trends. In particular, this approach allows workers to detect significant deviations from the clock-like model of evolution in a particular lineage without using any information on species outside that lineage, such as the (often unknown) source of the potential HGT. More practically, the procedure described here could be suitable for removing anomalous COGs from multigene sets employed for construction of organismal phylogenies. Acknowledgments We thank Eva Czabarka and Georgy Karev (National Center for Biotechnology Information) for helpful discussions on the statistical analysis of the data. P.S.N., M.S.G., and A.A.M. were partially supported by grants from the Howard Hughes Medical Institute (grant 55000309), the programs “Molecular and Cellular Biology” and “Origin and Evolution of the Biosphere” of the Russian Academy of Sciences, and the Fund for Support of Russian Science (MSG). REFERENCES 1. Adachi, J., and M. Hasegawa. 1992. MOLPHY: programs for molecular phylogenetics. Institute of Statistical Mathematics, Tokyo, Japan. 2. Aravind, L., R. L. Tatusov, Y. I. Wolf, D. R. Walker, and E. V. Koonin. 1998. Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles. Trends Genet. 14:442-444. [PubMed] 3. Brochier, C., E. Bapteste, D. Moreira, and H. Philippe. 2002. Eubacterial phylogeny based on translational apparatus proteins. Trends Genet. 18:1-5. [PubMed] 4. Brochier, C., H. Philippe, and D. Moreira. 2000. The evolutionary history of ribosomal protein RpS14: horizontal gene transfer at the heart of the ribosome. Trends Genet. 16:529-533. [PubMed] 5. Bromham, L., and D. Penny. 2003. The modern molecular clock. Nat. Rev. Genet. 4:216-224. [PubMed] 6. Brown, J. R. 2003. Ancient horizontal gene transfer. Nat. Rev. Genet. 4:121-132. [PubMed] 7. Clarke, G. D., R. G. Beiko, M. A. Ragan, and R. L. Charlebois. 2002. Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J. Bacteriol. 184:2072-2080. [PubMed] 8. Cutler, D. J. 2000. Understanding the overdispersed molecular clock. Genetics 154:1403-1417. [PubMed] 9. Daubin, V., N. A. Moran, and H. Ochman. 2003. Phylogenetics and the cohesion of bacterial genomes. Science 301:829-832. [PubMed] 10. Doolittle, R. F., and J. Handy. 1998. Evolutionary anomalies among the aminoacyl-tRNA synthetases. Curr. Opin. Genet. Dev. 8:630-636. [PubMed] 11. Doolittle, W. F. 1999. Lateral genomics. Trends Cell Biol. 9:M5-M8. [PubMed] 12. Doolittle, W. F. 1999. Phylogenetic classification and the universal tree. Science 284:2124-2129. [PubMed] 13. Fay, J. C., and C. I. Wu. 2001. The neutral theory in the genomic era. Curr. Opin. Genet. Dev. 11:642-646. [PubMed] 14. Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Sunderland, Mass. 15. Felsenstein, J. 1996. Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Methods Enzymol. 266:418-427. [PubMed] 16. Fitch, W. M. 1970. Distinguishing homologous from analogous proteins. Syst. Zool. 19:99-106. [PubMed] 17. Gamieldien, J., A. Ptitsyn, and W. Hide. 2002. Eukaryotic genes in Mycobacterium tuberculosis could have a role in pathogenesis and immunomodulation. Trends Genet. 18:5-8. [PubMed] 18. Gillespie, J. H. 1991. The causes of molecular evolution. Oxford University Press, Oxford, United Kingdom. 19. Gillespie, J. H. 1984. The molecular clock may be an episodic clock. Proc. Natl. Acad. Sci. USA 81:8009-8013. [PubMed] 20. Gogarten, J. P., W. F. Doolittle, and J. G. Lawrence. 2002. Prokaryotic evolution in light of gene transfer. Mol. Biol. Evol. 19:2226-2238. [PubMed] 21. Grishin, N. V., Y. I. Wolf, and E. V. Koonin. 2000. From complete genomes to measures of substitution rate variability within and between proteins. Genome Res. 10:991-1000. [PubMed] 22. Hansmann, S., and W. Martin. 2000. Phylogeny of 33 ribosomal and six other proteins encoded in an ancient gene cluster that is conserved across prokaryotic genomes: influence of excluding poorly alignable sites from analysis. Int. J. Syst. Evol. Microbiol. 50:1655-1663. [PubMed] 23. Huang, X. 1994. On global sequence alignment. Comput. Appl. Biosci. 10:227-235. [PubMed] 24. Iyer, L. M., E. V. Koonin, and L. Aravind. 2004. Evolution of bacterial RNA polymerase: implications for large-scale bacterial phylogeny, domain accretion, and horizontal gene transfer. Gene 335:73-88. [PubMed] 25. Jain, R., M. C. Rivera, and J. A. Lake. 1999. Horizontal gene transfer among genomes: the complexity hypothesis. Proc. Natl. Acad. Sci. USA 96:3801-3806. [PubMed] 26. Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, United Kingdom. 27. Koonin, E. V. 2003. Horizontal gene transfer: the path to maturity. Mol. Microbiol. 50:725-727. [PubMed] 28. Koonin, E. V., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, D. M. Krylov, K. S. Makarova, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, I. B. Rogozin, S. Smirnov, A. V. Sorokin, A. V. Sverdlov, S. Vasudevan, Y. I. Wolf, J. J. Yin, and D. A. Natale. 2004. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 5:R7. [PubMed] 29. Koonin, E. V., and M. Y. Galperin. 2002. Sequence-evolution-function. Computational approaches in comparative genomics. Kluwer Academic Publishers, New York, N.Y. 30. Koonin, E. V., K. S. Makarova, and L. Aravind. 2001. Horizontal gene transfer in prokaryotes: quantification and classification. Annu. Rev. Microbiol. 55:709-742. [PubMed] 31. Koski, L. B., and G. B. Golding. 2001. The closest BLAST hit is often not the nearest neighbor. J. Mol. Evol. 52:540-542. [PubMed] 32. Lawrence, J. G., and H. Hendrickson. 2003. Lateral gene transfer: when will adolescence end? Mol. Microbiol. 50:739-749. [PubMed] 33. Lerat, E., V. Daubin, and N. A. Moran. 2003. From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-proteobacteria. PLoS Biol. 1:E19. [PubMed] 34. Makarova, K. S., V. A. Ponomarev, and E. V. Koonin. 2001. Two C or not two C: recurrent disruption of Zn-ribbons, gene duplication, lineage-specific gene loss, and horizontal gene transfer in evolution of bacterial ribosomal proteins. Genome Biol. 2:RESEARCH 0033. [PubMed] 35. Myers, E. W., and W. Miller. 1988. Optimal alignments in linear space. Comput. Appl. Biosci. 4:11-17. [PubMed] 36. Nelson, K. E., R. A. Clayton, S. R. Gill, M. L. Gwinn, R. J. Dodson, D. H. Haft, E. K. Hickey, J. D. Peterson, W. C. Nelson, K. A. Ketchum, L. McDonald, T. R. Utterback, J. A. Malek, K. D. Linher, M. M. Garrett, A. M. Stewart, M. D. Cotton, M. S. Pratt, C. A. Phillips, D. Richardson, J. Heidelberg, G. G. Sutton, R. D. Fleischmann, J. A. Eisen, C. M. Fraser, et al. 1999. Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399:323-329. [PubMed] 37. Notredame, C., D. G. Higgins, and J. Heringa. 2000. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302:205-217. [PubMed] 38. Ochman, H., J. G. Lawrence, and E. A. Groisman. 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405:299-304. [PubMed] 39. Ohta, T., and J. H. Gillespie. 1996. Development of neutral and nearly neutral theories. Theor. Popul. Biol. 49:128-142. [PubMed] 40. Ponting, C. P., L. Aravind, J. Schultz, P. Bork, and E. V. Koonin. 1999. Eukaryotic signalling domain homologues in archaea and bacteria. Ancient ancestry and horizontal gene transfer. J. Mol. Biol. 289:729-745. [PubMed] 41. Ragan, M. A. 2001. On surrogate methods for detecting lateral gene transfer. FEMS Microbiol. Lett. 201:187-191. [PubMed] 42. Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502-504. [PubMed] 43. Sicheritz-Ponten, T., and S. G. Andersson. 2001. A phylogenomic approach to microbial evolution. Nucleic Acids Res. 29:545-552. [PubMed] 44. Sonnhammer, E. L., and E. V. Koonin. 2002. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 18:619-620. [PubMed] 45. Syvanen, M. 2002. Rates of ribosomal RNA evolution are uniquely accelerated in eukaryotes. J. Mol. Evol. 55:85-91. [PubMed] 46. Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. I. Wolf, J. J. Yin, and D. A. Natale. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41. [PubMed] 47. Tatusov, R. L., E. V. Koonin, and D. J. Lipman. 1997. A genomic perspective on protein families. Science 278:631-637. [PubMed] 48. Tatusov, R. L., D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, and E. V. Koonin. 2001. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29:22-28. [PubMed] 49. Woese, C. R., G. J. Olsen, M. Ibba, and D. Soll. 2000. Aminoacyl-tRNA synthetases, the genetic code, and the evolutionary process. Microbiol. Mol. Biol. Rev. 64:202-236. [PubMed] 50. Wolf, Y. I., L. Aravind, N. V. Grishin, and E. V. Koonin. 1999. Evolution of aminoacyl-tRNA synthetases—analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events. Genome Res. 9:689-710. [PubMed] 51. Wolf, Y. I., I. B. Rogozin, N. V. Grishin, and E. V. Koonin. 2002. Genome trees and the tree of life. Trends Genet. 18:472-479. [PubMed] 52. Wolf, Y. I., I. B. Rogozin, N. V. Grishin, R. L. Tatusov, and E. V. Koonin. 2001. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol. Biol. 1:8. [PubMed] 53. Wolf, Y. I., I. B. Rogozin, and E. V. Koonin. 2004. Coelomata and not ecdysozoa: evidence from genome-wide phylogenetic analysis. Genome Res. 14:29-36. [PubMed] 54. Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13:555-556. [PubMed] 55. Zhaxybayeva, O., P. Lapierre, and J. P. Gogarten. 2004. Genome mosaicism and organismal lineages. Trends Genet. 20:254-260. [PubMed] 56. Zuckerkandl, E., and L. Pauling. 1962. Molecular evolution, p. 189-225. In M. Kasha and B. Pullman (ed.), Horizons in biochemistry. Academic Press, New York, N.Y. 57. Zuckerkandl, E., and L. Pauling. 1965. Molecules as documents of evolutionary history. J. Theor. Biol. 8:357-366. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Nat Rev Genet. 2003 Mar; 4(3):216-24.
[Nat Rev Genet. 2003]J Theor Biol. 1965 Mar; 8(2):357-66.
[J Theor Biol. 1965]Genome Res. 2000 Jul; 10(7):991-1000.
[Genome Res. 2000]Genome Biol. 2004; 5(2):R7.
[Genome Biol. 2004]Syst Zool. 1970 Jun; 19(2):99-113.
[Syst Zool. 1970]Nat Rev Genet. 2003 Mar; 4(3):216-24.
[Nat Rev Genet. 2003]Nat Rev Genet. 2003 Feb; 4(2):121-32.
[Nat Rev Genet. 2003]Trends Cell Biol. 1999 Dec; 9(12):M5-8.
[Trends Cell Biol. 1999]Science. 1999 Jun 25; 284(5423):2124-9.
[Science. 1999]Mol Microbiol. 2003 Nov; 50(3):725-7.
[Mol Microbiol. 2003]Proc Natl Acad Sci U S A. 1999 Mar 30; 96(7):3801-6.
[Proc Natl Acad Sci U S A. 1999]Trends Genet. 2002 Jan; 18(1):1-5.
[Trends Genet. 2002]Trends Genet. 2000 Dec; 16(12):529-33.
[Trends Genet. 2000]Gene. 2004 Jun 23; 335():73-88.
[Gene. 2004]Genome Biol. 2001; 2(9):RESEARCH 0033.
[Genome Biol. 2001]Annu Rev Microbiol. 2001; 55():709-42.
[Annu Rev Microbiol. 2001]J Bacteriol. 2002 Apr; 184(8):2072-80.
[J Bacteriol. 2002]Annu Rev Microbiol. 2001; 55():709-42.
[Annu Rev Microbiol. 2001]FEMS Microbiol Lett. 2001 Jul 24; 201(2):187-91.
[FEMS Microbiol Lett. 2001]Mol Microbiol. 2003 Nov; 50(3):725-7.
[Mol Microbiol. 2003]Trends Genet. 1998 Nov; 14(11):442-4.
[Trends Genet. 1998]BMC Bioinformatics. 2003 Sep 11; 4():41.
[BMC Bioinformatics. 2003]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]BMC Bioinformatics. 2003 Sep 11; 4():41.
[BMC Bioinformatics. 2003]Nucleic Acids Res. 2001 Jan 1; 29(1):22-8.
[Nucleic Acids Res. 2001]Comput Appl Biosci. 1994 Jun; 10(3):227-35.
[Comput Appl Biosci. 1994]J Mol Biol. 2000 Sep 8; 302(1):205-17.
[J Mol Biol. 2000]Genome Res. 2004 Jan; 14(1):29-36.
[Genome Res. 2004]Comput Appl Biosci. 1997 Oct; 13(5):555-6.
[Comput Appl Biosci. 1997]Bioinformatics. 2002 Mar; 18(3):502-4.
[Bioinformatics. 2002]Genome Res. 2000 Jul; 10(7):991-1000.
[Genome Res. 2000]BMC Evol Biol. 2001 Oct 20; 1():8.
[BMC Evol Biol. 2001]Methods Enzymol. 1996; 266():418-27.
[Methods Enzymol. 1996]Comput Appl Biosci. 1988 Mar; 4(1):11-7.
[Comput Appl Biosci. 1988]J Mol Biol. 2000 Sep 8; 302(1):205-17.
[J Mol Biol. 2000]Trends Genet. 2002 Jan; 18(1):1-5.
[Trends Genet. 2002]Int J Syst Evol Microbiol. 2000 Jul; 50 Pt 4():1655-63.
[Int J Syst Evol Microbiol. 2000]Trends Genet. 2002 Sep; 18(9):472-9.
[Trends Genet. 2002]BMC Evol Biol. 2001 Oct 20; 1():8.
[BMC Evol Biol. 2001]Proc Natl Acad Sci U S A. 1999 Mar 30; 96(7):3801-6.
[Proc Natl Acad Sci U S A. 1999]Microbiol Mol Biol Rev. 2000 Mar; 64(1):202-36.
[Microbiol Mol Biol Rev. 2000]Genome Res. 1999 Aug; 9(8):689-710.
[Genome Res. 1999]Curr Opin Genet Dev. 1998 Dec; 8(6):630-6.
[Curr Opin Genet Dev. 1998]Microbiol Mol Biol Rev. 2000 Mar; 64(1):202-36.
[Microbiol Mol Biol Rev. 2000]Genome Res. 1999 Aug; 9(8):689-710.
[Genome Res. 1999]J Mol Evol. 2002 Jul; 55(1):85-91.
[J Mol Evol. 2002]Genome Res. 2000 Jul; 10(7):991-1000.
[Genome Res. 2000]Annu Rev Microbiol. 2001; 55():709-42.
[Annu Rev Microbiol. 2001]Science. 2003 Aug 8; 301(5634):829-32.
[Science. 2003]PLoS Biol. 2003 Oct; 1(1):E19.
[PLoS Biol. 2003]Trends Genet. 2002 Sep; 18(9):472-9.
[Trends Genet. 2002]Proc Natl Acad Sci U S A. 1999 Mar 30; 96(7):3801-6.
[Proc Natl Acad Sci U S A. 1999]Trends Genet. 2004 May; 20(5):254-60.
[Trends Genet. 2004]Bioinformatics. 2002 Mar; 18(3):502-4.
[Bioinformatics. 2002]