![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||
Copyright © 2001 Wolf et al; licensee BioMed Central Ltd. Verbatim copying and redistribution of this article are permitted in any medium for any non-commercial purpose, provided this notice is preserved along with the article's original URL. For commercial use, contact info@biomedcentral.com Genome trees constructed using five different approaches suggest new major bacterial clades 1National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA 2Howard Hughes Medical Institute and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-9050, USA Corresponding author.#Contributed equally. Yuri I Wolf: wolf/at/ncbi.nlm.nih.gov; Igor B Rogozin: rogozin/at/ncbi.nim.nih.gov; Nick V Grishin: grishin/at/chop.swmed.edu; Roman L Tatusov: tatusov/at/ncbi.nlm.nih.gov; Eugene V Koonin: koonin/at/ncbi.nlm.nih.gov Received September 20, 2001; Accepted October 23, 2001. This article has been cited by other articles in PMC.Abstract Background The availability of multiple complete genome sequences from diverse taxa prompts the development of new phylogenetic approaches, which attempt to incorporate information derived from comparative analysis of complete gene sets or large subsets thereof. Such attempts are particularly relevant because of the major role of horizontal gene transfer and lineage-specific gene loss, at least in the evolution of prokaryotes. Results Five largely independent approaches were employed to construct trees for completely sequenced bacterial and archaeal genomes: i) presence-absence of genomes in clusters of orthologous genes; ii) conservation of local gene order (gene pairs) among prokaryotic genomes; iii) parameters of identity distribution for probable orthologs; iv) analysis of concatenated alignments of ribosomal proteins; v) comparison of trees constructed for multiple protein families. All constructed trees support the separation of the two primary prokaryotic domains, bacteria and archaea, as well as some terminal bifurcations within the bacterial and archaeal domains. Beyond these obvious groupings, the trees made with different methods appeared to differ substantially in terms of the relative contributions of phylogenetic relationships and similarities in gene repertoires caused by similar life styles and horizontal gene transfer to the tree topology. The trees based on presence-absence of genomes in orthologous clusters and the trees based on conserved gene pairs appear to be strongly affected by gene loss and horizontal gene transfer. The trees based on identity distributions for orthologs and particularly the tree made of concatenated ribosomal protein sequences seemed to carry a stronger phylogenetic signal. The latter tree supported three potential high-level bacterial clades,: i) Chlamydia-Spirochetes, ii) Thermotogales-Aquificales (bacterial hyperthermophiles), and ii) Actinomycetes-Deinococcales-Cyanobacteria. The latter group also appeared to join the low-GC Gram-positive bacteria at a deeper tree node. These new groupings of bacteria were supported by the analysis of alternative topologies in the concatenated ribosomal protein tree using the Kishino-Hasegawa test and by a census of the topologies of 132 individual groups of orthologous proteins. Additionally, the results of this analysis put into question the sister-group relationship between the two major archaeal groups, Euryarchaeota and Crenarchaeota,
and suggest instead that Euryarchaeota might be a paraphyletic group with respect to Crenarchaeota. Conclusions We conclude that, the extensive horizontal gene flow and lineage-specific gene loss notwithstanding, extension of phylogenetic analysis to the genome scale has the potential of uncovering deep evolutionary relationships between prokaryotic lineages. Background The determination of multiple, complete genome sequences of bacteria, archaea and eukaryotes has created the opportunity for a new level of phylogenetic analysis that is based not on a phylogenetic tree for selected molecules, for example, rRNAs, as in traditional molecular phylogenetic studies [1,2], but (ideally) on the entire body of information contained in the genomes. The most straightforward version of this type of analysis, to which we hereinafter refer to as 'genome-tree' building, involves scaling-up the traditional tree-building approach and analyzing the phylogenetic trees for multiple gene families (in principle, all families represented in many genomes), in an attempt to derive a consensus, 'organismal' phylogeny [3-5]. However, because of the wide spread of horizontal gene transfer and lineage-specific gene loss, at least in the prokaryotic world, comparison of trees for different families and consensus derivation may become highly problematic [6,7]. Probably due to all these problems, a pessimistic conclusion has been reached that prokaryotic phylogeny might not be reconstructable from protein sequences, at least with current phylogenetic methods [4]. With the complete genome sequences at hand, it appears natural to seek for alternatives to traditional, alignment-based tree-building in the form of integral characteristics of the evolutionary process. Probably the most obvious of such characteristics is the presence-absence of representatives of the analyzed species in orthologous groups of genes, and recently, at least three groups have employed this approach to build genome trees, primarily for prokaryotes [8-10]. An alternative way to construct a genome tree involves using the mean or median level of similarity among all detectable pairs of orthologs as the measure of the evolutionary distance between species [11]. Yet another possibility involves building species trees by comparing gene orders. This approach had been pioneered in the classical work of Dobzhansky and Sturtevant who used inversions in Drosophila chromosomes to construct an evolutionary tree [12]. Subsequently, mathematical methods have been developed to calculate rearrangement distances between genomes, and, using these, phylogenetic trees have been built for certain small genomes, such as plant mitochondria and herpesviruses [13,14]. These approaches, however, are applicable only to genomes that show significant conservation of global gene order, which is manifestly not the case among prokaryotes [15-17]. Even relatively close species such as, for example, Escherichia coli and Haemophilus influenzae, two species of the γ-subdivision of Proteobacteria, retain very little conservation of gene order beyond the operon level (typically, two-to-four genes in a row), and essentially none is detectable among evolutionarily distant bacteria and ar
chaea [15,16,18]. Very few operons, primarily those coding for physically interacting subunits of multiprotein complexes such as certain ribosomal proteins or RNA-polymerase subunits, are conserved across a wide range of prokaryotic lineages [15,16]. On the other hand, pairwise comparisons of even distantly related prokaryotic genomes reveal considerable number of shared (predicted) operons, which creates an opportunity for a meaningful comparative analysis [19][20,21]. The critical issue with all these approaches to genome tree building is to what extent each of them reflects phylogeny and to what extent they are affected by other evolutionary processes, such as lineage-specific gene loss and horizontal gene transfer. Comparative analyses have strongly suggested that these phenomena make major contributions to genome evolution, at least in prokaryotes [7,22-25]. These phenomena have the potential to severely affecting phylogenetic tree topology, particularly when similar sets of genes are lost indifferent lineages because of similar environmental pressures, or when a preferential trend of horizontal gene flow exists between different lineages. The possibility even has been discussed that the amount of lateral gene exchange is such that it invalidates the very principle of representing the evolution of species as a tree; instead, the only adequate representation of evolutionary history could be a complex network [6][25]. Genome-trees seem to be the last resort for the species tree concept. Unless phylogenetic signal can be revealed by at least some approaches based on genome-wide comparisons, the conclusion seems imminent that this concept should be abandoned and replaced by a more complex representation of evolution. Here, we compare the topologies produced with five, largely independent approaches to genome-tree building: i) presence-absence of genomes in Clusters of Orthologous Groups of proteins (COGs); ii) conservation of local gene order (pairs of adjacent genes) among prokaryotic genomes; iii) distribution of percent identity between apparent orthologs; iv) sequence conservation in concatenated alignments of ribosomal proteins; v) comparative analysis of multiple trees reconstructed for representative protein families. We find that, while the presence-absence approach is most heavily affected by gene loss and horizontal transfer, the other four methods reveal stronger phylogenetic signals. Although the topologies of the trees constructed with different approaches were only partially compatible, three previously unnoticed high-level clades among bacteria were revealed with notable consistency. We suggest that, in spite of all the complexity brought about by horizontal gene transfer and lineage-specific gene loss, these groups reflect certain evolutionary reality, i.e. the trajectory of evolution for a relatively stable gene core. It appears that this is the only meaningful way to treat the notion of a species tree: as the history of a relatively large ensemble of genes, not a comprehensive representation of the history of entire genomes. Results New criteria for genome-tree construction To our knowledge, conserved gene pairs and distributions of identity level between orthologs have not been used previously as the basis for phylogenetic tree construction. Therefore we start by describing the relevant results of prokaryotic genome comparison in somewhat greater detail. Conserved gene pairs in prokaryotic genomes The results of the present analysis of conserved gene pairs are consistent with the notion of the fluidity of prokaryotic gene order caused by extensive recombination. Only 17 invariant genes pairs were detected, all of which consists of genes for ribosomal proteins and RNA polymerase subunits. The remaining 4586 gene pairs were missing in at least one genome. The number of gene pairs represented in three, four and a greater number of genomes decayed rapidly, with highly conserved pairs forming the tail of the distribution (Fig. (Fig.1).1
The number of conserved gene pairs present in individual prokaryotic genomes varied from 208 for M. genitalium to 2314 for P. aeruginosa (Table 1). Analysis of the co-occurrence of gene pairs among the prokaryotic genomes shows high values of the Jaquard coefficient, which reflect partial conservation of gene order (see legend to Table 1), for closely related species, for example, 0.32 for E. coli and H. influenzae and 0.35 for M. thermoautotrophicum and M. jannaschi (Table 1). The value of this coefficient varied from 0.16 to 0.66, with a mean of 0.26, for archaea, and from 0.04 to 0.87, with a mean of 0.16, for bacteria. In contrast, for archaeal-bacterial comparisons, the values varied from 0.04 to 0.18, with the average of 0.08 (Table 1). These observations appear to indicate that the distribution of conserved gene pairs among prokaryotic genomes carries a phylogenetic signal.
Distributions of identity percentage between probable orthologs from complete prokaryotic genomes Figure Figure22
The use of reciprocal best hits is a conservative way to identify the set of probable orthologs between pairs of genomes because some of the orthologs are missed due to complex relationships between groups of paralogs. Nevertheless, all genome-to-genome comparisons included at least 100 (for the smallest genomes such as the mycoplasmas), and typically, a considerably greater number of protein pairs ([11] and data not shown). This suggests that parameters of the distributions of the similarity level between probable orthologs identified in this fashion could potentially serve as useful measures of the evolutionary distance between genomes. Genome trees constructed with three different approaches Genome trees were generated using the approaches described under Material and Methods. All the trees showed a clear separation of the two major prokaryotic domains, Bacteria and Archaea (Fig. (Fig.33
Presence-absence of genomes in COGs The topology of the parsimony tree built using this criterion appears to reflect primarily the phenotypes of the respective organisms (Fig. (Fig.3).3 i) bacteria with large genomes, namely E. coli, B. subtilis, Synechocystis sp., Deinococcus radiodurans and Mycobacterium tuberculosis, and free-living bacteria with small genomes, A. aeolicus and T. maritima ii) parasites with small genomes (mycoplasmas, spirochetes, chlamydia and rickettsia) Parasites with moderate-sized genomes (H. influenzae, N. meningitidis, and P. multocida; H. pylori and C. jejuni) formed two distinct groups. Thus, well-established phylogenetic relationships between free-living and parasitic bacteria, such as those within the Proteobacteria (E. coli-H. influenzae-P. multocida-N. meningitidis) and within low-GC Gram-positive bacteria (B. subtilis-mycoplasmas), are not reflected accurately in this tree topology. The two free-living bacteria with small genomes, the hyperthermophiles A. aeolicus and T. maritima, did not join either the free-living or the parasitic bacterial cluster, despite their small number of genes similar to that in bacterial parasites (Fig. (Fig.3).3 In previous studies that employed similar approaches to genome-tree building, phylogenetically reasonable clades were observed after a simple omission of parasitic species [8,9]. Such an operation could be applied to the tree shown in Fig. Fig.3,3 Conserved gene pairs The topology of the tree based on gene pair conservation seems to carry a stronger phylogenetic signal than the gene presence-absence tree because it correctly groups together related free-living and parasitic bacteria despite major differences in gene repertoires (Fig. (Fig.4).4 At least some unusual aspects of this tree's the topology could be explained by horizontal transfer of operons between particular bacterial and archaeal lineages. Specifically, it has been noticed previously that T. maritima shares a considerable number of genes and operons with Gram-positive bacteria, to the exclusion of other bacteria [21]; this seems to be compatible with the position of T. maritima with the Gram-positive cluster. Similarly, considerable horizontal gene transfer appear to have occurred between the Sulfolobus and Thermoplasma lineages, which cluster together in the archaeal part of this tree. The presence of extra species in the proteobacterial cluster is more surprising because no obvious trend for operon transfer between these bacteria and bona fide Proteobacteria has been noticed during systematic genome comparisons; however, a considerable number of shared gene pairs was detected during the present analysis (Table 1). Artifacts of tree construction could also contribute to these associations. In contrast, the spirochete-chlamydia clade might reflect a deep phylogenetic relationship (see discussion below). Parameters of percent identity distributions between orthologs Different characteristics of the distributions of percent identity between the probable orthologs, such as the mean, the median, the mode and various quantiles, were used to calculate distances between genomes and construct phylogenetic trees. Trees built with different cut-off values for symmetrical best hits, four different formulas for the evolutionary distance calculation (see Materials and Methods) and different parameters of the distributions showed essentially the same topology, with strong bootstrap support for most of the clades (Fig. (Fig.55 Alignment-based approaches to the construction of a species tree The above three approaches involve construction of genome trees "par excellence", i.e. based on integral characteristics of genomes (or, more precisely, gene sets) that are not directly related to more traditional, alignment-based measures, which are usually employed for calculating evolutionary distances or for parsimony analysis. These genome tree raise several interesting phylogenetic questions, for example, do spirochetes and chlamydia indeed share a common ancestor, and are Euryarchaeota, in fact, a paraphyletic group with respect to the Crenarchaeota. However, the reliability of the conclusions drawn from the topology of these trees remains uncertain. Therefore we decided to complement these genome-oriented approaches with more traditional ones applied on a large scale. Concatenated alignments of ribosomal proteins The alignments of the 32 ribosomal proteins conserved in all bacterial and archaeal species were concatenated head-to-tail and treated as a single alignment containing 4821 columns. The underlying assumption is that the genes coding for ribosomal proteins that function as components of a large macromolecular complex are unlikely to undergo horizontal transfer, which tends to confound comparisons of the tree topologies for other protein families and would invalidate the concatenation approach. The resulting maximum-likelihood tree contains the complete proteobacterial and Gram-positive bacterial clusters as well as the spirochete-chlamydia cluster noticed in the genome-trees. In addition to the spirochetes-chlamydia clade, the following non-trivial affinities were detected with strong bootstrap support: i) a cluster of the two hyperthemophiles, A. aeolicus and T. maritima, ii) a cluster including D. radiodurans, Synechocystis, and M. tuberculosis, which, at a deeper level, joined the Gram-positive bacterial branch (Fig. (Fig.6).6 The reliability of the observed non-trivial groupings was further examined by using a maximum likelihood approach (the Kishino-Hasegawa test). For each clade (usually, species) forming the group to be tested, trees with alternative topologies were manually constructed by joining the clade in question to every other major group in the tree. For example, to assess the support for the spirochetes-chlamydia grouping, spirochetes were placed, sequentially, with Thermotoga, Aquifex, the Thermotoga-Aquifex branch, ε-proteobacteria, the αβγ-proteobacterial branch, Proteobacteria, the Deinococcus-Synechocystis-Mycobacterium cluster, the low G+C Gram-positive cluster, the branch that unites the latter two clusters, and between bacteria and archaea (to the bacterial root). The same alternatives were tested for chlamydia. Alternative topologies were compared either directly, using the ProtML program, or were subjected to local rearrangement first. In cases when the topology did not revert to the original one, the final, "optimized" topology was used for the comparison. These tests showed high stability of the Thermotoga-Aquifex and Deinococcus-Synechocystis-Mycobacterium groupings (no competing topologies with likelihood within 1 SD unit from the original; Fig. Fig.77
A census of protein families Another approach to the "species tree" problem involves analysis of phylogenetic trees for as many individual protein families as possible, in an attempt to identify a prevailing topology or at least common phylogenetic patterns. A survey of the COG data set identified 132 COGs, each of which included a large number of bacterial and archaeal species, but no or few paralogs and thus appeared to be amenable to a large-scale phylogenetic analysis (Table 11). Maximum-likelihood trees were constructed for each of these COGs, and a breakdown of nearest neighbors was derived for species and groups involved in each of the non-trivial or questionable branchings discussed above (Crenarchaea, Thermotoga, Aquifex, Deinococcus, Mycobacterium, Synechocystis, spirochetes, chlamydia, and ε-proteobacteria). In each case, a wide spread of topologies was observed, but the grouping that is observed in the concatenated ribosomal proteins tree was encountered most often, although, for example, for the spirochete-chlamydia cluster, the lead over other topologies was slim (Fig. (Fig.1313
Discussion and Conclusions The trees constructed with each of the four approaches employed here reflect both the phylogenetic signal and the phenotypic (life style) similarities or differences between organisms, but the relative contributions of these two types of information appear to differ substantially. The gene presence-absence analysis seemed to be dominated by the phenotypic signal, primarily that from gene loss. The tree based on conserved gene pairs appeared to combine phylogenetic information with major effects of horizontal transfer of operons. In contrast, the trees based on the distributions of the identity level of orthologs appear to be more meaningful phylogenetically as indicated by the recovery of established high-level phylogenetic groups of bacteria, such as Proteobacteria and Gram-positive bacteria. The ability to correctly identify these major bacterial subdivisions and the absence of obviously wrong groupings confer credibility to non-trivial clades present in these trees, in particular the spirochete-chlamydia clade. The same logic applied to the tree made of concatenated ribosomal protein sequences, which included two other non-trivial bacterial groupings, Aquifex-Thermotoga and Synechocystis-Mycobacterium-Deinococcus, the latter joining the Gram-positive branch. Furthermore, extensive testing of alternative topologies using the Kishino-Hasegawa test largely supported these new bacterial branches. The nature of this support becomes clearer when one examines the results of the protein family census. Each of the potential new clades was indeed most common among the observed topologies, but in no case, was the excess of this topology overwhelming. Taken together, these results seem to shed light on the very notion of a "species tree". It appears that, at best, a species tree can be viewed as a prevailing phylogenetic trend, which, as far as deep branchings are concerned, may not even apply to a majority of the genes in a genome. The potential new, deep relationships between bacterial lineages revealed during this analysis should be considered preliminary and treated with caution. Nevertheless, an evolutionary affinity between Cyanobacteria (Synechocystis) and Actinomycetes (Mycobacterium) appears plausible, particularly given the presence, in these bacterial groups, of well-developed and partly similar signal transduction systems [27]. The connection between two hyperthermophilic bacteria, Aquifex and Thermotoga, also has obvious biological meaning, although, in this case, particular caution is due, given the possibility of preferential horizontal gene exchange between these organisms that inhabit similar environments. However, the strong support for this grouping obtained in the analysis of concatenated ribosomal proteins argues against horizontal transfer as the primary cause for the observed topology. Although recent studies on the phylogeny of ribosomal proteins suggest some horizontal transfer events, these seem to be largely restricted to bacteria-specific ribosomal proteins. In the universal set of ribosomal proteins, only one, S14, showed clear signs of horizontal transfer [28]. The potential deep phylogenetic connections uncovered during this analysis call for detailed genome comparisons in search of potential shared derived characters, such as unique protein domain architectures, that could support the new clades. The major bacterial lineages are poorly resolved in rRNA-based trees [2,29] and those built using alignments of RNA polymerase subunits [30] and translation elongation factors [29,31]. In the currently accepted taxonomy, which is based primarily (but not exclusively) on 16S RNA phylogenetic analysis, bacterial lineages that are suggested by this analysis to form higher-level clusters, tend to form primary nodes under Bacteria (Chlamydiales, Spirochetales, Cyanobacteria, the Thermus-Deinococcus group, Aquificales, Thermotogales). Thus, the genome trees primarily suggest (however tentatively) new unifications based on deep phylogenetic connections, rather than split already established clades. A notable exception is the traditional unification of Actinomycetes, or High G+C gram-positive bacteria (represented here by Mycobacterium), with low G+C Gram-positive bacteria (the Bacillus-Clostridium group) under Firmicutes (Gram-positive bacteria). Such a connection was not supported by any of the trees analyzed here, and it is also poorly, if at all, supported by the latest consensus trees for 16S RNA, 23 S RNA and translation factor EF-Tu [29]. Therefore it seems likely that the Firmicutes clade, at least in its present composition, does not exist. The new clade that might replace it consists of low-GC Gram-positive bacteria and the potential Actinomycetes-Deinococcales-Cyanobacteria group (Fig. (Fig.6).6 An independent phylogenetic study of concatenated ribosomal proteins has been recently published [32]. The main specific conclusion reported in this study was the apparent association of Synechocystis with Gram-positive bacteria, although instability of the tree topology dependent on the subset of sites used for analysis was noticed. Another recent study addressed the issue of a global tree through phylogenetic analysis of 14 concatenated sets of orthologous proteins, for which no strong evidence of horizontal transfer was available [33]. Notably, some of the unexpected groupings within the bacterial domain reported in this study coincide or overlap with those described here, namely, a spirochete-chlamydial clade and a Deinococcales-Cyanobacteria clade. The grouping of the latter clade with Actinomycetes, the unification of the Deinococcales-Cyanobacteria-Actinomycetes clade with Gram-positive bacteria and the grouping of the two bacterial hyperthermophiles were not reproduced in the work of Brown and co-workers. The differences between the results of the two studies could owe to the differences between data sets analyzed, the methods used or, most likely, both. We should note that the present study engaged a substantially broader data set and more diverse methods for tree construction. We believe, however, that, in terms of the potential contribution of genome-wide phylogenetic analysis to phylogenetic taxonomy, the areas where different methods and independent analyses by different groups converge might be more important than the areas of discrepancy. It appears that potential new clades revealed in such independent studies are strong candidates for new, high-level taxa. The results of the present study suggest that genome trees based on new, integral criteria do not provide substantial advantages in phylogenetic reconstruction over more traditional, alignment-based methods expanded to the genomic scale. In fact, the latter seem to be more sensitive in detecting potential deep evolutionary relationships and this is expected to further improve with the increasing number of completely sequenced genomes becoming available for analysis. We believe, however, that this conclusion does not necessarily indicate that genome trees, such as those based on representation of genomes in orthologous sets or conservation of gene pairs, are useless. In addition to revealing some new phylogenetic affinities, they are capable of alerting researchers to other evolutionary phenomena, such as loss of similar gene sets in different organisms and preferential horizontal gene exchange between certain lineages. Material and Methods Sequence data The sequences of the proteins encoded in complete genomes were extracted from the Genome division of the Entrez retrieval system [34]. The analyzed genomes included those of 30 bacteria: Aquifex aeolicus (Aquae), Bacillus halodurans (Bacha), Bacillus subtilis (Bacsu), Borrelia burgdorferi (Borbu), Buchnera sp. (Bucsp), Campylobacter jejunii (Camje), Caulobacter crescentus (Caucr), Chlamydia trachomatis (Chltr), Chlamydophila pneumoniae (Chlpn), Deinococcus radiodurans (Deira), Escherichia coli (Escco), Haemophilus influenzae (Haein), Helicobacter pylori (Helpy), Lactococcus lactis (Lacla), Mesorhizobium loti (Meslo), Mycoplasma genitalium (Mycge), Mycoplasma pneumoniae (Mycpn), Mycobacterium tuberculosis (Myctu), Neisseria meningitidis (Neime), Pasteurella multocida (Pasmu), Psudomonas aeruginosa (Pseae), Rickettsia prowazekii (Ricpr), Staphyloccocus aureus (Staau), Streptococcus pyogenes (Strpy), Synechocystis PCC6803 (SynPC), Thermotoga maritima (Thema), Treponema pallidum (Trepa), Ureaplasma urealyticum (Ureur), Vibrio cholerae (Vibch), Xylella fastidiosa (Xylfa), and ten archaea: Aeropyrum pernix (Aerpe), Archaeoglobus fulgidus (Arcfu), Halobacterium sp. (Halsp), Methanobacterium thermoautotrophicum (Metth), Methanococcus jannaschii (Metja), Pyrococcus horikoshii (Pyrho), Pyrococcus abyssi (Pyrab), Sulfolobus solfataricus (Sulso), Thermoplasma acidophilum (Theac), Thermoplasma volcanium (Thevo). Phylogenetic tree construction Parsimony trees based on the presence-absence of conserved gene pairs in prokaryotic genomes The database of Clusters of Orthologous Groups of proteins (COGs) was used as the source of information on orthologous genes in prokaryotic genomes [35,36]. Briefly, the COGs were constructed from the results of all-against-all BLAST [37] comparison of proteins encoded in complete genomes by detecting consistent groups of genome-specific best hits (BeTs). The COG construction procedure does not rely on any preconceived phylogenetic tree of the included species except that certain obviously related genomes (for example, two species of mycoplasmas or pyrococci) were grouped prior to the analysis, to eliminate strong dependence between BeTs. In order to avoid spurious occurrence of the same gene pair, only gene pairs conserved in three or more genomes were considered. A pair of genes from two COGs was considered to be conserved if the respective genes were adjacent in at least one genome and were separated by no more than two genes in at least two additional genomes. This relaxed definition of a conserved gene pair was adopted to take into account the high level of recombination in prokaryotic genomes. From the data on the presence-absence of each conserved gene pair in the analyzed genomes (excluding pairs of closely related species: E. coli-Buchnera sp., H. influenzae-P. multocida, C. trachomatis-C. pneumoniae, P. horikoshii-P. abyssi, M. genitalium-M. pneumoniae-U. urealyticum, H. pyroli – C. jejuni, T. acidophilum-T. volcanium), a 0/1 matrix analogous to the one used for the presence-absence of individual genes was constructed, and a tree was built using Dollo parsimony [38]. A parsimony method was chosen for this analysis because the presence-absence of a conserved gene pair in a genome can be naturally treated in terms of character states. The Dollo model is based on the assumption that each derived character state (in this case, the
presence of a gene pair) originates only once, and homoplasies exist only in the form of reversals to the ancestral condition (absence of a gene pair) [38]. In other words, parallel or convergent gains of the derived condition are assumed to be highly unlikely. The Dollo parsimony method is not sensitive to gene loss which is extremely common in evolution of prokaryotes, but the results can be affected by independent acquisition of the same gene pair by different genome via horizontal gene transfer. Phylogenetic analysis was performed by using the PAUP 4.0 program [39], with 1000 bootstrap replicates performed to assess the reliability of the tree topology. In addition, the tree topology was analyzed using the neighbor-joining method [40]. Parsimony trees based on the representation of genomes in orthologous gene sets The information on orthologous genes in prokaryotic genomes and the yeast genome was derived from the COGs as in the previous approach, and the orthology data were similarly represented as a 0/1 matrix of presence-absence of the analyzed genomes in the COGs. A Dollo parsimony tree was constructed and the reliability of its topology was assessed using the bootstrap method as described above. Distance trees based on distributions of identity percentage between orthologous protein sequences The sequences of all proteins encoded in the analyzed genomes were compared to each other using the gapped BLASTP program [37]. Reciprocal, genome-specific BeTs were collected at different expectation (E) value cutoffs (0.01, 0.001, 0.0001, 0.00001). This method for identification of probable orthologs is, in principle, similar to the method employed in COG construction, but differs in that there is no requirement for the formation of triangles of consistent BeTs. The result of this procedure is a conservative selection of orthologous pairs because the cases of lineage-specific duplication that result in non-symmetrical BeTs are excluded and so are orthologous pairs with very low sequence similarity. However, the limitation of the COG system, namely the requirement that each orthologous group is represented in at least three genomes, is avoided. The distributions of identity percentage among the reciprocal best hits were derived for each pair of species. The mean, mode, median and different quantiles of the identity percentage distributions were used for estimating evolutionary distances. Four distance measures were used, namely: i) P-distances calculated as the fraction of different residues: d = 1-q, ii) Poisson distances d = -1nu, iii) geometric distances calculated using the formula d = 1/u-1, and iv) logarithmic distances found as a solution of the equation u = ln(1+2d)/(2d), where d is the evolutionary distance, q is percent identity, and u = (q-0.05)/0.95 [41,42][43]. Trees were constructed from the distance matrices obtained with the above distance estimates using the neighbor-joining method [40] as implemented in the NEIGHBOR program of the PHYLIP package [44]. Bootstrap values were estimated by resampling the set of orthologs identified for each pair of genomes 1000 times and reconstructing trees from the distributions of the distances from these resampled sets. Maximum Likelihood trees based on concatenated alignments of ribosomal proteins Sets of orthologous ribosomal proteins were extracted from the COG database, and their amino acid sequences were aligned using the T-Coffee program [45], with subsequent manual validation and removal of poorly aligned regions. The alignments are available upon request. Pairwise evolutionary distances between the sequences in concatenated alignments were calculated using the Dayhoff PAM model as implemented in the PROTDIST program of the PHYLIP package [44]. A distance tree was constructed from the resulting distance matrix by using the least-square [46] method as implemented in the FITCH program of PHYLIP [44]. The maximum likelihood tree was constructed with the JTT-F model of amino acid substitutions [47], as implemented in the ProtML program of the MOLPHY package [48], by optimizing the least squares tree with local rearrangements. Alternative topologies were created manually by modifications of the original tree and directly compared by ProtML. Bootstrap analysis was performed by using the Resampling of Estimated Log-Likelihoods (RELL) method as implemented in ProtML [48,49]. Comparative analysis of Maximum Likelihood trees for individual protein families The representative families were selected from the COG database according to the following criteria: i) at least 30 species are represented; ii) no more than two paralogs in any of the species; iii) no more than 1.2 paralogs per genome on average; iv) at least 100 positions in the alignment containing less than 30% of gaps. This selection procedure resulted in a set of 132 families (COGs). Alignments and ML trees were constructed for these families as described above for the concatenated ribosomal proteins. Quantitative comparison of tree topologies To compare tree topologies quantitatively, the symmetric distance between trees [50] was computed using the TREEDIST program of the PHYLIP package (version 3.6a). Briefly, each of the two compared trees is divided by each internal branch into two partitions. The symmetric distance is the number of partitions that are found in one tree but not the other.
Acknowledgements We thank M. Nei for simulating discussions about the Dollo parsimony analysis, J. Felsenstein for alerting us of the inclusion of the TREEDIST program in PHYLIP3.6a and D. Leipe for discussions on taxonomy. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||
Microbiol Rev. 1987 Jun; 51(2):221-71.
[Microbiol Rev. 1987]J Bacteriol. 1994 Jan; 176(1):1-6.
[J Bacteriol. 1994]Science. 1996 Jan 26; 271(5248):470-7.
[Science. 1996]Nucleic Acids Res. 2001 Jan 15; 29(2):545-52.
[Nucleic Acids Res. 2001]Science. 1999 Jun 25; 284(5423):2124-9.
[Science. 1999]Nat Genet. 1999 Jan; 21(1):108-10.
[Nat Genet. 1999]J Mol Evol. 1999 Nov; 49(5):591-600.
[J Mol Evol. 1999]Genome Res. 2000 Jul; 10(7):991-1000.
[Genome Res. 2000]Genomics. 1995 Nov 20; 30(2):299-311.
[Genomics. 1995]J Comput Biol. 1999 Fall-Winter; 6(3-4):431-45.
[J Comput Biol. 1999]Trends Cell Biol. 1999 Dec; 9(12):M5-8.
[Trends Cell Biol. 1999]Mol Microbiol. 1997 Aug; 25(4):619-37.
[Mol Microbiol. 1997]Bioessays. 1999 Feb; 21(2):99-104.
[Bioessays. 1999]Science. 1999 Jun 25; 284(5423):2124-9.
[Science. 1999]Genome Res. 2000 Jul; 10(7):991-1000.
[Genome Res. 2000]Nat Genet. 1999 Jan; 21(1):108-10.
[Nat Genet. 1999]Nucleic Acids Res. 1999 Nov 1; 27(21):4218-22.
[Nucleic Acids Res. 1999]Microbiol Rev. 1987 Jun; 51(2):221-71.
[Microbiol Rev. 1987]Science. 1997 May 2; 276(5313):734-40.
[Science. 1997]J Mol Biol. 1999 Jun 18; 289(4):729-45.
[J Mol Biol. 1999]Trends Genet. 2000 Dec; 16(12):529-33.
[Trends Genet. 2000]J Bacteriol. 1994 Jan; 176(1):1-6.
[J Bacteriol. 1994]Electrophoresis. 1998 Apr; 19(4):554-68.
[Electrophoresis. 1998]J Bacteriol. 1997 Mar; 179(5):1734-47.
[J Bacteriol. 1997]Proc Natl Acad Sci U S A. 1996 Jul 23; 93(15):7749-54.
[Proc Natl Acad Sci U S A. 1996]Int J Syst Evol Microbiol. 2000 Jul; 50 Pt 4():1655-63.
[Int J Syst Evol Microbiol. 2000]Nat Genet. 2001 Jul; 28(3):281-5.
[Nat Genet. 2001]Bioinformatics. 1999 Jul-Aug; 15(7-8):536-43.
[Bioinformatics. 1999]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]Nucleic Acids Res. 2000 Jan 1; 28(1):33-6.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Mol Biol Evol. 1987 Jul; 4(4):406-25.
[Mol Biol Evol. 1987]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]J Mol Evol. 1997 Oct; 45(4):359-69.
[J Mol Evol. 1997]J Mol Evol. 1997 Apr; 44(4):361-70.
[J Mol Evol. 1997]Mol Biol Evol. 1987 Jul; 4(4):406-25.
[Mol Biol Evol. 1987]Methods Enzymol. 1996; 266():418-27.
[Methods Enzymol. 1996]J Mol Biol. 2000 Sep 8; 302(1):205-17.
[J Mol Biol. 2000]Methods Enzymol. 1996; 266():418-27.
[Methods Enzymol. 1996]Science. 1967 Jan 20; 155(760):279-84.
[Science. 1967]Comput Appl Biosci. 1992 Jun; 8(3):275-82.
[Comput Appl Biosci. 1992]