• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of jbacterPermissionsJournals.ASM.orgJournalJB ArticleJournal InfoAuthorsReviewers
J Bacteriol. Jan 2009; 191(1): 65–73.
Published online Oct 31, 2008. doi:  10.1128/JB.01237-08
PMCID: PMC2612427

Trends in Prokaryotic Evolution Revealed by Comparison of Closely Related Bacterial and Archaeal Genomes [down-pointing small open triangle]


In order to explore microevolutionary trends in bacteria and archaea, we constructed a data set of 41 alignable tight genome clusters (ATGCs). We show that the ratio of the medians of nonsynonymous to synonymous substitution rates (dN/dS) that is used as a measure of the purifying selection pressure on protein sequences is a stable characteristic of the ATGCs. In agreement with previous findings, parasitic bacteria, notwithstanding the sometimes dramatic genome shrinkage caused by gene loss, are typically subjected to relatively weak purifying selection, presumably owing to relatively small effective population sizes and frequent bottlenecks. However, no evidence of genome streamlining caused by strong selective pressure was found in any of the ATGCs. On the contrary, a significant positive correlation between the genome size, as well as gene size, and selective pressure was observed, although a variety of free-living prokaryotes with very close selective pressures span nearly the entire range of genome sizes. In addition, we examined the connections between the sequence evolution rate and other genomic features. Although gene order changes much faster than protein sequences during the evolution of prokaryotes, a strong positive correlation was observed between the “rearrangement distance” and the amino acid distance, suggesting that at least some of the events leading to genome rearrangement are subjected to the same type of selective constraints as the evolution of amino acid sequences.

With the rapid growth of the number of completely sequenced bacterial and archaeal genomes, comparative genomics of prokaryotes is coming of age. Broadly, comparative genomics has the potential to address two major classes of phenomena, macroevolutionary and microevolutionary. The study of macroevolutionary problems, such as the relationship between bacterial and archaeal phyla or horizontal gene transfer between distant organisms, benefits most from comparative analysis of maximally diverse genomes. In contrast, microevolutionary processes that arguably are central to our understanding of the mechanisms of evolution, such as the differential effects of selection on different types of genomic sequences, require comparison of multiple, closely related genomes (2, 15, 23).

Until recently, sets of multiple, closely related genomes suitable for microevolutionary studies have been available, at best, for a few model bacteria, such as Escherichia coli, or bacteria of special interest, such as Bacillus anthracis. However, with the progress in sequencing technology and the resulting rapid increase in the number of completed genomes, this situation has changed. Currently, the number of completely sequenced prokaryotic genomes is growing exponentially, with a doubling time of approximately 21 months (17). As of August 2008, 847 bacterial and 97 archaeal genomes have been available, and about 1,900 genome projects were in progress (19). From this diverse collection of prokaryotic genomes, we recently created a data set of alignable tight genome clusters (ATGCs) that includes closely related genomes from numerous groups of bacteria and archaea and that was devised as a flexible platform for research in microevolutionary genomics of prokaryotes (27; http://atgc.lbl.gov). Obviously, there can be many different definitions of a “close” relationship between genomes, and more importantly, different degrees of closeness are optimal for different types of analysis. Given that in prokaryotes gene order is known to evolve much faster than sequences of homologous proteins (4, 14, 24, 35), our main approach in the construction of ATGCs involved selecting genomes that maintained a sufficient amount of synteny to cover a significant fraction of genes and to serve as an aid in the identification of orthologs (16). Typically, the ATGCs include either different bacterial (archaeal) species from the same genus or strains of the same species. The inclusion of only closely related genomes in each ATGC ensures reliable identification of orthologous genes and the availability of high-quality alignments for most, if not all, of them. The ATGCs constructed with this partially formalized approach include bacterial and archaeal genomes with considerable variation of evolutionary distances that have the potential to inform the investigation of key microevolutionary questions. These questions include the effects of purifying and positive selection on different classes of sites (nonsynonymous, synonymous, and noncoding sites), the connections between these effects, global genome characteristics (size and nucleotide composition), genome architecture (the sizes of genes and intergenic regions, and gene density), various aspects of the organism's life style, and the relationship between sequence evolution and recombination.

Here, we compare the evolutionary features of the ATGCs and show that the ratio of the rates of nonsynonymous to synonymous nucleotide substitutions (dN/dS), which is widely used to characterize the nature and strength of the selective pressure affecting protein sequences (13, 18, 29), is a stable characteristic of prokaryotic lineages, at least at small evolutionary scales. We further reveal the strong relationship between dN and the rate of genome rearrangement and describe nontrivial connections between the purifying selection pressure and other characteristics of genomes.



The ATGCs were built from microbial genomes downloaded from the RefSeq database (release 26; 4 November 2007); RefSeq sequences that correspond to plasmids and transposons were excluded from the analysis. The ATGCs are available online and are downloadable from the ATGC website (27; http://atgc.lbl.gov). The definition of an ATGC requires that the genomes in a cluster be closely related with respect to the sequences of orthologous genes and alignable in terms of gene synteny. Accordingly, the clustering procedure consisted of two steps (a more detailed description is given in reference 27). In the first step, genomes were clustered by the sequence similarity of orthologous genes. For a particular pair of genomes, the likely orthologous genes were identified as bidirectional best hits (BBHs) (31) in an all-against-all BLASTP search (1), and for each pair of orthologs, dS and dN were estimated from the alignments of the coding nucleotide sequences using the maximum-likelihood method implemented in PAML with equilibrium codon frequencies, the basic codon substitution model, and a uniform dN/dS ratio for all codons (36). In this step, the median of dS over all orthologous pairs of genes for a particular pair of genomes was used as the intergenomic distance. The preliminary clusters were derived from an ultrametric genome tree that was constructed from the matrix of intergenomic distances using the KITSCH program of the PHYLIP package (9). The clusters were defined by using the depth cutoff dS value of 1.5.

In the second step, the additional criterion of gene order similarity was applied to increase the reliability of identification of orthologs within a cluster. For each pair of genomes, all BBHs detected in the previous step were considered, and each BBH was tested to determine whether it belonged to a synteny region. A BBH was considered to be supported by synteny if there was a high density of adjacent BBHs in its close vicinity in both genomes. Specifically, the standard dot plot for a given genome pair was constructed from the complete set of BBHs (32). For each BBH, the synteny support was calculated as the maximum number of other BBHs in a sliding window of length 7 that included the original BBH. The BBHs with five or more BBHs in the neighborhood were considered to be supported by synteny.

To determine whether two genomes were alignable, the rearrangement distance between two genomes was calculated as follows: DY = (Nb − Ns)/Nb, where DY is the synteny distance, Nb is the total number of BBHs and Ns is the number of BBHs supported by synteny.

Finally, to generate alignable clusters, all genomes in a cluster from the previous step were considered nodes in a graph; edges were added if DY was < 0.15 (an essentially arbitrary cutoff chosen so that the substantial majority are supported by synteny and, accordingly, are included in all analyses), and single-linkage clustering was performed. Connected components of the graph of size 2 or greater represented ATGCs.

Estimation of dN, dS, and dN/dS.

For each pair of genomes within an ATGC, alignments of the coding nucleotide sequences of all synteny-supported BBHs were generated using the MUSCLE program (6), and the synonymous (dS) and nonsynonymous (dN) substitution rates were estimated using the maximum-likelihood method implemented in PAML (36). The medians of dS (DS) and dN (DN) over all BBHs supported by synteny were considered two types of intergenomic distances. The DN/DS ratio was used as an estimate of the selective pressure that affected the compared genomes on the path of evolution after their radiation from the last common ancestor.

To carry out the analysis of correlates of selective pressure, reliable estimates of selective pressure are required. Accordingly, only genome pairs with a DS value of <1.2 and a DN value of >0.01 were selected, and accordingly, only those ATGCs that included at least one pair of genomes satisfying these criteria were used for further analysis. The result of applying these criteria was that several clusters of densely sampled genomes, such as streptococci and staphylococci, which are represented in the ATGC web resource, were excluded from the present analysis.

Genomic variables and PCA.

The values of the following seven genomic variables were calculated for each of the 41 ATGCs by averaging the values for the corresponding constituent genomes: the genome size (log scale), the number of proteins (log scale), the GC content, the median gene length (log scale), the median intergenic-spacer length (log scale), the fraction of pathogenic organisms, and the DN/DS ratio (log scale) (Table (Table1).1). All variables were standardized to a mean of 0 and a standard deviation of 1. Principal-component analysis (PCA) as implemented in the R analysis package (33) was performed on the 41-by-7 data table.

Characteristics of the 41 ATGCsa


Purifying selective pressure (dN/dS) as an intrinsic characteristic of prokaryotic lineages.

The basic measure of the selection pressure acting on protein-coding sequences is the dN/dS ratio. The lower the dN/dS ratio, the stronger the pressure of purifying selection that affects the given protein-coding gene. The dN/dS value also depends on positive selection, which is manifest in increased dN values. However, all of its biological importance notwithstanding, quantitatively positive selection appears to be a relatively minor factor (13, 18); thus, in practice, the dN/dS value calculated for complete protein-coding-gene sequences is primarily a measure of the strength of purifying selection. In order to characterize the dN/dS values on a genome-wide scale, we constructed the distributions of dN, dS, and dN/dS within each of the 41 analyzed ATGCs over the respective sets of orthologous protein-coding genes. In agreement with previous findings (11), the shapes of these distributions were closely similar among the ATGCs and could be best fit by the log-normal distribution (Fig. 1a to c). We then examined the dependence between the ratio of the medians of dN and dS (DN/DS) and the median of dN that served as a proxy for the evolutionary distance between genomes (very similar results were obtained with the median of dS [data not shown]). As exemplified in Fig. 2a to c, the points in the plots strongly clustered along a horizontal straight line, that is, the DN/DS value remained virtually constant within a given ATGC, with deviations seen only at very low DN. It has been reported that, for some groups of genomes, the estimates of the purifying selection pressure drop at small distances; that is, the smaller the distance, the greater the dN/dS ratio (29). It is generally thought that this effect is due to the large fraction of nonfixed, effectively neutral substitutions (polymorphisms) at small evolutionary distances. In the ATGC analysis, we indeed observed such a tendency (Fig. (Fig.1c).1c). Furthermore, in many ATGCs, DN/DS showed extremely high variance at small distances due to the lack of statistical power (data not shown).

FIG. 1.
Distributions of dS, dN, and dN/dS in orthologous gene sets from three genome pairs from different ATGCs. (a) Distribution (probability density) of dN. (b) Distribution (probability density) of dS. (c) Distribution (probability density) of dN/dS. Metma, ...
FIG. 2.
Dependence of DN/DS on the distance between genomes, measured as DN. Each point corresponds to a pair of genomes in the given ATGC. (a) Xanthomonas sp. (b) Shewanella sp. (c) P. marinus.

The principal outcome of this analysis is that at reliably measured evolutionary distances, namely, a DN value greater than 0.01 and a DS value of less than 1.2, DN/DS is a highly stable characteristic of prokaryotic lineages represented by the ATGCs (a dN value of <0.01 does not provide sufficiently large numbers of substitutions for a statistically reliable analysis, and estimates of dS values of >1.2 are unreliable owing to the saturation of synonymous sites with substitutions (13). Given the effective constancy of DN/DS, the conclusion follows that either DN (for more distant genomes) or DS (for closer genomes) can serve as a robust measure of the evolutionary distance between genomes, whereas DN/DS is an appropriate measure of the purifying selection pressure. The stability of DN/DS within ATGCs suggests that, although for some of the clusters the number of genome pairs that fit the above criteria was small (Table (Table1),1), the DN/DS value nevertheless could be a robust indicator of the strength of selection for the ATGC as a whole, even in those cases.

The DN/DS values for individual ATGCs span an order of magnitude, between ~0.02 and ~0.2, with the median at ~0.06 (Fig. (Fig.33 and Table Table1).1). The distribution of the DN/DS values is supposed to reflect the variance of the purifying selection pressure that affects the evolution of diverse bacteria and archaea. According to the population-genetic theory, the strength of (purifying) selection depends on the effective population size and characteristic mutation and recombination rates of the respective organisms, and these values themselves depend on the life style (20, 21). Examination of the data in Table Table11 reveals no overwhelming pattern and few major trends. It is noticeable that the highest DN/DS values, that is, apparently, the weakest purifying selection pressure, are seen in obligate parasites, including intracellular ones. This observation is compatible with previous findings and is most likely explained by the small effective population size, frequent bottlenecks, and low level of recombination that are characteristic of intracellular parasites and symbionts (22).

FIG. 3.
Distribution (probability density) of DN/DS in the 41 analyzed ATGCs. The distribution curve was obtained by Gaussian-kernel smoothing of the individual data points (28).

Beyond this observation, hardly any generalizations could be immediately drawn from the results, and some values seemed unexpected. In particular, the lowest DN/DS values, that is, presumably the strongest purifying selection pressure, is seen in bacteria that are extremely different in terms of their genome sizes and life styles, namely, Listeria spp. (soil bacteria with medium-size genomes, some of which are pathogens), Lactococcus lactis (a saprophyte with a relatively small genome that grows on dairy products), and Pseudomonas aeruginosa, a versatile animal pathogen with a large genome (Table (Table1).1). It is hard to imagine a common denominator that could be responsible for the uniformly strong selection in these organisms. Conversely, one of the highest DN/DS values (particularly weak purifying selection or, less likely, strong positive selection) was seen in one of the three ATGCs that consist of strains of Prochlorococcus marinus, an abundant planktonic bacterium that is thought to form extremely large populations and that has been considered a model case for genome streamlining (5). In an attempt to reveal potential hidden patterns in the data that could reflect distinct evolutionary trends, we turned to a more systematic, quantitative analysis of the connections between the DN/DS ratio and other variables that characterize bacterial and archaeal genomes.

Correlates of purifying selection pressure in prokaryotic genomes.

Population-genetic theory predicts that organisms that are subjected to strong selection pressure will experience genome streamlining resulting in compact, typically small genomes with few mobile elements and paralogs, short intergenic regions (high gene density), and possibly even relatively short proteins (20, 21). We examined the connections between the DN/DS value, which is thought to reflect the strength of purifying selection pressure, and six other genome-related variables that were measured for each ATGC, namely, the genome size, the number of annotated proteins, the protein-coding gene size, the intergenic-region size, the GC content, and the fraction of pathogens in the ATGC (Table (Table1;1; see Materials and Methods for additional details). The correlations between the DN/DS value and the other variables were found to be moderately strong to weak and not necessarily of the expected sign. In particular, there was a moderate but highly statistically significant negative correlation between the DN/DS and the genome size of prokaryotes (Fig. (Fig.4a),4a), i.e., larger genomes, on average, appear to be subjected to a stronger selection pressure than small genomes in an apparent contradiction of the theoretical prediction. Despite this correlation, the majority of the ATGCs are characterized by DN/DS values within the relatively narrow window between 0.04 and 0.08, and that group includes organisms spanning a broad range of genome sizes (Table (Table11 and Fig. Fig.4a).4a). In agreement with these findings, but unexpectedly considering theoretical predictions, there was also a significant negative correlation between the DN/DS value and the median length of protein-coding genes, that is, organisms encoding longer proteins, on average, seemed to be subjected to stronger purifying selection than those encoding shorter proteins (Fig. (Fig.4b).4b). In contrast, there was no significant correlation between the median intergenic-region size or gene density and the DN/DS ratio (Fig. 4c and d). On the whole, these correlations (or lack thereof) between the purifying selection pressure (measured through the DN/DS) and other genomic variables seem to be poorly compatible with the concept of streamlining caused by strong purifying selection.

FIG. 4.
Correlations between the purifying selection pressure (DN/DS) and other genomic variables. (a) DN/DS versus genome size. (b) DN/DS versus intergenic-region size. (c) DN/DS versus gene density. (d) DN/DS versus protein-coding-gene size. The dashed lines ...

Considering that the above correlations are relatively weak and their signs seem to be unexpected, we performed PCA of the DN/DS value and the other six genome-related variables. Linear regression analysis showed that altogether, approximately 64% of the variance of DN/DS could be explained through the rest of the variables, an observation that implies substantial connections between the variables. The loadings (contributions of each variable) of the first two principal components (PCs) are shown in Fig. Fig.5a,5a, and Fig. Fig.5b5b shows the projection of the 41 ATGCs on the PC1-PC2 plane. The results seem to illustrate nontrivial trends in the evolution of prokaryotes. Based on the loadings of the first two PCs (Fig. (Fig.5a),5a), PC1 can be viewed as reflecting global, genome level factors (principal contributions from the DN/DS, the genome [proteome] size, and the GC content), whereas PC2 seems to reflect mostly local, gene-specific factors (major contributions from the gene and intergenic-region lengths, but also significant contribution of the DN/DS). Together, the first two PCs accounted for ~66% of the original data variance.

FIG. 5.
PCA of seven genomic variables. (a) Loadings of the first two PCs. (b) Scatter of the ATGCs in the plane of the first two PCs. The red contour encloses the tight cluster of genomes, mostly those of free-living organisms, that are subjeced to relatively ...

Free-living organisms with relatively large genomes, high GC contents, large proteins, long intergenic regions, and low DN/DS ratios (strong purifying selection) form a tight cluster in the PC1-PC2 plane at low PC1 values (Fig. (Fig.5b).5b). In contrast, organisms with smaller, AT-rich genomes, many of which are pathogens that have relatively small genes separated by long intergenic regions (PC2), are generally characterized by high DN/DS ratios indicative of weak purifying selection; however, these organisms are widely scattered along the PC2 axis, that is, they differ greatly in terms of gene and intergenic-region sizes (Fig. (Fig.5b).5b). The positive correlation between the genome size and the GC content is well known, if not well explained (8, 25). However, the set of characteristics of genomes that have low DN/DS ratios is incompatible with the concept of genome streamlining as a dominant trend of evolution under strong selection pressure (20, 21). There was very little taxonomic coherence in the results of the PCA analysis of the prokaryotic genomes: even different strains of the same bacterial species that formed distinct ATGCs could be broadly spread, as illustrated in Fig. Fig.5b5b for P. marinus.

Thus, the analysis of the links between the DN/DS ratio and other characteristics of prokaryotic genomes supports the notion that genomes of parasites, although small and often compact due to extensive gene loss, are typically subjected to weak purifying selection (22). In contrast, these findings do not seem to support the straightforward concept of genome streamlining caused by strong purifying selection pressure. One of the possible interpretations is that, although the DN/DS ratio is nearly constant within most ATGCs (see above), fluctuations are common at greater evolutionary scales, obscuring the effects of purifying selection on genomes. Alternatively or additionally, it is conceivable that the evolution of the genomes of bacteria and archaea, especially those that inhabit complex and variable environments, is shaped by the balance between streamlining under the pressure of purifying selection and selection for the maintenance of adequate complexity of the gene repertoire and regulatory networks.

Sequence evolution and rearrangement of prokaryotic genomes.

It was shown in the early days of comparative genomics and subsequently confirmed by numerous observations that gene order in prokaryotes is relatively poorly conserved during evolution, typically changing much faster than protein sequences (4, 14, 24, 35). Comparisons of closely related bacterial and archaeal genomes revealed a characteristic “cross-like” pattern of localization of orthologous genes, indicating that inversions around the origin of replication comprise one of the dominant routes of genome rearrangement (7, 34). We exploited the ATGCs to examine the patterns of genome rearrangement in prokaryotes and the possible effects of selection on rearrangement.

Ideally, analysis of genome rearrangements would involve reconstruction of the history of recombination events that occurred after the radiation of the compared genomes from their last common ancestor. Several algorithms have been developed for this type of analysis (3, 12), but in many cases, the number of rearrangement events even between two strains within a bacterial species is so large (Fig. (Fig.6a)6a) that the reconstruction methods fail.

FIG. 6.
Patterns of genome rearrangement in prokaryotes. (a) Nearly complete decay of synteny (DY = 0.69; DN = 0.15; DS [dbl greater-than sign] 1); Streptococcus sanguinis SK36 and Streptococcus pneumoniae R6. (b) Virtual absence of rearrangement (DY ~ ...

We used a simple, quantitative measure of genome rearrangement, the synteny distance (DY), which is calculated as the fraction of orthologous genes defined as BBHs between protein sequences (see Materials and Methods) that are not covered by synteny regions: DY = (Nb − Ns)/Nb, where Nb is the total number of BBHs and Ns is the number of BBHs contained in synteny regions. The estimation of this quantity does not require an explicit reconstruction of the recombination history, but it nevertheless reflects the “rearrangement distance” between the compared genomes. It should be noted that DY measures only those genome rearrangements that break synteny, that is, that involve transpositions of individual genes. Considerable genome rearrangement, in particular, inversions of chromosomal segments carrying multiple genes, can occur even at a DY of 0 (e.g., Fig. Fig.6c6c).

Similarly to the way DN/DS is construed as a measure of purifying selection pressure that affects sequence evolution and, as shown above, is nearly constant within ATGCs, the DY/DS ratio could potentially reflect the purifying selection that affects genome rearrangement. Remarkably, when several ATGCs with a DY value of zero (see below) were left out, a strong positive correlation was observed between the DY/DS ratio and the DN/DS ratio (Fig. (Fig.7).7). This finding strongly suggests that in most prokaryotes the pressure of purifying selection that acts on at least certain types of genome rearrangement and on protein sequences is determined by the same factors.

FIG. 7.
Correlation between the mean purifying selection pressures affecting amino acid sequence evolution (DN/DS) and genome rearrangement (DY/DS). Rs, Spearman ranking correlation coefficient.

Distinct patterns of genome rearrangement in different groups of prokaryotes.

The measure of genome rearrangement, DY, that is introduced here is an overall characteristic of the evolutionary decay of gene synteny and does not reflect the specific events that lead to rearrangement. Beyond the strong positive correlation between the DY/DS and DN/DS ratios, prokaryotic genomes separated by similar evolutionary distances (measured as either DN or DY) show distinctive patterns of rearrangement that are captured by the genomic dot plots shown in Fig. Fig.6.6. The following four patterns could be readily distinguished: (i) virtually no genome rearrangement and a few small insertions/deletions (this is the pattern seen in several ATGCs that include bacterial parasites with small genomes, such as Chlamydia, Chlamydophila, Rickettsia, and Borrelia; DY, ~0 [Fig. [Fig.6b]);6b]); (ii) multiple inversions centered at the origin of replication, resulting in a cross-like pattern and limited transposition, a pattern that can result in substantially rearranged genomes but DY values close to zero, as inversion does not disrupt synteny blocks (Fig. (Fig.6c);6c); (iii) limited rearrangement with hot spots of gene transposition and low (but nonzero) DY values (Fig. (Fig.6d);6d); and (iv) multiple inversions and transpositions with high DY values (Fig. (Fig.6e6e).

The factors that affect genome rearrangement are not well understood but presumably might have to do with the abundance of mobile elements (transposons) and the state of repair/recombination systems in the respective genomes. Of special interest are those ATGCs that, despite relatively large evolutionary distances reflected in high DN values, show virtually no rearrangement. One plausible view is that the lack of genome rearrangement is a selectively neutral phenomenon, simply reflecting the loss of a recombinational system that is required for rearrangement in the respective organisms. Indeed, it has been suggested that the low frequency of recombination in Corynebacterium compared to Mycobacterium was likely due to the absence of RecBCD, a well-characterized recombinational enzyme complex, in the former (26). However, inspection of the clusters of orthologous genes (30) failed to reveal consistent loss of any major repair/recombination genes (although individually these genomes certainly have lost some); most of these genomes also contained various transposons (data not shown). In particular, the RecBCD system is present in Chlamydia, Chlamydophila, and Borrelia, although it is missing in Rickettsia, an observation that rules out the straightforward explanation of the lack of rearrangement through the loss of recombinational capacity. Therefore, it remains unclear to what extent the lack of genome rearrangement in some of the bacterial parasites is due to the deterioration of repair/recombination systems in these genomes and to what extent this phenomenon might be caused by features of the population dynamics of these organisms and/or selective constraints. The latter remain to be investigated but generally might have to do with selection against breaking operons and, accordingly, disrupting gene coregulation.


The ATGCs present a platform for the analysis of various aspects of the microevolution of prokaryotes (27). Here, we have shown that the ratio of the medians of dN and dS over the set of orthologous genes that is thought to reflect the pressure of purifying selection affecting the protein-coding sequences in the respective genomes is a highly stable characteristic of ATGCs. Having established the stability of this measure, we examined the connections between the strength of purifying selection and other genomic characteristics. In agreement with previous reports (15, 22), we found that bacterial parasites, especially, intracellular ones, despite the sometimes dramatic genome shrinkage caused by gene loss, are typically subjected to weak purifying selection, presumably owing to relatively small characteristic population sizes and frequent bottlenecks. Otherwise, however, the present results seem to emphasize the complexity of prokaryotic-genome evolution and to defy straightforward interpretations based on population-genetic theory. In particular, we did not detect any evidence of genome streamlining caused by a strong pressure of purifying selection (21). Contrary to the streamlining prediction, the genomes that are subjected to strong selection pressure have a tendency to possess larger genomes and longer genes and intergenic regions than genomes evolving under weak selection. Certainly, this is only a statistical trend, so a variety of free-living prokaryotes with very close purifying selection pressures span nearly the entire range of genome sizes. Conceivably, despite the stability of the DN/DS values at short evolutionary distances (within an ATGC), on a larger evolutionary scale, the effective population sizes of archaea and bacteria (and perhaps, to a lesser extent, the mutation and recombination rates) fluctuate often enough to obscure the expected dependences between selection pressure and genome characteristics. It also seems possible that the genomes of bacteria and archaea, especially those that inhabit complex and variable environments, are under selective pressure to maintain the minimal metabolic and regulatory complexity that is required to survive in these habitats, so that the evolutionary trajectory depends on the balance between this requirement and the drive for streamlining. Perhaps, nearly “pure” streamlining can be observed only in organisms that live in relatively simple and stable environments and reach extremely high effective population sizes, as suggested, for instance, by the genome analysis of the most common and abundant marine bacterium, Pelagibacter ubique (10).

Notably, although the gene order changes much faster than protein sequences during the evolution of prokaryotes, we observed a strong positive correlation between the “rearrangement distance” and the amino acid distance. Thus, at least some of the events leading to genome rearrangement, such as transposition of individual genes, seem to be subjected to the same type of selective constraints as the evolution of the amino acid sequences of prokaryotic proteins. Remarkably, these findings mimic the observations of the relationship between sequence evolution and genome rearrangement in animals (37).

In our opinion, the ATGCs are a promising resource for evolutionary-genomic studies. Of course, the 41 ATGCs currently available comprise an inadequately small data set, considering the revealed complexity of the patterns of prokaryotic evolution. Within several years, the exponentially growing collection of genomes from bacteria and archaea with diverse life styles should provide opportunities for a more complete analysis and a more representative and appropriately nuanced characterization of the factors that govern microbial-genome evolution.


The research of P.S.N., Y.I.W., and E.V.K. was supported by the Department of Health and Human Services intramural program (National Library of Medicine, NIH).


[down-pointing small open triangle]Published ahead of print on 31 October 2008.


1. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 253389-3402. [PMC free article] [PubMed]
2. Andersson, S. G., C. Alsmark, B. Canback, W. Davids, C. Frank, O. Karlberg, L. Klasson, B. Antoine-Legault, A. Mira, and I. Tamas. 2002. Comparative genomics of microbial pathogens and symbionts. Bioinformatics 18(Suppl. 2)S17. [PubMed]
3. Bourque, G., and P. A. Pevzner. 2002. Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res. 1226-36. [PMC free article] [PubMed]
4. Dandekar, T., B. Snel, M. Huynen, and P. Bork. 1998. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23324-328. [PubMed]
5. Dufresne, A., L. Garczarek, and F. Partensky. 2005. Accelerated evolution associated with genome reduction in a free-living prokaryote. Genome Biol. 6R14. [PMC free article] [PubMed]
6. Edgar, R. C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 321792-1797. [PMC free article] [PubMed]
7. Eisen, J. A., J. F. Heidelberg, O. White, and S. L. Salzberg. 2000. Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol. 1RESEARCH0011. [PMC free article] [PubMed]
8. Escartin, F., S. Skouloubris, U. Liebl, and H. Myllykallio. 2008. Flavin-dependent thymidylate synthase X limits chromosomal DNA replication. Proc. Natl. Acad. Sci. USA 1059948-9952. [PMC free article] [PubMed]
9. Felsenstein, J. 1996. Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Methods Enzymol. 266418-427. [PubMed]
10. Giovannoni, S. J., H. J. Tripp, S. Givan, M. Podar, K. L. Vergin, D. Baptista, L. Bibbs, J. Eads, T. H. Richardson, M. Noordewier, M. S. Rappe, J. M. Short, J. C. Carrington, and E. J. Mathur. 2005. Genome streamlining in a cosmopolitan oceanic bacterium. Science 3091242-1245. [PubMed]
11. Grishin, N. V., Y. I. Wolf, and E. V. Koonin. 2000. From complete genomes to measures of substitution rate variability within and between proteins. Genome Res. 10991-1000. [PMC free article] [PubMed]
12. Hannenhalli, S., C. Chappey, E. V. Koonin, and P. A. Pevzner. 1995. Genome sequence comparison and scenarios for gene rearrangements: a test case. Genomics 30299-311. [PubMed]
13. Hurst, L. D. 2002. The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet. 18486. [PubMed]
14. Itoh, T., K. Takemoto, H. Mori, and T. Gojobori. 1999. Evolutionary instability of operon structures disclosed by sequence comparisons of complete microbial genomes. Mol. Biol. Evol. 16332-346. [PubMed]
15. Jordan, I. K., I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. 2002. Microevolutionary genomics of bacteria. Theor. Popul. Biol. 61435-447. [PubMed]
16. Koonin, E. V. 2005. Orthologs, paralogs and evolutionary genomics. Annu. Rev. Genet. 39309-338. [PubMed]
17. Koonin, E. V., and Y. I. Wolf. 23 October 2008. Genomics of Bacteria and Archaea: the emerging generalizations after 13 years. Nucleic Acids Res. [Epub ahead of print.] [PMC free article] [PubMed]
18. Li, W. H. 1997. Molecular evolution. Sinauer, Sunderland, MA.
19. Liolios, K., K. Mavromatis, N. Tavernarakis, and N. C. Kyrpides. 2008. The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 36D475-D479. [PMC free article] [PubMed]
20. Lynch, M. 2007. The origins of genome archiecture. Sinauer Associates, Sunderland, MA.
21. Lynch, M. 2006. Streamlining and simplification of microbial genome architecture. Annu. Rev. Microbiol. 60327-349. [PubMed]
22. Mamirova, L., K. Popadin, and M. S. Gelfand. 2007. Purifying selection in mitochondria, free-living and obligate intracellular proteobacteria. BMC Evol. Biol. 717. [PMC free article] [PubMed]
23. Mira, A., L. Klasson, and S. G. Andersson. 2002. Microbial genome evolution: sources of variability. Curr. Opin. Microbiol. 5506-512. [PubMed]
24. Mushegian, A. R., and E. V. Koonin. 1996. Gene order is not conserved in bacterial evolution. Trends Genet. 12289-290. [PubMed]
25. Nakabachi, A., A. Yamashita, H. Toh, H. Ishikawa, H. E. Dunbar, N. A. Moran, and M. Hattori. 2006. The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science 314267. [PubMed]
26. Nakamura, Y., Y. Nishio, K. Ikeo, and T. Gojobori. 2003. The genome stability in Corynebacterium species due to lack of the recombinational repair system. Gene 317149-155. [PubMed]
27. Novichkov, P. S., I. Ratner, Y. I. Wolf, E. V. Koonin, and I. Dubchak. 9 October 2008. ATGC: a database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes. Nucleic Acids Res. [Epub ahead of print.] [PMC free article] [PubMed]
28. Parzen, E. 1962. On estimation of a probability density function and mode. Ann. Math. Stat. 331065-1076.
29. Rocha, E. P., J. M. Smith, L. D. Hurst, M. T. Holden, J. E. Cooper, N. H. Smith, and E. J. Feil. 2006. Comparisons of dN/dS are time dependent for closely related bacterial genomes. J. Theor. Biol. 239226-235. [PubMed]
30. Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. I. Wolf, J. J. Yin, and D. A. Natale. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinform. 441. [PMC free article] [PubMed]
31. Tatusov, R. L., E. V. Koonin, and D. J. Lipman. 1997. A genomic perspective on protein families. Science 278631-637. [PubMed]
32. Tatusov, R. L., A. R. Mushegian, P. Bork, N. P. Brown, W. S. Hayes, M. Borodovsky, K. E. Rudd, and E. V. Koonin. 1996. Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli. Curr. Biol. 6279-291. [PubMed]
33. Team, R. D. C. 2008. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
34. Tillier, E. R., and R. A. Collins. 2000. Genome rearrangement by replication-directed translocation. Nat. Genet. 26195-197. [PubMed]
35. Wolf, Y. I., I. B. Rogozin, A. S. Kondrashov, and E. V. Koonin. 2001. Genome alignment, evolution of prokaryotic genome organization and prediction of gene function using genomic context. Genome Res. 11356-372. [PubMed]
36. Yang, Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 241586-1591. [PubMed]
37. Zdobnov, E. M., and P. Bork. 2007. Quantification of insect genome divergence. Trends Genet. 2316-20. [PubMed]

Articles from Journal of Bacteriology are provided here courtesy of American Society for Microbiology (ASM)
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...