• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of genoresGenome ResearchCSHL PressJournal HomeSubscriptionseTOC AlertsBioSupplyNet
Genome Res. Feb 1, 2003; 13(2): 145–158.
PMCID: PMC420360

Evolutionary Implications of Microbial Genome Tetranucleotide Frequency Biases


We compared nucleotide usage pattern conservation for related prokaryotes by examining the representation of DNA tetranucleotide combinations in 27 representative microbial genomes. For each of the organisms studied, tetranucleotide usage departures from expectations (TUD) were shared between related organisms using both Markov chain analysis and a zero-order Markov method. Individual strains, multiple chromosomes, plasmids, and bacteriophages share TUDs within a species. TUDs varied between coding and noncoding DNA. Grouping prokaryotes based on TUD profiles resulted in relationships with important differences from those based on 16S rRNA phylogenies, which may reflect unequal rates of evolution of nucleotide usage patterns following divergence of particular organisms from a common ancestor. By both symmetrical tree distance and likelihood analysis, phylogenetic trees based on TUD profiles demonstrate a level of congruence with 16S rRNA trees similar to that of both RpoA and RecA trees. Congruence of these trees indicates that there exists phylogenetic signal in TUD patterns, most prominent in coding region DNA. Because relationships demonstrated in TUD-based analyses utilize whole genomes, they should be considered complementary to phylogenies based on single genetic elements, such as 16S rRNA.

Biases in nucleotide composition and organization in prokaryotic genomes have long been recognized (Muto and Osawa 1987), with the representation of short oligonucleotide combinations as a focus of analysis (Henaut et al. 1996; Gelfand and Koonin 1997; Rocha et al. 1998). Dinucleotide frequencies within organisms represent genomic signatures, which may result from selective pressures as a result of dinucleotide stacking, DNA conformational tendencies, DNA replication and repair mechanisms, or selection by restriction endonucleases (Karlin et al. 1998), and codon usage also may influence nucleotide usage because it affects translational efficiency (Grantham et al. 1981; Grosjean and Freirs 1982; Sharp et al. 1993). However, constraints beyond dinucleotide frequencies and codon usage preferences can be identified only through analysis of longer oligonucleotide words (Pride and Blaser 2002). Methods available for determining the significance of oligonucleotide word frequencies include Markov chain analysis (Schbath et al. 1995; Cardon and Karlin 1994), which involves determining word frequencies by removing biases in their constituent oligonucleotides; however, the evolutionary significance of oligonucleotide word frequencies in prokaryotes has not been fully addressed.

Evolutionary inferences based on gene sequences, such as 16S rRNA (Woese and Fox 1977; Woese et al. 1990) are considered reliable indicators of prokaryotic ancestry; however, because evolutionary constraints are multidimensional (Koonin et al. 2000), analysis of a single gene is insufficient to fully understand the divergence between related life forms. The universally conserved 16S rRNA, with conservative rates of nucleotide substitution, is generally accepted as the standard for assessing microbial evolution; however, analysis of other gene loci often may not be phylogenetically congruent (Doolittle 1999). Such incongruencies often result from horizontal gene transfer, which obscures evidence of recent common ancestry (Holmes et al. 1999). With an increasing number of complete genomic sequences available, it now can be determined whether the relationships revealed from phylogenies based on 16S rRNA are reflected in the nucleotide usage patterns of individual organisms. Analysis of complete genomes can identify the extent to which nucleotide usage has evolved after divergence from recent common ancestors and can provide insight into selective pressures on usage not addressed by 16S rRNA sequences nor fully revealed in codon usage preference analyses.

Because analysis of tetranucleotide frequencies provides insights beyond those inferred from analysis of codon usage biases, we sought to develop an analytical method to examine their conservation across and between prokaryotic genomes. Our goals were to compare alternative models for determining tetranucleotide frequency divergences to understand the extent to which tetranucleotide usage is shared for multiple genomes and their plasmids and bacteriophages, and to determine whether tetranucleotide usage divergences exhibit phylogenetic signal compared with phylogenies based on 16S rRNA.


Representation of Tetranucleotide Combinations in Microbial Genomes

For the studied microbial genomes, we analyzed the tetranucleotide usage deviations from expectations (TUD) to determine whether the patterns of deviation are similar between closely related organisms. In a compromise between maximal information retrieval and minimal oligonucleotide length, tetranucleotides were selected for analysis because they offer both sufficient data points and provide data on nucleotide usage biases not inferred from codon usage analysis. We compared a zero-order Markov method that measures the deviation in usage of each tetranucleotide from that expected under a random mononucleotide distribution (Almagor 1983), and a Markov chain method (Cardon and Karlin 1994; Schbath et al. 1995) that measures the frequency divergence of tetranucleotides by removing the biases in their shorter oligonucleotide components. Although the TUD profile is unique for each microbial genome studied, closely related organisms are similar (Fig. (Fig.1).1). As expected, the TUD profiles for the two sequenced Helicobacter pylori strains are virtually superimposable (Fig. (Fig.1A).1A). In other species (Neisseria meningitidis, Escherichia coli, Chlamydia pneumoniae, and Mycobacterium tuberculosis) for which two or more genomic sequences were analyzed, tetranucleotides with most extreme divergence and the extent of divergence were nearly identical for each member, indicating the existence of species-specific patterns (data not shown). Although H. pylori and Campylobacter jejuni differ in G + C content by 8.6% (Table (Table1),1), their TUD profiles are similar (Fig. (Fig.1A),1A), including many of the most highly over- and underrepresented tetranucleotides, consistent with their close evolutionary relationship (Parkhill et al. 2000). As G + C content deviates from 50%, nucleotide usage is predicted to become less random (Muto and Osawa 1987; Sueoka 1988), however, even amongst organisms with G + C content near 50% (e.g., E. coli) their patterns of tetranucleotide usage are substantially deviated from expected (Fig. (Fig.1A).1A). Of the organisms studied, the number of tetranucleotides with F(W) > |21.5| is highest forMethanococcus janaschii (34 tetranucleotides), followed by H. pylori (21), N. meningitidis (19), C. jejuni(12), and Deinococcus radiodurans (12). These organisms had the broadest range in tetranucleotide usage deviation using the zero-order Markov method. Similarity of profiles in related species is most clearly demonstrated by E. coli and Salmonella typhi; whereas M. tuberculosis and Mycobacterium leprae differ in profile to a greater degree (Fig. (Fig.1A).1A). For both D. radiodurans and Vibrio cholerae, each of their two chromosomes had similar TUD profiles (data not shown).

Figure 1.
Frequency distribution of DNA tetranucleotide usage profiles of selected prokaryotes. The observed/expected tetranucleotide frequency divergence (F(W)) was determined for the 256 tetranucleotide combinations for each genome, using both Markov chain and ...
Table 1.
Bacterial Chromosomal and Plasmid Genomes Examined in This Study.

The zero-order Markov method yields a wider profile base with greater interspecies distinction than does the Markov chain method (Fig. (Fig.1).1). Although E. coli and M. tuberculosis have similar profiles in Markov chain analysis (Fig. (Fig.1B),1B), they have unique profiles by zero-order Markov analysis (Fig. (Fig.1A).1A). Thus, because the zero-order Markov method only removes the biases resulting from the frequencies of mononucleotides, the TUD calculated this way will incorporate the frequency biases of all the component oligonucleotides yielding distinct species-specific profiles.

Interchromosomal Tetranucleotide Comparisons

Pairwise genomic comparisons of TUD profiles within and between species illustrates that related organisms share common patterns (Fig. (Fig.2).2). Previous studies indicate that TUD patterns are highly conserved across prokaryotic genomes, with the exception of horizontally acquired genetic elements (Pride and Blaser 2002). Many of these elements, such as the cag island in H. pylori and the integron island in V. cholerae, have more similar TUD patterns to their host genomes than to other closely related organisms despite their horizontal acquisition (Table (Table2),2), and therefore were not excluded from the analysis. The two H. pylori strains have nearly identical profiles of tetranucleotide divergences (R2 > 0.99; Fig. Fig.2A,2A, B). These relationships are not based on G + C content, as randomly generated sequences designed with H. pylori G + C content show no correlation to either strain (R2 < 0.01) in TUD profiles. As expected by their evolutionary proximity (Parkhill et al. 2000), H. pylori and C. jejuni (Fig. (Fig.2C,2C, D) have considerably more similarity in their TUD profiles than do H. pylori and H. influenzae(Fig. (Fig.2E,2E, F), which have nearly identical G + C compositions (Table (Table1).1). The zero-order Markov method yields higher correlation in TUD profiles between H. pylori and C. jejuni or H. influenzae than does the Markov chain method, indicating that oligonucleotide (<4 nt) components contribute substantially to the similarity between species. Distantly related H. pylori and M. tuberculosis show no correlation (R2 < 0.03) in TUD patterns (data not shown; Appendix Table Table1).1). Two Pyrococcus species show strong similarities to one another, whereas Bacillus subtilis and Bacillus halodurans are less similar (data not shown; Appendix Table Table1).1). E. coli strains K12 and O157:H7 have nearly identical TUD (R2 > 0.99), despite the presence of 1387 additional open reading frames (ORFs) in O157:H7, a difference believed the result of horizontal gene transfer (Perna et al. 2001). For D. radiodurans that possesses two chromosomes, the TUD of each is nearly identical; a similar phenomenon was found for the two-chromosome V. cholerae as well (data not shown; Appendix Table Table1).1).

Figure 2.
Linear regression analysis of DNA tetranucleotide usage profiles among selected genomes. F(W) was determined for each of the 256 tetranucleotide combinations for each genome as described in Methods section, and the profiles compared by linear regression ...
Table 2.
Comparison of Tetranucleotide Usage Deviation in Species-Specific Bacteriophages, Plasmids, and Horizontally Acquired Genetic Elements and Their Host Strains and Controls.

Analysis of Plasmids, Species-Specific Phages, and Horizontally Acquired Genetic Elements

To determine whether organism-specific TUD patterns extend to horizontally acquired genetic elements, D. radiodurans was studied; its megaplasmid (177 kb) has similar patterns to the two chromosomes, but for its large (45 kb) plasmid, relationships are less close (Table (Table2).2). TUD profiles of pO157 found in E. coli O157:H7 are most similar with its host strain, less similar to E. coli strain K12 and to S. typhi, and dissimilar to the more distant H. influenzae. Similarly, Yersinia pestis plasmid pCD1 has TUD patterns highly similar to its host's chromosome, with less related bacteria progressively less similar. In general, smaller plasmids (<25 kb) share less similarity in TUD patterns to their host's genome than do larger plasmids (data not shown), consistent with their greater host range. Species-specific bacteriophages showed similar TUD patterns with their hosts (Table (Table2),2), which may hinder their ability to infect distantly related species. Whereas two Enterobacteriaceae-specific phages studied show parallel similarities to Enterobacteriaceae TUD patterns, larger differences are seen for two Mycobacterium-specific phages. Both the H. pylori cag island (Tomb et al. 1997) and the V. cholerae integron island (Heidelberg et al. 2000) have TUD patterns more similar to their host genomes than to other organisms studied (Table (Table22).

Intragenomic Comparisons of Tetranucleotide Usage

Although patterns of dinucleotide divergences in coding and noncoding DNA are essentially identical (Burge et al. 1992), our analysis of tetranucleotide usage deviations indicate that there are substantial differences in some prokaryotes (Table (Table3;3; Fig. Fig.3).3). For H. pylori, although coding and noncoding DNA TUD profiles are strongly correlated (Fig. (Fig.3A,3A, B), the most overrepresented tetranucleotides in coding and noncoding DNA differ (Table (Table3).3). Homopolymers CCCC and GGGG show substantial differences in representation between coding and noncoding DNA. That the most underrepresented tetranucleotides (GTAC, ACGT, and TCGA) in H. pylori are shared for both coding and noncoding DNA, indicates that factors beyond codon usage biases, such as restriction-endonuclease cognate sequence avoidance (Pride and Blaser 2002), influence their distribution (Table (Table3).3). For C. jejuni, the differences in TUD profiles in coding and noncoding DNA are greater than that for H. pylori (Table (Table3;3; Fig. Fig.3C,3C, D). B. subtilis has TUD profile differences in coding and noncoding DNA intermediate to that for H. pylori and C. jejuni (Table (Table3;3; Fig. Fig.3E,3E, F). Therefore, analysis of TUD profiles reveals greater differences between coding and noncoding DNA than would be predicted by analysis of dinucleotides.

Figure 3.
Frequency distribution (A, C, E) and linear regression (B, D, F) of DNA tetranucleotide usage deviation profiles of selected prokaryotes. For each genome, the observed/expected tetranucleotide usage deviation (F(W)) was determined for the 256 combinations ...
Table 3.
Extremes of Tetranucleotide Usage Deviation in Coding and Noncoding DNA of Three Prokaryotic Genomes

Clustering of Organisms Based on Tetranucleotide Usage

Because TUD profiles appeared most similar between related organisms (Figs. (Figs.1,1, ,2),2), we next sought to determine whether groupings based on such profiles resemble phylogenetic groupings based on 16S rRNA for 27 representative organisms. In the phylogram based on 16S rRNA, most Gram-negative organisms cluster together, with the archaea distant from the eubacteria, the thermophilic bacteria most proximate to the archaea, and the Chlamydia species and the Gram-positive organisms most proximate to the thermophilic bacteria (Fig. (Fig.4A).4A). Because the zero-order Markov method yields distinct species-specific TUD profiles, we grouped organisms based on these profiles. The TUD profile-based phylogeny (Fig. (Fig.4B),4B), shows different relationships from those based on 16S rRNA, including that: (1) Campylobacter, Helicobacter, and Rickettsia are more distant from the other Gram-negative organisms; (2) the relative distance between the archaea and the bacteria is decreased; (3) the Pyrococcus species are more distantly related to one another; (4) B. halodurans and B. subtilis are more distantly related to each other; (5) the relative distance between M. tuberculosis and M. lepraeis increased; (6) the relative distances between the Mycoplasma species are increased; and (7) the relative distances between the two N. meningitidis strains are increased. Groupings based on penta-, and hexanucleotide usage deviations are essentially identical to those based on tetranucleotides (data not shown). Thus, although the phylogenies produced have broad similarities, important differences are uncovered.

Figure 4.
Phylograms of 27 selected organisms for which genomic sequences are available. (A) 16S rRNA sequences were subjected to neighbor-joining analysis using HKY85 distance matrices. (B) The same organisms were grouped by using distance matrices based on the ...

Analysis of Congruence Among 16S and Tetranucleotide Trees

The similarities between phylogenies created based on 16S rRNA and those created based on TUD profiles indicate that the latter contain phylogenetic signal. To determine the extent of the phylogenetic signal in TUD-based trees in comparison to 16S rRNA trees, topological differences between each were analyzed by symmetrical tree distances, which measure the number of clusters present exclusively in either tree (Penny and Hendy 1985). Of 100 trees based on 16S rRNA sequences, an average of nine clusters differ between each tree (green), while an average of 19 clusters differ between each TUD (red) tree (Fig. (Fig.5A).5A). Comparisons of 16S rRNA vs. TUD trees show that an average of 27 clusters differ (blue), while neither 16S rRNA nor TUD trees has clusters in common with 100 random trees (Fig. (Fig.5A,5A, black). Trees based on 16S rRNA and RpoA differ by an average of 23 clusters (Fig. (Fig.5B,5B, blue), which indicates that the conservation of clusters for RpoA is similar to that for TUD. TUD trees based on coding DNA are similar to those for whole genomes, and have more clusters in common with 16S rRNA trees than those based on noncoding DNA (Fig. (Fig.5C,5C, D), indicating that in prokaryotes, most of the phylogenetic signal exists in the coding regions. Importantly, 16S rRNA and TUD trees based on the Markov chain method differ by an average of 37 clusters (data not shown), demonstrating that phylogenetic signal is more conserved using the zero-order Markov method.

Figure 5.
Tree distance analysis of phylogenies of 27 prokaryotes. One hundred phylogenies were created using bootstrapping techniques for these organisms based on 16S rRNA or RpoA sequences, or tetranucleotide usage deviation (TUD). Tree distances were determined ...

Formal analysis of congruence between trees based on 16S rRNA and TUD was performed using likelihood analysis (Feil et al. 2001), a statistical test for comparison of tree topologies. The results are generally similar to those of the symmetrical tree distance analysis, with trees based on RpoA, RecA, GroE, and TUD revealing a high degree of similarity to 16S rRNA in topology (Fig. (Fig.6).6). In all cases, the differences in likelihoods (Δ-ln L) fall well outside those of 200 random trees (the 99th percentile of the random distribution), indicating a high degree of congruence among the trees. Trees for prokaryotic coding DNA TUD demonstrate more congruence with 16S rRNA trees than those of GroE, whole-genome TUD, and noncoding DNA TUD, and demonstrate a level of similarity to 16S rRNA trees parallel to that of RpoA and RecA trees. That coding DNA TUD trees are more congruent with 16S rRNA trees than noncoding DNA and whole-genome TUD trees confirms that the phylogenetic signal exists largely in the coding DNA. TUD phylogenies based on Markov chain analysis (Fig. (Fig.6G)6G) and phylogenies based on whole-genome dinucleotide usage patterns (Fig. (Fig.6H),6H), while demonstrating topological similarities to 16S rRNA, are far less congruent with 16S rRNA than the other trees analyzed.

Figure 6.
Likelihood analysis of phylogenetic congruence in prokaryotes. The phylogeny based on 16S rRNA is compared with phylogenies based on RpoA, GroE, RecA, whole-genome dinucleotide usage deviation, whole-genome tetranucleotide usage deviation (TUD), coding ...


We analyzed prokaryotic genome TUD to determine whether common patterns are shared by related organisms. The Markov chain model, involving determining the expected frequency of a word by removing biases in its oligonucleotide components to find statistically meaningful deviations in word frequencies (Rocha et al. 1998), is the most common method for analysis of oligonucleotide word frequencies. However, by removing oligonucleotide component biases, cross-species comparisons become increasingly difficult, as these biases apparently contribute to the development of organism-specific nucleotide usage patterns. An alternative method, using zero-order Markov criteria (Almagor 1983), is based on comparing tetranucleotide frequencies across genomes by correcting for unequal base frequencies. Although there is no statistically meaningful way to compare differences observed using zero-order Markov and Markov chain criteria, the TUD developed by zero-order Markov analysis shows stronger relationships between like genomes (Figs. (Figs.1,1, ,22).

Our data demonstrate that TUD patterns are well-conserved for both intra- and interspecies comparisons, and that similarity in these patterns is not based on G + C content. That the different chromosomes of D. radiodurans and V. cholerae demonstrate substantial TUD conservation, and that different H. pylori, E. coli, N. meningitidis, and C. pneumoniae strains share essentially identical TUD patterns, indicates their species specificity. That the closely related C. jejuni and H. pylori differing in G + C content by 8.6% demonstrate significant correlation in TUD patterns, while less closely related H. pylori and H. influenzae, which differ in G + C content only by 1%, have lower correlation, suggests that nucleotide usage patterns are relatively conserved despite evolution of G + C composition. The conservation in TUD patterns also extends to horizontally acquired genetic elements, plasmids, and bacteriophages with substantial correlation to their host organisms (Table (Table2).2). These findings further substantiate that there are organism-specific TUD patterns transmitted to horizontally acquired genetic elements, likely through the process of amelioration (Lawrence and Ochman 1997; Pride and Blaser 2002).

Phylogenetic reproduction based on prokaryotic nucleotide frequency divergences is not a novel concept, and is generally not believed to be as robust as standard phylogenetic methods based on 16S rRNA (Cardon and Karlin 1994; Leung et al. 1996). Our TUD-based analysis produces phylogenies similar to those based on 16S rRNA sequences, with several important differences. One explanation for these differences is that the 16S rRNA and TUD-based phylogenies result from unequal evolutionary rates after divergence of the studied organisms from common ancestors. For example, in contrast to 16S rRNA analysis, Gram-negative organisms E. coli, S. typhi, Y. pestis, H. influenzae, and N. meningitidis do not share a recent common ancestor with H. pylori and C. jejuni on TUD-based phylogenies. One hypothesis to explain the greater degree of difference between the Enterobacteriaceae and the Campylobacter/Helicobacter group is that the nucleotide usage patterns of H. pylori and C. jejuni are evolving more rapidly than their 16S rRNA sequences. In support of this hypothesis is that both H. pylori and C. jejuni demonstrate the greatest range in TUD of the organisms studied (Fig. (Fig.1A,1A, B), and have substantial extremes of both tetranucleotide under- and overrepresentation. These extremes could result from lack of functional mismatch repair systems (Bhagwat and McClelland 1992) in both organisms (Tomb et al. 1997; Parkhill et al. 2000), or restriction-modification (R-M) induced pressures. R-M systems are believed to exert considerable selective pressures on nucleotide usage, as if restriction is intact but methylation incomplete, organisms avoiding the cognate sequences have a fitness advantage (Gelfand and Koonin 1997). Both H. pylori (Kong et al. 2000) and C. jejuni contain substantial numbers of R-M systems. The substantial underrepresentation of tetranucleotides ACGT, GTAC, and TCGA (Table (Table3),3), each the recognition sequence for known H. pylori R-M systems (Xu et al. 2000; V. Butkus, unpubl.), further suggests a role for these systems in shaping TUD patterns (Pride and Blaser 2002). That these tetranucleotides are underrepresented to similar extents in both coding and noncoding DNA (Table (Table3),3), supports this hypothesis, as R-M systems exert genome-wide pressures on nucleotide usage patterns, further demonstrating that the underrepresentation cannot be attributed to codon usage biases. Alternatively, natural competence and its control also could affect nucleotide usage patterns, as naturally competent organisms (e.g., M. janaschii, H. pylori, N. meningitidis, C. jejuni, and D. radiodurans) containing the largest numbers of R-M systems (Kong et al. 2000; Lin et al. 2001) possess the highest proportion of highly divergent tetranucleotides.

By analysis of congruence between phylogenetic trees (Feil et al. 2001) based on TUD profiles and on 16S rRNA, we demonstrate that there is phylogenetic signal in the whole-genome TUD patterns of prokaryotes, and that the signal is most prominent in coding DNA (Fig. (Fig.6).6). Phylogenetic trees for RpoA, RecA, and coding DNA TUD exhibit essentially identical levels of congruence with 16S rRNA phylogenies, and slightly higher levels of congruence than GroE and whole-genome TUDs. The lack of complete congruence among phylogenies based on housekeeping genes (such as RpoA, RecA, GroE) and 16S rRNA is usually attributed to frequent recombinational events (Holmes et al. 1999; Eisen 2000b), obscuring evidence of phylogenetic signal. Because nucleotide usage patterns in coding DNA are responsible for most of the phylogenetic signal, it is possible that recombination on a whole-genome level is reflected in the frequency of RecA and RpoA recombinational events, and that phylogenetic incongruencies between 16S rRNA and TUD trees may reflect differential levels of horizontal transfer events in certain prokaryotes. Trees for noncoding DNA TUDs and whole-genome TUDs based on Markov chain analysis are significantly correlated with 16S rRNA trees, but show much less congruence than trees based on housekeeping genes or zero-order Markov coding DNA TUDs, which indicates that little phylogenetic signal is conserved in noncoding or Markov chain TUD patterns (Fig. (Fig.6).6). That trees based on TUD also are substantially more congruent with 16S rRNA trees than those based on dinucleotide or codon usage frequencies (Fig. (Fig.6,6, and data not shown; Appendix Fig. Fig.1),1), suggests that through analysis of longer oligonucleotide words, biases will be uncovered that contribute to phylogenetic signal. That TUD patterns have greater phylogenetic signal than codon frequencies supports the hypothesis that nucleotide organizational biases beyond those of codon usage are the basis for these results. Although previous studies indicate that there is considerable distance between the Mycoplasma species based on dinucleotide usage patterns (Karlin et al. 1997), in the TUD trees the Mycoplasma species cluster together, but with greater divergence than those based on 16S rRNA.

Although phylogenetic analysis of 16S rRNA provides the most widely accepted methodology for grouping organisms (Woese et al. 1990; Olsen et al. 1994; Pace 1997; Doolittle 1999), analysis of TUD patterns in microbial genomes provides a tool for examination of related organisms after their evolutionary divergence. We hypothesize that the differences indicate that organisms evolve nucleotide usage patterns more rapidly than 16S rRNA after diverging from their recent common ancestors, as is likely the explanation for the Mycoplasma clustering and for the Bacillus species. Thus, TUD analysis allows alternative insights into the selective forces governing microbial evolution, especially as a result of elements that might affect genomic structure, such as natural competence, lack of functional mismatch repair systems, and R-M systems. The benefits of the method are that it is easily reproducible, requires no foreknowledge of coding and noncoding sequences, requires no nucleotide or amino-acid alignments, and contains phylogenetic signal rivaling that of housekeeping genes. The drawbacks of the method include that it likely is subject to convergent evolution, in which external forces induce changes in genomic nucleotide usage patterns, giving unrelated organisms the appearance of recent common ancestry. This phenomenon of homoplasy also substantially influences phylogeny based on single genes, and is thus not unique to TUD analysis (Maynard Smith and Smith 1998). Another similar drawback is that the method may be subject to global influences (e.g., restriction endonucleases) that affect genomic structure, increasing the apparent distance between related organisms. These global forces should not be ignored, but may not be uniform for all organisms, probably affecting ancestral reproduction. The method also is influenced by horizontal transfer events. In organisms in which the proportion of horizontal transfer is large, such as Thermotoga maritima (Nelson et al. 1999), its phylogenetic position on TUD trees may be affected. This is offset at least partially by the phenomenon of amelioration (Table (Table2),2), thus dampening the effect of horizontal transfer events (Pride and Blaser 2002). For phylogenetic studies, use of TUD and other such whole genomic analyses (Sankoff et al. 1992; Fitz-Gibbon and House 1999; Snel et al. 1999; Eisen 2000a) should be considered complementary to analyses based on single gene products, such as 16S rRNA.


Microbial Genomes, Phages, and Plasmids

Complete genome sequences of the bacteria, archaea, bacteriophages, and plasmids (all > 25 kb) studied were obtained from GenBank (ftp://ncbi.nlm.nih.gov/genbank/genomes/bacteria/, http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/phg.html, and http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/eub_p.html, respectively) (Tables (Tables11 and and3).3). Coding regions of prokaryotic genomes were identified based on GenBank annotation using Swaap PH 1.0 (Pride, D.T. 2001. Swaap PH 1.0: A tool for analyzing nucleotide usage patterns in coding and noncoding portions of microbial genomes. Distributed by the author, Department of Microbiology and Immunology, Vanderbilt University, Nashville, Tennessee, available at http://www.bacteriamuseum.org/SWAAP/SwaapPage.htm), and noncoding regions were classified as all other DNA sequences.

Analysis of Representation of Nucleotide Combinations

To determine the tetranucleotide usage departures from expectations among prokaryotic genomes, two different Markov methods were used. The zero-order Markov method (Almagor 1983) is designed to determine the expected number of tetranucleotides by removing biases in mononucleotide frequencies. The expected number of tetranucleotides is determined by the equation: E(W) = [(Aa * Cc * Gg * Tt) * N], where A, C, G, and T represent the frequency of nucleotides A, C, G, and T within the window being evaluated, respectively, a, c, g, and t represent the number of nucleotides A, C, G, and T in each tetranucleotide, respectively, and N represents the length of the window being evaluated. The frequency of divergence of the word F(W) is expressed as the ratio of the observed O(W) to the expected E(W). Markov chain analysis (Cardon and Karlin 1994; Schbath et al. 1995) determines the expected frequency of oligonucleotide words by removing biases in their oligonucleotide components. Briefly, as described (Rocha et al. 1998), W = (w1w2…wm) denotes the word formed by the concatenation of m nucleotides, and N(W) is its observed count in a sequence of length n. The expected count E(W) of W is:

equation M1

For each genome analyzed, comparisons of F(W) for each tetranucleotide combination, and for the reverse-complement of each combination by linear regression analysis yielded R2 values = 0.99; therefore, analyses concentrated only on the documented clockwise strand F(W) values. The profile of TUD for all tetranucleotides was determined for each organism studied (Table (Table1)1) using Swaap 1.0.0 (Pride, D.T. 2001. Swaap 1.0.0: A tool for analyzing substitutions and similarity in multiple alignments. Distributed by the author, Department of Microbiology and Immunology, Vanderbilt University in Nashville, Tennessee, available at http://www.bacteriamuseum.org/SWAAP/SwaapPage.htm), and their relative intra- and intergenomic abundance compared by linear regression analysis using Microsoft Excel 2000 (Microsoft Corp., Inc.).

Cluster Analysis of Prokaryotes

Distances based on tetranucleotide frequency divergences were determined: Dt = ¼N * |F1(W)  F2(W)|, where N equals the length of the nucleotide word, F1(W)and F2(W) represent F(W) for each of the 256 tetranucleotides for organisms 1 and 2 (analogous to computations derived by Cardon and Karlin [1994]). Bootstrapping was performed by sampling with replacement of each of the 256 tetranucleotide frequencies using Swaap PH 1.0 (Pride, D.T. 2001. Swaap PH 1.0: A tool for analyzing nucleotide usage patterns in coding and noncoding portions of microbial genomes. Distributed by the author, Department of Microbiology and Immunology, Vanderbilt University in Nashville, Tennessee, available at http://www.bacteriamuseum.org/SWAAP/SwaapPage.htm), and phylograms were created based on distance matrices using Phylip 3.5 (Felsenstein 1989), and displayed using Treeview (Page 1996). 16S rRNA sequences were obtained from the Ribosomal Database Project II (Maiden et al. 2001), and phylograms were created using HKY85 distances with Phylip 3.5 (Felsenstein 1989). Sequences of RpoA (RNA polymerase subunit A), RecA (recombination protein A), and GroE (HSP60 family chaperonin) were obtained from the COG database (Tatusov et al. 2001), and phylograms created using mean distances with Phylip 3.5 (Felsenstein 1989).

Analysis of Congruence Among Phylogenetic Trees

Analysis of symmetrical distances among phylogenetic trees was performed using the method of Penny and Hendy (1985). Briefly, 100 phylograms were created for 16S rRNA, RecA, GroE, RpoA, or tetranucleotides by bootstrapping, and 100 phylograms with random topology also were created. Each set of phylograms was compared using Paup 4.0b8 (Swofford 1998). Analysis of congruence among the gene phylograms was performed on consensus trees, and 200 trees were created with random topology. A maximum likelihood method, similar to that used by Feil et al. (2001), was used to determine the extent of congruence among phylograms; differences in log likelihood (Δ-ln L) were computed between phylograms based on 16S rRNA and phylograms based on RecA, RpoA, GroE, tetranucleotides, dinucleotides, and random topology. Differences in Δ-ln L for random phylograms can be considered as the null distribution, which would be obtained when there is no more similarity in topology than that expected by chance. If the Δ-ln L values for comparisons among the phylograms fall within the 99th percentile of the null distribution, then the topologies are significantly different, and thus incongruent (Feil et al. 2001).

Figure a

An external file that holds a picture, illustration, etc.
Object name is 58321-01fapp1_L1TT_rev2.jpg

Appendix Figure Figure1 Phylograms1 Phylograms of 27 selected organisms for which genomic sequences are available. Organisms were grouped by using distance matrices based on the sums of the differences from the other organisms for the frequencies of the 64 codons, and phylogenies created by neighbor-joining analysis. Bootstrap values based on 100 replicates are represented at each node, and branch length index is indicated in each panel.


ftp://ncbi.nlm.nih.gov/genbank/genomes/bacteria/; GenBank Web site which offers bacterial and archaeal genome sequences.

http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/phg.html; GenBank Web site which offers bacteriophage genome sequences.

http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/eub_p.html; GenBank Web site which offers bacterial and archaeal plasmid sequences.

http://www.bacteriamuseum.org/SWAAP/SwaapPage.htm; Web site which offers Swaap 1.0.0 and Swaap PH 1.0.

Table a

Appendix. Linear Regression Analysisa of DNA Tetranucleotide Usage Profiles Among Selected Prokaryotes.b

An external file that holds a picture, illustration, etc.
Object name is 58321-01t1_L1TT.jpg
aR2 values from linear regression analysis displayed. R2 values ≥ 0.50 in bold.
bMarkov chain analysis displayed in top half of matrix. Zero-order Markov analysis displayed in bottom half of matrix.
cAe, Aquifex aeolicus; Ap, Aeropyrnum pernix; Bh, Bacillus halodurans; Bs, Bacillus subtilis; Cj, Campylobacter jejuni; Cp, Chlamydia pneumoniae; Ct, Chlamydia trachomatis; Dr,Deinococcus radiodurans chromosomes 1 and 2; EcK,Escherichia coli K12; EcO, Escherichia coli O157:H7; Hi, Haemophilus influenzae; HpJ, Helicobacter pylori J99; Hp2, Helicobacter pylori 26695; Ll,lactococcus lactis; Mth, Methanobacterium thermoautotrophicum; Mj, Methanococcus janaschii; Ml, Mycobacterium leprae; Mtb, Mycobacterium tuberculosis; Mg, Mycoplasma genitalium; Mp, Mycoplasma pneumoniae; Nm, Neisseria meningitidis serotypes A and B; Pa, Pyrococcus abyssi; Ph,Pyrococcus horikoshii; Rp, Ricketsia prowazekii; St, Salmonella typhi; Ss, Synechocystis species; Tm, Thermotoga maritima; Vc, Vibrio cholerae chromosomes 1 and 2; Yp, Yersenia pestis.


Supported in part by the Medical Scientist Training Program, the National Institutes of Health (RO1DK53707, RO1GM63270, and the Cancer Center Core grant CA68485), the UNCF-Merck Science Initiative, the Foundation for Bacteriology, and the Gates Millennium Scholars Program.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.


E-MAIL ude.uyn.dem@10dedirP; FAX (212) 252–7164.

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.335003. Article published online before print in January 2003.


1. Almagor H. 1983. A Markov analysis of DNA sequences. J. Theor. Biol. 104: 633-645. [PubMed]
2. Bhagwat A.S. and McClelland, M. 1992. DNA mismatch correction by very short patch repair may have altered the abundance of oligonucleotides in the E. coli genome. Nucleic Acids Res. 20: 1663-1668. [PMC free article] [PubMed]
3. Burge C., Campbell, A.M., and Karlin, S. 1992. Over- and under-representation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. 89: 1358-1362. [PMC free article] [PubMed]
4. Cardon L.R. and Karlin, S. 1994. Computational DNA sequence analysis. Annu. Rev. Microbiol. 48: 619-654. [PubMed]
5. Doolittle W.F. 1999. Phylogenetic classification and the universal tree. Science 284: 2124-2128. [PubMed]
6. Eisen J.A. 2000a. Assessing evolutionary relationships among microbes from whole-genome analysis. Curr. Opin. Microbiol. 3: 475-480. [PubMed]
7. Eisen J.A. 2000b. Horizontal gene transfer among microbial genomes: New insights from complete genome analysis. Curr. Opin. Genet. Dev. 10: 606-611. [PubMed]
8. Feil E.J., Holmes, E.C., Bessen, D.E., Chan, M.-S., Day, N.P.J., Enright, M.C., Goldstein, R., Hood, D.W., Kalia, A., Moore, C.E., et al. 2001. Recombination within natural populations of pathogenic bacteria: Short-term empirical estimates and long-term phylogenetic consequences. Proc. Natl. Acad. Sci. 98: 182-187. [PMC free article] [PubMed]
9. Felsenstein J. 1989. PHYLIP—Phylogeny Inference Package (Version 3.2). Cladistics 5: 164-166.
10. Fitz-Gibbon S.T. and House, C.H. 1999. Whole-genome based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res. 27: 4218-4222. [PMC free article] [PubMed]
11. Gelfand M.S. and Koonin, E.V. 1997. Avoidance of palindromic words in bacterial and archaeal genome: A close connection with restriction enzymes. Nucleic Acids Res. 25: 2430-2439. [PMC free article] [PubMed]
12. Grantham R., Gautier, C., Guoy, M., Jacobzone, M., and Mercier, R. 1981. Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Res. 9: R43-R74. [PMC free article] [PubMed]
13. Grosjean H. and Freirs, W. 1982. Preferential codon usage in prokaryotic genes: The optimal codon-anti-codon interaction energy and the selective codon usage in efficiently expressed genes. Gene 18: 199-209. [PubMed]
14. Heidelberg J.F., Eisen, J.A., Nelson, W.C., Clayton, R.A., Gwinn, M.L., Dodson, R.J., Haft, D.H., Hickey, E.K., Peterson, J.D., Umayam, L., et al. 2000. DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature 406: 477-483. [PubMed]
15. Henaut A., Rouxel, T., Gleizes, A., Moszer, I., and Danchin, A. 1996. Uneven distribution of GATC motifs in the Escherichia coli chromosome, its plasmids and its phages. J. Mol. Biol. 257: 574-585. [PubMed]
16. Holmes E.C., Urwin, R., and Maiden, M.C.J. 1999. The influence of recombination on the population structure and evolution of the human pathogen Neisseria meningitides. Mol. Biol. Evol. 16: 741-749. [PubMed]
17. Karlin S., Mrazek, J., and Campbell, A.M. 1997. Compositional biases of bacterial genomes and evolutionary implications. J. Bacteriol. 179: 3899-3913. [PMC free article] [PubMed]
18. Karlin S., Campbell, A.M., and Mrazek, J. 1998. Comparative DNA analysis across diverse genomes. Annu. Rev. Genet. 32: 185-225. [PubMed]
19. Koonin E.V., Aravind, L., and Kondrashov, A.S. 2000. The impact of comparative genomics on our understanding of evolution. Cell 101: 573-576. [PubMed]
20. Kong H., Lin, L.-F., Porter, N., Stickel, S., Byrd, D., Posfai, J., and Roberts, R.J. 2000. Functional analysis of putative restriction-modification system genes in the Helicobacter pylori J99 genome. Nucleic Acids Res. 28: 3216-3223. [PMC free article] [PubMed]
21. Lawrence J.G. and Ochman, H. 1997. Amelioration of bacterial genomes: Rates of change and exchange. J. Mol. Evol. 44: 383-397. [PubMed]
22. Leung M.Y., Marsh, G.M., and Speed, T.P. 1996. Over- and underrepresentation of short DNA words in herpesvirus genomes. J. Comp. Biol. 3: 345-360. [PMC free article] [PubMed]
23. Lin L.-F., Posfai, J., Roberts, R. J., and Kong, H. 2001. Comparative genomics of the restriction-modification systems in Helicobacter pylori. Proc. Natl. Acad. Sci. 98: 2740-2745. [PMC free article] [PubMed]
24. Maiden B.L., Cole, J.R., Lilburn, T.G., Parker, C.T., Jr., Saxman, P.R., Farris, R.J., Garrity, G.M., Olsen, G.J., Schmidt, T.M., and Tiedje, J.M. 2001. The RDP-II (ribosomal database project). Nucleic Acids Res. 29: 173-174. [PMC free article] [PubMed]
25. Maynard Smith J. and Smith, N.H. 1998. Detecting recombination from gene trees. Mol. Biol. Evol. 15: 590-599. [PubMed]
26. Muto A. and Osawa, S. 1987. The guanine and cytosine content of genomic DNA and bacterial-evolution. Proc. Natl. Acad. Sci. 84: 166-169. [PMC free article] [PubMed]
27. Nelson K.E., Clayton, R.A., Gill, S.R., Gwinn, M.L., Dodson, R.J., Haft, D.H., Hickey, E.K., Peterson, J.D., Nelson, W.C., Ketchum, K.A., et al. 1999. Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritime. Nature 399: 323-329. [PubMed]
28. Olsen G.J., Woese, C.R., and Overbeek, R. 1994. The winds of (evolutionary) change: Breathing new life into microbiology. J. Bacteriol. 176: 1-6. [PMC free article] [PubMed]
29. Pace N.R. 1997. A molecular view of microbial diversity and the biosphere. Science 276: 734-740. [PubMed]
30. Page R.D.M. 1996. TREEVIEW: An application to display phylogenetic trees on personal computers. Comp. Appl. Biosci. 12: 357-458. [PubMed]
31. Parkhill J., Wren, B.W., Mungall, K., Ketley, J.M., Churcher, C., Basham, D., Chillingworth, T., Davies, R.M., Feltwell, T., Holroyd, S., et al. 2000. The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences. Nature 403: 665-668. [PubMed]
32. Penny D. and Hendy, M.D. 1985. The use of tree comparison metrics. Systematic Zoology 34: 75-82.
33. Perna N.T., Plunkett, G., III, Burland, V., Mau, B., Glasner, J.D., Rose, D.J., Mayhew, G.F., Evans, P.S., Gregor, J., Kirkpatrick, H.A., et al. 2001. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409: 529-533. [PubMed]
34. Pride D.T. and Blaser, M.J. 2002. Identification of horizontally acquired genetic elements in Helicobacter pylori and other prokaryotes using oligonucleotide difference analysis. Genome Letters 1: 2-15.
35. Rocha E.P.C., Viari, A., and Danchin, A. 1998. Oligonucleotide bias in Bacillus subtilis: General trends and taxonomic comparisons. Nucleic Acids Res. 26: 2971-2980. [PMC free article] [PubMed]
36. Sankoff D., Leduc, G., Antoine, N., Paquin, B., Lang, B.F., and Cedergren, R. 1992. Gene order comparisons for phylogenetic inference: Evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. 89: 6575-6579. [PMC free article] [PubMed]
37. Schbath S., Prum, B., and de Turckheim, E. 1995. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. J. Comp. Biol. 2: 417-437. [PubMed]
38. Sharp P.M., Stenico, M., Peden, J.F., and Lloyd, A.T. 1993. Codon usage: Mutational bias, translational selection, or both? Biochem. Soc. Trans. 21: 835-841. [PubMed]
39. Snel B., Bork, P., and Huynen, M.A. 1999. Genome phylogeny based on gene content. Nature Genetics 21: 108-110. [PubMed]
40. Suoeka N. 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. 85: 2653-2657. [PMC free article] [PubMed]
41. Swofford, D.L. 1998. Paup 4.0b2. Phylogenetic analysis using parsimony (* and other methods). Version 4. Sinauer Associates, Sunderland, Massachusetts.
42. Tatusov R.L., Natale, D.A., Garkavtsev, I.V., Shankavaram, U.T., Rao, B.S., Kiryutin, B., Galperin, M.Y., Fedorova, N.D., and Koonin, E.V. 2001. The Cog Database: New developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29: 22-28. [PMC free article] [PubMed]
43. Tomb J.-F., White, O., Kervalage, A.R., Clayton, R.A., Sutton, G.G., Fleischman, R.D., Ketchum, K.A., Klenk, H.P., Gill, S., Dougherty, B.A., et al. 1997. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388: 539-547. [PubMed]
44. Woese C.R. and Fox, G.E. 1977. Phylogenetic structure of the prokaryotic domain: The primary kingdoms. Proc. Natl. Acad. Sci. 74: 5088-5090. [PMC free article] [PubMed]
45. Woese C.R., Kandler, O., and Wheelis, M.L. 1990. Towards a natural system of organisms: Proposal for the domains archaea, bacteria, and eukarya. Proc. Natl. Acad. Sci. 87: 4576-4579. [PMC free article] [PubMed]
46. Xu Q., Morgan, R.D., Roberts, R.J., and Blaser, M.J. 2000. Identification of type II restriction and modification systems in Helicobacter pylori reveals their substantial diversity among strains. Proc. Natl. Acad. Sci. 97: 9671-9676. [PMC free article] [PubMed]

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...