![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 1999, The National Academy of Sciences Evolution Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA Departments of †Biological Sciences and §Mathematics, Stanford University, Stanford, CA 94305-2125 ‡To whom reprint requests should be addressed. E-mail: fa.amc/at/forsythe.stanford.edu. Contributed by Allan Campbell Accepted May 26, 1999. This article has been cited by other articles in PMC.Abstract Our basic observation is that each genome has a characteristic “signature” defined as the ratios between the observed dinucleotide frequencies and the frequencies expected if neighbors were chosen at random (dinucleotide relative abundances). The remarkable fact is that the signature is relatively constant throughout the genome; i.e., the patterns and levels of dinucleotide relative abundances of every 50-kb segment of the genome are about the same. Comparison of the signatures of different genomes provides a measure of similarity which has the advantage that it looks at all the DNA of an organism and does not depend on the ability to align homologous sequences of specific genes. Genome signature comparisons show that plasmids, both specialized and broad-range, and their hosts have substantially compatible (similar) genome signatures. Mammalian mitochondrial (Mt) genomes are very similar, and animal and fungal Mt are generally moderately similar, but they diverge significantly from plant and protist Mt sets. Moreover, Mt genome signature differences between species parallel the corresponding nuclear genome signature differences, despite large differences between Mt and host nuclear signatures. In signature terms, we find that the archaea are not a coherent clade. For example, Sulfolobus and Halobacterium are extremely divergent. There is no consistent pattern of signature differences among thermophiles. More generally, grouping prokaryotes by environmental criteria (e.g., habitat propensities, osmolarity tolerance, chemical conditions) reveals no correlations in genome signature. Extensive data support the proposal that each living organism possesses a genomic signature consisting of dinucleotide relative abundance values calculated from genomic sequences (1–3). Explicitly, the genomic signature profile consists of the array {ρ*XY = f*XY/f*Xf*Y}, where f*X denotes the frequency of the mononucleotide X and f*XY the frequency of the dinucleotide XY, both computed from the sequence concatenated with its inverted complement. These dinucleotide relative abundance values {ρ*XY} minus 1 (also termed dinucleotide biases) effectively assess differences between the observed dinucleotide frequencies and those expected from random associations of the component mononucleotide frequencies. From data simulations and statistical theory, the estimates ρ*XY ≤ 0.78 or ρ*XY ≥ 1.23 convey significant underrepresentation or overrepresentation, respectively, for 50-kb random DNA contigs (1–3). We present substantial data showing that the genome signatures of bacterial plasmids are pervasively similar to those of their natural hosts. By contrast, the signatures of animal mitochondrial DNA are not close to those of their hosts but are generally concordant with those of other animal mitochondrial (Mt) DNA. Justifications for Using Genome Signature. Biochemical experiments in the 1960s and 1970s measuring nearest-neighbor frequencies (4, 5) established that the set of dinucleotide relative abundance values {ρ*XY} is a remarkably stable property of the DNA of an organism. From this perspective, the set of dinucleotide relative abundance values constitutes a genomic signature that is diagnostic and can discriminate sequences from different organisms (3, 6). What causes the uniformity of signature throughout the genome? It pervades both noncoding and coding DNA (7), and hence cannot be explained by preferential codon usage. A reasonable explanation postulates differences in the replication and repair machinery of different species, which either preferentially generate or preferentially select specific dinucleotides in the DNA. These effects might operate through local DNA structures (base step conformational tendencies), context-dependent mutation rates, methylation, and/or other DNA modifications (1–3, 6). A measure of genomic signature difference between two sequences f and g (from different organisms or from different regions of the same genome) is the average absolute dinucleotide relative abundance difference calculated as Genome Signature Comparisons. Fig. Fig.11
The dinucleotide TA is broadly underrepresented or low normal in prokaryotic sequences about the level 0.50 ≤ ρ*TA ≤ 0.82 (exceptions include Rickettsia prowazekii, Clostridium acetobutylicum, and the archaea P. aerophilum, P. horikoshii, and Sulfolobus sp. (full names appear in the legend to Figs. Figs.11
Among prokaryotes, CG is suppressed (underrepresented) in M. genitalium (but not in M. pneumoniae), in R. prowazekii, in B. burgdorferi, in C. jejuni, in the low-G+C Gram-positive sequences of Streptococcus and Clostridium, and in several thermophiles, including M. jannaschii, M. thermoautotrophicum, Sulfolobus spp., but not in P. aerophilum or P. horikoshii. At the other extreme, CG is overrepresented in B. stearothermophilus, in halobacteria, and also in several β- and α-proteobacterial genomes (e.g., Neisseria spp. and Rhizobium spp.). Among eukaryotes, CG shows potent suppression in vertebrates (even in deuterostomes). Overall, ρ*CG values in vertebrates range from 0.23 to 0.40, whereas they are in the normal range for insects, worms, and most fungi. It has been shown (12) that unmethylated CG dinucleotides of normal frequency in most enteroproteobacteria can induce an immune response in mammalian genomes, where CG is very low. Is this a concomitant of genomic signature biases? The reverse dinucleotide, GC, is predominantly overrepresented in many β- and γ-proteobacterial sequences and in several low-G+C Gram-positive bacterial genomes (e.g., B. subtilis and C. acetobutylicum; Fig. Fig.1).1 TT/AA is overrepresented in several proteobacteria, in Mycoplasmas, in Synechocystis sp., in Deinococcus radiodurans, in A. aeolicus among prokaryotes, and in insects and worms among eukaryotes. There are no underrepresentations of TT/AA. Overrepresentations of CC/GG include Synechocystis, B. burgdorferi, A. aeolicus, M. jannaschii, M. thermoautotrophicum, and P. horikoshii. There are no underrepresentations of CC/GG. δ* Differences Among Prokaryotic Sequences. Fig. Fig.22 (i) Within-species δ* differences (diagonal elements of Fig. Fig.2)2 (ii) The rickettsial sequences are grouped with α-proteobacteria, apparently on the basis of rRNA gene comparisons. Is this consistent? The classical α-proteobacterial types are divided into two major subgroups: 1, including Rhizobium spp., that function importantly in nitrogen fixation, and 2, including Rhodobacter spp. and P. denitrificans, found predominantly in soil and marine habitats and doing anoxygenic photosynthesis. A tentative third group, 3, includes the Rickettsia and Ehrlichia clades (obligate intracellular parasites). Genome signature comparisons indicate drastic discrepancies between the combined groups { 1, 2} and the group 3. Moreover, the 1 and 2 genomes are pervasively of high G+C content (≥60%), whereas 3 genomes are of low G+C content (<32%). The δ* differences between Rickettsia and classical α-proteobacterial sequences are generally >200, very distant (see Fig. Fig.22(iii) Sulfolobus spp. δ* differences from other prokaryotes. Fig. Fig.22 δ* (Sulfolobus, Clostridium) ≈ 85, moderately similar; δ* (Sulfolobus, Rickettsia, and Buchnera) ≈ 125–130, distantly similar; δ* (Sulfolobus, other thermophilic archaea) ≈ 71–114, weakly to distantly similar; δ* (Sulfolobus, purple proteobacteria and high G+C Gram-positive) ≈ 190–270, very distant; δ* (Sulfolobus, cyanobacteria) ≈ 145–177, distant. (iv) Halobacterial genome sequences are outliers. Intriguingly, with respect to genome signature comparisons, the closest to Halobacterium spp. are the Streptomyces sequences, weakly similar. The δ* differences of Halobacterium spp. from other prokaryotes mostly exceed the extreme level of 250. Genome Compatibility Between Plasmids and Host. Plasmids are genetic mobile elements among bacterial cells, generally laterally transferred by conjugation (on occasion by transduction or transformation). Plasmids carry restriction systems, antibiotic resistance genes, heavy-metal cofactors, Nif (nitrogen fixation) genes, and other contingency functions. Replication of plasmids is largely governed by host machinery. How the genome signature of a plasmid sequence compares with that of its host sequence is assessed for available prokaryotic genomes with completely sequenced plasmids. Fig. Fig.33
B. burgdorferi. The complete genome extends 0.911 Mb and contains 17 plasmids, of which 11 have been sequenced from 10 kb to 54 kb in size labeled A to K. The average δ* difference within the B. burgdorferi (BORBU) genome is 25 with range 3–66. The δ* differences among all plasmids and compared with a complete set of 50-kb contigs of the host genome range from 42 to 84, clearly moderately similar (data not shown). Mutual plasmid sequence δ* differences show moderate to weak similarity. Specific plasmids. M. jannaschii, mja. This archaeal genome contains two plasmids of 58 kb (called large) and 16.5 kb (small). Again, we find that the two plasmids are mutually close (δ* = 35) and each moderately similar to the host genome. Rhizobium spp., rhi. We have available about 150 kb of nonredundant aggregate sequences from R. leguminosarum and about 250 kb of sequence from R. meliloti. There is a mammoth plasmid of about 550 kb. Partitioning these sequences into 50-kb contigs yields the average δ* differences given in Fig. Fig.33 Broad-host-range plasmids. Two plasmids that stably replicate in many hosts are RSF1010 and RP4 (13, 14); see legend to Fig. Fig.3.3 The broad-range plasmids are generally moderately or weakly similar to the other plasmids from proteobacterial genomes. The plasmid from L. lactis is distantly similar to the broad-range plasmids. The plasmids from B. burgdorferi, Halobacterium spp., and M. jannaschii are very distant from all proteobacterial plasmids. The δ* differences among B. burgdorferi plasmids are mutually closely or moderately similar, except plasmid G, which is only weakly similar. Plasmid G is close (δ* = 49) to the L. lactis plasmid, perhaps implying that a low G+C Gram-positive bacterium may be the source of plasmid G. The two broad-range plasmids are mutually moderately similar (δ* = 83). This similarity might mean either that relatively close genome signatures promote plasmid establishment or that the plasmids have acquired their hosts’ signatures during long-term residence. Experiments on conjugation can address issues such as specificity vs. wide host range and relevance of size and signature for plasmid compatibility. We interpret the similarities in signature between plasmids and their bacterial hosts as implying that they share much replication and repair machinery, perhaps because the prokaryotic cell is not compartmentalized to the degree that the eukaryotic cell is. Genomic Signature δ* Differences Among Mitochondrial Mt Genomes. Fig. Fig.44
Mammalian Mt genome signatures appear almost as random samples of each other. Why are the genome signatures among vertebrate Mt sequences significantly more similar than those among the corresponding host nuclear genomic sequences (15)? δ* differences among organisms putatively reflect variations in replication and/or repair (1–3, 6). The replication machinery for animal Mt DNA apparently varies less than that for host DNA, perhaps because Mt replication is less affected by changes in external environment or developmental programs than is host DNA replication. δ* differences of Mt from protostomes (e.g., insect, mollusk) versus deuterostomes show δ* differences mostly in the range 90–115, weakly similar. The worm (LUMTE and C. elegans) genomes compared with deuterostome Mt are closely to weakly similar, δ* ≈ 41–101. The fungal Mt sets (excluding S. cerevisiae) compared with deuterostome Mt also yield δ* differences in the range 43–114. The S. cerevisiae mtDNA (≈78 kb) composition is an extreme anomaly attested to by the inordinate δ* differences from all other Mt or nuclear genomes, contributed to by about 100 G+C-rich clusters, each about 50 to 100 bp long, separated by A+T-rich spacers and numerous transposable elements. The two plant Mt genomes (ARATH, MARPO, see Fig. Fig.4),4 Animal Mt sequences show significant underrepresentations of CG dinucleotides, ρ*CG ≈ 0.40–0.60 (10), about to the same extent as occurs in vertebrate nuclear genomic sequences. Virtually all animal Mt maintain normal representations of TA dinucleotides, whereas the corresponding nuclear DNAs predominantly have TA in low relative abundance, suggesting that mtDNA may be thermodynamically less stable than nuclear DNA because the dinucleotide TA has the lowest stacking energies compared with all other base steps. The fungal Mt of S. pombe has ρ*CG ≈ 0.54, typical of animal Mt. However, the Podospora anserina fungal Mt genome has ρ*CG in the normal range. The Mt genome of A. thaliana has ρ*CG = 0.73, significantly low. The single persistently high ρ* value occurs for ρ*CC/GG ≥ 1.30 in animal and fungal Mt sequences. δ* differences between species parallel the corresponding nuclear δ* differences, despite large differences between Mt and corresponding host nuclear signatures (15). Discussion Prokaryotic molecular taxonomy heretofore has been derived predominantly from sequence comparisons among rRNA genes. There are many uncertainties and controversies regarding divisions among prokaryotes (for recent reviews, see refs. 16 and 17). Protein sequence comparisons and associated phylogenetic tree constructions are even more conflicting relative to evolutionary relationships (18, 19). Conventional methods of phylogenetic reconstruction from sequence information employ only similarity or dissimilarity assessments of aligned homologous genes or regions. Some difficulties intrinsic to this approach compared to the use of genome signature include the following: (i) Alignments of distantly related long sequences (e.g., complete genomes) are generally not feasible for various reasons, including chromosomal rearrangements, whereas signature comparisons do not depend on alignments. (ii) Different phylogenetic reconstructions (trees) may result for the same set of organisms based on analysis of different protein, gene, or noncoding sequences. Attempts to overcome these conflicts by “averaging” over many proteins are problematic because of biases in species sampling, effects of lateral transfer, complications of gene duplications, and inadequacies and artifacts of phylogenetic methods. As the signature has remarkably low variance throughout the genome, a tree based on δ* differences is independent of which genome segments of 50 kb (or longer) is used in its construction. (iii) Chimeric origins and lateral transfer between distantly related organisms complicate alignment-based phylogenies. Signature comparisons are unaffected by these factors. On the other hand, alignment-based comparisons may provide information on individual gene origins that the signature method does not. (iv) Tree construction derived from aligned sequences cannot be applied to organisms for which similar gene sequences are largely unavailable (e.g., for bacteriophages, diverse eukaryotic viruses). (v) That “lateral transfer” is pervasive among prokaryotic genomes is now widely appreciated. Vectors for lateral transfer include exogenous transposons, hitchhiking on plasmids and/or phages, movement via episomes, and cell fusions. What possible reasons and mechanisms can account for the qualitative parallelism between the evolutionary development of host nuclear genomes and the development of Mt organelle genomes despite the pronounced difference between the Mt and host nuclear genome signatures? The Mt and nuclear genomes for animal and fungal organisms use independent DNA polymerase machinery (e.g., γ vs. α, , and δ subunits in mammals). Also, the methods of replication and the nature of the replication origins are fundamentally different. Explicitly, the animal and fungal Mt transcription-primed replication machinery is distinctive in that most of the heavy strand is synthesized first and the light strand subsequently, whereas the nuclear genomes are replicated bidirectionally from multiple origins. There appears to be no DNA excision repair mechanism to deal with cyclobutane dimers in the Mt, and no repair of bulky lesions (20). Mt DNAs in animals and fungi show elevated levels of single- and double-strand breaks, mismatches, and generally corrupted base pairings, probably due to a paucity of abasic site correction facilities and mismatch repair capacity in Mt genomes (21). Moreover, repair may be less urgent for Mt activity because each cell has many mitochondria (hundreds or thousands) and a modicum of impaired organelles may not significantly curtail energy production. We, nevertheless, propose that Mt genomes retain signatures close to those of their repair-competent prokaryotic ancestor.The contrast between plasmids (which track host genomic signatures) and mitochondria (which do not) is sharp. The similar signatures of plasmids and hosts may have two bases: (i) As we have postulated for genome fusions (8), perhaps a plasmid whose signature is too different from that of the host will not be accepted by it. This would prescribe a maximum possible signature deviation. The largest δ* difference observed in our survey is weakly similar (or about the distance from human to sea urchin). (ii) During the plasmid’s residence in its current host, the same pressures that homogenize the signature throughout the chromosome will also drive the plasmid’s signature towards that of the host. Such amelioration has been postulated for the G+C content of laterally transferred DNA (22). We suspect that the signature should ameliorate even more rapidly, for both plasmids and laterally transferred chromosomal segments. Whereas most successful gene transfer between lineages is very likely intraspecific, an appreciable amount of transfer among distantly related bacteria seems to have accumulated over time (22, 23). Despite such transfer, the signatures currently observed are almost invariant throughout the genome. Some means based on codon usage biases for ascertainments of laterally transferred genes in bacterial organisms are set forth in ref. 23. Supplemental Figures
Acknowledgments We are happy to acknowledge valuable discussions and comments on the manuscript by Drs. B. Edwin Blaisdell, Dale Kaiser, and David Relman. S.K. is supported in part by National Institutes of Health Grants 5R01GM10452-34 and 5R01HG00335-11 and National Science Foundation Grant DMS9704552. References 1. Karlin S, Burge C. Trends Genet. 1995;11:283–290. [PubMed] 2. Blaisdell B E, Campbell A M, Karlin S. Proc Natl Acad Sci USA. 1996;93:5854–5859. [PubMed] 3. Karlin S, Mrázek J, Campbell A M. J Bacteriol. 1997;179:3899–3913. [PubMed] 4. Josse J, Kaiser A D, Kornberg A. J Biol Chem. 1961;263:864–875. [PubMed] 5. Russell G J, Walker P M, Elton R A, Subak-Sharpe J H. J Mol Biol. 1976;108:1–23. [PubMed] 6. Karlin S. Curr Opin Microbiol. 1998;1:598–610. [PubMed] 7. Karlin S, Mrázek J. J Mol Biol. 1996;262:459–472. [PubMed] 8. Karlin S, Brocchieri L, Mrázek J, Campbell A M, Spormann A M. Proc Natl Acad Sci USA. 1999;96:9190–9195. [PubMed] 9. Karlin S, Campbell A M, Mrázek J. Annu Rev Genet. 1998;32:185–225. [PubMed] 10. Cardon L R, Burge C, Clayton D A, Karlin S. Proc Natl Acad Sci USA. 1994;91:3799–3803. [PubMed] 11. Karlin S, Doerfler W, Cardon L R. J Virol. 1994;68:2889–2897. [PubMed] 12. Krieg A M, Yi A-K, Schorr J, Davis H L. Trends Microbiol. 1998;6:23–27. [PubMed] 13. Olsen R H, Shipley P. J Bacteriol. 1973;113:772–780. [PubMed] 14. Bagdasarian M, Lurz R, Ruckert B, Franklin F C, Bagdasarian M M, Frey J, Timmis K N. Gene. 1981;16:237–247. [PubMed] 15. Karlin S, Mrázek J. Proc Natl Acad Sci USA. 1997;94:10227–10232. [PubMed] 16. Brown J R, Doolittle W R. Microbiol Rev. 1997;61:456–502. 17. Gupta R S. Microbiol Mol Biol Rev. 1998;62:1435–1491. [PubMed] 18. Gupta R S, Golding G B. Trends Biochem Sci. 1996;21:166–171. [PubMed] 19. Budin K, Philippe H. Mol Biol Evol. 1998;15:943–956. [PubMed] 20. Shadel G S, Clayton D A. Annu Rev Biochem. 1997;66:409–435. [PubMed] 21. Yakes F M, Van Houten B. Proc Natl Acad Sci USA. 1997;94:514–519. [PubMed] 22. Lawrence J G, Ochman H. J Mol Evol. 1997;44:383–397. [PubMed] 23. Karlin S, Mrázek J, Campbell A M. Mol Microbiol. 1998;29:1341–1355. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Trends Genet. 1995 Jul; 11(7):283-90.
[Trends Genet. 1995]J Bacteriol. 1997 Jun; 179(12):3899-913.
[J Bacteriol. 1997]J Biol Chem. 1961 Mar; 236():864-75.
[J Biol Chem. 1961]J Mol Biol. 1976 Nov; 108(1):1-23.
[J Mol Biol. 1976]J Bacteriol. 1997 Jun; 179(12):3899-913.
[J Bacteriol. 1997]Curr Opin Microbiol. 1998 Oct; 1(5):598-610.
[Curr Opin Microbiol. 1998]J Mol Biol. 1996 Oct 4; 262(4):459-72.
[J Mol Biol. 1996]Proc Natl Acad Sci U S A. 1999 Aug 3; 96(16):9190-5.
[Proc Natl Acad Sci U S A. 1999]Annu Rev Genet. 1998; 32():185-225.
[Annu Rev Genet. 1998]Proc Natl Acad Sci U S A. 1994 Apr 26; 91(9):3799-803.
[Proc Natl Acad Sci U S A. 1994]J Virol. 1994 May; 68(5):2889-97.
[J Virol. 1994]Trends Microbiol. 1998 Jan; 6(1):23-7.
[Trends Microbiol. 1998]Trends Genet. 1995 Jul; 11(7):283-90.
[Trends Genet. 1995]Curr Opin Microbiol. 1998 Oct; 1(5):598-610.
[Curr Opin Microbiol. 1998]J Bacteriol. 1997 Jun; 179(12):3899-913.
[J Bacteriol. 1997]Annu Rev Genet. 1998; 32():185-225.
[Annu Rev Genet. 1998]J Bacteriol. 1973 Feb; 113(2):772-80.
[J Bacteriol. 1973]Gene. 1981 Dec; 16(1-3):237-47.
[Gene. 1981]Proc Natl Acad Sci U S A. 1997 Sep 16; 94(19):10227-32.
[Proc Natl Acad Sci U S A. 1997]Trends Genet. 1995 Jul; 11(7):283-90.
[Trends Genet. 1995]J Bacteriol. 1997 Jun; 179(12):3899-913.
[J Bacteriol. 1997]Curr Opin Microbiol. 1998 Oct; 1(5):598-610.
[Curr Opin Microbiol. 1998]Proc Natl Acad Sci U S A. 1994 Apr 26; 91(9):3799-803.
[Proc Natl Acad Sci U S A. 1994]Proc Natl Acad Sci U S A. 1997 Sep 16; 94(19):10227-32.
[Proc Natl Acad Sci U S A. 1997]Microbiol Mol Biol Rev. 1998 Dec; 62(4):1435-91.
[Microbiol Mol Biol Rev. 1998]Trends Biochem Sci. 1996 May; 21(5):166-71.
[Trends Biochem Sci. 1996]Mol Biol Evol. 1998 Aug; 15(8):943-56.
[Mol Biol Evol. 1998]Annu Rev Biochem. 1997; 66():409-35.
[Annu Rev Biochem. 1997]Proc Natl Acad Sci U S A. 1997 Jan 21; 94(2):514-9.
[Proc Natl Acad Sci U S A. 1997]Proc Natl Acad Sci U S A. 1999 Aug 3; 96(16):9190-5.
[Proc Natl Acad Sci U S A. 1999]J Mol Evol. 1997 Apr; 44(4):383-97.
[J Mol Evol. 1997]Mol Microbiol. 1998 Sep; 29(6):1341-55.
[Mol Microbiol. 1998]