![]() | ![]() |
Formats:
|
|||||||||||||||||||||||||||||||||
Copyright : © 2006 Allen et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Long-Range Periodic Patterns in Microbial Genomes Indicate Significant Multi-Scale Chromosomal Organization 1 Department of Bioengineering, University of California San Diego, La Jolla, California, United States of America 2 Bioinformatics Program, University of California San Diego, La Jolla, California, United States of America Philip Bourne, Editor University of California San Diego, United States of America * To whom correspondence should be addressed. E-mail: palsson/at/ucsd.edu ¤a Current address: Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia, United States of America ¤b Current address: The Institute for Systems Biology, Seattle, Washington, United States of America Received July 25, 2005; Accepted December 7, 2005. This article has been cited by other articles in PMC.Abstract Genome organization can be studied through analysis of chromosome position-dependent patterns in sequence-derived parameters. A comprehensive analysis of such patterns in prokaryotic sequences and genome-scale functional data has yet to be performed. We detected spatial patterns in sequence-derived parameters for 163 chromosomes occurring in 135 bacterial and 16 archaeal organisms using wavelet analysis. Pattern strength was found to correlate with organism-specific features such as genome size, overall GC content, and the occurrence of known motility and chromosomal binding proteins. Given additional functional data for Escherichia coli, we found significant correlations among chromosome position dependent patterns in numerous properties, some of which are consistent with previously experimentally identified chromosome macrodomains. These results demonstrate that the large-scale organization of most sequenced genomes is significantly nonrandom, and, moreover, that this organization is likely linked to genome size, nucleotide composition, and information transfer processes. Constraints on genome evolution and design are thus not solely dependent upon information content, but also upon an intricate multi-parameter, multi-length-scale organization of the chromosome. Synopsis For more than a decade, the genetic material for a growing number of microbial organisms has been determined experimentally using genome sequencing techniques. These sequenced genomes provide researchers with an abundance of information regarding the composition and capabilities of each organism since they serve as “parts lists” that specify the protein machinery that each cell generates. However, genomes are not merely “lists” but also are typically arranged in nonrandom order. It is thought that this order may be related to some extent to the way in which each genome is packed into the tiny confines of a cell (often more than 1,000-fold packing). The authors have used signal processing methods to identify long-range spatial patterns in the arrangement of most sequenced microbial genomes, and they have related the degree of organization in each genome to various characteristics specific to the corresponding organisms. They have also analyzed in detail the degree of overlap among patterns in numerous different kinds of data for a model bacterial organism, Escherichia coli. Their results conclusively demonstrate that there are significant evolutionary constraints that act upon genome organization as well as genome content, and that the interplay between organization and function cannot be ignored in understanding fundamentally how a microbial cell works. Introduction Genomes in prokaryotic organisms typically are packed tightly into a nucleoid where they carry out multiple functions simultaneously [1,2]. The condensed DNA within the bacterial nucleoid must not only be efficiently replicated and segregated during cell division [3], but it must also simultaneously participate in the information transfer processes of transcription and translation [4]. Recent studies have significantly advanced our understanding of the ultrastructural and multifunctional organization of prokaryotic chromosomes. DNA in Escherichia coli has been found to be packed into supercoiled domains ranging 2–66 kb and averaging ~10 kb [5]. At a slightly longer length-scale, studies using fluorescence in situ hybridization have revealed that the origin and terminus of replication in E. coli gravitate toward the poles of the cell throughout replication, but both migrate to the mid-cell region just prior to the initiation of chromosome replication [6]. Fluorescence experiments in synchronized cultures of the aquatic bacterium Caulobacter crescentus have revealed the cellular location of 112 individual chromosomal loci throughout replication and cell division [7]. In addition to these imaging techniques, genetic dissection has been used to identify four macrodomains and two less-structured regions in the E. coli chromosome [8]. Two of these macrodomains were consistent with those found near the origin and terminus of replication using fluorescence in situ hybridization [6]. However, many issues remain unresolved regarding the intricacies of this arrangement, and particularly the relationship between chromosomal ultrastructure and the processes of transcriptional regulation and protein synthesis [4,9]. Several studies have revealed that genes in bacterial nucleoids tend to be arranged along the long axis of the cell (in the case of rod-shaped bacteria) so as to preserve the linear order of the genes along the chromosome [6,7,10,11]. Given this linear arrangement, prokaryotic genome sequences inherently contain useful information relating to chromosomal ultrastructure since they provide numerous properties as a function of chromosome position [12]. However, the inference of 3-D genome-packing from direct examination of the raw sequence is somewhat challenging at the short length-scales of the nucleotide, gene, or operon (1 bp–10 kb) due to the inherently one-dimensional nature of sequence data and hence the considerable sequence noise over shorter scales. Accordingly, various averaging and filtering methods have been used to identify long-range (i.e., >10-kb) position-dependent patterns in genome-associated properties [12–14]. In order to detect such long-range periodic patterns in inherently noisy chromosome position-dependent data, wavelet analysis has been used in several studies [13,15] (Figure 1
As more prokaryotic genome sequences become available, it should be increasingly possible to relate the quantitative degree of genome organization to global properties of each organism, including the presence of known nucleoid-binding proteins [20], organism taxa, and genome size and composition. Observed correlations may indicate constraints that affect (or are affected by) genome organization. Furthermore, a study of genome position-dependent patterns in heterogeneous data types in a well-studied model organism such as E. coli (e.g., gene expression versus specific codon preferences) may reveal properties that are spatially linked. Therefore, the need exists to define an unbiased, quantitative measure of genome organization from sequence-derived data, compute this quantity for numerous sequenced prokaryotic genomes, relate this quantity to global properties of each organism, and determine the spatial coupling of multiple heterogeneous properties for a well-studied model organism. In this study, we address these needs by employing wavelet analysis in concert with a bootstrap significance test (Materials and Methods) to compute the pattern strengths of chromosome position-associated datasets derived from 163 sequenced prokaryotic chromosomes. This pattern strength provides a measure of the nonrandom nature of sequence-derived data that is independent of genome length. We then computed the pattern strength of genome position-dependent properties for nearly every sequenced prokaryotic genome, and we related this measure to taxonomic and physiological characteristics of each organism. Finally, we examined disparate genome position-dependent data available for E. coli to determine properties that are spatially correlated over multiple length-scales. Our results demonstrate that the degree of organization in bacterial genomes is highly variable and correlates with specific properties, and our analysis of patterns in multiple E. coli datasets supports the notion that the overall organization of the bacterial chromosome results from the simultaneous optimization of functional and structural constraints. Results/Discussion Pattern Strengths of Sequenced Prokaryotic Organisms Using the pattern detection method described (Materials and Methods), we computed the pattern strengths for the GC/AT content, fractional gene density, and codon adaptation index (CAI) derived from 163 sequenced prokaryotic chromosomes (Figure 2
Correlation of Pattern Strengths to Organism-Specific Properties Pattern strengths in the sequence-derived parameters for each chromosome were compared with global properties such as genome length, total AT composition, organism taxon, and the presence of specific nucleoid-binding proteins. Pattern strengths in CAI and GC/AT content were found to be weakly but significantly correlated with genome size (r = 0.60, p = 2.4 × 10−17; Figure 3
We then examined correlation of pattern strength with particular organism-specific characteristics relating to taxon, gram stain, cell shape, and the presence of particular classes of proteins in each organism (summarized in Table 3). The Wilcoxon rank-sum test (p < 0.05) was used to assess significance. With respect to organism taxa, patterns in CAI were found to be stronger among the proteobacteria and weaker among the mollicutes and spirochetes. Cell-shape biases in pattern strength included a preference for stronger patterns in rod-shaped bacteria and weaker patterns in spiral-shaped bacteria. No other correlations relating to organism taxa, staining characteristics, or cell shape were observed. However, this analysis is inherently biased by the particular genomes that have been sequenced to date and are thus somewhat skewed toward enteric bacteria and pathogens. As the physiological and morphological diversity of sequenced prokaryotes increases, more definitive conclusions can be drawn regarding possible correlation between genome patterning and such properties as organism lifestyle and cell shape.
Genomes exhibiting the strongest patterns in CAI and GC/AT content had a higher likelihood (Wilcoxon rank-sum p < 0.05) of containing genes for flagella and pili than would be expected if the existence of these structures were uncorrelated with pattern strength. As shown in Table 3, the presence of genes encoding the specific nucleoid binding proteins H-NS, Fis, CbpB, Hfq, IciA, Lrp, and Muk was also found to be correlated with overall patterning in CAI. Comparisons of pattern strengths for each sequence-derived parameter revealed no significant correlations, with the exception of GC/AT content versus CAI (r = 0.74, p = 5.6 × 10−29). This correlation reflects the fact that CAI and GC/AT content are not actually independent properties, since GC-rich stretches of DNA will favor synonymous codons containing G and C. Overlap of Patterns in Heterogeneous Datasets in E. coli Since a 600–650 kb periodic pattern has previously been detected in E. coli gene expression [17,18,22], the above results motivated an assessment of chromosome position-dependent patterns in functional properties specifically in E. coli (in addition to the patterns in GC/AT content, CAI, gene density, and gene orientation discussed above). Correlation of similar patterns in these heterogeneous datasets allows for an evaluation of the structural and functional organization of the E. coli genome. Binary matrices of significant pattern density regions were generated for a p-value cutoff corresponding to a specified false discovery rate (FDR) [23] (FDR < 5% for our analysis). Unity was assigned to regions in the scalogram deemed to have statistically significant patterning and zeros assigned elsewhere (Figure 4
In analyzing the overlap of patterns in functional genome position-dependent data in E. coli, we observed that gene expression [17], gene essentiality [24], and the evolutionary retention index [24] contain significant periodic patterns overlapping at the 650-kb length-scale (Figure 4
Analysis of all 163 chromosomes revealed that long-range patterns in synonymous codon usage (CAI) are not strictly independent from those in GC/AT composition. However, patterns in sequence-derived DNA-bending parameters for E. coli (e.g., intrinsic curvature, propeller twist, stacking energy, etc.) almost completely overlap with patterns in GC/AT content (Figure 5 Genome topology has been shown to be a selection target in the long-term evolution of E. coli [30]. Our results demonstrate that prokaryotic genomes generally possess significant organization that increases with genome size, overall GC composition, and the presence of several known nucleoid-binding proteins. Thus, genome composition and size may impose additional constraints on the evolution of gene order and chromosomal arrangement in prokaryotes. Given that the spatial organization of chromosomal loci within a replicating E. coli cell is linearly ordered along the cellular axis [6,11], the analysis presented here would imply the existence of six subchromosomal functional domains in the E. coli genome [22]. This notion of highly expressed topological domains has been suggested before [31] and is consistent with the macrodomains elucidated by genetic dissection of E. coli [8]. The boundaries of those four domains and two less-structured regions [8] align with the boundaries of the regions of high and low gene expression, gene essentiality, and evolutionary retention in E. coli at the 600–650-kb length-scale (Figure 6 Implications and Conclusions As demonstrated in the analyses described above, genome sequences and sequence-derived properties are significantly patterned (i.e., non-randomly distributed) with respect to chromosome position in most of the prokaryotic genomes sequenced to date (Figure 2 In E. coli, a more detailed analysis of available data demonstrates that patterns in multiple disparate properties are interlinked (Figures 4 Materials and Methods Chromosome position-associated datasets. Datasets were analyzed from most prokaryotic genome sequences published through January 2005 and were downloaded from the CBS Genome Atlas Database [34] (http://www.cbs.dtu.dk/services/GenomeAtlas). Four types of chromosome position-dependent data were analyzed for 151 prokaryotic organisms (corresponding to 163 chromosomes in 16 archaeal and 135 bacterial organisms): 1) GC/AT content averaged in kilobase bins, 2) gene orientation (i.e., strand), 3) fractional gene density (defined as the number of genes—or fractions of genes—per kilobase), and 4) CAI [35] per gene. For the CAI, we used the global codon usage as the reference set to maintain consistency, since the highly expressed genes for some of the organisms may not be predictable a priori. GC and AT content are by definition inversions of one another and are strictly anti-correlated, so any patterns present in either property will be identical. Thus, patterns in these properties are simply referred to as patterns in GC/AT content. The analysis of additional data from E. coli K-12 MG1655 included sequence-derived biophysical parameters averaged across 1-kb segments [12], gene classifications and product locations [28], gene expression [17], gene essentiality [24], and evolutionary retention indices computed based upon homology with 32 representative bacterial sequences [24]. Pattern detection by wavelet analysis and significance testing. Wavelet analysis, reviewed in detail elsewhere [36], is an approach whereby irregular patterns in biological data may be elucidated [14,15,17–19,37]. In short, each genome-scale dataset was ordered according to position along the chromosome. These ordered data, f(x) (where x is defined as the nucleotide position along the chromosome), were then continuously integrated using a family of filter functions to obtain a transform value for numerous filter widths (i.e., scales, designated a) centered at each position x in the dataset:
The filter function used in this study was the Morlet wavelet, defined as:
This particular wavelet was chosen because the length scale of the transform corresponds approximately to the period of any localized pattern [36]. The resulting transform values may be plotted in the form of a scalogram (Figure 1 Currently, no standard statistical methods of verifying patterns identified using continuous wavelet transforms are in common use. Thus, the significance of each transform value was ascertained by a bootstrap approach in which the order of the data points along the chromosome was randomized 200×, and the real and imaginary portions of the Morlet wavelet transform were recomputed for each randomized dataset (described previously for the real portion of the Morlet wavelet [17]). As described in Protocol S1, the randomization of each genome position-associated dataset was performed on either a gene-by-gene basis (for annotation-derived data) or on a kilobase-by-kilobase basis (for annotation-independent properties such as GC content). Thus, the null hypothesis against which each wavelet scalogram was tested consisted of the wavelet transform of a “scrambled” dataset, where the unit of chromosome which was scrambled was either the gene or a kilobase segment. A p-value was then computed for each point in the scalogram based upon the number of times the magnitude of the transform value from each randomization exceeded that of the original transform. The p-value cutoff corresponding to a selected FDR [23] (FDR < 5%) was then determined from the distribution of p-values computed for each scalogram from the randomization tests. Given this cutoff, one can generate a binary matrix (the same size as the scalogram) containing unity for each point in the scalogram for which FDR < 0.05, and zeroes elsewhere. The ratio of the sum of the non-zero elements in this binary matrix to the total matrix size is taken to be the pattern strength of a given dataset (colored areas in Figure 1 Controls. Presented in the Protocol S1 are a set of positive and negative controls for the wavelet transform and bootstrap procedure described above. The negative controls showed that no significant patterns were detected in trivial or randomly ordered datasets (for which no pattern would be expected a priori), thus effectively ruling out the possibility that the observed periodic patterns are simply artifacts inherent either in the wavelet filter used or spurious cyclic patterns caused by outliers in otherwise random data (called the Slutzky-Yule effect when observed in moving averages). Wavelet analysis was performed for a 1-Mb subset of the Pseudomonas putida GC/AT dataset in order to rule out the possibility that the correlation shown in Figure 3 Protocol S1: Additional Material Providing More Detailed Experimental Methods and Positive and Negative Controls (195 KB DOC) Click here for additional data file.(196K, doc) Table S1: Complete List of Computed Pattern Strengths for Each Chromosome in this Study, along with Associated Organismal Data (53 KB XLS) Click here for additional data file.(53K, xls) Acknowledgments We thank Chris Herring, Markus Herrgård, Jennifer Reed, Steve Fong, and Jason Papin for stimulating discussions and comments on the manuscript, Adam Feist for assistance with data pre-processing, and the National Institutes of Health (GM57089) and National Science Foundation (BES 03–31342) for funding and support. Author contributions. TEA and BØP conceived and designed the experiments. TEA performed the experiments. TEA, NDP, and ARJ analyzed the data. TEA and ARJ contributed reagents/materials/analysis tools. TEA, NDP, ARJ, and BØP wrote the paper. Competing interests. BØP serves on the Scientific Advisory Board of Genomatica. Abbreviations
Footnotes Citation: Allen TE, Price ND, Joyce AR, Palsson BØ (2006) Long-range periodic patterns in microbial genomes indicate significant multi-scale chromosomal organization. PLoS Comput Biol 2(1): e2. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
||||||||||||||||||||||||||||||||
J Struct Biol. 2003 May; 142(2):256-65.
[J Struct Biol. 2003]Science. 2003 Aug 8; 301(5634):780-5.
[Science. 2003]Nat Genet. 2002 Nov; 32(3):347-52.
[Nat Genet. 2002]Genes Dev. 2004 Jul 15; 18(14):1766-79.
[Genes Dev. 2004]Genes Dev. 2000 Jan 15; 14(2):212-23.
[Genes Dev. 2000]Genes Dev. 2000 Jan 15; 14(2):212-23.
[Genes Dev. 2000]Proc Natl Acad Sci U S A. 2004 Jun 22; 101(25):9257-62.
[Proc Natl Acad Sci U S A. 2004]Mol Microbiol. 1998 Feb; 27(4):777-86.
[Mol Microbiol. 1998]Proc Natl Acad Sci U S A. 2004 Jun 22; 101(25):9175-6.
[Proc Natl Acad Sci U S A. 2004]J Mol Biol. 2000 Jun 16; 299(4):907-30.
[J Mol Biol. 2000]Nucleic Acids Res. 2004; 32(6):1982-92.
[Nucleic Acids Res. 2004]Annu Rev Genet. 2004; 38():771-92.
[Annu Rev Genet. 2004]J Bacteriol. 2003 Nov; 185(21):6392-9.
[J Bacteriol. 2003]Genome Biol. 2004; 5(11):R86.
[Genome Biol. 2004]Proc Natl Acad Sci U S A. 2003 Apr 15; 100(8):4672-7.
[Proc Natl Acad Sci U S A. 2003]J Bacteriol. 2003 Nov; 185(21):6392-9.
[J Bacteriol. 2003]J Bacteriol. 2003 Oct; 185(19):5673-84.
[J Bacteriol. 2003]J Biotechnol. 2000 Mar 31; 78(3):209-19.
[J Biotechnol. 2000]Philos Trans R Soc Lond B Biol Sci. 2000 Feb 29; 355(1394):179-90.
[Philos Trans R Soc Lond B Biol Sci. 2000]J Mol Biol. 2000 Jun 16; 299(4):907-30.
[J Mol Biol. 2000]J Mol Biol. 2000 Jun 16; 299(4):907-30.
[J Mol Biol. 2000]EMBO J. 2004 Oct 27; 23(21):4330-41.
[EMBO J. 2004]Genes Dev. 2000 Jan 15; 14(2):212-23.
[Genes Dev. 2000]Microb Comp Genomics. 2000; 5(4):205-22.
[Microb Comp Genomics. 2000]Mol Microbiol. 2002 Jul; 45(1):17-29.
[Mol Microbiol. 2002]Genetics. 2005 Feb; 169(2):523-32.
[Genetics. 2005]Genes Dev. 2000 Jan 15; 14(2):212-23.
[Genes Dev. 2000]Proc Natl Acad Sci U S A. 2004 Jun 22; 101(25):9175-6.
[Proc Natl Acad Sci U S A. 2004]Proc Natl Acad Sci U S A. 2003 Apr 15; 100(8):4672-7.
[Proc Natl Acad Sci U S A. 2003]Biochimie. 2001 Feb; 83(2):201-12.
[Biochimie. 2001]J Bacteriol. 2003 Nov; 185(21):6392-9.
[J Bacteriol. 2003]EMBO J. 2004 Oct 27; 23(21):4330-41.
[EMBO J. 2004]Nat Genet. 2002 Nov; 32(3):347-52.
[Nat Genet. 2002]Proc Natl Acad Sci U S A. 2003 Dec 23; 100(26):15440-5.
[Proc Natl Acad Sci U S A. 2003]Nat Biotechnol. 2004 Oct; 22(10):1218-9.
[Nat Biotechnol. 2004]Bioinformatics. 2004 Dec 12; 20(18):3682-6.
[Bioinformatics. 2004]Nucleic Acids Res. 1987 Feb 11; 15(3):1281-95.
[Nucleic Acids Res. 1987]J Mol Biol. 2000 Jun 16; 299(4):907-30.
[J Mol Biol. 2000]Microb Comp Genomics. 2000; 5(4):205-22.
[Microb Comp Genomics. 2000]J Bacteriol. 2003 Nov; 185(21):6392-9.
[J Bacteriol. 2003]J Mol Biol. 2003 Sep 19; 332(3):617-33.
[J Mol Biol. 2003]Bioinformatics. 2003 Jan; 19(1):2-9.
[Bioinformatics. 2003]J Bacteriol. 2003 Nov; 185(21):6392-9.
[J Bacteriol. 2003]Phys Rev Lett. 2004 Sep 3; 93(10):108101.
[Phys Rev Lett. 2004]J Mol Biol. 2002 Feb 15; 316(2):341-63.
[J Mol Biol. 2002]J Bacteriol. 2003 Nov; 185(21):6392-9.
[J Bacteriol. 2003]