• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Nat Biotechnol. Author manuscript; available in PMC Nov 18, 2013.
Published in final edited form as:
PMCID: PMC3832199

Elucidation of the Transcription Unit Architecture of the Escherichia coli K-12 MG1655 Genome


Bacterial genomes are organized in terms of structural and functional components. These components include promoters, transcription start and termination sites, open reading frames, regulatory non-coding regions, untranslated regions and transcription units, that together comprise the functional organization of a genome. Here, we use a systems approach that iteratively integrates multiple high-throughput measurements at a genome-scale to identify the organizational structure of the Escherichia coli K-12 MG1655 genome. Integration of the organizational components provides experimentally annotated transcription unit (TU) architecture, including alternative transcription start sites, promoter structures, boundaries and open reading frames. A total of 4,661 TUs were identified, representing an increase of > 530% over current knowledge. This comprehensive TU architecture allows for the elucidation of condition-specific uses of alternative sigma factors at the genome-scale. Furthermore, the TU architecture provides a foundation on which genome-scale transcriptional and translational regulatory networks are based.

Tremendous progress has been made toward determining whole genome sequences of bacteria as well as in describing their transcriptomes and proteomes during the last decade1-6. However, the in-depth organizational structure of bacterial genomes has not yet been fully elucidated. Knowledge of the organizational structure of bacterial genomes is of fundamental importance since it dictates the flow of genetic to functional information at the genome-scale. Bacterial genomes are highly organized in various structural and functional components. These organizational components include (but are not limited to) promoters, transcription start sites (TSSs), open reading frames (ORFs), regulatory non-coding regions, untranslated regions (UTRs), and transcription units (TUs).

With the publication of the first full genome sequence in the mid 1990s, it became possible, in principle, to identify all the gene products involved in complex biological processes in a single organism7. In practice, almost 15 years later, this has proved hard to accomplish. Sequence information by itself is not suitable for a comprehensive elucidation of these components. Multiple simultaneous genome-scale measurements are therefore needed to determine all these components, their location, and their relationship to the genome sequence. Establishing the organizational structure of a genome is a challenging task. In-depth analyses of transcriptomes and proteomes of multiple prokaryotic organisms indicate that the information content and structure of a genome is much more complex than previously thought5,8-10 and the process of revealing the role of cellular components for transcription and translation on a genome-scale has just begun3,11-15. Here we describe a four-step systems approach that iteratively integrates multiple genome-scale measurements on the basis of genetic information flow to identify the organizational components and map those onto the genome sequence (Fig. 1, Fig. 2a)16.

Figure 1
Flowchart of the systematic iterative integration process
Figure 2
Integration of the organizational components


Determination of RNA polymerase binding regions at the genome-scale

The first step of the flow of genetic information is its transfer into messenger RNA (mRNA) by the transcription process. Although this process is extensively regulated in response to external signals, mRNA is basically synthesized by RNA polymerase (RNAP) that initially binds to the promoter region. We therefore integrated RNAP- binding regions and mRNA transcript abundance to determine segments of contiguous transcription originating from promoter regions. To identify RNAP-binding regions at the genome scale, we employed a ChIP-chip method to E. coli K-12 MG1655 grown in the presence or absence of rifampicin under multiple growth conditions15,17. Using an antibody specific to the RNAP β subunit, we obtained RNAP-associated DNA fragments that were then fluorescently labelled and hybridized to a high-density oligonucleotide tiling microarray representing the entire E. coli genome15. Rifampicin treatment generated a genome-wide static map of RNAP-binding regions compared to a dynamic map of RNAP-binding regions without rifampicin treatment (Fig. 2a)15. From this static map, we identified a total of 1,511 and 1,444 RNAP-binding regions on the forward and reverse strand, respectively (Fig. 2b, Supplementary Table 1), essentially representing the promoter regions at the genome-scale18. Interestingly, the locations of RNAP-binding regions obtained from rifampicin-treated cells are nearly independent of the experimental conditions used15. This observation could be due to the stochastic interaction between repressors and regulatory regions known to cause random bursts in transcription in vivo (Supplementary Fig. 1)19. The dynamic maps, in contrast, indicate differential RNAP binding across the entire genome, representing the genome-wide rearrangement of RNAP in response to environmental conditions15 (Supplementary Fig. 1). Considering the current E. coli genome annotation (4,505 genes in total), we determined an average of one RNAP-binding region per every 1.5 genes.

Integration of the RNAP-binding regions and transcriptomic data

In the second step, we obtained comprehensive information about the expression level of mRNA transcripts across the entire E. coli genome using tiling microarrays to profile transcriptomes under multiple growth conditions. These growth conditions included log-phase, heat-shocked, stationary phase, and a different nitrogen source (see Online Methods). Negative control probes that represent non-specific background hybridization were randomly selected based on the median signal intensity (depicted as a dotted line in Fig. 2c)20. The microarray signals were subsequently transformed to binary signals, representing presence (probes expressed above background) and absence probes (background). Transcription data obtained from multiple growth conditions were added cumulatively in a step-by-step approach. These rounds of iteration resulted in coverage of 73.0%, 80.2%, 86.8%, and 87.4% of the currently annotated genome, respectively (Supplementary Table 2).

The last iteration result (i.e., cumulative integration of microarray results from four growth conditions) represents 118,767 probes detected above background level (false discovery rate (FDR) threshold = 0.05) (Table 1). A total of 567 genes (12.6%) fell below FDR threshold consisting of 409 uncharacterized and 158 currently known genes. Within the known genes, several, such as rhaBADM, tynA, and speF, are only functional under specific growth conditions21,22 and are therefore unlikely to be detected under the conditions used (Supplementary Table 2). In addition, we detected transcription of a total of ~140 kb, which had not been annotated as ORFs previously.

Table 1
Systematic and iterative integration of experimentally derived components for meta-structure elucidation

The RNAP-binding regions and transcriptomic data were integrated to obtain a map of contiguous transcript segments (i.e., RNAP-guided transcript segments), which is independent of the current genome annotation. The binary signals (i.e., presence (1) or absence (0) calls) were then partitioned into segments of constant signals separated by RNAP-binding regions determined above (Fig. 2c). Compared to a change point detection algorithm23 and a running-window approach24, the RNAP-guided transcript segmentation method, i.e., integrating the binary transcript signals with the RNAP-binding information, circumvents the assembly of unrelated transcripts and greatly benefits further TU determination (Supplementary Fig. 2).

A total of 1,364 and 1,321 segments with average length of 1.3 kb was determined from the cumulative iterations on the forward and reverse strand, respectively (Table 1, Supplementary Table 3). Among those, a total of 98 segments were determined without RNAP-binding. The genomic coverage of the segments was ~81% with an average probe density of ~83% per segment. With each iteration, boundary accuracy and probe density of the segments increased (Supplementary Table 3, Supplementary Fig. 3). A total of 253 segments were determined in regions of the genome lacking prior ORF annotation, including 82 segments in intergenic regions, 147 segments on an opposite annotated strand, and 24 segments in intragenic regions (Supplementary Fig. 4).

Determination of transcription start sites

In the third step, we integrated the RNAP-guided transcript segments with genome-wide TSSs data (Fig. 2d). TSSs were determined by a newly developed, modified 5′-RACE method using a unique RNA adapter and massive-scale sequencing (Supplementary Fig. 5). Three cumulative iterations yielded > 4.4 million sequence reads of an average length of 30 bp corresponding to ~30× genome lengths (~133 Mb raw sequence data) (Table 1). Sequence reads were mapped back onto the reference E. coli genome (NC_000913) to determine the numbers of reads matching each genomic position. Approximately 64% of the sequence reads uniquely mapped to one genomic region, whereas the remaining reads either mapped to repeated sequences or were of poor quality. Mapping the reads to the genome allowed the determination of 3,969 TSSs from the first iteration, and 4,062 and 4,133 TSSs from consecutively cumulative iterations (Supplementary Table 4). Each promoter region (2,955 in total) averages 1.6 TSSs. For confirmation, we compared our data to currently validated TSSs22 and found that 87% (1,089 out of 1,252) of the validated TSSs agreed to TSSs obtained from this study (Supplementary Table 5).

The 13% of the validated TSS (corresponding to 146 TUs) not detected in this study could be due to low mRNA expression levels as well as condition specific use of TSSs. For example, the validated TSSs for narK, a gene encoding a nitrate/nitrite antiporter expressed under anaerobic growth conditions, were not detected in this study. This could be explained by nearly background mRNA levels for this gene under the applied conditions. Another example is the ilvIH operon, encoding acetolactate synthase involved in the amino acid biosynthesis. The ilvIH operon has four experimentally verified TSSs22. Among those, we detected only one TSS, which is highly regulated by the transcription factor Lrp under our growth condtions15,22. On the other hand, we found that ~2% of TSSs (97 out of 4,133) were from weakly transcribed genes and that ~5% of RNAP-guided transcript segments (145 out of 2,685) lacked TSSs. Consequently, integration of the TSSs with the RNAP-guided transcript segments allowed us to determine a total of 4,036 TSS-associated transcriptional segments (Table 1).

Identification of potential protein-coding ORFs

In the fourth step, we addressed how many potential protein-coding ORFs (pORFs) are within each RNAP-guided transcript segment by using a high-throughput proteomics approach for identifying peptides at a genome-scale. This approach was based on liquid chromatography coupled to Fourier transform ion cyclotron resonance mass spectrometry (LC-FTICR-MS) and accurate mass and time tag (AMT tag)24. The proteomics analysis yielded a total of 54,549 peptides based on a stop-to-stop database of the E. coli genome (Supplementary Table 6, see Online Methods).

To predict pORFs from proteomics data without relying on current annotation, we mapped the genomic locations of peptides onto a maximally extendable ORF scaffold (i.e., stop codon to most distant start codon) built from all six possible translational frames (Supplementary Table 7). This analysis yielded 2,542 pORFs (FDR < 2%) (Fig. 2e, Supplementary Table 8, Supplementary Fig. 6, see Online Methods). Among those, 2,525 pORFs (~ 99%) were mapped to currently annotated ORFs (~ 59% coverage) (Supplementary Table 8). Interestingly, > 99% of translation stop positions were exactly matched to currently annotated ORFs, however, only 64% of translation start positions were matched.

To examine the accuracy of translation start and stop positions, pORFs were compared with a total of 888 ORFs whose translational boundaries have been validated25. Out of 2,525 pORFs, 803 were mapped to validated ORFs. All the translation stop position of these 803 pORFs matched the validated ones exactly. However, only 499 pORFs (accuracy = ~62%) showed identical translation start positions (Supplementary Table 9). By considering translation start codons that were closest to the observed peptide(s) within pORFs, the increase of accuracy that matched validated ORFs was negligible (507 pORFs). pORFs with non-matching translation start positions (296 pORFs) exhibited poor peptide coverage (Supplementary Fig. 7). Overall, the proteogenomic mapping approach allows for the genome-scale determination of ORFs26, however, due to a limitation in peptide coverage, additional methods, e.g., proteomics with N-terminal modification27, have to be applied to obtain a more comprehensive and accurate ORF map.

A total of 2,385 pORFs showed direct evidence of transcription after mapping them to the RNAP-guided transcript segments identified above (Supplementary Table 10). Moreover, we identified 17 pORFs in genomic regions lacking prior annotation. Among those, mRNA transcripts of 12 pORFs were confirmed by transcriptomic analysis, suggesting additional ORFs compared to the current genome annotation (Supplementary Table 10). The current genome annotation still contains 2,087 gene loci that are listed as “predicted”, i.e., without any experimental verification. Over 42% (878) of these predicted gene loci were mapped onto pORFs, suggesting they were translated into proteins under growth conditions applied (Supplementary Table 9).

Determination of transcription unit architecture

By using the organizational components, we defined 3,138 modular units of the E. coli genome representing potential transcription units (Fig. 3). Each modular unit contains information on (i) promoter regions, (ii) transcription start sites (TSSs), (iii) transcribed regions, and (iv) ORFs, consisting of pORFs and currently annotated ORFs (Table 1, Supplementary Table 11). The modular unit defined based on this data is different from the classic definition of an operon, since operons do not allow for nested TUs. We consequently determined the transcription unit (TU) architecture of the E. coli genome that results from condition-dependent combination of the modular units. In general, a TU in a bacterial genome is defined as having multiple ORFs that are transcribed from one promoter to synthesize a single mRNA transcript. Conceptually, expression levels of multiple modular units within a single TU remain constant without an expression gap between them, assuming an absence of differential mRNA degradation28.

Figure 3
Modular units

These criteria allowed assembling modular units to determine TU architecture at a genome-scale using the change point detection algorithm29. One TU can be identified from a series of contiguous modular units based upon their transcription termination position. On the other hand, multiple TUs can be obtained from a single modular unit, if it contains multiple TSSs (Fig. 4a). In total, we determined 4,661 TUs, of which 3,946 (~86%) were fully supported by all organizational components (Fig. 4c, Supplementary Table 12). This represents an increase of > 530% compared to the experimentally validated 875 TUs (Supplementary Table 13)22. While 72 TUs (~8%) were not determined in our analysis due to a lack of identified TSSs, a total of 1,786 TUs (~72%) were consistent with computationally predicted TUs (Supplementary Table 14). Each of the 4,661 TUs is comprised of an average of 1.1 modular units with the largest TU (TU-0061) containing nine modular units equivalent to 16 ORFs (Supplementary Table 12). A total of 3,010 TUs (~65%) are monocistronic, while 1,652 TUs contain more than one ORF (polycistronic). 398 TUs (~9%) were comprised of multiple modular units that are nested within each other, defining a convoluted genome structure (Fig. 4a). This nested TU architecture might therefore increase the flexibility of expression states of bacterial genomes without increasing genome size.

Figure 4
Determination of transcription units (TUs) and use of alternative TSSs


We integrated multiple genome-scale measurements to elucidate the transcription unit architecture of the E. coli K-12 MG1655 genome in an iterative fashion. Since the organizational components were determined from three growth conditions (i.e., three iterations of the process in Fig. 1), the elucidation of the E. coli TU architecture will not be complete. Nevertheless, the identification of 4,661 TUs in this study presents a comprehensive map that provides useful information for the genome-scale elucidation of condition-specific transcription of TUs through different RNAP holoenzymes. The TU architecture also allows for the elucidation of condition-specific transcription initiation through alternative TSSs regulated by transcription factors, as well as for the investigation of the bacterial promoter structure. We give two examples of how the new TU structure will guide the reconstruction of the genome-scale transcriptional regulatory network.

First, TUs can be transcribed in a condition-dependent manner through alternative sigma factor use. For example, the thrLABC operon is regulated by transcriptional attenuation, which is modulated by the availability of charged isoleucyl- and threonyl-tRNA30. However, we found that additional promoters located in front of thrB separately regulate thrBC under stationary growth phase (Fig. 4a). We therefore assume the involvement of different RNAP holoenzymes (i.e., different sigma factors involved) to regulate the additional promoter under these conditions. For validation, we measured σ70 and σS holoenzyme (Eσ70 and EσS) occupancy within the promoter regions using ChIP followed by quantitative PCR under exponential and stationary growth conditions (Fig. 4b). As expected, the background level of EσS occupancy in exponential growth phase reflects the lack of EσS-mediated transcription. In stationary growth phase, however, the levels of EσS occupancy increased significantly. On the other hand, background levels of Eσ70 occupancy were found in both growth conditions.

Second, the reconstruction of the transcription regulatory network has thus been hampered so far by the lack of comprehensive TU information that is now available. In a previous report, one of E. coli's global transcription factors, Lrp, negatively regulates the thrLABC, and the extent of repression significantly increases under nutritional shortage conditions, such as stationary growth conditions15. However, σS positively regulates the previously unknown TU (i.e., thrBC) within that operon. As demonstrated by this second example, the TU architecture presented here provides a detailed delineation of the alternative and condition-dependent combination of the modular units to form TUs that were previously unknown, and, in this case, represents nested use of the operon structure as now defined.

We observed that ~35% of promoter regions contain multiple TSSs, indicating the presence of alternative TSSs for large portions of the E. coli TU architecture (Fig. 4c). To examine whether the conditional use of alternative TSSs can be observed, we compared well-known TUs with the TSSs determined from this study. For example, the stpA gene and the livKHMGF operon encoding an H-NS-like DNA-binding protein and the leucine ABC transporter complex both have multiple experimentally verified TSSs. In the case of the stpA promoter, we detected the dominant TSS (2,796,558), which is highly activated by the transcription factor Lrp15,22. The two other TSSs (2,796,578 and 2,796,600) are therefore likely to be less utilized under the growth conditions used in this study. On the other hand, we observed two confirmed TSSs from the promoter region of livKHMGF operon. While the TSS (3,595,753) is dominantly utilized to transcribe the operon, the transcription factor Lrp apparently represses the other TSS (3,595,778) (Fig. 4d, Supplementary Fig. 8). The transcription factors Fnr and ArcA repress the transcription of the nuo operon, encoding the NADH:ubiquinone oxidoreductase complex, under anaerobic conditions9. Since the transcription factors are inactivated under aerobic conditions, we confirmed the existence of one TSS released from the repression (Fig. 4e). As demonstrated here, the alternative TSSs are utilized to regulate the bacterial transcriptome in response to different environmental stimuli. This phenomenon is likely to be tightly linked with the transcription regulatory network. Furthermore, the comprehensive location maps of transcription factors determined in vivo (e.g., by means of ChIP-chip or ChIP-seq) are needed to fully evaluate this phenomenon. The use of alternative TSSs seems to be a common feature for transcriptional regulation in E. coli and might be widespread in bacteria.

Translational efficiency in eukaryotes and bacteria is often controlled by RNA-binding proteins, noncoding regulatory RNAs (ncRNAs), endoribonucleases, the 30S subunit, and structural rearrangements within 5′ untranslated regions (5′UTR)31. Knowing sequences and lengths of 5′UTRs is therefore important in order to understand transcription regulation, mRNA transcript stability, and ultimately translational efficiency. 5′UTR regions were obtained from the distance (bp) between each TSS and the translation start site of the first gene in the TU (Supplementary Table 12, Supplementary Fig. 9). The median length of 5′UTR was around 36 bp. The majority of TSSs (~93%) fall within 300 bp from the translation start site. In yeast, genes with longer UTRs fall into categories that require regulation, whereas genes with short UTRs seem to fall into categories with a reduced need for posttranscriptional regulation, such as housekeeping genes23. However, when we compared 5′UTR length to functional categories (Supplementary Fig. 9)22 no differences were detected. This result indicates that in bacteria, binding regions of regulatory elements (sigma factors, transcription factors, and regulatory RNAs) within promoter regions should intensively overlap. Furthermore, genome-wide 5′UTR sequence profiles will foster further understanding of translational efficiency in prokaryotes. An additional aspect of the genome-scale TU architecture is that it will accelerate the investigation of the core promoter elements (e.g., -10 (or extended -10), -35, and a spacer region) at the genome-scale.

Conceptually, the transcription regulatory network consists of nodes (TUs and regulatory proteins) and links (their interactions)32. The genome-scale elucidation of the TU architecture, represented here, greatly advances the transcription regulatory network reconstruction effort by providing a nearly complete set of nodes, that, in turn, now requires condition-dependent location-analysis of sigma factors, transcription factors, and other participating components for its full reconstruction.

Taken together, the extensive experimental results presented demonstrate how the organizational components of the bacterial genome can be experimentally obtained. The determination of the components requires multiple genome-scale measurements and their iterative and systematic integration (Fig. 1). The determination of organizational components for the E. coli K-12 MG1655 genome notably improves our knowledge and understanding of this widely studied genome. The process developed and implemented here can be applied to other prokaryotic organisms. The result is an experimental annotation of a genome and it provides the scaffold on which the transcriptional and translational regulatory network will be built.

Online Methods

Strains and media

E. coli MG1655 cells were harvested at mid-exponential phase (OD600nm ~ 0.6) with exception of stationary phase experiments (OD600nm ~ 1.5). Glycerol stocks of E. coli strains were inoculated into M9 complete or W2 minimal medium33 (for nitrogen-limiting condition) and cultured at 37 °C with constant agitation overnight. Cultures were diluted 1:100 into fresh minimal medium and then cultured at 37 °C to appropriate cell density. For heat-shocked experiments, cells were grown to mid-exponential phase at 37 °C and half of the culture was sampled as a control. The remaining culture was transferred into pre-warmed (50 °C) medium and incubated for 10 min. For nitrogen-limiting conditions, ammonium chloride in the minimal medium was replaced by glutamine (2 g/L). For rifampicin-treated cells, rifampicin dissolved in methanol was added to a final concentration of 150 μg/mL and subsequently stirred for 20 min. Cultures were monitored by observing cell density at 600 nm to verify inhibitory effects of rifampicin.


Cells at appropriate cell density were cross-linked by 1% formaldehyde at room temperature for 25 min. Following the quenching of the unused formaldehyde with a final concentration of 125 mM glycine at room temperature for 5 min, the cross-linked cells were harvested and washed three times with 50 mL of ice-cold TBS (Tris Buffered Saline). The washed cells were re-suspended in 0.5 mL lysis buffer composed of 50 mM Tris-HCl (pH 7.5), 100 mM NaCl, 1 mM EDTA, 1 μg/mL RNaseA, protease inhibitor cocktail (Sigma) and 1 kU Ready-Lyse™ lysozyme (Epicentre). The cells were incubated at room temperature for 30 min and then treated with 0.5 mL of 2×IP buffer with the protease inhibitor cocktail. The lysate was then sonicated four times for 20 sec each in an ice bath to fragment the chromatin complexes using a Misonix sonicator 3000 (output level = 2.5). The range of the DNA size resulting from the sonication procedure was 300 – 1000 bp. 6 μL of mouse antibody (NT63, Neoclone) was used to immunoprecipitate the chromatin complex of RNA polymerase β subunit (RpoB) and DNA. For the control (mock-IP), 2 μg of normal mouse IgG (Upstate) was added into the cell extract. The remaining ChIP-chip procedures were performed as described previously15. The high-density oligonucleotide tiling arrays used to perform ChIP-chip analysis consisted of 371,034 oligonucleotide probes spaced 25 bp apart (25 bp overlap between two probes) across the E. coli genome (NimbleGen). After hybridization and washing steps, the arrays were scanned on an Axon GenePix 4000B scanner and features were extracted as a pair format by using NimbleScan™ 2.4 software (NimbleGen).


To monitor the enrichment of RNAP-binding regions prior to the microarray hybridization, the quantitative real-time PCR (qPCR) against the previously characterized RNAP-binding regions was performed in triplicate using iCycler™ (Bio-Rad) and SYBR green (Qiagen). The qPCR conditions were as follows: 25 μL SYBR, 1 μL of each primer (10 pM), 1 μL of IP or mock-IP DNA, and 22 μL of ddH2O. The samples were cycled to 94 °C for 15 sec, 52 °C for 30 sec, and 72 °C for 30 sec (total 40 cycles) on a LightCycler (Bio-Rad). The threshold cycle (Ct) values were calculated automatically by the iCycler™ iQ optical system software (Bio-Rad). Normalized CtCt) values for each sample were calculated by subtracting the Ct value obtained for the mock-IP DNA from the Ct value for the IP-DNA (ΔCt = CtIPCtmock). To measure relative gene expression levels, cDNA synthesized was used instead of the IP DNA.

Identification of RNAP-binding regions

To identify RNAP-binding regions, we used the peak finding algorithm built into the NimbleScan™ software. Processing of ChIP-chip data was performed in three steps: normalization, IP/mock-IP ratio computation (log base 2), and enriched region identification. For normalization and log ratio computation, signal intensity from all arrays was mapped to a reference distribution created by taking averages of the sorted raw data and scaled to a median of 1.0. The ChIP-chip datasets exhibited strong raw reproducibility (pair-wise Pearson coefficients ≥ 0.96). Each log ratio dataset from triplicate samples was used to identify RNAP-binding region using the software (width of sliding window = 300 bp). The results from this analysis were not the binding positions (i.e., single binding peaks) but binding regions. We then calculated the median position of those regions to avoid detecting skewed position by unwanted noise. Since the median positions do not necessarily match to the probe positions of the microarray, we assigned the nearest probe positions to the median positions. Our approach to identify the RNAP-binding regions was to first determine binding locations from each data set and then combine the binding locations from at least five of the six datasets to define a binding region34. ChIP-chip experiments are usually performed using multiple replicates, and it is common to average these replicates to produce an enrichment signal that is then analyzed for binding event information. We find that different replicates often reflect non-trivial differences in molecular binding activity and that averaging can abolish strong enrichment signals or indicate binding event locations that are not supported by any individual replicate. So, after normalizing replicates first individually and then altogether, we computed and applied a baseline correction in the form of an offset for each replicate such that an enrichment signal of one corresponded to the mean value of the non-enriched probes. All raw and processed signals, along with in-house Perl and R scripts used to process raw ChIP-chip datasets, are available from our website (http://systemsbiology.ucsd.edu/publications).

Transcriptome analysis

Total RNA samples were isolated using RNeasy Plus Mini kit (Qiagen) in accordance with manufacturer's instruction. Subsequently, 20 μg of the purified total RNA sample was reverse transcribed with 1,500 U SuperScript II reverse transcriptase (Invitrogen), 30 U SUPERase·In (Ambion), 750 ng random primer, 10 mM dNTP mixture containing 4 mM amino-allyl dUTP, 10 mM DTT and 8 μg/mL actinomycin D. Actinomycin D was used to remove antisense transcript artifacts during the cDNA synthesis31. The amino-allyl labeled cDNAs were purified with QIAquick PCR purification columns (Qiagen). Phosphate wash (5 mM KPO4 and 80% ethanol) and elution buffer (4 mM KPO4) were used to protect amino-allyl residues instead of using PE and PB buffers, respectively. The amino-ally labeled cDNAs were subsequently incubated with Cy5 Monoreactive dyes (Amersham) to obtain Cy5 labeled cDNAs. The cDNA samples were fragmented by 0.3 U RNase-free DNaseI (Epicentre) per μg cDNA, which were then purified and hybridized onto the high-density oligonucleotide tiling microarrays. After hybridization and washing steps, the arrays were scanned on an Axon GenePix 4000B scanner and features were extracted by using NimbleScan software. The resulting pair files from experimental triplicates were then normalized using the ‘RMA analysis’ function from NimbleScan.

Determination of RNAP-guided transcript segments

Following the normalization, we employed the ‘TranscriptionDetector’ algorithm (TD) to determine probes expressed above background level20. To determine the background level, we selected negative control probes that represent non-specific background hybridization to evaluate the significance of expression of individual probes (p-value calculation). The negative control probes were randomly selected based on the median signal intensity. The purpose of negative control probes is to estimate the background, non-binding probe signal. This is because the nucleotide sequence of the negative control probes does not match any region of the genome, and so no hybridization should occur with the negative control probes. Since our array lacked the negative control probes on our array, we reasoned that there must exist probes on our array that effectively act as negative control probes. Our reasoning for this was based on the fact that not all of the genome is expressed in any one condition, and therefore there are probes for which no complementary transcript exists in the cell. The crucial step is identifying these probes. We did this by assuming that more of the genome is not expressed than is expressed in a particular condition. Under this assumption, the median probe value corresponds to a probe with no enrichment. Our results changed very little when we used even lower values for background signal, but did change noticeably when we used (much) higher values. These checks indicated that we had safely estimated the non-binding probe values. The microarray signals were transformed to binary absence/presence calls as one (probes expressed above background) and zero (background). However, we often observed the orphan presence calls in the binary absence/presence calls obtained from TD algorithm. Since the orphan presence calls are most likely to be false positives from TD algorithm, we removed the orphan calls manually based on the presence calls from the opposite strand (i.e., if there are dense calls from opposite strand, we removed the orphan calls of the strand). Then, we assigned genomic coordinates of the first and last presence calls between two RNAP-binding regions to the start and end genomic coordinates of RNAP-guided transcript segment. However, in some cases, the RNAP-binding regions did not allow us to select the correct position of the first expressed probe, since we assigned the median probe position to the RNAP-binding region. Therefore, we manually assigned the first probe position to the RNAP-guided transcript segment. A minority (less than 2%) of transcribed regions lacked RNAP-binding regions (a total of 98 RNAP-guided transcript segments). We detected unlikely long RNAP-guided transcript segments and another RNAP-guided transcript segment at the opposite strand. We assumed these cases were due to the low gene expression and the failure to detect RNAP-binding regions. Therefore, we manually divided the RNAP-guided transcript segments into two segments. However, we expect that expression of those regions might increase when different growth conditions are applied (Supplementary Fig. 4). Through implementing a fixed intensity threshold (presence/absence calls) and a genomic coordinate of the RNAP-binding region, a genome-wide summary of piece-wise constant expression segments (i.e., RNAP-guided transcript segments) was obtained along with its genomic coordinates and potential promoter regions.

Genome-scale determination of transcription start sites (TSSs)

Total RNA samples were isolated as described above. To enrich mRNA from the isolated total RNA samples, ribosomal RNA (rRNA) was removed by using MICROBExpress™ kit (Ambion) in accordance with the manufacturer's instructions35. To ligate 5′-RNA adapter (5′-GUUCAGAGAGUUCUACAGUCCGACGAUC) to the 5′-end of mRNA, the enriched mRNA samples were incubated with 100 μM of the adapter and 4 U of T4 RNA ligase (NEB). cDNAs were then synthesized from the adapter-ligated mRNA samples using random primers extended with 3′-adapter sequence (5′-CAAGCAGAAGACGGCATACGANNNNNNNNN). The mRNA samples were then reverse transcribed as described above to obtain cDNA samples. The cDNA samples were amplified using a mixture of 1 μL of the cDNA, 10 μL of Phusion HF buffer (NEB), 1 μL of dNTPs (10 mM), 1 μL SYBR green (Qiagen), 0.5 μL of HotStart Phusion (NEB), and 5 pmole of primer mix (5′-CAAGCAGAAGACGGCATACGA and 5′-AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA). The PCR mixture was denatured at 98 °C for 30 sec and cycled to 98 °C for 10 sec, 57 °C for 20 sec and 72 °C for 20 sec. The amplification was monitored on a LightCycler (BioRad) and stopped at the beginning of the saturation point. A fraction of the amplified DNA between 100 bp and 200 bp was then extracted from a 6% TBE gel after electrophoresis. Gel slices were dissolved in two volumes of EB buffer (Qiagen) and 1/10 volume of 3 M sodium acetate (pH 5.2). The amplified DNA was ethanol-precipitated and resuspended in EB buffer. A second PCR amplification was carried out to amplify the DNA libraries to a total final mass up to 1 μg with as few PCR cycles as possible. The final amplified DNA libraries were purified using QIAquick PCR purification column and eluted in 35 μl EB buffer. The samples were then quantitated on a NanoDrop 1000 spectrophotometer.

Sequence data processing and mapping

Since sequence reads obtained from an Illumina Genome Analyzer become more error prone towards the 3′-end, all reads were truncated to 25 bp. These truncated reads were then aligned onto the E. coli MG1655 genome (NC_000913) using BLAT with the following arguments: stepsize = 1, tilesize = 12, minmatch = 1. Only reads that aligned to only one genomic location were retained. Finally, the genomic coordinate of the 5′-end of these uniquely aligned reads were defined as a TSS, which was then mapped onto 5′-end of the RNAP-guided transcript segments with the following criteria: window size = 200 bp, cutoff = 60%.

Predicting potential ORFs (pORFs) and mapping them onto RNAP-guided transcript segments

Proteomics data, using cells grown under log phase, heat-shocked conditions, and stationary phase, were obtained by using LC-FTICR mass spectrometry as described before24,36. These proteomics data were analyzed by SEQUEST to match MS/MS spectra against the stop-to-stop peptide database. To generate this database, the E. coli genome sequence (NC_000913) was computationally segmented into stop-to-stop fragments considering two adjacent stop codons in all six translational frames and translated into peptides. The peptides were then chunked into 10-mer oligopeptides, retaining genomic position and frame information. The proteomics analysis yielded a total of 54,549 peptides, covering ~59% of currently annotated ORFs. To predict all potential ORFs (pORF) in the E. coli genome, we identified all stop codons (TAG, TAA, and TGA) across the entire genome and then assigned the first occurring start codon (ATG, GTG, or TTG) between two adjacent stop codons in the same frame (the maximally extendable ORF)26,37. This process yielded a total of 156,781 maximally extendable ORFs from 439,680 start codons and 359,212 stop codons in all six translational frames (Supplementary Table 7). It should be noted that start codon preference and length of the maximally extendable ORF were not considered to generate them. In a functional classification of ORFs, the coverage of annotated proteins (~52%) was higher than that of hypothetical proteins (~35%). Finally, the maximally extendable ORFs containing at least one peptide (in frame) from proteomics data (from this study and publicly available source12) were considered as preliminary pORFs. A total of 131 peptides (~ 0.3%) were removed because they did not map to any maximally extendable ORFs. Although the 131 peptides were obtained uniquely from the mass spectrometry analysis, we should consider the existence of false positives in the unique peptides. Therefore, we examined the difference between the filtered observation count of mapped unique peptides and those of unmapped ones. The filtered observation count of the unmapped peptides was significantly lower (up to ~37 counts) than that of the mapped ones (up to ~63,000 counts), suggesting that these are most likely measurement errors (i.e., false positives from mass spectrometry analysis). This analysis yielded 3,247 preliminary pORFs. However, we often observed multiple pORFs from different translational frames that were largely overlapped. We thus compared the peptides mapped onto the overlapped pORFs, suggesting that the bona fide pORFs contain multiple peptides with a high frequency of peptide detection. As another criterion, we used mRNA transcript profiles to infer the translation directionality (i.e., translated strand) of the overlapped pORFs (Supplementary Fig. 6). This stringent analysis removed a total of 790 unique peptides. A total of 921 peptides (131 peptides from mORF mapping + 790 peptides from the above stringent test) were considered as the false positives, suggesting that the false positive discovery rate (FDR) was < 2%. This analysis yielded 2,542 pORFs (FDR < 2%). To determine pORFs in the same TU, we mapped each pORF to RNAP-guided transcript segment using their genomic positions.

Determination of transcription units

To determine the transcription units (TUs), we first assembled the modular units based on the break point results obtained from the change point detection algorithm29. A total of 61 modular units (< 2%) obtained from the current annotation lacked any experimentally determined organizational components. These modular units indicate that specific growth conditions are required to determine their organizational components. For example, one modular unit contains the rha operon that encodes metabolic enzymes related with rhamnose metabolism requiring rhamnose as an environmental cue21.

Data visualization and availability

All data were visualized using either SignalMap (NimbleGen) or IGB. All raw and processed data are available at http://systemsbiology.ucsd.edu/publication along with in-house Perl and Python scripts used for data processing.

Supplementary Material

Supp Fig 1

Supp Fig 2

Supp Fig 3

Supp Fig 4

Supp Fig 5

Supp Fig 6

Supp Fig 7

Supp Fig 8

Supp Fig 9

Supplement Table1

Supplement Table10

Supplement Table11

Supplement Table12

Supplement Table13

Supplement Table14

Supplement Table2

Supplement Table3

Supplement Table4

Supplement Table5

Supplement Table6

Supplement Table7

Supplement Table8

Supplement Table9


The authors thank Derek Lovley at University of Massachusetts, Amherst for his insightful discussion and Marc Abrams for editing the manuscript. Proteomics experiments were performed using EMSL, a national scientific user facility sponsored by the Department of Energy's Office of Biological and Environmental Research and located at Pacific Northwest National Laboratory. This work was supported by National Institutes of Health and by the Office of Science (BER), U. S. Department of Energy.


Supplementary Information is linked to the online version of the paper at www.nature.com/naturebiotechnology/.

Author Contributions B.K.C., K.Z., Y.Q., E.M.K. and B.Ø.P. conceived and designed experiments. B.K.C., Y.S.P., Y.G. and E.M.K. performed genome-scale experiments. All data analyses were performed by B.K.C., K.Z., Y.Q., Y.S.P. and C.L.B. The manuscript was written by B.K.C., K.Z. and B.Ø.P.

Author Information Raw data are available from the Gene Expression Omnibus (GEO) under accession number (awaiting accession numbers, will be updated) and our website (http://systemsbiology.ucsd.edu).

Reprints and permissions information is available at www.nature.com/reprints.

The authors declare no competing financial interest.


1. Entrez Genome Project. National Center for Biotechnology Information, NIH; Bethesda: 2008. Available at http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi.
2. MacLean D, Jones JD, Studholme DJ. Application of ‘next-generation’ sequencing technologies to microbial genetics. Nat Rev Microbiol. 2009;7:287–296. [PubMed]
3. Faith JJ, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007;5:e8. [PMC free article] [PubMed]
4. Graham R, Graham C, McMullan G. Microbial proteomics: a mass spectrometry primer for biologists. Microb Cell Fact. 2007;6:26. [PMC free article] [PubMed]
5. Medini D, et al. Microbiology in the post-genomic era. Nat Rev Microbiol. 2008;6:419–430. [PubMed]
6. Xia Q, et al. Protein abundance ratios for global studies of prokaryotes. Proteomics. 2007;7:2904–2919. [PMC free article] [PubMed]
7. Fleischmann RD, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. [PubMed]
8. Reed JL, Famili I, Thiele I, Palsson BO. Towards multidimensional genome annotation. Nat Rev Genet. 2006;7:130–141. [PubMed]
9. Cho BK, Knight EM, Barrett CL, Palsson BO. Genome-wide analysis of Fis binding in Escherichia coli indicates a causative role for A-/AT-tracts. Genome Res. 2008;18:900–910. [PMC free article] [PubMed]
10. Koonin EV, Wolf YI. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 2008;36:6688–6719. [PMC free article] [PubMed]
11. Grainger DC, et al. Studies of the distribution of Escherichia coli cAMP-receptor protein and RNA polymerase along the E. coli chromosome. Proc Natl Acad Sci USA. 2005;102:17693–17698. [PMC free article] [PubMed]
12. Ishihama Y, et al. Protein abundance profiling of the Escherichia coli cytosol. BMC Genomics. 2008;9:102. [PMC free article] [PubMed]
13. Typas A, et al. High-throughput, quantitative analyses of genetic interactions in E. coli. Nat Methods. 2008;5:781–787. [PMC free article] [PubMed]
14. Feist AM, et al. Reconstruction of biochemical networks in microorganisms. Nat Rev Microbiol. 2009;7:129–143. [PMC free article] [PubMed]
15. Cho BK, et al. Genome-scale reconstruction of the Lrp regulatory network in Escherichia coli. Proc Natl Acad Sci USA. 2008;105:19462–19467. [PMC free article] [PubMed]
16. Crick F. Central dogma of molecular biology. Nature. 1970;227:561–563. [PubMed]
17. Campbell EA, et al. Structural mechanism for rifampicin inhibition of bacterial RNA polymerase. Cell. 2001;104:901–912. [PubMed]
18. Herring CD, et al. Immobilization of Escherichia coli RNA polymerase and location of binding sites by use of chromatin immunoprecipitation and microarrays. J Bacteriol. 2005;187:6166–6174. [PMC free article] [PubMed]
19. Choi PJ, Cai L, Frieda K, Xie XS. A stochastic single-molecule event triggers phenotype switching of a bacterial cell. Science. 2008;322:442–446. [PMC free article] [PubMed]
20. Halasz G, et al. Detecting transcriptionally active regions using genomic tiling arrays. Genome Biol. 2006;7:R59. [PMC free article] [PubMed]
21. Power J. The L-rhamnose genetic system in Escherichia coli K-12. Genetics. 1967;55:557–568. [PMC free article] [PubMed]
22. Keseler IM, et al. EcoCyc: a comprehensive view of Escherichia coli biology. Nucleic Acids Res. 2009;37:D464–D470. [PMC free article] [PubMed]
23. David L, et al. A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci USA. 2006;103:5320–5325. [PMC free article] [PubMed]
24. Zimmer JS, Monroe ME, Qian WJ, Smith RD. Advances in proteomics data analysis and display using an accurate mass and time tag approach. Mass Spectrom Rev. 2006;25:450–482. [PMC free article] [PubMed]
25. Rudd KE. EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res. 2000;28:60–64. [PMC free article] [PubMed]
26. Jaffe JD, Berg HC, Church GM. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004;4:59–77. [PubMed]
27. Ansong C, et al. Proteogenomics: needs and roles to be filled by proteomics in genome annotation. Brief Funct Genomic Proteomic. 2008;7:50–62. [PubMed]
28. Sabatti C, Rohlin L, Oh MK, Liao JC. Co-expression pattern from DNA microarray experiments as a tool for operon prediction. Nucleic Acids Res. 2002;30:2886–2893. [PMC free article] [PubMed]
29. Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23:657–663. [PubMed]
30. Yanofsky C. Attenuation in the control of expression of bacterial operons. Nature. 1981;289:751–758. [PubMed]
31. Kaberdin VR, Blasi U. Translation initiation and the fate of bacterial mRNAs. FEMS Microbiol Rev. 2006;30:967–979. [PubMed]
32. Cho BK, Charusanti P, Herrgard MJ, Palsson BO. Microbial regulatory and metabolic networks. Curr Opin Biotechnol. 2007;18:360–364. [PubMed]
33. Powell BS, et al. Novel proteins of the phosphotransferase system encoded within the rpoN operon of Escherichia coli Enzyme IIANtr affects growth on organic nitrogen and the conditional lethality of an erats mutant. J Biol Chem. 1995;270:4822–4839. [PubMed]
34. Bieda M, et al. Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. Genome Res. 2006;16:595–605. [PMC free article] [PubMed]
35. Reppas NB, Wade JT, Church GM, Struhl K. The transition between transcriptional initiation and elongation in E. coli is highly variable and often rate limiting. Mol Cell. 2006;24:747–757. [PubMed]
36. Lipton MS, et al. Global analysis of the Deinococcus radiodurans proteome by using accurate mass tags. Proc Natl Acad Sci USA. 2002;99:11049–11054. [PMC free article] [PubMed]
37. Blattner FR, et al. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1474. [PubMed]
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...