![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © 2009, EMBO and Nature Publishing Group Prevalence of transcription promoters within archaeal operons and coding sequences 1Institute for Systems Biology, Seattle, WA, USA 2Department of Biomedical Engineering and UC Davis Genome Center, One Shields Avenue, University of California, Davis, CA, USA 3Divisions of Human Biology and Clinical Research, Fred Hutchinson Cancer Research Center, Seattle, WA, USA 4Departments of Microbiology, and Molecular and Cellular Biology, University of Washington, Seattle, WA, USA aInstitute for Systems Biology, Departments of Microbiology, and Molecular and Cellular Biology, University of Washington, 1441 N 34th Street, Seattle, WA 98103, USA. Tel.: +1 206 732 1266; Fax: +1 206 732 1299; Email: nbaliga/at/systemsbiology.org *These authors contributed equally to this work †Present address: Departamento de Bioquímica e Imunologia, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo, Brazil. E-mail: tiekoide/at/gmail.com Received November 20, 2008; Accepted May 13, 2009. This is an open-access article distributed under the terms of the Creative Commons Attribution Licence, which permits distribution and reproduction in any medium, provided the original author and source are credited. Creation of derivative works is permitted but the resulting work may be distributed only under the same or similar licence to this one. This licence does not permit commercial exploitation without specific permission. Abstract Despite the knowledge of complex prokaryotic-transcription mechanisms, generalized rules, such as the simplified organization of genes into operons with well-defined promoters and terminators, have had a significant role in systems analysis of regulatory logic in both bacteria and archaea. Here, we have investigated the prevalence of alternate regulatory mechanisms through genome-wide characterization of transcript structures of ~64% of all genes, including putative non-coding RNAs in Halobacterium salinarum NRC-1. Our integrative analysis of transcriptome dynamics and protein–DNA interaction data sets showed widespread environment-dependent modulation of operon architectures, transcription initiation and termination inside coding sequences, and extensive overlap in 3′ ends of transcripts for many convergently transcribed genes. A significant fraction of these alternate transcriptional events correlate to binding locations of 11 transcription factors and regulators (TFs) inside operons and annotated genes—events usually considered spurious or non-functional. Using experimental validation, we illustrate the prevalence of overlapping genomic signals in archaeal transcription, casting doubt on the general perception of rigid boundaries between coding sequences and regulatory elements. Keywords: archaea, ChIP–chip, non-coding RNA, tiling array, transcription Introduction Systems-biology approaches have been successfully applied to construct quantitative and predictive models of biological networks (Bonneau et al, 2007; Faith et al, 2007). However, a significant amount of information is missing from these models because of incomplete parts lists (unannotated genes, non-coding RNAs (ncRNAs), poorly understood protein modifications and so on) as well as a lack of molecular detail associated with these processes. Incorporating such detail will make these models mechanistically accurate and useful for synthetic-biology approaches targeting large-scale biological-circuit re-engineering. Among the current systems-scale models most amenable for such large-scale redesign are those that describe gene-regulatory networks (GRNs). GRN models are usually built upon transcriptome data, in which typically genes or gene modules (with similar expression patterns and shared regulatory motifs) are associated with their transcriptional regulators through linear or Bayesian models. However, although these models can be predictive (Bonneau et al, 2007), they often rely on approximations of the transcription process and lack finer details of dynamic environment-dependent assembly of transcription complexes at each of the numerous promoters in the genome. High-density tiling arrays can be used to define transcribed regions (David et al, 2006), start sites (McGrath et al, 2007), and protein–DNA interaction sites (Reiss et al, 2008), which can be used to identify some of these missing details associated with transcriptional regulation, and thereby enable us to construct systems-scale predictive models of GRNs that are also mechanistically accurate. We recently constructed a model of an environment and gene-regulatory influence network (EGRIN) for the halophilic archaeon Halobacterium salinarum NRC-1. This model accurately predicts the transcriptional changes in 80% of all genes to new environmental and genetic perturbations (Bonneau et al, 2007). Using an integrated biclustering algorithm to identify regulons and their putative cis-regulatory motifs (Reiss et al, 2006), and a sparse regression procedure to statistically pair these regulons with their putative regulators (Bonneau et al, 2006), we were able to discover the combinatorial and conditional regulation of genes by multiple TFs and EFs (environmental factors) (Bonneau et al, 2007). Although several of the statistically inferred influences in this network were shown to be likely mediated through direct interactions with the promoters of regulated genes, a large number of influences are thought to be indirect. The logical next step is to make this quantitative and predictive network also mechanistically accurate on a systems scale. Construction of a mechanistically accurate systems-scale model is a reasonable expectation for Halobacterium salinarum NRC-1, as its transcription is driven by a simplified version of a eukaryotic RNA polymerase (RNAP) II (Hirata et al, 2008) in a genome with prokaryotic organization. The archaeal RNAP requires only two general transcription factors – GTFs (TATA binding protein –TBP and transcription factor B –TFB) for promoter recruitment and basal transcription initiation. Furthermore, only ~130 putative transcriptional regulators (TRs) are present among the ~2400 genes encoded in the genome of H. salinarum NRC-1 (Ng et al, 2000). A relatively small number of genes and few TFs (GTFs and TRs) together make H. salinarum NRC-1 an attractive model system for characterizing gene-regulatory mechanisms at all promoters. Notably, the combinatorial action of multiple TFBs and TBPs (H. salinarum NRC-1 possesses 6 TBPs and 7 TFBs) in defining basal promoter architecture in most archaea (Baliga et al, 2000; Facciotti et al, 2007) provides a unique opportunity to characterize dynamic conditional regulation of a large fraction of genes during cellular responses to complex changes. Here, we report a significant step toward a mechanistically accurate EGRIN model by characterizing the dynamic remodeling of the transcriptome structure of H. salinarum NRC-1 during a complex cellular response, and correlating these changes to genome-wide binding locations of 50% of all predicted GTFs as well as several specific TRs. By integrating diverse data types, we identified: (i) transcription start sites (TSSs) and termination sites (TTSs) for ~64% of the genes, including new and revised protein-coding genes; (ii) 61 new ncRNA candidates; (iii) 5′ and 3′ untranslated regions (UTRs) of mRNAs; (iv) functional promoters upstream and internal to coding regions; (v) instances of transcription termination inside coding sequences; (vi) mRNA populations with variable 3′-end locations; (vii) transcripts with extensive overlaps in their 3′ termini; and (viii) operon-encoding transcripts of variable length. Significantly, these findings suggest that the incorporation of mechanistic accuracy into GRN models would require genes, operons, promoters, and terminators to be treated as dynamic entities. Results Genome-wide protein–DNA binding data show TF binding inside genes and operons A detailed map of genomic locations where TFs bind DNA and modulate transcription is essential to model mechanisms of gene regulation on a systems scale. Chromatin immunoprecipitation of transcription complexes coupled to microarray (ChIP–chip Ren et al (2000)) or sequencing (ChIP–seq (Robertson et al (2007)) is a commonly used approach to construct such maps. In ChIP–chip, the resolution to which the protein–DNA binding sites (TFBSs) can be identified is often limited by the genomic spacing of the probes in the array. We utilized the MeDiChI algorithm (Reiss et al, 2008) to estimate precise TFBS locations and their corresponding local false discovery rates (LFDRs) from new and previously reported genome-wide ChIP–chip measurements for 11 TFs (with two or more biological replicates for each): all TFBs (TFBa, TFBb, TFBc, TFBd, TFBe, TFBf, and TFBg), one TBP (TBPb) and three TRs (Trh3, Trh4, and VNG1451C) in H. salinarum NRC-1 (see Materials and methods). On the basis of simulations similar to those of Reiss et al (2008), with a noise model customized to mimic the data used in this study, we estimated that the average positional uncertainty in TFBS locations identified by MeDiChI averaged ~50 nucleotides (nt) (1SE) over all ChIP–chip data sets used in this study. We found that the 3072 significant (LFDR<0.1) individual TFBSs for all data sets often fell within distinct loci where at least three different TFs were observed within a ±50 nt window (P<10−8). We therefore refined this TFBS list to a conservative set of 318 such distinct ‘multi-TF-binding loci', hereafter TFBS loci throughout the genome (Table I; see Supplementary Table 1 for each loci). As we applied to each individual data set an LFDR cutoff of 0.1, which by itself is rather stringent, the joint LFDR of these 318 TFBS loci is significantly smaller than that. Although each individual TF had a significant bias of binding in annotated intergenic regions (~60%, on average, versus ~16% expected), this fraction increased to ~70% (276) when considering the 318 TFBS loci (P~10−31). Monte Carlo simulations of TFBSs placed only in non-coding regions in the genome with a ~50–75 nt positional uncertainty and an LFDR between 0.1–0.01 show that 80–85% of detected TFBSs should fall in intergenic regions (for more details, see Materials and methods). Thus, our assessment was that a small but significant fraction of these significant TFBS loci in our ChIP–chip data sets (as many as ~10% of the multi-TFBS loci) fell within coding regions. Here onwards, we present detailed and systematic experimental validation that shows that many of these TF-binding events inside coding sequences have significant consequences on the transcriptional regulation of diverse aspects of cellular physiology.
Analysis of transcriptome structure shows new expression features The location of a TFBS in the vicinity of a TSS or a TTS could indicate whether a given binding event is functional, especially for the interactions localized within a gene or operon. We investigated this by systematically mapping transcript boundaries and their dynamic changes at the whole-genome level using genome-wide tiling array data and then integrating this information with the TF-binding information. We define transcriptome structure as the collection of TSSs and TTSs that together characterize transcriptional units (mono- and polycistronic mRNAs, tRNAs, rRNAs, and other ncRNAs). Sequence signatures for these features are yet to be characterized in archaea, and computational predictions based on known signatures in bacteria and eukaryotes remain error prone due to incomplete understanding of transcription processes in all organisms (Jones, 2006). Therefore, we experimentally mapped the transcriptome structure of H. salinarum NRC-1 by hybridizing total RNA (including RNA species <200 nt) to genome-wide high-density tiling arrays (60mer probes with 40 nt overlap between contiguous probes). We first applied a segmentation algorithm based on regression trees (see Materials and methods) to map transcript boundaries in cells cultured under standard laboratory growth conditions (mid-logarithmic phase, 37°C, 225 r.p.m. shaking—hereafter ‘reference RNA') (Figure 1A
H. salinarum NRC-1 presents a number of interesting switches in metabolism during growth (Facciotti et al, submitted) because of complex changes in EFs, including pH, oxygen, nutrition, and so on (Schmid et al, 2007). Although most single perturbations (radiation, oxygen, metals, and so on) affect the expression of only ~10% of all genes (Baliga et al, 2004; Kaur et al, 2006; Whitehead et al, 2006), the changes during growth resulted in differential regulation of a significantly higher proportion of genes (~63%, 1518 genes) (Figure 1B
Integration of TFBSs with transcriptome structure shows conditional modulation of operon organization Changes in transcript levels and structure are ultimately the product of gene regulation mediated by dynamic assemblies of TF–DNA complexes. This mechanistic perspective of global gene-expression dynamics can be obtained by correlating transcript boundaries with genome-wide binding locations of TFs. We estimated that at least 13 of the 318 TFBS loci were internal to predicted operons (Price et al, 2005) (see Materials and methods). This is a conservative estimate (see Table I) as only TFBS with three or more binding loci were considered. Eight of these TFBS loci fall within 50 nt of an intergenic ‘gap' between predicted open-reading frames in the operon. An interesting example is the cluster of genes involved in arginine fermentation (arcRACB), at least two of which (arcC and arcB) constitute a predicted operon (Figure 4A
This prompted us to search for operons interspersed with conditionally active promoters. For instance, we observed a TFBS internal to the operon VNG2211H-endA-trpS1 (Figure 4B Using this approach, we manually identified 78 operons with conditionally altered gene-expression levels of constituent genes. To gain a global perspective on the prevalence of conditional operons, we classified all predicted operons in the H. salinarum NRC-1 genome based on two scores (Figure 4F We were able to compute tiling and correlation scores for the 269 operons in H. salinarum NRC-1 with significant expression in the tiling-array experiments, and classified 115 (~43%) as condition dependent (Figure 4F Interaction of TFs within coding regions is associated with transcript boundaries The operons in H. salinarum NRC-1 usually have very short (~50 nt) or no intergenic regions between constituent annotated coding regions (‘gaps'). We find that although only 27% of the gaps between coding regions in all 299 predicted operons are longer than 20 nt, this fraction increases to 37% in the conditional operons (Supplementary Figure 3). Although this might partly explain some of the internal promoter activity, the lack of a significant intergenic region ( 20 nt) between at least 53 conditionally co-transcribed gene pairs within operons suggest the presence of alternate internal promoters within coding sequences, as illustrated with operons VNG2211H-endA-trpS1, sdhCDBA, and nirH-VNG1775C-hemA. Notably, absence of a significant intergenic gap and high degree of correlation in transcription profiles for these genes would have precluded discovery of their conditional co-transcription based on generally accepted rules for operon organization (Supplementary Figure 3E).TF binding in the middle of coding sequences can also result in transcription initiation or termination internal to a single annotated protein-coding gene. We highlight this with an example that focuses on gas-vesicle biogenesis—a hallmark response of H. salinarum NRC-1 to low toxic conditions under high cell density (Yang and DasSarma, 1990). Several TFs, including TFBd, bind internally to two distinct loci within gvpE1, a transcriptional regulator of gas-vesicle biogenesis (Scheuch et al, 2008). The binding locations of one set of TFs correlates with the termination of a transcript initiated upstream to gvpD. Moreover, when we observe the relative transcript levels in a strain overexpressing TFBd, it results in upregulation of a transcript downstream of its binding location in the second locus (Figure 5d
We note that our estimate of the prevalence of internal TFBSs as described above is conservative given the stringent nature of our automated analysis with the inclusion of only significant multi-TFBS loci, as well as the limited range of conditions under which both transcriptome and ChIP–chip data were collected. Indeed, although 69% of the 318 TFBS loci lie in intergenic regions, a fraction of the remaining 31% (~100 sites) which fall within annotated coding regions (in particular, 42 of these, which fall >50 nt from any annotated start or stop site) are likely functional (Table I). After revising the predicted translation start sites based on the transcriptome structure as described previously, we observed 610 TFBSs over all 11 TFs (LFDR<0.1) that fell within coding regions (by >50 nt) and 47 (7.7%) were nearby (within 100 nt) a putative internal transcription break point (P~0.015 relative to randomly placed internal break points), suggesting that they might constitute functional promoters and/or terminators (Supplementary Table 7). However, using the automated procedure, about half of all detected TSSs for annotated genes were not associated with any detectable TFBSs and about half of all detected TFBSs were not associated with any identifiable TSS. Although a significant fraction of transcript boundaries might also result from alternate regulatory mechanisms, such as transcript cleavage, our inability to correlate these features to TF-binding locations might also reflect the dynamics and complexity of combinatorial TF binding and TSS selection. Discovery of conditional promoter binding of GTFs All ChIP–chip data described above were collected at mid to late phase of growth in batch cultures (OD600>1.0), and therefore are not specific to the conditions over the entire growth curve, which were investigated in the mRNA-expression experiments. We investigated the effect of this condition dependence on the ChIP–chip data by comparing genome-wide binding locations of TFBd during three different phases of growth (OD600=0.3, 0.8, and 1.4). Surprisingly, even though TFBd was strongly overexpressed in all three ODs, only the ChIP–chip data obtained at OD600=0.8 (mid phase) showed highly enriched over-representation of TFBSs within intergenic regions (P~10−19 for OD600=0.8, versus P~10−3 for OD600=0.3 and P~10−2 for OD600=1.4), a stringent criterion which we have used throughout this and other studies (e.g. Reiss et al, 2008) to enrich ChIP–chip TFBS selections for likely functional binding. Moreover, the locations of the strongest TFBSs were more in agreement (within 50 nt) between the early- and mid-phase (OD600=0.3 and 0.8, respectively) data (P~10−3) and between the mid- and late-phase (OD600=0.8 and 1.4, respectively) (P~10−3) than between the early- and late-phase (OD600=0.3 and 1.4, respectively) data (P~0.35). Validation of regulated transcriptional initiation from promoters in coding sequences The proximity of chromosomal loci with multiple TFBSs and experimentally mapped TSSs within coding regions strongly indicates that the observed conditional modulation of operon organization is achieved through the activation of promoters in coding sequences. However, it is imperative to rule out alternate hypotheses, such as conditional transcript cleavage followed by differential degradation of the mRNA fragments as such a process could also result in truncated transcript(s) with a 5′ or a 3′ boundary within a coding sequence. Verifying our precision of mapping a transcript boundary and TFBSs with other technologies (EMSA, northern blot, 5′ RACE and so on) cannot definitively refute this alternate hypothesis. Instead, the ultimate proof for functional promoters within a coding sequence is the in vivo demonstration of regulated transcription initiation at these loci. To provide such evidence we developed a fluorescence-based assay using a fast-degrading variant of GFP (Reuter and Maupin-Furlow, 2004) and showed that the synthesis and degradation of this reporter accurately reflects dynamics observed at the level of transcription (Supplementary Figure 4). Next, we constructed GFP transcriptional fusions to assess the activity of promoters inside coding sequences of two genes encoding a tRNA endonuclease (VNG2210G) and a siroheme biosynthesis enzyme (VNG1775C) (Figure 6A
Discussion Our analysis of genome-wide protein–DNA binding sites suggested that ~10% of the multi-TFBS loci fell within coding regions. To show that these TFBS have significant functional consequences on transcriptional regulation and cellular physiology, we carried out a series of systematic experimental validations. First, we analyzed the transcriptome structure of H. salinarum NRC-1 under dynamic conditions. By correlating locations of the transcriptional units to predicted coding sequences in the genome, we were able to discover and characterize new features within the transcriptome. Next, by integrating TFBS locations with the transcriptome structure, we were able to show that some of these internal binding sites were indeed functional in the conditional modulation of operon organization, including promoters in coding regions. Finally, using transcriptional fusions to GFP reporters we provided in vivo validation of growth-phase-regulated transcription initiation from two of these promoters localized in coding sequences. We will discuss in detail each of these findings and their impact on our understanding and modeling of GRNs. New features in H. salinarum transcriptome: increasing and detailing the parts lists By analyzing the transcriptome of H. salinarum NRC-1 using high resolution and under dynamically changing growth conditions, we were able to assign TSSs to 64% of all annotated genes, TTSs to 46% of the genes and verify the expression of 203 operons. Notably, the average distance of ~24 nt between these TSSs and experimentally mapped locations of TFBSs for one or more of 11 TFs is consistent with earlier knowledge of the relative position of the pre-initiation complex from the TSS. One important outcome of mapping TSSs and TTSs was the discovery of 5′and 3′ UTRs for genes and operons. The observed absence of a 5′ UTR in most of the genes with an identified TSS suggests either internal SD translation signals or more likely, an alternate (eukaryotic-like) mechanism of translation initiation. In contrast, most transcripts with an experimentally detected TTS had a 3′ UTR, which was on average longer than those determined by Brenneis et al (2007). This difference can be explained by the smooth decay observed in the signal at the 3′ end of the transcripts (Figures 1A By correlating the transcriptional units contained within pairs of TSS and TTS with chromosomal coordinates of predicted genes (Ng et al, 2000) and experimentally mapped peptides from large-scale proteomics studies (Van et al, 2008), we were able to revise the translation start site for 61 genes and detect 10 new protein-coding genes (Supplementary Table 4). This highlights the importance of constant genome re-annotation on the basis of evidence presented by new high-throughput experimental data. Another important feature was the identification of 61 new putative ncRNAs in H. salinarum genome. Although the physiological roles and mechanism of action of specific ncRNAs remain to be uncovered, the significant correlation (positive or negative) between the profiles of the ncRNAs and the antisense strand (Figure 3 Mapping of all these new features of the H. salinarum NRC-1 transcriptome is expected to pave the way toward a detailed functional and mechanistic analysis of GRNs, thereby improving global models of cellular behavior. Dynamics of operon organization By observing the dynamics of the transcriptome structure, we noted that the organization of some operons seemed to be conditionally modulated. To assess the global prevalence of such operons, we devised a quantitative measure for classifying any operon as conditional, by integrating both the data from the transcriptome structure (‘tiling score') with correlations derived from expression profiles of H. salinarum NRC-1 genes in 719 microarray experiments (see Materials and methods). Using these measures, we classified 43% of the measured operons as condition dependent (Figure 4F The conditional modulation of the operon arcBCD (Figure 4A It is arguable whether, in some cases, our data simply refute the initial prediction of operon organization (Price et al, 2005). However, operons with low correlation and high tiling score have high probability of being conditionally co-transcribed, as there is no difference in their absolute transcript levels, suggesting that these genes are transcribed as a polycistron in some, if not in most, conditions. Many operons with low overall correlation and low tiling score still present meaningful co-expression of genes in specific conditions (see Supplementary information 1 at http://baliga.systemsbiology.net/regulatory_logic/). Likewise, many operons have high correlation but low tiling score, suggesting that they have identical relative transcript levels (and are probably co-regulated) even if their absolute transcript levels differ. On the basis of this evidence, we posit that considering operons as dynamic entities is more appropriate than refuting the initial prediction, given that the currently available data sets do not exhaust the universe of possible environmental perturbations. This raises interesting questions on how the annotation databases will evolve to represent the complicated dynamics of biological features. Physiological implications for conditional modulation of operon structures Modulation of transcript levels within certain operons suggests a change in stoichiometry or composition of subunits within protein complexes. Alternatively, it can also be a mechanism for maintaining stoichiometry of a complex with differential turnover or translation rates of specific subunits (Hayter et al, 2005). We illustrate this with two examples:
The association of a TFBS with a TSS internal to an operon strongly suggests that the operon's genes are conditionally transcribed through alternate promoters. Alternative mechanisms that could result in such behavior include post-transcriptional cleavage followed by differential turnover of mRNAs, intra-cistronic transcription termination (Adhya, 2003; Lee et al, 2008), or even the condensation state of the genome by chromatin proteins in regulating gene expression (Reeve and Sandman, 2007). Although specific consequences of this phenomenon on the function of SDH and the ABC transporter will require further investigation, these observations nevertheless challenge our assumption that the operon organization of genes encoding these complexes reflects static subunit composition and fixed stoichiometry in diverse environments. Our findings reinforce the fact that even simple prokaryotes possess complex mechanisms to fine tune the expression of genes in an operon through prevalent use of alternate promoters and/or terminators that are combinatorially induced or repressed in response to dynamically changing environments (Adhya, 2003). The significance of conditional activity of transcriptional promoters Although ~57% of the operons were classified as not condition dependent by our analysis (Figure 4F Notably, there were numerous examples where bona fide associations of TFBSs and TSSs or TTSs were not made by our conservative automated procedure (see Materials and methods). For instance, although some TFBSs presented weak binding-signal intensity, they could be manually associated with TSSs or TTSs, suggesting they too might be functional; and vice versa, weak TSSs or TTSs could be assigned to nearby TFBSs. Furthermore, by charting growth-phase-dependent changes in genome-wide binding patterns of TFBd we also provided further evidence for the conditional nature of TF binding. Although this shows some limitations in the coverage afforded by our automated approach and the single data set covering only growth-related transcriptional changes, it also gives a perspective on how to incorporate such valuable information with less stringent modeling approaches that integrate orthogonal evidences to mechanistically define GRNs. In other words, protein–DNA binding is probabilistic and combinatorial, depending on the relative abundance of TFBs, TBPs, and TRs under specific environmental conditions, thus, when we observe a TFBS that is weak or not associated with a transcript boundary or a strong promoter consensus (Muller et al, 2007), it does not necessarily mean that it is not functional. In fact, some weak-affinity TFBSs in yeast have been shown to be functional (Tanay, 2006). Only given sufficient experiments for localizing most TFBSs in a wide array of environments, one could map these types of conditional interactions to construct a comprehensive map of dynamic transcriptional regulatory mechanisms at all promoters. Conclusions Historically, studies of transcriptional regulation have often focused on TFBSs upstream to coding sequences. Furthermore, TFBSs inside coding regions are usually viewed as spurious or non-functional and used as a quality control metric for genome-wide ChIP–chip experiments. Computational (Ward and Bussemaker, 2008) and experimental (Lee et al, 2002) analyses of promoters restricted to intergenic regions add a significant bias in the distribution of documented TFBSs in the genome, reinforcing the overall assumption that transcriptional regulation occurs exclusively in intergenic regions. Interestingly, many studies that focus on a single gene or operon do find internal promoters (Tsui et al, 1994; Guillot and Moran, 2007). We have significantly extended these findings, by showing that the widespread and conditional affinity for several TFs does not discriminate between coding and intergenic regions. It also shows how a simple prokaryote can use the same set of genes in different combinations to elicit different responses according to the environmental challenge. Given recent reports of extensive TF binding within coding regions in many organisms (Tanay, 2006; Zhu et al, 2006; Yochum et al, 2007; Shimada et al, 2008), evidence is mounting to suggest that this is a general phenomenon. This is not surprising given that the genome is known to encode multiple levels of information within the same sequence (Itzkovitz and Alon, 2007)—here TF binding and gene coding. Irrespective of the specific underlying mechanism(s), our observations of widespread modulation of operon architecture, as well as transcription initiation and termination inside genes, etc. all constitute evidence that archaea can intersperse regulatory logic within their coding sequence and blur the boundaries between coding and non-coding elements. We have shown that it is possible to use new high-throughput technologies to find these biologically important instances in which transcriptional regulation does occur within coding sequences and, furthermore, that it is possible to globally characterize specific regulatory mechanisms responsible for these phenomena. Combined with new high-throughput sequencing technologies, our results will expand the view of genetic information processing that can be investigated at high resolution (Nagalakshmi et al, 2008; Wilhelm et al, 2008). These data will enable construction of mechanistically accurate models for reliable systems re-engineering of biological circuits. Materials and methods Strains, culturing, and growth conditions H. salinarum NRC-1 growth-curve experiments were conducted in CM media, in a water bath incubator at 37°C with agitation of 125 r.p.m. Reference samples were cultured under standard growth conditions (Baliga and DasSarma, 1999), at mid-log phase (OD600=~0.6), as well as all the strains used for ChIP–chip experiments (Facciotti et al, 2007). High-resolution tiling array construction, RNA hybridization, and reference normalization Whole-genome high-resolution tiling arrays for H. salinarum NRC-1 were designed with e-Array (Agilent Technologies), using strand-specific 60mer probes tiled every 20 nt for the main chromosome (NC_002607) and every 21 nt for the plasmids pNRC200 (NC_002608) and pNRC100 (NC_001869), consisting a total of 244K probes, including manufacturers' controls. The microarrays were printed by Agilent technologies and hybridized to total RNA, which was isolated using mirVana miRNA Isolation kit (Ambion) and direct labeled with Alexa547 and Alexa647 dyes (Kreatech) (Baliga et al, 2004). We used direct chemical labeling of RNA (Baliga et al, 2004) to avoid enzymatic labeling artifacts (Perocchi et al, 2007) and enable strand-specific signals for transcribed segments. Hybridization and washing were carried out according to array manufacturer's instructions. Arrays were scanned in ScanArray (Perkin Elmer) and spot finding was carried out using Feature Extraction (Agilent Technologies). Two biological replicates were sampled and dye-flip experiments were conducted for each sample. Resulting intensities were quantile normalized across all experiments. Log ratios were calculated for each probe (growth-curve sample/reference). The reference-RNA signals were quantile normalized and then jointly normalized by sequence content using a linear model similar to that of Johnson et al (2006). This model attempts to capture the effect on hybridization signal or efficiency from duplicate probes (cross-hybridization), G-C content, and sequence-specific factors. The ‘sequence-based' correction was subtracted from the probe intensities and resulted in a reduction by ~10% in residual sum-of-squares between the intensities of neighboring probes. Interactive visualization of the data was carried out in the Gaggle Genome Browser (Bare et al., in preparation), available at http://baliga.systemsbiology.net/regulatory_logic/. ChIP–chip experiments and analysis ChIP–chip experiments were carried out for all TFBs (TFBa, TFBb, TFBc, TFBd, TFBe, TFBf, and TFBg), one TBP (TBPb) and three TRs (Trh3, Trh4, and VNG1451C) in H. salinarum NRC-1 using the HaloSpan array (Facciotti et al, 2007), which consists of 500 nt PCR products of successive regions of H. salinarum NRC-1 genome. The data for all TFBs were retrieved from Facciotti et al (2007). For TFBa, TFBd, and TFBf, data for additional biological replicates were also acquired using 13-nt resolution tiling arrays on the Nimblegen platform (for TFBd, two such distinct biological replicates were acquired). The TRs encoding genes were cloned in pMTFcmyc vector; chromatin immunoprecipitation and identification were carried out as described by Facciotti et al (2007). Binding locations were defined by applying the MeDiChI algorithm (Reiss et al, 2008) to each data set. This regression-based method deconvolves the ChIP–chip enrichment ratios along the genome by fitting them with a ‘peak profile' model of binding events, assuming a distribution in enriched DNA fragment lengths. It was shown that MeDiChI can increase the effective resolution of TFBS locations by a factor of five relative to the probe spacing of the tiling array, even for overlapping peaks (Reiss et al, 2008). P-values reported by MeDiChI (based upon peaks detected in bootstrap-resampled data that statistically seem to contain only noise; see (Reiss et al, 2008)) for each data set were converted to LFDR estimates through a semi-parametric two-component mixture model (Robin et al, 2007). A comparison of the peak intensities derived from MeDiChI for all three TFs (TFBd, TFBf, and TFBa) for which there were biological replicate measurements using both microarray platforms (500 nt resolution spotted arrays versus 13 nt resolution Nimblegen arrays) (Supplementary Figure 5) provided strong validation (with R2 of 0.66, 0.52, and 0.81, respectively) of most (~500) TFBSs included in the analysis, and even for TFBSs with LFDR >0.1. The 318 multi-TFBS loci described in Results and Discussion were computed by locating peaks in a kernel density estimate (bandwidth=50 nt) of all TFBSs with LFDR<0.1 across the genome. Only density peaks generated by >2 individual TFBSs were counted. Monte Carlo simulations were used to estimate the expected fraction of TFBSs, which would be detected in intergenic regions, as a function of detection positional uncertainty σ, and FDR f, given that the true TFBS locations fall only in intergenic regions across the genome. In these simulations, n TFBSs were simulated by placing n (1–f) TFBSs in intergenic regions and n f in annotated coding regions, and Gaussian-distributed random offsets (±σ in nucleotides) were added to each simulated TFBS location. We used n=20 000 for our simulations. A binding event was considered internal to a transcribed or annotated coding region only if it was internally localized at a distance of >50 nt from the respective region's boundaries. The microarray data reported in this paper have been deposited in the National Center for Biotechnology Information Gene Expression Omnibus (GEO) database (GEO accession no. GSE13150). Identification of probes hybridizing with transcribed regions of the genome Probes in the tiling arrays were assessed as to whether they were complementary to a region that is transcribed in one or more of the observed conditions. We integrated the following probe measurements: (a) their log intensities in the 54 reference-RNA tiling arrays, (b) their relative changes across the growth curve in the 12 growth-curve tiling arrays, and (c) their Pearson correlations with the changes of their two neighboring probes across the growth curve (McGrath et al, 2007) into an iteratively reweighted logistic regression model that used annotated coding regions as the ‘training' set. The resulting model was used to estimate a probability that each probe was complementary to a transcribed region. Integrated, multivariate segmentation defines transcript boundaries Regression trees (CART; Breiman et al (1984)) were used to partition the tiling array data (log probe intensity values) into regions of constant intensity, separated by abrupt ‘break points', by fitting a constant value to a large, contiguous region, and recursively dividing the regions to significantly improve the residual sum-of-squares. The number of splits (and hence the complexity of the model) was determined using 100-fold cross-validation, to choose the most parsimonious model within 1σ from the optimal one. The relative likelihood of each break was estimated using 100 bootstraps with symmetric wild resampling (as in Reiss et al (2008)). Each segment was constrained to contain no fewer than five probes, restricting the procedure to detect only larger ( 100 nt) segments (putative transcripts). Using the multivariate implementation of this procedure in the mvpart R library, we could apply the procedure simultaneously to all tiling arrays described above, enabling us to constrain the segments in an integrated manner using the reference-RNA and growth-curve tiling arrays, as well as the growth-curve correlations and the probe transcription probabilities. The maximum resolution of the derived transcription break points is no better than the tiling array resolution (here, 20 nt). The resulting breaks were subsequently classified into transcription ‘starts' and ‘stops' based upon whether the signal increased or decreased across the break in both the RNA references and the probe transcription probabilities.Detection of putative non-coding RNAs The procedure described above for segmenting the data and identifying transcriptional start/stop sites was constrained to omit smaller ( 100 nt) transcripts. To identify putative non-coding RNAs (ncRNAs), including smaller ncRNA candidates, we individually partitioned each growth-curve sample data set (log ratios of the growth-curve samples relative to the reference RNA) using recursive partitioning trees as previously described. For each sample, P-values for each segment were computed relative to the log ratios of reference-RNA samples localized in the segment's coordinates. A segment was classified as a putative non-coding RNA if it presented a high probability of being expressed (P-value<0.05), its neighboring segments were not differentially expressed (P-value >0.05, in order to filter out possible UTR regions of genes), and no annotated gene or repeat overlapped the segment's coordinates. Further filtering was carried out through the model used to estimate a probability that each probe was complementary to a transcribed region (see section ‘Identification of probes hybridizing with transcribed regions of the genome'), and segments with P>0.05 were discarded. Many ncRNAs were complementary to repeat regions of the genome, so all duplicate ncRNA candidates were removed if they contained a 12-nt contiguous sequence similarity with any other ncRNA candidate.Peptide atlas update H. salinarum NRC-1 Peptide Atlas was updated with the addition of three new experiments corresponding to cultures grown under standard conditions. Sample preparation and mass spectrometry analyses were carried out as described by Van et al (2008), resulting in an additional 30 mass spectrometry runs and 33 986 tandem mass spectra. A new search database was constructed, including sequences from newly identified transcribed regions in the tiling array experiments. The updated version of H. salinarum NRC-1 peptide atlas is available at https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/buildDetails?atlas_build_id=130, including a total of 527 mass spectrometry runs and 121 618 tandem mass spectra. Identification of conditional operons For each of the predicted operons obtained from Price et al (2005), three different statistics were computed in a pairwise manner over all genes in that operon: (1) the log10(P-value) for the two-sample Student's t-test of the mean levels of the probes complementary to each pair of genes; (2) the Spearman rank correlation of the gene-expression profiles across 719 microarray experiments covering several diverse environmental perturbations (oxygen (Schmid et al, 2007), transition metals—Mn, Fe, Co, Ni, Cu, and Zn (Kaur et al, 2006), UV (Baliga et al, 2004) and gamma (Whitehead et al, 2006) radiation, interaction with Dunaliella salina, growth curve in CM media and in defined media, light–dark cycle, and oxidative stress—H2O2 and paraquat (unpublished; see http://gaggle.systemsbiology.net/projects/halo/2007-04), and (3) the Spearman rank correlation of the genes' tiling array probes over the growth curve. Thus, conditional operons (as opposed to classical operons) were identified on the basis of (1) the similarity in expression levels of the probes for each gene in the operon in the tiling array data, and (2) co-expression of the operon's genes across 719 microarrays. To obtain a probability that each operon is conditional, we computed the minimum values of each of these statistics for each operon (resulting in three ‘scores' per operon), and applied a quadratic discriminant classifier to these scores, using a set of 73 manually identified conditional operons as the training set. A probability cutoff of P=0.64 was chosen to minimize the false classification rate for the manually classified training set. This protocol resulted in a total of 123 classified conditional operons. Finally, the cumulative hypergeometric distribution was used to assess the P-value for the over-representation of TFBSs internal to the 123 conditional operons (from the ChIP–chip data for all GTRs and TFs described above) versus the number of TFBSs internal to the 176 non-conditional (classical) operons. Construction of promoter–GFP fusions and evaluation of promoter activity A 150–500 bp region surrounding the TSS localized in coding sequences of VNG2210G and VNG1775C was PCR amplified (primers VNG2210G-F: 5′-CGAAAACCGGATTCAAGTTC-3′, VNG2210G-R: 5′- ATCGTGTCGTCTGTGTCGTC-3′, resulting in a 208-bp PCR product corresponding to region 1 639 638–1 639 431 in H. salinarum main chromosome and VNG1775C-F: 5′-CTTCGGTCGACAGGGTTATC-3′ and VNG1775C-R: 5′-TGTCGACCAATCTACGTCGC-3′, resulting in a 162-bp PCR product corresponding to region 1 314 877–1 314 716 in H. salinarum main chromosome) and fused to a GFP coding sequence on a MevR selectable expression plasmid. H. salinarum NRC-1 cells were transformed and selected on CM agar containing 20 μg/ml mevinolin. Cells were sampled at mid-log, late-log, and stationary phases. On sampling, the cells were simultaneously washed, diluted to a nominal density of OD600 0.2, and fixed in a basal salt solution with 0.25% (w/v) formaldehyde. Cells were incubated in the fixative for 10 min at 4°C, followed by a second wash in basal salt solution. The fixative concentration was previously determined to adequately arrest cell function while also preserving GFP fluorescence and the combined wash steps served to remove as much of the peptone-based growth medium as possible to reduce fluorescence background. The dynamic range of the flow cytometer (InFlux, Cytopia/BD) was calibrated using fixed, non-fluorescent H. salinarum NRC-1 cells and 1-μm Y/G fluorescent beads (Polysciences, Inc). Before running on the flow cytometer, 1-μm beads were spiked into each sample to a final density of 1 × 107 beads/ml (approximately ten-fold less than the nominal cell density) to serve as an internal calibration standard. The mean fluorescence from 100 000 cells in each sample was normalized relative to the bead fluorescence level. Supplementary Table 1 (A) Significant Transcription factor binding sites (TFBS, LFDR < 0.1) and (B) Multi transcription factor binding loci Click here to view.(195K, xls) Supplementary Table 2 Transcription start sites, termination sites and untranslated 5' and 3' regions for annotated genes and operons that were expressed in H. salinarum NRC-1 during growth and/or reference conditions. Click here to view.(816K, xls) Supplementary Table 3 Overlapping transcripts in H. salinarum NRC-1. Overlaps greater than 20 nt were considered as significant, given the error of transcript boundary detection using tiling arrays. Click here to view.(36K, xls) Supplementary Table 4 Revision of gene start codons based on detected TSS, Peptide Atlas information and H. salinarum R-1 annotation. Click here to view.(93K, xls) Supplementary Table 5 Newly transcribed elements in H. salinarum NRC-1 genome. Genome location, estimated length of the transcript and transcript levels (log ratio) during growth. Click here to view.(81K, xls) Supplementary Table 6 Conditional operons in H. salinarum NRC-1. Minimum correlation, tiling score values and combined score are reported. Manually verified conditional operons are also indicated. Click here to view.(60K, xls) Supplementary Table 7 Transcription factor binding sites (TFBS) internal to genes that are associated with a transcript boundary. Click here to view.(24K, xls) Supplementary Materials File 1 This file contains Supplementary figures S1-5, Supplementary table legends SI-VII Click here to view.(2.8M, pdf) Acknowledgments Thanks to Kenia Whitehead and Sacha Coesel for helpful discussions, Dan Tenenbaum for the construction of the webpage and Kenichi Masumura for help in the growth-curve experiments. This work was supported by grants from NIH (P50GM076547 and 1R01GM077398-01A2), DoE (MAGGIE: DE-FG02-07ER64327 and DE-FG02-07ER64327), NSF (EF-0313754, EIA-0220153, MCB-0425825, DBI-0640950) and NASA (NNG05GN58G) to NSB. Footnotes The authors declare that they have no conflict of interest. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Cell. 2007 Dec 28; 131(7):1354-65.
[Cell. 2007]PLoS Biol. 2007 Jan; 5(1):e8.
[PLoS Biol. 2007]Cell. 2007 Dec 28; 131(7):1354-65.
[Cell. 2007]Proc Natl Acad Sci U S A. 2006 Apr 4; 103(14):5320-5.
[Proc Natl Acad Sci U S A. 2006]Nat Biotechnol. 2007 May; 25(5):584-92.
[Nat Biotechnol. 2007]Bioinformatics. 2008 Feb 1; 24(3):396-403.
[Bioinformatics. 2008]Cell. 2007 Dec 28; 131(7):1354-65.
[Cell. 2007]BMC Bioinformatics. 2006 Jun 2; 7():280.
[BMC Bioinformatics. 2006]Genome Biol. 2006; 7(5):R36.
[Genome Biol. 2006]Nature. 2008 Feb 14; 451(7180):851-4.
[Nature. 2008]Proc Natl Acad Sci U S A. 2000 Oct 24; 97(22):12176-81.
[Proc Natl Acad Sci U S A. 2000]Mol Microbiol. 2000 Jun; 36(5):1184-5.
[Mol Microbiol. 2000]Proc Natl Acad Sci U S A. 2007 Mar 13; 104(11):4630-5.
[Proc Natl Acad Sci U S A. 2007]Science. 2000 Dec 22; 290(5500):2306-9.
[Science. 2000]Nat Methods. 2007 Aug; 4(8):651-7.
[Nat Methods. 2007]Bioinformatics. 2008 Feb 1; 24(3):396-403.
[Bioinformatics. 2008]Annu Rev Genomics Hum Genet. 2006; 7():315-38.
[Annu Rev Genomics Hum Genet. 2006]Genome Res. 2007 Oct; 17(10):1399-413.
[Genome Res. 2007]Genome Res. 2004 Jun; 14(6):1025-35.
[Genome Res. 2004]Genome Res. 2006 Jul; 16(7):841-54.
[Genome Res. 2006]Mol Syst Biol. 2006; 2():47.
[Mol Syst Biol. 2006]Nucleic Acids Res. 2005; 33(3):880-92.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2005; 33(3):880-92.
[Nucleic Acids Res. 2005]Mol Microbiol. 2004 Jan; 51(2):579-88.
[Mol Microbiol. 2004]PLoS Genet. 2007 Dec; 3(12):e229.
[PLoS Genet. 2007]Proc Natl Acad Sci U S A. 1999 Nov 23; 96(24):13662-7.
[Proc Natl Acad Sci U S A. 1999]Proc Natl Acad Sci U S A. 2000 Oct 24; 97(22):12176-81.
[Proc Natl Acad Sci U S A. 2000]J Proteome Res. 2008 Sep; 7(9):3755-64.
[J Proteome Res. 2008]Genomics. 2008 Apr; 91(4):335-46.
[Genomics. 2008]Annu Rev Biochem. 2005; 74():199-217.
[Annu Rev Biochem. 2005]Mol Microbiol. 2005 Jan; 55(2):469-81.
[Mol Microbiol. 2005]Proc Natl Acad Sci U S A. 2002 May 28; 99(11):7536-41.
[Proc Natl Acad Sci U S A. 2002]Curr Opin Microbiol. 2005 Dec; 8(6):685-94.
[Curr Opin Microbiol. 2005]Bioinformatics. 2006 Nov 1; 22(21):2590-6.
[Bioinformatics. 2006]Nucleic Acids Res. 2005; 33(3):880-92.
[Nucleic Acids Res. 2005]Cell. 2007 Dec 28; 131(7):1354-65.
[Cell. 2007]J Bacteriol. 1996 Aug; 178(16):4942-7.
[J Bacteriol. 1996]J Bacteriol. 1990 Jul; 172(7):4118-21.
[J Bacteriol. 1990]Arch Microbiol. 2008 Sep; 190(3):333-9.
[Arch Microbiol. 2008]Bioinformatics. 2008 Feb 1; 24(3):396-403.
[Bioinformatics. 2008]Appl Environ Microbiol. 2004 Dec; 70(12):7530-8.
[Appl Environ Microbiol. 2004]PLoS Genet. 2007 Dec; 3(12):e229.
[PLoS Genet. 2007]Science. 2008 Jun 6; 320(5881):1344-9.
[Science. 2008]Proc Natl Acad Sci U S A. 2000 Oct 24; 97(22):12176-81.
[Proc Natl Acad Sci U S A. 2000]J Proteome Res. 2008 Sep; 7(9):3755-64.
[J Proteome Res. 2008]J Bacteriol. 1996 Aug; 178(16):4942-7.
[J Bacteriol. 1996]Nucleic Acids Res. 2005; 33(3):880-92.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2005; 33(3):880-92.
[Nucleic Acids Res. 2005]Mol Cell Proteomics. 2005 Sep; 4(9):1370-81.
[Mol Cell Proteomics. 2005]Genome Res. 2007 Oct; 17(10):1399-413.
[Genome Res. 2007]J Biol Chem. 2008 Apr 18; 283(16):10967-77.
[J Biol Chem. 2008]Proc Natl Acad Sci U S A. 2007 Feb 20; 104(8):2909-14.
[Proc Natl Acad Sci U S A. 2007]Genome Res. 2007 Oct; 17(10):1399-413.
[Genome Res. 2007]J Biol Chem. 2008 Apr 18; 283(16):10967-77.
[J Biol Chem. 2008]Proc Natl Acad Sci U S A. 2007 Feb 20; 104(8):2909-14.
[Proc Natl Acad Sci U S A. 2007]Genome Res. 2007 Oct; 17(10):1399-413.
[Genome Res. 2007]Sci STKE. 2003 Jun 3; 2003(185):pe22.
[Sci STKE. 2003]J Mol Biol. 2008 Apr 25; 378(2):318-27.
[J Mol Biol. 2008]Mol Cell. 2006 Dec 8; 24(5):747-57.
[Mol Cell. 2006]J Mol Biol. 2008 Apr 25; 378(2):318-27.
[J Mol Biol. 2008]Mol Microbiol. 2007 Jul; 65(1):21-6.
[Mol Microbiol. 2007]J Biol Chem. 2007 May 18; 282(20):14685-9.
[J Biol Chem. 2007]Genome Res. 2006 Aug; 16(8):962-72.
[Genome Res. 2006]Bioinformatics. 2008 Jul 1; 24(13):i165-71.
[Bioinformatics. 2008]Science. 2002 Oct 25; 298(5594):799-804.
[Science. 2002]Mol Microbiol. 1994 Jan; 11(1):189-202.
[Mol Microbiol. 1994]J Bacteriol. 2007 Oct; 189(20):7181-9.
[J Bacteriol. 2007]Genome Res. 2006 Aug; 16(8):962-72.
[Genome Res. 2006]Mol Cell. 2006 Apr 21; 22(2):169-78.
[Mol Cell. 2006]BMC Mol Biol. 2007 Nov 12; 8():102.
[BMC Mol Biol. 2007]Science. 2008 Jun 6; 320(5881):1344-9.
[Science. 2008]Nature. 2008 Jun 26; 453(7199):1239-43.
[Nature. 2008]J Bacteriol. 1999 Apr; 181(8):2513-8.
[J Bacteriol. 1999]Proc Natl Acad Sci U S A. 2007 Mar 13; 104(11):4630-5.
[Proc Natl Acad Sci U S A. 2007]Genome Res. 2004 Jun; 14(6):1025-35.
[Genome Res. 2004]Nucleic Acids Res. 2007; 35(19):e128.
[Nucleic Acids Res. 2007]Proc Natl Acad Sci U S A. 2006 Aug 15; 103(33):12457-62.
[Proc Natl Acad Sci U S A. 2006]Proc Natl Acad Sci U S A. 2007 Mar 13; 104(11):4630-5.
[Proc Natl Acad Sci U S A. 2007]Bioinformatics. 2008 Feb 1; 24(3):396-403.
[Bioinformatics. 2008]Nat Biotechnol. 2007 May; 25(5):584-92.
[Nat Biotechnol. 2007]Bioinformatics. 2008 Feb 1; 24(3):396-403.
[Bioinformatics. 2008]J Proteome Res. 2008 Sep; 7(9):3755-64.
[J Proteome Res. 2008]Nucleic Acids Res. 2005; 33(3):880-92.
[Nucleic Acids Res. 2005]Genome Res. 2007 Oct; 17(10):1399-413.
[Genome Res. 2007]Genome Res. 2006 Jul; 16(7):841-54.
[Genome Res. 2006]Genome Res. 2004 Jun; 14(6):1025-35.
[Genome Res. 2004]Mol Syst Biol. 2006; 2():47.
[Mol Syst Biol. 2006]