![]() | ![]() |
Formats:
|
||||||||
Copyright © 2009 Meyer et al; licensee BioMed Central Ltd. Identification of candidate structured RNAs in the marine organism 'Candidatus Pelagibacter ubique' 1Department of Molecular Cellular and Developmental Biology, Yale University, Box 208103, New Haven, CT 06520, USA 2Department of Molecular Biophysics and Biochemistry, Yale University, Box 208103, New Haven, CT 06520, USA 3Howard Hughes Medical Institute, Yale University, Box 208103, New Haven, CT 06520, USA 4Department of Microbiology, Oregon State University, Corvallis, OR 97333, USA Corresponding author.Michelle M Meyer: michelle.meyer/at/yale.edu; Tyler D Ames: tyler.ames/at/yale.edu; Daniel P Smith: dansmith/at/orst.edu; Zasha Weinberg: zasha.weinberg/at/yale.edu; Michael S Schwalbach: schwalbm/at/onid.orst.edu; Stephen J Giovannoni: steve.giovannoni/at/oregonstate.edu; Ronald R Breaker: ronald.breaker/at/yale.edu Received January 6, 2009; Accepted June 16, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Metagenomic sequence data are proving to be a vast resource for the discovery of biological components. Yet analysis of this data to identify functional RNAs lags behind efforts to characterize protein diversity. The genome of 'Candidatus Pelagibacter ubique' HTCC 1062 is the closest match for approximately 20% of marine metagenomic sequence reads. It is also small, contains little non-coding DNA, and has strikingly low GC content. Results To aid the discovery of RNA motifs within the marine metagenome we exploited the genomic properties of 'Cand. P. ubique' by targeting our search to long intergenic regions (IGRs) with relatively high GC content. Analysis of known RNAs (rRNA, tRNA, riboswitches etc.) shows that structured RNAs are significantly enriched in such IGRs. To identify additional candidate structured RNAs, we examined other IGRs with similar characteristics from 'Cand. P. ubique' using comparative genomics approaches in conjunction with marine metagenomic data. Employing this strategy, we discovered four candidate structured RNAs including a new riboswitch class as well as three additional likely cis-regulatory elements that precede genes encoding ribosomal proteins S2 and S12, and the cytoplasmic protein component of the signal recognition particle. We also describe four additional potential RNA motifs with few or no examples occurring outside the metagenomic data. Conclusion This work begins the process of identifying functional RNA motifs present in the metagenomic data and illustrates how existing completed genomes may be used to aid in this task. Background The discovery of many RNA sequences that do not encode proteins (non-coding RNAs or ncRNA) and have biological functions beyond those of tRNA and rRNA, has significantly expanded the known role of RNA in diverse cellular processes. Consequently, there is a growing effort to systematically identify ncRNAs utilizing both experimental and computational techniques. Experimental approaches are typically used to identify non-coding portions of an organism's genome that are actively being transcribed. These approaches are not dependent on the identification of conserved RNA sequences or secondary structures, and therefore are well-suited for the discovery of unstructured or poorly-conserved ncRNAs. However, experimental limitations can cause some RNAs to be missed, and the false-positive rate may be high due to "transcriptional noise" [1,2]. Alternatively, computational methods seek to identify evidence of conserved RNA sequences and secondary structures through comparative genomics [3,4]. However, such methods usually cannot be used to identify RNA motifs that may not have conserved secondary structure, are small with few base-pairing elements, or are not well-represented in genomic sequence databases. Marine metagenomic sequence data are a proven resource for the discovery of novel protein diversity and have provided additional examples for thousands of previously identified open reading frames (ORFs) with no known homologs [5]. While there have been surveys conducted with the marine metagenome to discover additional examples of known ncRNAs [6,7], there have been no studies explicitly examining these data for novel RNA motifs, in part due to unique computational challenges inherent to metagenomic datasets. Specifically, the exceedingly large amount of sequence data available (~7 billion base pairs), relatively poor annotation of protein coding regions due to a high frequency of fragmentary genes that result from short sequence reads, and comparatively high sequencing error rates make metagenomic data analysis difficult [8-10]. To circumvent many of the challenges associated with analyzing metagenomic sequence data, we have used the genome of 'Cand. P. ubique' HTCC 1062 as a starting point to discover new RNA motifs within the marine metagenome. Bacteria of the SAR11 clade, of which 'Cand. P. ubique' is a representative, are found throughout the world's oceans and are the dominant aerobic heterotrophs in marine surface waters [11]. Given its numeric advantage, genes from members of the SAR11 clade are well-represented in marine metagenomic libraries with nearly 20% of sequence reads from the Global Oceanographic Survey (GOS) matching most closely to genes present in the 'Cand. P. ubique' genome [12,13]. Only ~30% of the GOS reads could be aligned well to the 584 available reference genomes. The other predominant genera represented in the GOS data are Prochlorococcus, Synechococcus, Burkholderia, and Shewanella, none of which are closely related to 'Cand. P. ubique'. While, alignments to every reference genome were identified, typically they showed identity to regions corresponding to large, highly conserved genes [13]. At 1.3 million base pairs, the genome of 'Cand. P. ubique' is the smallest known for a free-living organism, but it appears to encode for nearly all the basic functions of Alphaproteobacteria cells [14]. The genome contains very little non-coding DNA, with a median intergenic region (IGR) length of 3 nucleotides. In addition, the organism has remarkably low GC content (29%). While evaluating nucleotide composition is usually not a viable method for identifying ncRNAs [15], in genomes with a strong AT bias or hyperthermophilic environment, the higher GC content necessary to maintain a stable RNA structure may be used to identify candidate ncRNAs [16-19]. 'Cand. P. ubique' offers an ideal opportunity to utilize nucleotide composition as its genome has very few long IGRs, which are generally low GC (23% on average). In the current study we combine nucleotide composition with comparative genomics approaches to identify novel structured RNA motifs in 'Cand. P. ubique' and the marine metagenomic data. First, we demonstrate that longer, higher GC 'Cand. P. ubique' IGRs are much more likely to contain structured RNAs (rRNAs, tRNAs, etc.). Subsequently, we utilized the IGRs in 'Cand. P. ubique' with similar properties that lack assigned ncRNAs as the starting point for a comparative sequence analysis strategy that takes advantage of marine metagenomic sequences. We discovered four likely structured ncRNAs including a new riboswitch class, and three other candidate cis-regulatory motifs. In addition we describe several other conserved IGRs that encode potential structured RNA elements. Results Analysis strategy To identify potential ncRNAs in the genome of 'Cand. P. ubique', all IGRs were extracted from the 'Cand. P. ubique' genome and ranked by GC content. When 'Cand. P. ubique' IGRs are plotted by their length and percent GC, those containing annotated RNAs (rRNAs, tRNAs, riboswitches, etc.) cluster toward the top right of the graph (Figure (Figure1).1
To identify additional structured RNAs that may not be annotated, we performed BLAST searches of the remaining IGRs against the Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA) database [20]. Table 1 lists GC enriched 'Cand. P. ubique' IGRs longer than 100 bp and the number of BLAST hits identified with an E-value less than 10-5 as a measure of conservation. The average number of blast hits for IGRs containing tRNAs is 2158, with a standard deviation of 1282. However, the average number of blast hits for the 'Cand. P. ubique' IGRs containing SAM-II riboswitches, which are significantly smaller than a tRNA and most commonly present in Alpha-, Beta- and Gammaproteobacteria, is approximately 500. Based on this analysis and the need for a relatively large number of BLAST hits for subsequent comparative sequence analysis algorithms, IGRs with greater than 200 BLAST hits were further screened for unannotated ncRNAs and misannotated protein coding sequence. This screening process revealed several misannotated protein coding sequences in addition to several known structured RNAs not previously annotated (Additional file 2 – Table 1). The RNA motifs identified are typically very highly ranked on our list, and include tmRNA, the RNA component of the signal recognition particle (SRP), the RNase P RNA (class A), and a number of riboswitches (Table 1).
Identification of SRP RNA (4.5S RNA) [21] and RNase P RNA [22,23] was very straightforward. Both are completely contained within their respective IGRs and conform to well-established consensus sequences [24]. We also easily identified a variety of RNA cis-regulatory elements known as riboswitches [25] including two representatives of the glycine riboswitch class [26] previously described in 'Cand. P. ubique' [27], two class II SAM riboswitches (SAM-II) [28] and a TPP riboswitch [29,30]. In contrast, identification of the tmRNA [31] representative was somewhat more challenging. The tmRNA eluded identification during initial screens for several reasons. First, in the genome of 'Cand. P. ubique' the flanking gene (thyX, SAR11_0010) is likely misannotated resulting in a partial overlap of the annotated coding region with the tmRNA. While coding sequences in 'Cand. P. ubique' often overlap by several nucleotides, an in-frame methionine at position 30 of the existing annotation for thymidylate synthase sequence is most likely the correct start site based on BLAST analysis of ThyX protein sequences. Second, the genomic sequence of the tmRNA is split and permuted relative to the mature form of the RNA in 'Cand. P. ubique'. While this feature is shared by most other Alphaproteobacteria and by some Cyanobacteria [32], it makes identification of the RNA more difficult because the region between the two sections varies in length between 75 and 125 bp [33], and the permuted model is not currently represented in the Rfam database [24]. By applying length, %GC and conservation thresholds we have significantly enriched our list of IGRs for known structured RNAs. Only, 4% of all IGRs in 'Cand. P. ubique' contain known structured RNAs. Approximately 17% of IGRs greater than 100 bp contain structured RNA; and eliminating IGRs with <26% GC increases this percentage to ~40%. Applying the BLAST hit threshold further increases percentage of considered IGRs containing known structured RNAs to ~75%. However, our parameter choices do exclude 2 of the 34 IGRs (6%) containing previously known RNAs. The first is a tRNA that is found within an IGR of 98 bp. We explored lowering the 100 bp threshold. However, we identified few additional candidates, and these candidates typically were very close to previously established thresholds for other parameters further decreasing their attractiveness for comprehensive study. The second example of a known RNA we excluded using our parameters is the IGR containing a SAM-II riboswitch preceding metX (SAR11_0217), which failed to rank highly based on GC-enrichment. The IGR containing this riboswitch is 191 nucleotides long and 22.5% GC (ranked 121st in the genome based on Additional file 1). However, the SAM-II aptamer alone is 70 nucleotides long and 30% GC. An early investigation of the 'Cand. P. ubique' genome did explore ranking the IGRs by the highest percent GC within a "sliding window" of 50 nucleotides [19]. However, this did not change the rankings of 'Cand. P. ubique' IGRs significantly (R2 = 0.84, Additional file 3). Thus, this additional level of complexity was not implemented for the final analysis. For those IGRs that are longer than 100 bp, greater than 26% GC, and well-conserved in the marine metagenome (Table 1) but do not contain known structured RNAs, similar sequences identified by the BLAST analysis were used as input for comparative sequence analysis algorithms employed for ncRNA discovery. For each IGR several hypothetical alignments and secondary structures were generated using a covariance model search [34]. These alignments and predicted secondary structures were then used as the starting point for homology searches of the NCBI and metagenomic sequence databases to identify additional examples [35,36]. To confirm and refine secondary-structure models and sequence alignments, all examples for a particular IGR were subsequently combined and the process repeated beginning with the covariance model search to generate an RNA secondary structure that is well-supported by a large number of representatives (100–300 unique sequences). Using this strategy, we discovered candidate structured RNA elements located 5' relative to genes encoding ribosomal proteins S2 (rpsB) and S12 (rpsL), and the signal recognition particle protein (ffh). We also found a structured RNA element associated with genes for the methionine biosynthesis proteins O-acetylhomoserine (thiol)-lyase (metY), homoserine S-methyltransferase (mmum) and betaine-homocysteine methyltransferase (bhmt) (Figure (Figure2).2 rpsB motif We identified a likely RNA motif preceding the gene rpsB, which encodes ribosomal protein S2. The motif is present in both marine metagenomic sequences and most Alphaproteobacteria with the exception of most members of the Rickettsiaceae family (Additional file 4). In addition, we identified representatives in most Gammaproteobacteria, a few Epsilon-, Delta-, and Betaproteobacteria, Cyanobacteria, and some Firmicutes. In nearly all examples where the downstream genes can be determined, the motif precedes rpsB. However, a few precede fts, which encodes elongation factor Ts (Ef-Ts) and is often found in the same operon as rpsB [37]. The structure of rpsB motif (Figure (Figure2A)2A Cis-regulatory elements in the 5' untranslated regions (UTRs) of ribosomal protein encoding mRNAs have long been known [38]. Ribosomal proteins L1 [39], L4 [40,41], L10/L12 [42], L20 [43], S4 [44,45], S7 [46], S8 [47,48], S15 [49], and S1 [50] are known to bind mRNA sequences to control gene expression. All such sequences characterized to date are autoregulatory, where the mRNA is bound by a ribosomal protein encoded within the transcript [38]. Typically such sequences inhibit translation, although some regulate transcription [41,51]. The role of the S2 ribosomal protein in translation is not well understood. S2 binds the 30S subunit late in ribosome biogenesis and acts as a bridge between the 16S RNA and ribosomal protein S1, which is the only ribosomal protein contacting the 30S subunit through protein-protein interactions [52]. The function of S1 is similarly unclear; however it has been implicated in translating highly structured mRNAs [53], as well as in the formation of the translation initiation complex at internal ribosome binding sites [54]. Analysis of the crystal structure of the 30S subunit from T. Thermophilus ribosome shows that S2 contacts distal regions of the 16S RNA (H26 in the body and H35–37 in the body) [55]. These regions bear no obvious resemblance to the motif we have identified. However, structural mimicry cannot be excluded. In several instances the 5' UTR of an mRNA and the ribosomal RNA bound by the same protein share similar tertiary structures despite having little or no primary or secondary structure similarity [56-59]. The region upstream of the ribosomal protein S2 was identified as a potential 5' UTR in a transcriptome analysis of Escherichia coli [60,61]. In addition, recent in vivo work in E. coli shows that the region 162 nucleotides upstream of rpsB controls an rpsB-lacZ fusion construct in response to exogenous S2 added in trans [62]. This work identified the conserved RNA structure upstream of rpsB in other Gammaproteobacteria. However, we identified a more broadly conserved motif in Alpha- Beta- and Deltaproteobacteria as well as Cyanobacteria and Firmicutes. In addition, the pseudoknot interaction had not previously been identified. rpsL motif A second putative motif in the 5' UTR of a ribosomal mRNA was identified for rpsL (encoding ribosomal protein S12), the first gene in a series of 22 genes encoding ribosomal proteins in 'Cand. P. ubique' that are homologous to those in the E. coli str, spc, and S10 ribosomal operons. We identified over 900 representatives (659 unique sequences) of the motif in the marine metagenome in addition to the instance in 'Cand. P. ubique' (Additional file 5). The motif is consistently identified 3' of rpoC, which encodes RNA polymerase, and 5' of rpsL. The genes further downstream of rpsL are typically those identified in the 'Cand. P. ubique' operon. However, due to the length of the metagenomic sequences analyzed it is impossible to determine whether the entire series of ORFs is conserved. The motif occasionally precedes rpsG or fusA genes that directly follow rpsL in the 'Cand. P. ubique' genome. Despite extensive searching, we only identified the motif in 'Cand. P. ubique' and marine metagenomic sequence samples. The motif consists of a bulged P1 stem connecting to a three-stem junction (Figure (Figure2B).2B Several proteins encoded by this series of ribosomal protein genes in 'Cand. P. ubique' have been shown to regulate ribosomal protein expression in E. coli [40,41,46-48,62]. The str ribosomal operon (encoding ribosomal proteins S12, S7, and elongation factors G and Tu) is regulated by the binding of S7 to the transcript region between the genes for S12 and S7 [46]. Similarly, the spc operon (encoding ribosomal proteins L14, L24, L5, S14, S8, L6, L18, S5, L30 L15 and secY) is regulated by S8 binding to an mRNA structure between L24 and L5 [47,48]. The eleven-gene S10 operon (encoding ribosomal proteins S10, L3, L4, L23, L2, S19, L22, S3, L16, L29, S17) is regulated by ribosomal protein L4 binding to a 5' UTR preceding the S10 gene [40,41]. The secondary structure of the motif described here does not bear any resemblance to the regulatory motifs associated with S7, S8 and L4. Additionally, the rpsL motif is not located at the same genomic position as any of the E. coli regulatory motifs. While this series of ribosomal proteins in 'Cand. P. ubique' essentially consists of the three separate E. coli operons, separate regulation in this organism is unlikely as the coding regions typically overlap by a few base pairs and the largest IGR is nine nucleotides. This motif is not identified outside of 'Cand. P. ubique' and the metagenomic data. However, given its genomic context and conserved secondary structure, the rpsL motif is likely a structured RNA involved with regulation of ribosomal protein expression. Considering the large number of potential candidates, we cannot predict with confidence which protein may be its binding partner. ffh motif We identified an RNA motif in the IGR preceding the gene ffh which encodes the cytoplasmic protein component of the bacterial signal recognition particle (SRP). The motif is well-conserved in metagenomic sequence samples with over 600 representatives (345 unique sequences) (Additional file 6). In addition, this motif is widespread among Alphaproteobacteria occurring in all fully-sequenced representatives of the Rhodobacterales, Sphingomonadales and Rhizobiales classes. However, the ffh motif does not occur in any sequenced representatives of the Rhodospirillales or Caulobacterales classes and it is also not found in representatives of Rickettsiales other than 'Cand. P. ubique'. In nearly all examples where the downstream genes can be identified, the motif precedes ffh. This transcript has been detected by several metatranscriptomics analyses of microbial small RNAs [63,64]. The RNA motif consists of a single bulged hairpin (Figure (Figure2C).2C The signal recognition particle (SRP) is an essential RNA-protein complex conserved in all three domains of life that targets secreted proteins to the plasma membrane in eubacteria and archaea or to the endoplasmic reticulum in eukaryotes through interactions with peptide signal sequences [21]. The eubacterial SRP complex consists of the 4.5S RNA, a cytoplasmic protein (Ffh), and a receptor protein (FstY) that targets the complex to the membrane. Ffh binds directly to a conserved portion of the 4.5S RNA known as helix 8 [65], and FstY in turn binds Ffh [66,67]. The eukaryotic and archaeal SRPs typically consist of larger RNAs and a greater number of proteins. However, the interactions between the RNA component and the cytoplasmic protein are conserved [68]. How the levels of the Ffh protein and the 4.5S RNA are regulated is not fully understood. In E. coli the 4.5S RNA is present in excess compared to Ffh [69], and it has been shown using both depletion studies [70] and examination of a temperature sensitive ffh mutant in E. coli [71] that Ffh is significantly stabilized by its interactions with the 4.5S RNA and is rapidly degraded when not bound to the RNA. However, no regulation at the transcriptional or translational level has been described. The RNA motif identified does not appear to resemble the portion of the 4.5S RNA bound by Ffh. However, it is possible that the motif plays a role in the regulation of the ffh gene, especially given the widespread distribution of this motif and the precedent for cis-regulatory mRNA elements associated with the genes of RNA binding proteins [72]. Methionine biosynthesis associated motif We identified a conserved RNA motif preceding the methionine biosynthesis genes mmum, metY, and bhmt. This conserved sequence was previously identified as a potential regulatory region in 'Cand. P. ubique' as the three genes appear to be co-regulated from proteomic studies [73]. We found 690 representatives (505 unique sequences) in metagenomic sequences, most of which precede metY (Additional file 7). However, there are metagenomic examples that precede bhmt, metH, and mmum. In addition, there is a single example in the genome of Psychroflexus torquis ATCC 700755 (RefSeq accession NZ_AAPR0000000) also preceding metY. The motif consists of a simple pseudoknotted structure that is typically within ten nucleotides of a start codon (Figure (Figure2D).2D The RNA motif also displays some similarities to the previously described class II SAM riboswitches (SAM-II) that bind SAM and control sulfur metabolism genes in Alphaproteobacteria [28]. In particular the two RNA motifs share a similar overall pseudoknotted structure and many of the bases shown to contact the ligand in a crystal structure of the class II SAM riboswitch [75] have equivalent nucleotides in the new-found motif. Despite these similarities, the motif lacks the final 3' base-pairing element present in most SAM-II riboswitch representatives. Moreover, both paired regions in the new motif differ in length from those in the SAM-II consensus, and the loop regions outside those that bind the ligand in the SAM-II riboswitch are not well conserved. Such differences in the riboswitch aptamers for SAM-I and SAM-IV riboswitches cause representatives to be sorted into distinct collections when examined using bioinformatics search algorithms that identify common sequence and structural elements [76]. Likewise, the differences between SAM-II and the new-found motif also cause them to be sorted independently, suggesting that this is a new class of SAM-binding riboswitches that we have termed SAM-V. Other potential RNA motifs In addition to the motifs that we identified that have strong support as structured RNAs based on their alignments and distribution, we also identified several potential RNA motifs that are less well-supported. These candidate RNA motifs have fewer positions with covariation or compatible mutations and are not identified outside the genome of 'Cand. P. ubique' and metagenomic sequences. However, they do exhibit evidence of possible RNA structure formation and our models are supported by sequence alignments from the marine metagenome. The first of these motifs consists of a single bulged hairpin (Figure (Figure3A).3A The second motif consists of two base-paired stems in series where the loop of the second is especially well-conserved (Figure (Figure3B).3B The third motif (Figure (Figure3C)3C The fourth motif is somewhat more complex than others in this category (Figure (Figure3D).3D Microarray studies show that transcripts for all of these genes, although not necessarily any untranslated regions, are present in 'Cand. P. ubique' during both exponential growth and stationary phase cells. Interestingly, comparison of microarray and quantitative proteomic data (unpublished data) for pntA shows a ~300% increase in protein as cells enter stationary phase, starkly contrasting the corresponding 9% decease in transcript levels. This disparity between transcript and protein expression provides further evidence for post-transcriptional regulation of the gene. Unfortunately, proteomic data are not available for RhtB and BabIM (not included in the AMT-tag library), and SAR11_0636 was never observed in the proteomic dataset, so direct comparisons are not possible for these genes. Discussion In this study we identified structured RNAs that are conserved in both the genome of 'Cand. P. ubique' and the marine metagenomic datasets. A few these RNAs were assigned to previously-known classes, while this is the first description of others. Our work differs from other surveys of ncRNAs in the metagenome [6,7] in that we did not seek to identify additional examples of known motifs, but rather we sought to discover motifs not previously described. We identified three likely cis-regulatory protein binding motifs and a new riboswitch class, and our approach is validated by the confirmed biological function for two of the four motifs (rpsB motif and SAM-V riboswitch). In addition to these four RNA cis-regulatory elements, we also describe a series of motifs for which there is less evidence of RNA structure. While these RNA motifs are less well-supported by compatible and covarying mutations than the others we present, the structures are credible given the number of representatives identified, the degree of sequence conservation, and the thermodynamics of RNA folding. There are many additional IGRs in 'Cand. P. ubique' that contain a high percentage GC and seem highly conserved (Table 1), yet have no discernable RNA structure. For some of these IGRs, the large number of BLAST hits is the result of many different short aligned sections of high identity within the IGR (e.g. the IGR between SAR11_0641 and SAR11_0642). By contrast, in the IGRs where we identified convincing structured RNAs there is typically a longer region of alignment with mutations distributed throughout. For several other IGRs there are a large number of BLAST hits that align but form no detectable RNA structure (e.g. the IGR between SAR11_0037 and SAR11_0038). These regions may contain RNAs that are not extensively structured (e.g. antisense RNAs that base pair to target RNAs) [78], or perhaps they are conserved protein binding sites that act at the level of DNA. The parameters we used to identify IGRs for inspection were based on the properties of previously annotated RNAs and were designed to capture most structured RNAs. However, one IGR containing a known structured RNA does not meet our parameters for inspection. The IGR containing a SAM-II riboswitch preceding metX (SAR11_0217) failed to rank highly based on GC-enrichment. The IGR containing this riboswitch is 191 nucleotides long and 22.5% GC (ranked 121st in the genome based on Additional file 1), significantly below where we arbitrarily stopped examining IGRs due to the decreasing number of convincing BLAST matches (Table 1). However, the SAM-II aptamer alone is 70 nucleotides long and 30% GC. An early investigation of the 'Cand. P. ubique' genome did explore ranking the IGRs by the highest percent GC within a "sliding window" of 50 nucleotides [19]. However, this did not change the rankings of 'Cand. P. ubique' IGRs significantly (R2 = 0.84, Additional file 11). Thus, this additional level of complexity was not implemented for the final analysis. In contrast to other computational genomics studies [3], we identified relatively few candidate RNAs. This is likely because there is relatively little to find in 'Cand. P. ubique' compared with organisms that have larger genomes. The genome of 'Cand. P. ubique' is hypothesized to be streamlined to minimize nutrient use [14,79]. Even the strong AT bias may reflect adaptation to nitrogen limitation in a nutrient poor environment because GC pairs require an additional nitrogen compared to AT base pairs. A survey examining lengths of the RNase P RNA, SRP RNA, TPP and glycine riboswitches in 'Cand. P. ubique' compared with those in other Alphaproteobacteria showed that RNAs in 'Cand. P. ubique' have tendency toward fewer nucleotides (Additional file 12). On average they are greater than one standard deviation lower than the mean for a given RNA (average Z-value of -1.12). While this result is not statistically significant, the motifs identified here further reflect this tendency. The S2 motif identified in 'Cand. P. ubique' is among the shortest with an exceedingly short P2 stem (3 bp) and no P3 stem. The presence of RNA-based regulatory motifs in 'Cand. P. ubique' indicates that such mechanisms can be an effective use of scarce resources, and the smaller RNAs likely reflect pressure to decrease the number of nucleotides at both the DNA and RNA level. Interestingly ribosomal RNAs and tRNAs both showed less variation in length among Alphaproteobacteria than other structured RNAs, as well as less or no evidence of reduction in 'Cand. P. ubique' suggesting that it is difficult to alter RNAs with functions critical for survival. Conclusion This study increased the number of candidate structured RNAs in both 'Cand. P. ubique' and the marine metagenome. Several of the RNAs discovered have wide phylogenetic distribution, while others can only be found through examination of metagenomic data. The combination of computational approaches used in this work is relatively simple and in principle might be applied to any organisms with similar properties. This work also underscores how single completed genomes that are carefully annotated are important components in the effort toward annotating and understanding the vast amount metagenomic data available. Methods Identification of candidate RNA motifs Non-protein coding segments of the 'Cand. P. ubique' genome (RefSeq accession number NC_007205.1) were computationally identified based on the RefSeq version 25 gene annotations and their sequences extracted [80]. The size and percent GC values for these regions were established. Individual sequences annotated as harboring a structured ncRNA according to the Rfam database (version 8.1) were identified [24]. Two additional sequences containing tRNAs were identified from the RefSeq annotation of the 'Cand. P. ubique' genome, and the riboswitches were located based on alignments maintained through periodic homology searches [81]. As all known structured RNAs in 'Cand. P. ubique' are present in IGRs longer than 100 bp (Fig. (Fig.1),1 For the remaining IGRs, all blast matches from the conservation analysis were collected and the sequences extended to match the length of the IGR, or to the end of the sequence read (average trimmed sequence read is 822 bp in length [13]). This collection of sequences was then used as input for CMFinder version 0.2 [34] which created multiple sequence alignments with putative conserved secondary structures. These alignments were manually examined for features indicative of a structured RNA such as extent of covariation within predicted stems and conservation in areas outside base-paired regions. For most IGRs, several alternative structures were initially chosen for further analysis due to the high level of conservation in the sequences. The alignments and hypothetical secondary structures were used to search for additional homologs in the RefSeq25 database [80] along with metagenome sequences from acid mine drainage [82], soil and whale fall [83], human gut [84,85], mouse gut [86], gutless sea worms [87], sludge [88], Global Ocean Survey scaffolds [12,13], other marine sequences [89] and termite hindgut [90]. Homology searches were performed using RAVENNA version 0.2f, essentially as described previously [35,36,91,92]. For each IGR, homologs resulting from these searches were used in conjunction with the original sequences as the starting input for a second CMFinder search and the homology search process was iterated to derive a single structure, or in cases of predicted pseudoknot interactions two compatible structures, supported by the alignment. Analysis of motifs The alignments of IGRs where convincing RNA structure could be identified were manually edited by RALEE [93]. We used RNAshapes [94], CMFinder [34] and RAVENNA [36] during these analyses. Additional homology searches were conducted using the RAVENNA '-local' and '-global' command line options with the microbial subset of RefSeq version 25, and the metagenomic sequence databases described above. As the full RefSeq database is 3,717,469,431 nucleotides and the combined metagenomic databases total 5,529,658,033 nucleotides, several subset databases (Proteobacteria, Alphaproteobacteria, Bacteroidetes, Additional File 2 and Global Ocean Survey Scaffolds) were used to reduce the number of false positive hits. Local searches tended to have greater success identifying homologs of motifs with variable length or optional stems. For the genome context annotations, protein-coding genes were assembled from the annotations in RefSeq and from "predicted proteins" [5] in Global Ocean Survey sequences or annonatated genes in IMG/M [95]. However, sequences from three metagenome projects [85,89,90] were extracted from GenBank and genes were predicted using the MetaGene program (dated Oct. 12, 2006) with default parameters [96]. Conserved protein domains were detected using the Conserved Domain Database version 2.08 [97]. The extent of covariation and conservation of sequences reflected in consensus diagrams (e.g. Figure Figure2)2 List of abbreviations IGR: intergenic region; ncRNA: noncoding RNA; GOS: Global Oceanographic Survey; CAMERA: Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis; SRP: signal recognition particle; UTR: untranslated region; SAM: S-adenosylmethionine; bp: base pair. Authors' contributions MMM conceived and designed the study, executed bioinformatics searches, analyzed the data, and drafted the manuscript. TDA participated in the design of the study and provided bioinformatics infrastructure. DPS conceived the study, performed proteomics searches, and revised the manuscript. ZW provided bioinformatics infrastructure and reviewed motif analysis. MSS conceived the study and revised the manuscript. SJG conceived the study and revised the manuscript. RRB participated in the design of the study, reviewed motif analysis, and revised the manuscript. All authors read and approved the final manuscript. Additional file 1 All 'Cand. P. ubique' IGRs greater than 100 bp. A list of all intergenic regions in 'Cand. P. ubique' longer than 100 bp with the length, GC content and annotated RNAs indicated. Click here for file(341K, doc) Additional file 2 Misannotated protein coding regions identified. A list of likely misannotated protein coding regions identified in the course of this study. Click here for file(38K, doc) Additional file 3 IGR ranking by %GC and sliding window %GC. Comparison of ranking IGRs by %GC and an alternative ranking methodology based on a sliding window of 50 nucleotides. Click here for file(34K, doc) Additional file 4 rpsB alignment. Text file containing Stockholm alignment of the rpsB motif, may be viewed in any text editor including XEmacs with the RALEE extension, or MS-wordpad. Click here for file(373K, txt) Additional file 5 rpsL alignment. Text file containing Stockholm alignment of the rpsL motif, may be viewed in any text editor including XEmacs with the RALEE extension, or MS-wordpad. Click here for file(118K, txt) Additional file 6 ffh alignment. Text file containing Stockholm alignment of the ffh motif, may be viewed in any text editor including XEmacs with the RALEE extension, or MS-wordpad. Click here for file(109K, txt) Additional file 7 SAMV alignment. Text file containing Stockholm alignment of the SAM-V motif, may be viewed in any text editor including XEmacs with the RALEE extension, or MS-wordpad. Click here for file(74K, txt) Additional file 8 rhtb alignment. Text file containing Stockholm alignment of the rhtb motif, may be viewed in any text editor including XEmacs with the RALEE extension, or MS-wordpad. Click here for file(130K, txt) Additional file 9 pntA alignment. Text file containing Stockholm alignment of the pntA motif, may be viewed in any text editor including XEmacs with the RALEE extension, or MS-wordpad. Click here for file(39K, txt) Additional file 10 bablM alignment. Text file containing Stockholm alignment of the bablM motif, may be viewed in any text editor including XEmacs with the RALEE extension, or MS-wordpad. Click here for file(41K, txt) Additional file 11 SAR11_0636 alignment. Text file containing Stockholm alignment of the SAR11_0636 motif, may be viewed in any text editor including XEmacs with the RALEE extension, or MS-wordpad. Click here for file(65K, txt) Additional file 12 RNA motifs from Alphaproteobacteria ordered by length. Glycine riboswitch, TPP riboswitch, SRP, and RNaseP RNAs from Alphaproteobacteria ordered by length. Click here for file(265K, doc) Acknowledgements We thank Dr. Ming Chen Hammond for helpful discussions, N. Carriero and R. Bjornson for assisting our use of the Yale Life Sciences High Performance Computing Center (NIH grant RR19895-02), and the Pacific Northwest National Laboratory for the quantitative proteomic analysis. The work reported here was supported in part by NIH award U54AI57158 (Northeast Biodefense Center – Lipkin). M.M.M. is supported by an NIH NRSA (F32GM079974) and the Breaker Lab also receives support from the Howard Hughes Medical Institute. Portions of this work were also supported by a Marine Microbiology Initiative investigator award from the Gordon and Betty Moore Foundation. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||
Nucleic Acids Res. 2006; 34(2):635-46.
[Nucleic Acids Res. 2006]Curr Opin Microbiol. 2007 Jun; 10(3):257-61.
[Curr Opin Microbiol. 2007]Cell. 2002 Apr 19; 109(2):137-40.
[Cell. 2002]Curr Opin Microbiol. 2007 Jun; 10(3):279-85.
[Curr Opin Microbiol. 2007]PLoS Biol. 2007 Mar; 5(3):e16.
[PLoS Biol. 2007]BMC Genomics. 2007 Oct 1; 8():347.
[BMC Genomics. 2007]Nucleic Acids Res. 2007; 35(7):2283-94.
[Nucleic Acids Res. 2007]Nat Methods. 2007 Jun; 4(6):495-500.
[Nat Methods. 2007]BMC Bioinformatics. 2008 Apr 10; 9():182.
[BMC Bioinformatics. 2008]Nature. 2002 Dec 19-26; 420(6917):806-10.
[Nature. 2002]Science. 2004 Apr 2; 304(5667):66-74.
[Science. 2004]PLoS Biol. 2007 Mar; 5(3):e77.
[PLoS Biol. 2007]Science. 2005 Aug 19; 309(5738):1242-5.
[Science. 2005]Curr Biol. 2001 Sep 4; 11(17):1369-73.
[Curr Biol. 2001]Proc Natl Acad Sci U S A. 2002 May 28; 99(11):7542-7.
[Proc Natl Acad Sci U S A. 2002]Mol Biochem Parasitol. 2005 Dec; 144(2):149-58.
[Mol Biochem Parasitol. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D121-4.
[Nucleic Acids Res. 2005]PLoS Biol. 2007 Mar; 5(3):e75.
[PLoS Biol. 2007]Mol Membr Biol. 2005 Jan-Apr; 22(1-2):3-15.
[Mol Membr Biol. 2005]Mol Biosyst. 2007 Sep; 3(9):604-7.
[Mol Biosyst. 2007]Nat Rev Microbiol. 2006 Oct; 4(10):729-40.
[Nat Rev Microbiol. 2006]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D121-4.
[Nucleic Acids Res. 2005]Annu Rev Microbiol. 2005; 59():487-517.
[Annu Rev Microbiol. 2005]Annu Rev Biochem. 2007; 76():101-24.
[Annu Rev Biochem. 2007]Proc Natl Acad Sci U S A. 2000 Jul 5; 97(14):7778-83.
[Proc Natl Acad Sci U S A. 2000]Nucleic Acids Res. 2003 Jan 1; 31(1):446-7.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D121-4.
[Nucleic Acids Res. 2005]Mol Biochem Parasitol. 2005 Dec; 144(2):149-58.
[Mol Biochem Parasitol. 2005]Bioinformatics. 2006 Feb 15; 22(4):445-52.
[Bioinformatics. 2006]Bioinformatics. 2006 Jan 1; 22(1):35-9.
[Bioinformatics. 2006]Nucleic Acids Res. 1981 Aug 25; 9(16):4163-72.
[Nucleic Acids Res. 1981]Prog Nucleic Acid Res Mol Biol. 1994; 47():331-70.
[Prog Nucleic Acid Res Mol Biol. 1994]Cell. 1980 Sep; 21(2):517-22.
[Cell. 1980]Cell. 1980 Sep; 21(2):523-35.
[Cell. 1980]EMBO J. 1982; 1(8):999-1004.
[EMBO J. 1982]RNA. 2002 Jul; 8(7):878-89.
[RNA. 2002]Microbiol Mol Biol Rev. 2007 Sep; 71(3):477-94.
[Microbiol Mol Biol Rev. 2007]Proc Natl Acad Sci U S A. 1975 Jun; 72(6):2325-9.
[Proc Natl Acad Sci U S A. 1975]Mol Microbiol. 1999 Jan; 31(1):67-77.
[Mol Microbiol. 1999]J Mol Biol. 2002 Feb 22; 316(3):725-68.
[J Mol Biol. 2002]RNA. 2004 Jun; 10(6):954-64.
[RNA. 2004]Nucleic Acids Res. 2003 Apr 1; 31(7):1813-20.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2002 Sep 1; 30(17):3732-8.
[Nucleic Acids Res. 2002]RNA. 2008 Sep; 14(9):1882-94.
[RNA. 2008]Cell. 1980 Sep; 21(2):517-22.
[Cell. 1980]Cell. 1980 Sep; 21(2):523-35.
[Cell. 1980]J Mol Biol. 1994 Jan 7; 235(1):111-24.
[J Mol Biol. 1994]J Mol Biol. 1988 Nov 20; 204(2):295-307.
[J Mol Biol. 1988]RNA. 2008 Sep; 14(9):1882-94.
[RNA. 2008]Proc Natl Acad Sci U S A. 2008 Mar 11; 105(10):3805-10.
[Proc Natl Acad Sci U S A. 2008]Nature. 2009 May 14; 459(7244):266-9.
[Nature. 2009]Mol Membr Biol. 2005 Jan-Apr; 22(1-2):3-15.
[Mol Membr Biol. 2005]Science. 2000 Feb 18; 287(5456):1232-9.
[Science. 2000]Nature. 2004 Jan 15; 427(6971):215-21.
[Nature. 2004]Science. 2004 Jan 16; 303(5656):373-7.
[Science. 2004]Annu Rev Biochem. 2004; 73():539-57.
[Annu Rev Biochem. 2004]J Bacteriol. 1994 Dec; 176(23):7148-54.
[J Bacteriol. 1994]J Bacteriol. 1994 May; 176(9):2502-6.
[J Bacteriol. 1994]J Bacteriol. 2002 May; 184(10):2642-53.
[J Bacteriol. 2002]Curr Opin Struct Biol. 2006 Jun; 16(3):299-306.
[Curr Opin Struct Biol. 2006]Appl Environ Microbiol. 2008 Jul; 74(13):4091-100.
[Appl Environ Microbiol. 2008]Appl Environ Microbiol. 2008 Jul; 74(13):4091-100.
[Appl Environ Microbiol. 2008]Biochem Cell Biol. 2008 Apr; 86(2):157-68.
[Biochem Cell Biol. 2008]Genome Biol. 2005; 6(8):R70.
[Genome Biol. 2005]Nat Struct Mol Biol. 2008 Feb; 15(2):177-82.
[Nat Struct Mol Biol. 2008]RNA. 2008 May; 14(5):822-8.
[RNA. 2008]Bioessays. 2002 Aug; 24(8):700-7.
[Bioessays. 2002]BMC Genomics. 2007 Oct 1; 8():347.
[BMC Genomics. 2007]Nucleic Acids Res. 2007; 35(7):2283-94.
[Nucleic Acids Res. 2007]Curr Opin Microbiol. 2007 Apr; 10(2):134-9.
[Curr Opin Microbiol. 2007]Mol Biochem Parasitol. 2005 Dec; 144(2):149-58.
[Mol Biochem Parasitol. 2005]Cell. 2002 Apr 19; 109(2):137-40.
[Cell. 2002]Science. 2005 Aug 19; 309(5738):1242-5.
[Science. 2005]Genome Biol. 2005; 6(2):R14.
[Genome Biol. 2005]Nucleic Acids Res. 2007 Jan; 35(Database issue):D61-5.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D121-4.
[Nucleic Acids Res. 2005]Genome Biol. 2007; 8(11):R239.
[Genome Biol. 2007]PLoS Biol. 2007 Mar; 5(3):e75.
[PLoS Biol. 2007]PLoS Biol. 2007 Mar; 5(3):e77.
[PLoS Biol. 2007]Bioinformatics. 2006 Feb 15; 22(4):445-52.
[Bioinformatics. 2006]Nucleic Acids Res. 2007 Jan; 35(Database issue):D61-5.
[Nucleic Acids Res. 2007]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Science. 2005 Apr 22; 308(5721):554-7.
[Science. 2005]Science. 2006 Jun 2; 312(5778):1355-9.
[Science. 2006]DNA Res. 2007 Aug 31; 14(4):169-81.
[DNA Res. 2007]Bioinformatics. 2006 Jan 1; 22(1):35-9.
[Bioinformatics. 2006]BMC Bioinformatics. 2003 Sep 22; 4():44.
[BMC Bioinformatics. 2003]Nucleic Acids Res. 2007; 35(14):4809-19.
[Nucleic Acids Res. 2007]Bioinformatics. 2005 Jan 15; 21(2):257-9.
[Bioinformatics. 2005]Bioinformatics. 2006 Feb 15; 22(4):500-3.
[Bioinformatics. 2006]Bioinformatics. 2006 Feb 15; 22(4):445-52.
[Bioinformatics. 2006]Bioinformatics. 2006 Jan 1; 22(1):35-9.
[Bioinformatics. 2006]PLoS Biol. 2007 Mar; 5(3):e16.
[PLoS Biol. 2007]Nucleic Acids Res. 2008 Jan; 36(Database issue):D534-8.
[Nucleic Acids Res. 2008]DNA Res. 2007 Aug 31; 14(4):169-81.
[DNA Res. 2007]Science. 2006 Jan 27; 311(5760):496-503.
[Science. 2006]Nature. 2007 Nov 22; 450(7169):560-5.
[Nature. 2007]Nucleic Acids Res. 2007; 35(14):4809-19.
[Nucleic Acids Res. 2007]J Mol Biol. 1994 Mar 4; 236(4):1067-78.
[J Mol Biol. 1994]