• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of rnaThe RNA SocietyeTOC AlertsSubscriptionsJournal HomeCSHL PressRNA
RNA. Nov 2005; 11(11): 1616–1623.
PMCID: PMC1370847

Homology modeling revealed more than 20,000 rRNA internal transcribed spacer 2 (ITS2) secondary structures

Abstract

Structural genomics meets phylogenetics and vice versa: Knowing rRNA secondary structures is a prerequisite for constructing rRNA alignments for inferring phylogenies, and inferring phylogenies is a precondition to understand the evolution of such rRNA secondary structures. Here, both scientific worlds go together. The rRNA internal transcribed spacer 2 (ITS2) region is a widely used phylogenetic marker. Because of its high variability at the sequence level, correct alignments have to take into account structural information. In this study, we examine the extent of the conservation in structure. We present (1) the homology modeled secondary structure of more than 20,000 ITS2 covering about 14,000 species; (2) a computational approach for homology modeling of rRNA structures, which additionally can be applied to other RNA families; and (3) a database providing about 25,000 ITS2 sequences with their associated secondary structures, a refined ITS2 specific general time reversible (GTR) substitution model, and a scoring matrix, available at http://its2.bioapps.biozentrum.uni-wuerzburg.de.

Keywords: homology modeling, internal transcribed spacer 2 (ITS2), phylogeny, RNA, secondary structure

INTRODUCTION

The internal transcribed spacer 2 (ITS2) region is a standard marker in phylogenetics. However, because of its high sequence variability with an assumed conservation in structure, it is a double-edged tool for eukaryote evolutionary comparisons (Alvarez and Wendel 2003; Coleman 2003). Utilizing its high mutation rate acting on the primary sequence, it is currently used for species level classifications. However, its conservation in structure—four helices with helix III as the longest—might boost its value as a marker for megasystematics (Coleman 2003; Schultz et al. 2005). For both applications, it is absolutely necessary to consider both—sequence and structure—when calculating alignments and resulting phylogenies. To date, there are about 90,000 ITS2 sequences in GenBank (NCBI). But, for only about 5000 (from at that time 70,000 ITS2 sequences) was the common core secondary structure predicted with standard structure prediction algorithms (Schultz et al. 2005).

On the one hand, the remaining 85,000 ITS2 structures could, in principle, be rather different and have species or organism specific solutions for ITS2 processing. On the other hand, all available data on rRNA structure assembled to date support that rRNA processing is a very conserved process. To show and verify this for the ITS2 region requires detecting or rejecting conserved common RNA secondary structures in thousands of sequences over a large divergent phylogenetic range. We show that, in fact, 20,000 ITS2 rRNA sequences have a conserved RNA processing structure. Therefore, the data set derived in this study should constitute an optimal starting point for any ITS2 based phylogenetic analysis and clearly for ITS2 structure prediction.

RESULTS AND DISCUSSION

Homology modeling

At the time of the study, 91,873 ITS2 sequences were annotated in GenBank (NCBI) but the common core secondary structure could hitherto be predicted only for about 5000 ITS2 sequences with standard RNA folding programs (Schultz et al. 2005). Even discarding fragments and spurious sequences leaves 47,683 ITS2 sequences with unknown secondary structures. Here, we aim for the prediction of these structures by exploiting possible homology to sequences with already assigned, correct structures following the algorithm outlined in Figure 11.. Due to the high variability at the sequence level, one is not able to construct correct multiple alignments that constitute the training data for SCFG (stochastic context free grammar) (Eddy and Durbin 1994). In particular, the SCFG approach would yield a consensus structure, whereas we want to derive a concrete secondary structure. To this end we clustered all 47,683 sequences with unknown structures together with the 5003 sequences of known structures and checked each cluster for the presence of sequences with assigned structures. The best alignment hit (minimal p value) for the template sequence with its assigned structure was used to transfer the structure to the homologous sequences. To ensure high prediction performance, the structure was transferred only if >75% of all base pairs for each helix could be adopted (Fig. 2A2A).). To improve the quality of the model, postprocessing of the predicted structure was performed. First, free base-pairings were closed following the Nussinov algorithm (Nussinov and Jacobsen 1980; Fig. 2B2B).). Second, the structure was scanned for structural elements that are not physically possible like closed helix loops, which were fixed (Fig. 2C2C).). Third, too long sequences were cut, leading to a reannotation of the GenBank entry (Fig. 2D2D).). This process resulted in 16,214 predicted novel structures. An additional 21,807 sequences did show a significant sequence identity to at least one template sequence, but structure adoption did not satisfy our criteria. In a subset, the first or the last helix was totally missing, whereas for the other helices the structure could be predicted for >75% of the bases. As this could be indicative for false annotations in GenBank, the corresponding database entry was retrieved and the full sequence was aligned. For a total of 794 sequences, this approach not only allowed the prediction of the correct structure, but it also improved the original annotation (Fig. 2E2E).). To exploit these novel structures, the whole algorithm was repeated, adding the novel structures to the templates, resulting in 5533 further predicted structures. These consist of 4745 structures that could be predicted directly from a template structure plus 788 sequences for which the annotation of the ITS2 region had to be modified. To ensure the quality and to avoid error propagation in this repeated prediction, we decided to add only structures found before the post-processing to the template database. Additionally, none of the structures derived after reannotation was used.

FIGURE 1.
A flowchart illustrating our strategy in homology modeling. At each step a set of sequences was rejected and the remaining ones in their homology prediction refined. Further explanations are given in the text (see also Fig. 22).
FIGURE 2.
A set of figures describing various refinements and modeling steps in the algorithm. (A) Homology modeling. (I) One of the 5000 correctly folded ITS2 secondary structures described in Schultz et al. (2005). (II) Two homologous hits according to I showing ...

In summary, examining homology-based helix and structure modeling of 50,000 sequences revealed more than 20,000 homologous ITS2 secondary structures that, by definition, could not be modeled by standard RNA folding programs like RNAfold or mfold (Zuker and Stiegler 1981; Hofacker et al. 1994), but would instead have yielded wrongly folded predictions that would not match the common ITS2 pattern. Recently, new algorithms such as Alifold (Hofacker et al. 2002) have been published to predict a consensus secondary structure for a set of pre-aligned RNA sequences. However, this method would fail on our task, because it cannot exploit template structures to check for correct structure transfer.

A further 30,000 ITS2 sequences were rejected as they do not match any template in our database.

A database providing a total of about 25,000 ITS2 sequences and their predicted secondary structures (5000 templates and 20,000 new models) is available at http://its2.bioapps.biozentrum.uni-wuerzburg.de.

These available structures have been modeled as accurately as possible but represent only template-based predictions and thus may not be as accurate as experimentally verified RNA structures.

However, to test the accuracy of the prediction method we took one of the 5092 known template sequences (Schultz et al. 2005) and repredicted its structure by the proposed homology modeling algorithm. Then the predicted structure was compared to the “true” structure of the template sequence by counting position-wise structural matches. This procedure wass repeated for all 5092 available template sequences. According to our filters (p value and base pair adoption) for 4716 sequences (93%), a reprediction was possible. For those we got on average 97% structural identities between the “true” and the modeled structure. This represents a position-wise 97% prediction accuracy.

Characterization of predicted ITS2 structures

Having derived a data set of about 25,000 homologous ITS2 sequences together with their secondary structures, we are now able to evaluate the quality of the models and to describe ITS2 specific features. Beside the universally conserved U-U mismatch in helix II and the UGGU motif near the 5′ site apex of helix III (Table 11),), these are (1) the percentage of adopted base pairs per helix (homology modeling), (2) the free energy of the predicted structures, and (3) the helix length distribution.

TABLE 1.
Taxa distribution of total number of the ITSs and consensus folds (updated according to Schultz et al. 2005)

Box plots in Figure 3A3A elucidate the homology modeling prediction performance per helix. Focusing on the medians, helices II and IV show near optimal prediction performance (medians at 100%, i.e., >50% of structures were completely assigned), whereas helices I and III are more difficult to model and perform slightly worse (median at 94% and 95%, respectively). However, the variances of the prediction favor helices II and III. Notches show a small 95% confidence interval for the medians estimate.

FIGURE 3.
ITS2 statistics. (A) Box plots of the percentage of adopted base pairs per helix. In a box plot, the box shows the interquartile range (IQR) of the data, whereas the line through the box shows the median. The IQR is defined as the difference between the ...

The box plots in Figure 3B3B show that the medians for the identified helix lengths fit the well-known four-fingered hand structure of the ITS2, with helix III as the longest (cf. Coleman 2003).

There are structures in which the folding predicted by RNAfold was wrong with respect to the core structure of the ITS2. The correlation of predicted secondary structure energies as derived by RNAfold to folding energies derived by homology-based structure prediction is shown in a scatterplott (Fig. 3C3C).). It demonstrates the degree of deviance between optimal template free energy folding according to RNAfold and the energy of the homology modeled structures. Obviously, the energy of homology-based structures has to be higher than that of the optimal structure, and structures found in the first iteration have in general lower energies than those found in the second iteration (Fig. 3C3C).). The cloud of secondary structures in the lower right corner (Fig. 3C3C)) that shows a large difference (~80 kcal/mol) in the optimal free energy by usual RNA folding tools and the energy according to homology modeling was checked by looking at an adequate sample (n = 20) of such structures. Even those were confirmed to fit the expected biological constraints—that is, four helices with helix III as the longest. This indicates that the homology-based predicted structures represent the biologically active form, as the ITS2 secondary structure is critical for rRNA processing, involved in several RNA–RNA and RNA–protein interactions, and hence extremely well conserved. Focusing on the structure energy by standard base-pairing alone is not sufficient. Careful template-based modeling captures many structural constraints including complex interactions required for processing. This results in a sixfold increase in the number of structures that now can be assigned to their template.

For the wide application of ITS2 as a phylogenetic marker, a broad taxonomic distribution is crucial. Therefore, we tested the prevalence of the novel predicted structures for different taxonomic groups and compared it with the distribution of the energy-based predicted structures used as templates (Table 11).). We observe that the sixfold increase in successfully predicted secondary structures is consistently reflected in the homogenous increase of structures per taxon (Fig. 44).). This suggests a sixfold higher prediction power of the homology modeling-based approach, independent of the taxonomic grouping.

FIGURE 4.
A scatter plot of energy-based predicted structures per taxonomic group versus homology modeling-based structures per taxonomic group. Regression analysis reveals that a quadratic curve fits the data, whereas omitting the largest groups—even a ...

Now having a large number of homologous correctly pairwise aligned sequences at hand, we are able to derive a ITS2 specific general time reversible (GTR) substitution model (Lanave et al. 1984; Müller and Vingron 2000) and scoring matrices. The substitution model is one of the most important ingredients for phylogenetic studies. Typically, such a matrix is derived on small data sets and, therefore, it cannot really reflect the typical ITS2 specific substitution behavior.

However, a refined substitution model can be derived from the large data set that is available now, that is, beside all sequences with their predicted secondary structures; a substitution model and a scoring matrix is also available at http://its2.bioapps.biozentrum.uni-wuerzburg.de.

CONCLUSIONS

Structural genomics concentrates often only on protein structures. Here, RNA structure modeling starting from a carefully designed template database (Schultz et al. 2005) is performed. Five-thousand known sequences and secondary structures as well as structure selection filters allow us to reveal more than 20,000 ITS2 secondary structures in about 14,000 organisms. Such obtained structures provide a basis for large-scale phylogenetic studies.

Obviously, for the homology modeling-based approach, an ab initio correctly folded secondary structure template is essential. This means that both methods are important for themselves but have to go with each other to achieve the tremendous increase in structure prediction. For instance, combining both we can now predict about 3534 ITS2 structures in the Asteraceae, from which recently 340 have been described by Goertzen et al. (2003). The consensus structure obtained by their comparative approach (Fig. 11 of Goertzen et al. 2003) matches well our predictions for the Asteraceae ITS2.

Having biodiversity in mind, we expect and know there are additional variants (today about 30,000) deviating from our structural template. As already discussed in Schultz et al. (2005), there are known ITS2 structures having, for example, only three helices, branching ones, or more than four. Pseudogenes (Alvarez and Wendel 2003), with their rapid rate of sequence decay, may also deviate from our structure template, and similarly ITS2 sequences of >150 nt yet only partially available in the database will also be rejected by the algorithm. According to the used templates, such structures cannot and should not be found by our homology modeling approach. Indeed, these structures were correctly rejected during the template-based modeling. Mapping the structure variations to the tree of life would dramatically elucidate our knowledge about structure evolution and RNA processing. The method presented here can accommodate further templates as they became available.

From a practical point of view, we present in this article a data set that can and should be used as a starting point for any ITS2-based phylogenetic analysis and ITS2 structure prediction. It yields related ITS2 sequences and secondary structures and therefore constitutes a guideline for the alignment construction, which is essential for a good phylogenetic study.

From a theoretical point of view, we propose a powerful method that is actually independent of the considered RNA molecule and can therefore be easily transferred to any other RNA family, other than rRNA. So, it would be highly desirable to achieve a tree-of-life-wide database of other important RNA structures.

MATERIALS AND METHODS

Data ascertainment

All entries with the annotation “internal transcribed spacer 2” or “ITS2” were retrieved from GenBank and ITS2 subsequences were extracted as annotated. We discarded fragments and spurious sequences by assuming typical sequence lengths known from the already described structures (Schultz et al. 2005).

Sequence alignments

As these ITS2 sequences are globally homologous, global all-against-all pair-wise alignments were calculated according to the Needleman and Wunsch algorithm (Needleman and Wunsch 1970) using a GeneMatcher2 system (Paracel, Inc.) with default settings. Alignments with a very conservative p value of <10−16 were extracted.

Clustering

Aligned sequences showing global sequence similarity and finding each other directly or indirectly by a cascade were clustered.

Folding

(Cf. Fig. 11 for further details and steps.) Within a cluster we searched for one of the 5003 sequences with known structures to iteratively adopt that structure throughout the cascade of homologous sequences. If a cluster contained more than one sequence of known structure, the prediction for a specific sequence within the cluster was based on the most similar structural template (according to the p value). Generally, structures are adopted according to possible Watson–Crick base pairs (Watson and Crick 1953) and G-U pairings. More deviant noncanonical base pairs (Leontis et al. 2002) are not considered in the energy function or in the structure transfer. If the majority of base pairs—i.e., >75% per helix—could be adopted, structures are post-processed according to the algorithm of Nussinov and Jacobson (1980).

Iteration

Structures predicted in the first iteration (16,214) were used as templates for a second round of predictions (5533).

Visualization

Energies and structure images are derived by use of the Vienna Package (Hofacker et al. 1994).

Acknowledgments

We thank Daniel Gerlach, Stefanie Maisel, and Stefan Pinkert (all of the University of Würzburg, Germany) for providing parts of the data and the DFG for support (Bo-1099/5-3).

Notes

REFERENCES

  • Alvarez, I. and Wendel, J.F. 2003. Ribosomal ITS sequences and plant phylogenetic inference. Mol. Phylogenet. Evol. 29: 417–434. [PubMed]
  • Coleman, A.W. 2003. ITS2 is a double-edged tool for eukaryote evolutionary comparisons. Trends Genet. 19: 370–375. [PubMed]
  • Eddy, S.R. and Durbin, R. 1994. RNA sequence analysis using covariance models. Nucleic Acids Res. 22: 2079–2088. [PMC free article] [PubMed]
  • Goertzen, L.R., Cannone, J.J., Gutell, R.R., and Jansen, R.K. 2003. ITS secondary structure derived from comparative analysis: Implications for sequence alignment and phylogeny of the Asteraceae. Mol. Phylogenet. Evol. 29: 216–234. [PubMed]
  • Hofacker, I.L., Fontana, W., Stadler, P.F., Bonhoeffer, L.S., Tacker, M., and Schuster, P. 1994. Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 125: 167–188.
  • Hofacker, I.L., Fekete, M., and Stadler, P.F. 2002. Secondary structure prediction for aligned RNA sequences. J. Mol. Biol. 319: 1059–1066. [PubMed]
  • Lanave, C., Preparata, G., Saccone, C., and Serio, G. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20: 86–93. [PubMed]
  • Leontis, N.B., Stombaugh, J., and Westhof, E. 2002. The non-Watson– Crick base pairs and their associated isostericity matrices. Nucleic Acids Res. 30: 3497–3531. [PMC free article] [PubMed]
  • Müller, T. and Vingron, M. 2000. Modeling amino acid replacement. J. Comput. Biol. 7: 761–776. [PubMed]
  • Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443–453. [PubMed]
  • Nussinov, R. and Jacobson, A.B. 1980. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc. Natl. Acad. Sci. 77: 6309–6313. [PMC free article] [PubMed]
  • Schultz, J., Maisel, S., Gerlach, D., Muller, T., and Wolf, M. 2005. A common core of secondary structure of the internal transcribed spacer 2 (ITS2) throughout the Eukaryota. RNA 11: 361–364. [PMC free article] [PubMed]
  • Watson, J.D. and Crick, F.H. 1953. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171: 737– 738. [PubMed]
  • Zuker, M. and Stiegler, P. 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9: 133–148. [PMC free article] [PubMed]

Articles from RNA are provided here courtesy of The RNA Society
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...