- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- RNA
- v.11(8); Aug 2005
- PMC1370799

# RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble

^{1}Bioinformatics Center, Wadsworth Center, New York State Department of Health, Albany, New York 12208, USA

^{2}Center for Computational Molecular Biology and Division of Applied Mathematics, Brown University, Providence, Rhode Island 02912, USA

**Reprint request to:**Ye Ding, Bioinformatics Center, Wadsworth Center, New York State Department of Health, 150 New Scotland Avenue, Albany, NY 12208, USA; e-mail: gro.htrowsdaw@gnidy; fax: (518) 402-4623; or Charles E. Lawrence, Center for Computational Molecular Biology and Division of Applied Mathematics, Brown University, 182 George Street, Providence, RI 02912, USA; e-mail: ude.nworb.mad@ecnerwal; fax: (401) 863-1355.

## Abstract

Prediction of RNA secondary structure by free energy minimization has been the standard for over two decades. Here we describe a novel method that forsakes this paradigm for predictions based on Boltzmann-weighted structure ensemble. We introduce the notion of a centroid structure as a representative for a set of structures and describe a procedure for its identification. In comparison with the minimum free energy (MFE) structure using diverse types of structural RNAs, the centroid of the ensemble makes 30.0% fewer prediction errors as measured by the positive predictive value (PPV) with marginally improved sensitivity. The Boltzmann ensemble can be separated into a small number (3.2 on average) of clusters. Among the centroids of these clusters, the “best cluster centroid” as determined by comparison to the known structure simultaneously improves PPV by 46.5% and sensitivity by 21.7%. For 58% of the studied sequences for which the MFE structure is outside the cluster containing the best centroid, the improvements by the best centroid are 62.5% for PPV and 31.4% for sensitivity. These results suggest that the energy well containing the MFE structure under the current incomplete energy model is often different from the one for the unavailable complete model that presumably contains the unique native structure. Centroids are available on the Sfold server at http://sfold.wadsworth.org.

**Keywords:**secondary structure prediction, centroid, Boltzmann ensemble

## INTRODUCTION

RNA molecules are key elements in some of the cell’s most fundamental processes, including catalysis, RNA splicing, and regulation of transcription and translation. To a large degree, the function of a structural RNA molecule is determined by its structure. Computational methods for modeling RNA secondary structure have proven to be valuable in many cases in which crystal structures are not available.

Free energy minimization is a long-established paradigm in computational structural biology that is based on the assumption that, at equilibrium, the solution to the underlying molecular folding problem is unique, and that the molecule folds into the lowest energy state. Applications of this paradigm include RNA folding (Zuker 1989), protein folding (Anfinsen 1973; Abagyan 1993), and transmembrane helix packing (Pappu et al. 1999). The prediction of RNA secondary structure has been widely applied, with good success. Efficient algorithms for computing the minimum free energy (MFE) structure and a set of suboptimal structures (Zuker and Stiegler 1981; Mathews et al. 1999, 2004) are based on free energy parameters that are estimated or extrapolated from chemical melting experiments (Xia et al. 1998; Mathews et al. 1999, 2004). An alternative approach computes all suboptimal foldings within an energy increment above the MFE (Wuchty et al. 1999). The exponential growth in the number of these foldings motivated recent development of the RNAshapes method for the efficient representation of the near optimal set (Giegerich et al. 2004). In a drastic departure from the MFE perspective, efforts have been made to characterize the ensemble of structures (McCaskill 1990; Bonhoeffer et al. 1993). Recently, we have presented an algorithm that draws samples from the ensemble of secondary structures in proportion to their Boltzmann weights (Ding and Lawrence 2003). In other words, our algorithm guarantees the generation of a statistically representative sample of the Boltzmann-weighted ensemble of structures, and thus enables the calculation of sampling statistics for structural features (Ding and Lawrence 2001, 2003). In this report, we examine the utility of such samples in RNA secondary structure prediction.

In applications of the sampling algorithm, we found that there often exist distinct clusters in the Boltzmann ensemble (Ding and Lawrence 2003), with each cluster containing similar structures.

To aid in the characterization of the sampled structural space, we introduce the centroid structure as an efficient means to characterize the central tendency for a set of structures and present a procedure for its identification. We examine the predictive value of the centroid of the entire sampled ensemble, the centroid of the largest cluster, and the cluster centroid that is closest to the structure determined by comparative sequence analysis.

## RESULTS

### Clustering results

From online RNA databases, 81 RNA sequences from nine structural RNA classes were selected (see Materials and Methods section for sequence selection process). For each sequence, 1000 structures sampled by our algorithm (Ding and Lawrence 2003) are clustered, and then the cluster to which the MFE structure belongs is determined (see Materials and Methods section for clustering procedure). Although a structural RNA may have unique structure in solution, we have found that there exists a small number (3.2 on average) of distinct clusters of similar structures in the Boltzmann ensemble. Although the number of conformational states grows exponentially with sequence length, we found no evidence that the number of clusters increases with the length of the sequence (correlation coefficient = −0.1180, *P*-value = 0.294). Since the algorithm samples structures in accordance with their Boltzmann-weighted probabilities, the probability of a cluster is estimated by the frequency of structures in that cluster, i.e., the number of structures in the cluster divided by the sample size. In the case of multiple occurrences of the same structure, each occurrence is counted in the calculation. The MFE structure is present in the largest cluster for 55 of the 81 RNAs (68%). For 36 of these 55 sequences, the largest cluster dominates the structure space, with a probability of 0.7 or higher. Thus, the MFE structure is in a dominant cluster for only 44% (36/81) of the RNAs. The clustering results in Table 11 for 12 sequences exemplify possible scenarios for the cluster of the MFE structure. The cluster of the MFE structure can be the largest cluster with either a dominant probability or a moderate probability. The MFE cluster can also be in a cluster that is secondary in size, or in some cases a cluster of only small or negligible probability. The 23S rRNA sequence for *Chlamydomonas reinhardtii* (accession number {"type":"entrez-nucleotide","attrs":{"text":"X15727","term_id":"11423"}}X15727) presents an extreme case for which the MFE structure is not similar to any structure in the sampled ensemble. These findings suggest that the MFE structure does not always represent well the Boltzmann-weighted ensemble, thus motivating our search for more reliable representatives.

### Centroid structures as representatives

As an alternative to the MFE structure, we propose the centroid structure. The centroid for a given set of structures is the structure in the entire structure ensemble that has the minimum total base-pair distance to the structures in the set. Thus, the centroid structure can be considered as the single structure that best represents the central tendency of the set. A centroid is refered to as the *ensemble centroid* when the set is the entire collection of structures sampled from the ensemble. A centroid of a cluster of similar structures is referred to as a *cluster centroid*. The mathematical definition of centroid structure and the derivation for its identification are presented in the Materials and Methods section. Ever since the emergence of mfold (multiple folds) it has been a common practice to report a number of suboptimal folds for predicitive purposes (Zuker 1989, 2003). Both the optimal fold and the best from a long list of suboptimal folds are of interest for performance evaluation (Mathews et al. 1999). Here we employ a similar evaluation strategy and report on the predictive performance of the ensemble centroid and that of the best of a short list of cluster centroids, i.e., the best centroid. In other words, when a reference structure is available as the standard, the best cluster centroid is defined as the cluster centroid that has the shortest base-pair distance to this known structure. Of course, just as with the best suboptimal, the identity of this best centroid cannot be determined when a reference structure is not available. In keeping with accepted practice in this field, we employed structures determined from comparative sequence analysis as the standard for comparison and for the identification of the best centroid.

### Performance measures

We consider three measures for making performance comparisons between the MFE structure and centroids: base-pair distance, sensitivity, and PPV. More specifically, we compute the base-pair distance between the MFE structure and the structure determined by comparative sequence analysis and between the ensemble centroid or a cluster centroid and the structure determined by comparative sequence analysis. The sensitivity for a predicted structure is the percentage of base pairs in the structure determined by comparative sequence analysis that are also present in the predicted structure. The PPV is the percentage of base pairs in the predicted structure that are in the structure determined by comparative sequence analysis. These two complementary measures have become the standards for measuring predictive accuracy (Mathews et al. 1999; Dowell and Eddy 2004; Mathews 2004). The sensitivity focuses on predicting base pairs in the structure determined by comparative sequence analysis without regard to false positive base pair predictions, while the PPV focuses on accuracy of the predicted base pairs without regard to false negative base pairs. A perfect prediction is achieved if both the sensitivity and the PPV are 100%, in which case the two structures being compared are identical and have a distance of zero base pairs.

### Centroids are closer to the structure determined by comparative sequence analysis than is the MFE structure

The ensemble centroid, the centroid of the largest cluster, and the best centroid are closer in base-pair distance to the structure determined by comparative sequence analysis than is the MFE structure for 66 (81.5%), 60 (74.1%), and 74 (91.4%) sequences, respectively. Furthermore, these centroids are either closer to the structure determined by comparative sequence analysis than is the MFE structure or are as close to the structure determined by comparative sequence analysis as is the MFE structure for 73 (90.1%), 71 (87.7%), and 80 (98.8%) of the 81 sequences, respectively.

For each sequence, the percentage of distance improvement by a centroid over the MFE structure is calculated by [1 − *D*(C, P)/*D*(M, P)] × 00%, where *D*(C, P) is the base-pair distance between the centroid and the structure determined by comparative sequence analysis and *D*(M, P) is the base-pair distance between the MFE structure and the structure determined by comparative sequence analysis. For each RNA type, the averaged percentage of improvement is calculated; these values are presented in Table 22.. For the best cluster centroid, the average improvement is > 19% for every RNA type. For the ensemble centroid and the centroid of the largest cluster, substantial improvements are obtained, except for the SRP RNAs.

### Centroids yield comparable or improved sensitivities

For each RNA type, the averaged sensitivity by the MFE structure and the average percentage of improvement in sensitivity by each of the three centroids are presented in Table 33.. For the MFE structure, the ensemble centroid, and the centroid of the largest cluster, the results are comparable with marginal overall improvements by the centroids. Furthermore, the ensemble centroid and the largest cluster centroid show equal or improved sensitivity for > 60% of the sequences. For the best cluster centroid, there is an average improvement of 21.74% for all RNA types, and negative improvement is only observed for group II introns.

### Centroid predictions yield fewer errors

For each RNA type, the averaged PPV by the MFE structure and the average percentage of improvement in PPV by each of the three centroids are presented in Table 44.. For both the ensemble centroid and the best centroid, there is an improvement over the MFE structure, with an overall average of 30.0% and an overall average of 46.5%, respectively. For the best centroid, in particular, the PPV is either the same or improved for 79 of the 81 sequences (97.5%). For the centroid of the largest cluster, there is an improvement for seven of the nine RNA types, with an overall average improvement of 17.6%.

### MFE predictions break down when the MFE structure is in the wrong cluster

The degree of improvement by the best centroid largely depends on the location of the MFE structure. For 34 sequences (42.0%) for which the MFE structure is in the cluster of the best centroid, the base-pair distance and the PPV are substantially improved, and the sensitivity is improved appreciably; for the other 47 sequences (58.0%) for which the MFE structure is outside the cluster of the best centroid, the improvements by the best centroid are 36.5% for base-pair distance, 62.5% for PPV, and 31.4% for sensitivity (Table 55).). The latter case is illustrated by the energy landscape of the sampled ensemble and representative structures for *Agrobacterium tumefaciens* 5S rRNA (Fig. 11).

*Agrobacterium tumefaciens*5S rRNA (GenBank accession number {"type":"entrez-nucleotide","attrs":{"text":"X02627","term_id":"39130","term_text":"X02627"}}X02627) of 120 nt.

**...**

### Large standard deviations due to a wide range of improvements

The unusually large standard deviations in Tables 22–5 are due to a wide range of improvements, as illustrated by Figure 22 for the best centroid. For base-pair distance, only one sequence has a negative improvement of − 1.1%, and the improvement is as high as 100% (Fig. 2A2A).). In terms of sensitivity, there are small to moderate negative improvements for 20 of the 81 sequences (24.7%), with an average of −9.6%, and the positive improvement is as high as 245.5% (Fig. 2B2B).). For PPV, the improvement is as high as 313.6%, with only two sequences having negative improvements of −8.3% and − 25.3% (Fig. 2C2C).

### Computational costs and availability

The main memory requirement for the clustering procedure is the storage of the distance matrix. The computation of the centroid is a linear operation. The CPU times and memory requirements for our version of partition function calculation for sampling 1000 structures and for clustering and centroid calculation are given in Table 66 for several sequences of various lengths. Clustering features including centroids are available through the module Srna of the Sfold software for folding and design of nucleic acids. Sfold is available through Web servers at http://sfold.wadsworth.org and http://www.bioinfo.rpi.edu/applications/sfold. Sample output for a folded sequence is available at http://sfold.wadsworth.org/demo.

## DISCUSSION

Our main finding is that ensemble centroids yield more specific predictions with average improvements of 30.0% in PPV and 3.5% in sensitivity. More strikingly, the best of a small number of cluster centroids improves the PPV by 46.5% while simultaneously increasing the sensitivity by > 20%. Perhaps our most provocative finding is that the MFE structure falls outside the cluster containing the best centroid for over half of the studied sequences. In such cases, > 31% more base pairs are correctly identified with > 62% fewer predictive errors (Table 55).

The consistent finding of improved PPV suggests that the MFE structure may tend to overpredict. It has been argued that the structure determined by comparative sequence analysis is a minimal model for RNA secondary structure, because only base pairs for which comparative evidence exists are included in the structure model (Larsen and Zwieb 1991). This raises the possibility that overprediction by MFE structure is in part due to underrepresentation of base pairs in the structure determined by comparative sequence analysis. However, recent comparisons of structure determined by comparative sequence analysis with crystal structures indicate that covariation analysis for 16S and 23S rRNAs identifies nearly all base pairings (Cannone et al. 2002). For 16S and 23S rRNAs, the improvements by the centroids are substantial (Tables 22–4),), and thus cannot be attributed to potential underrepresentation of base pairs in the structure determined by comparative sequence analysis. For other types of RNAs, comprehensive data for comparing crystal structures with structure determined by comparative sequence analysis are needed for making a more general assessment.

To a large degree, the ensemble centroid is reflective of the high-frequency base pairs in the structure sample. Because the base-pair frequencies are sampling estimates of the base-pair probabilities computed by partition functions (McCaskill 1990), the finding of improved PPV by the ensemble centroid is consistent with the recent report that base pairs in MFE structure that have high probabilities have a significantly higher PPV than that of base pairs with lower probabilities (Mathews 2004).

All of the ensemble centroids in our analysis are based on samples of structures. However, we could also use the base-pair probabilities calculated from partition functions (McCaskill 1990) for this purpose. Because sample base-pair frequencies used for centroid calculation approach the base-pair probabilities as the sample size increases, our sample-based-centroid will approach the partition-function-based centroid. However, because base-pair probabilities give only the marginal probabilities of individual base pairs, the identification of clusters of similar structures based on base-pair probabilities alone is at best difficult. In contrast, because sampled structures are realizations from the joint high-dimensional distribution of all base pairs (Ding and Lawrence 2003), clustering is greatly facilitated. Accordingly, a statistical sample enables the decomposition of the two-dimensional histogram of base pairs into subhistograms of distinct structural clusters (Ding and Lawrence 2003).

Although the best centroids are the best predictors, these centroids cannot be defined when a reference structure is unavailable. However, it is an appealing feature that the best centroid predictions are based on only three to four clusters, on average. The small number of cluster centroid predictions can facilitate further structural determination by allowing the incorporation of other types of information, e.g., partial structure information from enzymatic or chemical probing. In order that our comparison be as direct and clear-cut as possible, all predicted structures in this analysis are based on the same set of energy rules (Xia et al. 1998; Mathews et al. 1999). We have not compared these approaches using recently revised energy rules (Mathews et al. 2004). Comparisons incorporating constraints (e.g., for forcing modified bases in tRNAs to be unpaired or for the incorporation of other partial structure information) and coaxial stacking also await further study. However, we currently see no reason why the advantages of these sample-based predictions should not extend to other cases. We also expect that the use of experimental constraints may improve the predictions, as demonstrated for predictions based on free energy minimization (Mathews et al. 1999, 2004).

We have examined the constrained MFE structure, using base pairs in the ensemble centroid as the constraints. In comparison with the ensemble centroid, we found that on average the sensitivity is improved by 2.22% for the constrained MFE structure as a result of more predicted base pairs; however, this small improvement has a cost of 7.05% average decrease in PPV. It remains an open question whether the combination of centroid prediction with other approaches can further improve structure prediction. The base pair frequencies for the entire structure sample as well as for individual clusters might be used as weights by the maximal weighted matching or the iterated loop matching method (Tabaska et al. 1998; Ruan et al. 2004) for calculating representative structures with pseudoknots.

As alternatives to examining clusters and centroids for sampled structures, one might consider clustering the 1000 structures with the lowest energies computed by RNAsub-opt of the Vienna RNA package (Hofacker 2003) or the abstract shape representation of foldings within an energy increment from the MFE (Giegerich et al. 2004). These two methods focus on the lowest end of the free energy density of states, whereas structure sampling allows characterization of Boltzmann-weighted density of states (Ding and Lawrence 2003). Thus, the energy landscape is examined from two different perspectives. For short sequences such as tRNAs, there is generally a good correspondence between our centroids and the abstract shapes or centroids for 1000 best structures (data not shown), as the low energy structures are well represented in a sample. However, the degree of correspondence and the overlap in the energy coverage diminishes as sequence length increases, because apparently the Boltzmann-weighted density of states becomes increasingly dictated by structures at an energy distance from the MFE, and these structures far outnumber those with energies near the MFE. For example, for the rabbit β-globin mRNA of 589 nt (GenBank accession no. {"type":"entrez-nucleotide","attrs":{"text":"V00879","term_id":"1484","term_text":"V00879"}}V00879), the 1000 best structures represent a small free energy range of 1.3 kcal/mol for default parameter settings of RNAsubopt, while a statistical sample presents representative structures from a much wider energy interval of 39.80 kcal/mol. In addition, a statistical sample can reveal “entropic clusters” (Ding and Lawrence 2003). For an entropic cluster, each member has a probability too small to command individual attention, but collectively the cluster has an appreciable probability because of the large number of cluster members. In computer RNA folding applications, it is a common practice among users to examine structures from mfold. Because the structure sample from mfold is heuristic rather than statistically or low-energy representative, the method presented here and the RNAshapes approach present two improved and complementary alternatives. The two methods are also complementary because RNAshapes provides an alternative method for the identification of structure clusters with cluster members having a common shape. It will be interesting to apply the RNAshapes algorithm to a sampled ensemble. The larger number of shape representatives than the number of our clusters suggests that our clustering procedure reports major clusters whereas the abstract shape approach may reveal more subtle structural dissimilarities.

As pointed out by Abagyan (1993), two major components are needed to solve macromolecular folding problems. First, all essential terms of the free energy of a trial conformation must be calculated with sufficient accuracy, and, second, a procedure is needed to find the minimum of this energy function. For RNAs, the global minimum can be found for an incomplete energy model, i.e., only the secondary structure model. Thus, it may not be a surprise that, for a majority of analyzed sequences (47 of 81 sequences), a centroid of an alternative cluster (probably representing an alternative energy well) is about 37% closer to the structure determined by comparative sequence analysis than the MFE structure computed with the incomplete energy function (Table 55).). As argued by Abagyan, since only a small number of such alternative structural classes are needed to find this best alternative, an approach that employs a post-analysis filter function maybe a productive path for selecting among clusters for improved structure prediction. Since for protein structural models neither of Abagyan’s two components is attainable, our findings argue that a more comprehensive examination of the energy landscape of the approximate models for structures of proteins and other macromolecules may also be worthy of further investigation. Even in the case of the complete model and energy function, the Boltzmann ensemble view is also important, e.g., for the investigation of metastable states, particularly RNA conformational switches (Barrick et al. 2004; Voss et al. 2004).

## MATERIALS AND METHODS

### RNA sequences

From publicly available databases (Larsen and Zwieb 1991; Sprinzl et al. 1998; Brown 1999; Cannone et al. 2002; Alm Rosenblad et al. 2003; Zwieb et al. 2003), we took samples of sequences for diverse types of structural RNAs with secondary structures determined by comparative sequence analysis. For tRNAs, RNase P RNAs, tmRNAs, signal recognition particle (SRP) RNAs, small subunit (16S or 16S-like) rRNAs, large subunit (23S or 23S-like) rRNAs, and 5S rRNAs, 10 sequences were randomly selected for each RNA type. In addition, nine group I introns without undetermined nucleotides and two group II introns that are available in the databases were also included in our analysis. The list of the 81 sequences is available from the authors upon request and will also be posted on the Sfold Web server.

### Clustering procedure

For comparing two secondary structures, we use the base-pair distance. The discriminatory power of this distance is adequate in our context, because we are interested in comparing multiple structures sampled for a given RNA sequence. In other circumstances, e.g., when structures of homologous sequences are compared, alternative metrics (Moulton et al. 2000) may be more appropriate to account for insertions and deletions. For hierarchical clustering of structures, we employ *Diana* (Kaufman and Rousseeuw 1990), a top-down divisive method that has performed well in other contexts (Datta and Datta 2003). For determining the number of clusters, we use the CH index (Calinski and Harabasz 1974) that was assessed as the best in a comprehensive study (Milligan and Cooper 1985). For a given divisive level on the clustering tree from *Diana*, the CH index is calculated as *CH*(*k*) = [*B*(*k*)/(*k* − 1)]/[*W*(*k*)/(*n*_{total} − *k*)], where *k* is the number of clusters, *n*_{total} is the total number of objects to be clustered, *B*(*k*) is the between-cluster sums of squares, and *W*(*k*) is the within-cluster sums of squares. The number of clusters at which the CH index is maximized is the optimal number of clusters. This number is then used to determine the structural clusters by identifying the corresponding divisive level for the hierarchical clustering tree produced by *Diana*. As described below, a secondary structure is an object in a high-dimensional Euclidean space that can be expressed by an upper triangular matrix. The cluster means required for the calculation of the sum of squares can be computed in two ways: averaging the corresponding Euclidean coordinates for all structures in a cluster or using the centroid of a cluster as its mean and computing the base-pair distance between the centroid and any structure in the cluster (see below for centroid calculation). The former is computationally intensive for long sequences whereas the later is a linear operation. We did not observe appreciable differences in the clustering results by the two methods. Therefore, we decided to use cluster centroids in the implementation of the CH index.

For every RNA sequence, we first cluster 1000 statistically sampled structures. We compute the MFE structure with mfold 3.1 (Zuker 2003) for the same set of Turner thermodynamic parameters (Xia et al. 1998; Mathews et al. 1999) that are currently implemented by our sampling algorithm (Ding and Lawrence 2003). To decide which cluster the MFE structure belongs to, we first identify the cluster whose centroid has the shortest base-pair distance to the MFE structure. If this distance is less than or equal to the longest base-pair distance between a structure in the cluster and the cluster centroid, the MFE structure belongs to this cluster; otherwise, the MFE structure does not belong to any cluster in the sample, i.e., it is in a new cluster by itself. The structure sample size of 1000 has been shown to be sufficiently large to guarantee statistical reproducibility in typical sampling statistics including base-pair frequencies, even when two independent samples do not share a single structure (Ding and Lawrence 2003). As reported below, base pair frequencies are all that are needed for centroid identification. In addition, regardless of sequence length, a cluster with an appreciable probability is expected to be represented in a sample of 1000 structures. Larger samples would reveal additional clusters that are insignificant at a significance level of 0.001.

### Matrix representation of RNA secondary structure

For an RNA sequence of *n* nucleotides, a secondary structure *I* can be expressed by an upper triangular matrix of indicators {*I _{ij}*}, 1 ≤

*i*<

*j*≤

*n*, where

*I*

_{ij}indicates base-pairing status between base

*i*and base

*j*. Specifically,

*I*

_{ij}= 1 if the

*i*th base is paired with the

*j*th base or

*I*

_{ij}= 0 otherwise. The requirement of at least three unpaired intervening bases between any base pair implies

*I*

_{ij}= 0 for

*j*=

*i*+ 1,

*i*+ 2, and

*i*+ 3, 1 ≤

*i*,

*i*+ 3 ≤

*n*. The indicators are not independent of one another, because they are subject to constraints. The assumption of no pseudoknots implies

*I*

_{ij}*I*

_{i′j′}= 0 for

*i*′ <

*i*<

*j*′ <

*j*. Also, when base triples are prohibited, ∑

_{1≤i≤n}

*I*

_{ij}≤ 1, and ∑

_{1≤j≤n}

*I*≤ 1.

_{ij}### Base-pair distance

While the base-pair indicators are binary and under constraints, they are also coordinates in a Euclidean space of dimension (*n* − 1)*n*/2. For two structures *I*_{1} = {*I _{ij}*

^{1}} and

*I*

_{2}= {

*I*

_{ij}^{2}}, consider the following metric

*D*

_{1}and the squared Euclidean distance

*D*

_{2}:

*D*

_{1}(

*I*

_{1},

*I*

_{2}) = ∑

_{1≤i<j≤n}|

*I*

_{ij}^{1}−

*I*

_{ij}^{2}| and

*D*

_{2}(

*I*

_{1},

*I*

_{2}) = ∑

_{1≤i}

_{<j≤n}(

*I*

_{ij}^{1}−

*I*

_{ij}^{2})

^{2}. In general, both metrics have sufficient discriminatory power for the purpose of clustering. In our context, both metrics are equal to the number of different base pairs in

*l*

_{1}and

*l*

_{2}. In other words, both of the metrics are in fact the well-known base-pair distance

*D*(

*I*

_{1},

*I*

_{2}). This interpretation does not apply to the square root of

*D*

_{2}, i.e., the Euclidean distance.

### Definition and derivation of centroid

For a set of *m* secondary structures *I*_{1}, *I*_{2}, . . . , *I*_{m}, with *I*_{k} = {*I _{ij}^{k}*}, 1 ≤

*k*≤

*m*, the centroid for the set is defined as the structure in the

*entire ensemble*of secondary structures that has the shortest total base-pair distance to the structures in the set. To compute the centroid structure, we need to find the secondary structure

*I*= {

*I*} that minimizes the following sum of distances under the constraints discussed above:

_{ij}where *C*_{s} = ∑_{1≤k≤m}∑_{i}∑_{j}(*I _{ij}^{k}*)

^{2}is a constant for the given structure set, and

*C*

_{ij}= ∑

_{1≤k≤m}

*I*is the total number of occurrences of base pair

_{ij}^{k}*i*·

*j*in the structure set. Because

*I*

_{ij}^{2}

*I*, the nonlinear programming problem is in fact a linear programming problem with nonlinear constraints. For a base pair with a frequency under 50%, it cannot be in the centroid because (

_{ij}*m*− 2

*C*

_{ij}) > 0, and thus

*I*

_{ij}must be 0 for the centroid. A base pair with a frequency of 50% does not influence the double sum in (1), because (

*m*− 2

*C*

_{ij}) = 0. For a base pair with a frequency > 50%, because (

*m*− 2

*C*

_{ij}) < 0, the inclusion of this base pair (i.e.,

*I*

_{ij}= 1) decreases the double sum in (1). Furthermore, any two base pairs with frequencies > 50% do not form a pseudoknot, because no base pairs in the structure set are involved in pseudoknots. Thus, the consensus structure formed by all base pairs with a frequency > 50% is a centroid. We note that for base pairs with a frequency of 50%, inclusion of any compatible combination into the > 50% consensus does define another centroid. However, the > 50% consensus structure is always the unique centroid with the smallest number of base pairs and is the one we use for analysis.

The centroid is referred to as the ensemble centroid when the structure set is the statistical sample generated by our sampling algorithm, typically with *m* = 1000. In this case, *C*_{ij} is the observed count for base pair *i* · *j* in the sample. For a cluster of similar structures identified from the statistical sample, the centroid is referred to as a cluster centroid. In this case, *C*_{ij} is the observed count for base pair *i* · *j* in the cluster.

## Acknowledgments

The Computational Molecular Biology and Statistics Core at the Wadsworth Center is acknowledged for providing computing resources for this work. This work was supported in part by National Science Foundation grant DMS-0200970 and National Institutes of Health grant GM068726 to Y.D. and by National Institutes of Health grant HG01257 to C.E.L. We are grateful to the suggestions and observations of the anonymous referees that led to drastic improvements in the run times of clustering and centroid identification as well as the presentation of the article.

## Notes

Article and publication are at http://www.rnajournal.org/cgi/doi/10.1261/rna.2500605.

## REFERENCES

- Abagyan, R.A. 1993. Towards protein folding by global energy optimization. FEBS Lett. 325: 17–22. [PubMed]
- Alm Rosenblad, M., Gorodkin, J., Knudsen, B., Zwieb, C., and Samuelsson, T. 2003. SRPDB (Signal Recognition Particle Database). Nucleic Acids Res. 31: 363–364. [PMC free article] [PubMed]
- Anfinsen, C.B. 1973. Principles that govern the folding of protein chains. Science 181: 223–230. [PubMed]
- Barrick, J.E., Corbino, K.A., Winkler, W.C., Nahvi, A., Mandal, M., Collins, J., Lee, M., Roth, A., Sudarsan, N., Jona, I., et al. 2004. New RNA motifs suggest an expanded scope for riboswitches in bacterial genetic control. Proc. Natl. Acad. Sci. 101: 6421–6426. [PMC free article] [PubMed]
- Bonhoeffer, S., McCaskill, J.S., Stadler, P.F., and Schuster, P. 1993. RNA multi-structure landscapes. A study based on temperature dependent partition functions. Eur. Biophys. J. 22: 13–24. [PubMed]
- Brown, J.W. 1999. The ribonuclease P database. Nucleic Acids Res. 27: 314. [PMC free article] [PubMed]
- Calinski, R.B. and Harabasz, J. 1974. A dendrite method for cluster analysis. Comm. Stat. 3: 1–27.
- Cannone, J.J., Subramanian, S., Schnare, M.N., Collett, J.R., D’Souza, L.M., Du, Y., Feng, B., Lin, N., Madabusi, L.V., Muller, K.M., et al. 2002. The comparative RNA Web (CRW) site: An online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinform. 3: 2. [PMC free article] [PubMed]
- Datta, S. and Datta, S. 2003. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19: 459–466. [PubMed]
- Ding, Y. and Lawrence, C.E. 2001. Statistical prediction of single-stranded regions in RNA secondary structure and application to predicting effective antisense target sites and beyond. Nucleic Acids Res. 29: 1034–1046. [PMC free article] [PubMed]
- ———. 2003. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res. 31: 7280–7301. [PMC free article] [PubMed]
- Dowell, R.D. and Eddy, S.R. 2004. Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinform. 5: 71. [PMC free article] [PubMed]
- Giegerich, R., Voss, B., and Rehmsmeier, M. 2004. Abstract shapes of RNA. Nucleic Acids Res. 32: 4843–4851. [PMC free article] [PubMed]
- Hofacker, I.L. 2003. Vienna RNA secondary structure server. Nucleic Acids Res. 31: 3429–3431. [PMC free article] [PubMed]
- Kaufman, L. and Rousseeuw, P.J. 1990.
*Finding groups in data: An introduction to cluster analysis.*John Wiley & Sons, New York. - Kruskal, J.B. and Wish, M. 1977.
*Multidimensional scaling.*Sage Publications, Beverly Hills, CA. - Larsen, N. and Zwieb, C. 1991. SRP-RNA sequence alignment and secondary structure. Nucleic Acids Res. 19: 209–215. [PMC free article] [PubMed]
- Mathews, D.H. 2004. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA 10: 1178–1190. [PMC free article] [PubMed]
- Mathews, D.H., Sabina, J., Zuker, M., and Turner, D.H. 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 288: 911–940. [PubMed]
- Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M., and Turner, D.H. 2004. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Natl. Acad. Sci. 101: 7287–7292. [PMC free article] [PubMed]
- McCaskill, J.S. 1990. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29: 1105–1119. [PubMed]
- Milligan, G.W. and Cooper, M.C. 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179.
- Moulton, V., Zuker, M., Steel, M., Pointon, R., and Penny, D. 2000. Metrics on RNA secondary structures. J. Comput. Biol. 7: 277–292. [PubMed]
- Pappu, R.V., Marshall, G.R., and Ponder, J.W. 1999. A potential smoothing algorithm accurately predicts transmembrane helix packing. Nat. Struct. Biol. 6: 50–55. [PubMed]
- Ruan, J., Stormo, G.D., and Zhang, W. 2004. An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots. Bioinformatics 20: 58–66. [PubMed]
- Sprinzl, M., Horn, C., Brown, M., Ioudovitch, A., and Steinberg, S. 1998. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 26: 148–153. [PMC free article] [PubMed]
- Tabaska, J.E., Cary, R.B., Gabow, H.N., and Stormo, G.D. 1998. An RNA folding method capable of identifying pseudoknots and base triples. Bioinformatics 14: 691–699. [PubMed]
- Voss, B., Meyer, C., and Giegerich, R. 2004. Evaluating the predictability of conformational switching in RNA. Bioinformatics 20: 1573–1582. [PubMed]
- Wuchty, S., Fontana, W., Hofacker, I.L., and Schuster, P. 1999. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49: 145–165. [PubMed]
- Xia, T., SantaLucia Jr., J., Burkard, M.E., Kierzek, R., Schroeder, S.J., Jiao, X., Cox, C., and Turner, D.H. 1998. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson–Crick base pairs. Biochemistry 37: 14719–14735. [PubMed]
- Zuker, M. 1989. On finding all suboptimal foldings of an RNA molecule. Science 244: 48–52. [PubMed]
- ———. 2003. Mfold Web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31: 3406–3415. [PMC free article] [PubMed]
- Zuker, M. and Stiegler, P. 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9: 133–148. [PMC free article] [PubMed]
- Zwieb, C., Gorodkin, J., Knudsen, B., Burks, J., and Wower, J. 2003. tmRDB (tmRNA database). Nucleic Acids Res. 31: 446–447. [PMC free article] [PubMed]

**The RNA Society**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (331K)

- Clustering of RNA secondary structures with application to messenger RNAs.[J Mol Biol. 2006]
*Ding Y, Chan CY, Lawrence CE.**J Mol Biol. 2006 Jun 9; 359(3):554-71. Epub 2006 Feb 2.* - Structure clustering features on the Sfold Web server.[Bioinformatics. 2005]
*Chan CY, Lawrence CE, Ding Y.**Bioinformatics. 2005 Oct 15; 21(20):3926-8. Epub 2005 Aug 18.* - Analysis of energy-based algorithms for RNA secondary structure prediction.[BMC Bioinformatics. 2012]
*Hajiaghayi M, Condon A, Hoos HH.**BMC Bioinformatics. 2012 Feb 1; 13:22. Epub 2012 Feb 1.* - Revolutions in RNA secondary structure prediction.[J Mol Biol. 2006]
*Mathews DH.**J Mol Biol. 2006 Jun 9; 359(3):526-32. Epub 2006 Feb 6.* - Prediction of RNA secondary structure by free energy minimization.[Curr Opin Struct Biol. 2006]
*Mathews DH, Turner DH.**Curr Opin Struct Biol. 2006 Jun; 16(3):270-8. Epub 2006 May 19.*

- Alu elements shape the primate transcriptome by cis-regulation of RNA editing[Genome Biology. 2014]
*Daniel C, Silberberg G, Behm M, Öhman M.**Genome Biology. 2014; 15(2)R28* - Systematic design and functional analysis of artificial microRNAs[Nucleic Acids Research. 2014]
*Arroyo JD, Gallichotte EN, Tewari M.**Nucleic Acids Research. 2014 May 1; 42(9)6064-6077* - The Nucleic Acid Database: new features and capabilities[Nucleic Acids Research. 2014]
*Coimbatore Narayanan B, Westbrook J, Ghosh S, Petrov AI, Sweeney B, Zirbel CL, Leontis NB, Berman HM.**Nucleic Acids Research. 2014 Jan; 42(D1)D114-D122* - Analysis of the Transcriptional Regulator GlpR, Promoter Elements, and Posttranscriptional Processing Involved in Fructose-Induced Activation of the Phosphoenolpyruvate-Dependent Sugar Phosphotransferase System in Haloferax mediterranei[Applied and Environmental Microbiology. 201...]
*Cai L, Cai S, Zhao D, Wu J, Wang L, Liu X, Li M, Hou J, Zhou J, Liu J, Han J, Xiang H.**Applied and Environmental Microbiology. 2014 Feb; 80(4)1430-1440* - Evaluating the effect of disturbed ensemble distributions on SCFG based statistical sampling of RNA secondary structures[BMC Bioinformatics. ]
*Scheid A, Nebel ME.**BMC Bioinformatics. 13159*

- Gene (nucleotide)Gene (nucleotide)Records in Gene identified from shared sequence links
- NucleotideNucleotidePublished Nucleotide sequences
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree