# Centroid estimation in discrete high-dimensional spaces with applications in biology

Author contributions: L.E.C. and C.E.L. designed research; L.E.C. and C.E.L. performed research; L.E.C. analyzed data; and L.E.C. and C.E.L. wrote the paper.

## Abstract

Maximum likelihood estimators and other direct optimization-based estimators dominated statistical estimation and prediction for decades. Yet, the principled foundations supporting their dominance do not apply to the discrete high-dimensional inference problems of the 21st century. As it is well known, statistical decision theory shows that maximum likelihood and related estimators use data only to identify the single most probable solution. Accordingly, unless this one solution so dominates the immense ensemble of all solutions that its probability is near one, there is no principled reason to expect such an estimator to be representative of the posterior-weighted ensemble of solutions, and thus represent inferences drawn from the data. We employ statistical decision theory to find more representative estimators, centroid estimators, in a general high-dimensional discrete setting by using a family of loss functions with penalties that increase with the number of differences in components. We show that centroid estimates are obtained by maximizing the marginal probabilities of the solution components for unconstrained ensembles and for an important class of problems, including sequence alignment and the prediction of RNA secondary structure, whose ensembles contain exclusivity constraints. Three genomics examples are described that show that these estimators substantially improve predictions of ground-truth reference sets.

**Keywords:**prediction, statistical inference, computational biology, discrete decoding

In the past decade, high-throughput data-acquisition technologies have rendered datasets with sizes unimaginable to our predecessors, including the sequence of the human genome (1) and the products of numerous high-throughput technologies of the post-genome era (2), data warehouses of commercial and internet transactions (3), and surveys of the objects of the universe (4). Although the emergence of such large datasets seems to imply more precise parameter estimation, paradoxically just the opposite is becoming increasingly common. This paradox emerged because these technologies simultaneously opened opportunities to draw inferences on previously unanswerable high-dimensional questions.

Estimation and prediction have long been dominated by procedures that identify the most probable point, including maximum likelihood estimation (5), maximum *a posteriori* (MAP) estimates such as Viterbi decoding of hidden Markov models, and minimum “free-energy” structure predictions (6, 7). These types of estimators are referred as ML estimators (maximum likelihood-family estimators) in the remainder of this article. In addition, many algorithms that optimize scoring functions to produce estimates or predictions correspond to equivalent maximum likelihood estimation procedures (8, 9), and thus also yield ML estimators.

Historically, there have been good reasons for this dominance. ML estimates are intuitively appealing because they identify the point in the space of the unknowns for which the data have highest probability. In the prediction of molecular structures, if the energy of one structure is sufficiently lower than all of the others, its probability will be near one and thus it will dominate the ensemble. More importantly, the long dominance of ML estimators rests on a principled foundation showing that they possess a number of very desirable properties, at least in the historic setting in which they were developed and have been very successfully applied: low-dimensional continuous spaces. Specifically, ML estimators have three key advantageous properties, as reported by Wald and Cramér (10, 11): consistency, asymptotic normality, and asymptotic efficiency. However, these properties only hold asymptotically as the data increase relative to the number of unknowns, and only properly for continuous variables. Such conditions are not attained when interest is focused on the inference of high-dimensional (high-D) discrete unknowns. Thus, the principled foundation supporting ML estimators is absent in this new setting.

Furthermore, evidence has emerged indicating that, in practice, estimators that gather information from the full ensemble of solutions predict ground-truth reference sets better than ML estimators. Specifically, Miyazawa (9) described reliable alignments that outperformed maximum similarity alignment procedures in the prediction of protein structure. More recently, Ding *et al.* (12) derived centroid estimators for predicting RNA secondary structure, and showed that they outperform well established ML estimators. Thus, there is now evidence in principle and in practice suggesting the need for alternative estimation procedures.

Bayesian inference provides a very useful alternative that enjoys a number of advantages in continuous settings. However, mean values of these estimators are not applicable here because, in general, they will not provide discrete solutions. Also, when interest is not on the overall solution, but on individual components, maximization of the marginal probabilities has been proposed (13). In sequence alignment, Miyazawa developed reliable alignments that maximize marginal probabilities and showed that these estimates meet the problem's main constraints (9). For the special case of predicting RNA secondary structure, one of us and colleagues developed centroid predictions and showed that these meet this problem's constraints (12).

Here, we use statistical decision theory to broadly generalize and extend the results of Ding *et al.* (12), to formally develop an alternative class of centroid estimators, and to prove some related theorems.

## Background

A common high-dimensional inference problem concerns the estimation of *n* correlated binary variables, θ, living on a subset of {0,1}^{n}. For example, in network identification one seeks to predict if pairs of nodes are either connected or not. In these applications binary variables, θ, are not observed directly, rather they have to be inferred from available data, *S*. For example, to predict gene networks, binary variables predict interactions between a pair of genes based on intensity data from microarray assays of expression levels of thousands of individual genes (14).

In maximum likelihood estimation, one finds the values of the unknowns that maximize the probability of the observed data, argmax_{θ}*P*(*S*|θ). In Bayesian statistics, the other major statistical inference paradigm, Bayes's theorem gives the probability of allowable realizations of the set of unknown random variables θ after seeing the data *S*, and thus is called posterior probability:

where *P*(θ|Λ) is the prior probability of θ conditional on a set of parameters Λ and *P*(*S*|θ,Λ) is the likelihood of the data conditional on θ and Λ. The denominator in Eq. **1**, which is analogous to the partition function in statistical physics, is often called the marginal likelihood of the data. The common Bayesian ML estimator, the posterior probability maximizer, maximum *a posteriori* estimator is

where, from now on, we denote by θ the set of all possible realizations of θ.

Consider now the simple case of three correlated binary variables. Suppose that the eight combinations of these variables were equally likely *a priori* and that, after observing the data, the probabilities of the eight alternative combinations of these variables are those shown in Fig. 1, and *p*_{2} < *p*_{1} < 2*p*_{2}.

*L*(1, 1, 1) has probability

*p*

_{1}and is the ML estimate; gray points have probability

*p*

_{2}as labeled, with

*p*

_{2}<

*p*

_{1}< 2

*p*

_{2}, including

*C*(0, 0, 0), and white, unshaded points have

**...**

As Fig. 1 shows, although the point *L*(1, 1, 1) is the most likely, it does not seem to represent the data well, because the only other points with positive probabilities are the four points that are about as different from this point as possible and together they are between two and four times more probable than point *L*. Point *M* is the mean with coordinates (*p*_{1} + *p*_{2}, *p*_{1} + *p*_{2}, *p*_{1} + *p*_{2}) and because, as discussed more formally below, *p*_{1} + *p*_{2}<1/2, it is always closer to *C*(0, 0, 0).

For the general case of *n* binary variables, where *p*_{2} < *p*_{1} < (*n* − 1)*p*_{2}, we could still have a cluster of *n* + 1 points that differ from some point *C* by no more than one component and such that each has probability *p*_{2}. The point at the opposite end of the hypercube *L* has probability *p*_{1}, all its neighbors have zero probability, and it differs from all of the rest with positive probability in at least *n* − 1 of its *n* components. Since *p*_{1} + (*n* + 1)*p*_{2} = 1, which implies that 1/(2*n*) < *p*_{2} < 1/(*n* + 2), if we choose *p*_{2} = 1/(*n* + 3), then we would have *p*_{1} = 2/(*n* + 3). The ratio of the posterior mass for the cluster around point *C* to that around *L* is (*n* + 1)*p*_{2}/*p*_{1} = (*n* + 1)/2. Thus, as the dimensionality of the problem increases, the proportion of posterior mass around point *C* becomes arbitrarily large when compared with *L*, and so does the distance between *C* and *L*. Nevertheless, since *p*_{2} < *p*_{1}, point *L* is the ML estimate.

Although by design this is an extreme case, there are many other examples that present similar scenarios that raise concerns about the utility of ML estimators in high-dimensional discrete spaces. To examine these issues in less extreme and more realistic cases, we consider four questions: (*i*) Is there a principled basis that prevents ML estimators from being isolated from the region of space strongly recommended by the data? (*ii*) What alternative estimators better represent the data? (*iii*) Do these alternative estimators offer improved representation of the data in practice? (*iv*) Do these alternative estimators predict known references cases better?

## Centroid Estimation

### Is There a Principled Basis Preventing ML Estimators from Being Isolated from the Region of Space Strongly Recommended by the Data?

In most high-dimensional circumstances, ML estimators are likely to have a very small probability because the single terms in the numerator of Eq. **1** will soon become swamped by the large number of terms in the denominator. We are not the first to recognize this fact, but many who recognize this often hoped that the ML estimator would be surrounded by a large number of similar solutions that together would contain a high proportion of the posterior probability mass (15). To find an estimator that will capture this hope, we employ statistical decision theory.

Statistical decision theory provides a principled means to optimally identify estimators with desirable properties through the use of loss functions. These functions assign losses to differences between the unknown value one seeks to infer and its estimate. Specifically, in the theory's jargon, we consider the loss *L*($\widehat{\theta}$, θ) associated with making the prediction $\widehat{\theta}$ when the actual value of the unknown solution is θ, under loss function *L*(·) (11). To address the uncertainty inherent in estimation and prediction we seek an estimator that minimizes the expected loss, the *risk*. For example, to find estimators with minimum variance, the expected squared error loss is minimized. Specifically, the *posterior* risk is defined as the expected loss of choosing some estimator from *S* (16, 17),

It is well known that ML estimators in this discrete setting are guaranteed to minimize posterior risk only under a zero–one loss function (13). Thus, any other point in the space that is different from the estimator incurs in a unitary penalty to the risk, regardless of how different they are.

Also, because these estimators ignore all other configurations but the most probable one, there is no principled reason for them not to differ greatly from all other solutions in the space, regardless of how strongly other points are supported by the data. Thus, because ML estimators in this setting only represent themselves and frequently have small posterior probabilities, these estimators are unlikely to be good representatives of the information contained in the observed data.

### What Alternative Estimators Better Represent the Data?

Although the hope of finding a high proportion of the posterior-weighted mass concentrated around an ML estimator has not been realized, nevertheless the concept of finding an estimator that does concentrate mass bares further consideration. Thus, to answer this question, we employ loss functions that incur higher losses as the difference in the number of components increases. The motivation is to better capture the character of the distribution of the posterior probability mass in the ensemble by seeking an estimator that represents collections of similar solutions, like point *C* in Fig. 1.

First, consider the *Hamming* loss,

where *I* is the indicator function, that is, *I*(*a*) = 1 if *a* is true and *I*(*a*) = 0 otherwise, and *n* is the dimension of both *z* and *y*. The Hamming loss function simply measures how many components differ between two members of a discrete solution space. Its posterior risk for some estimator $\widehat{\theta}$ is

and so, it is immediate that, to minimize the risk, we can simply choose

that is, the posterior *marginal* sum maximizer. We call $\widehat{\theta}$_{C} the *centroid estimator*, for which we have just presented the proof of the following equivalent definition:

### Theorem 1.

$\widehat{\theta}$_{C} *is the posterior Hamming loss risk minimizer*.

In the special case when θ = *A*_{1} × *A*_{2} × ⋯ × *A _{n}*, where

*A*is the set of possible values for the

_{i}*i*th entry in θ, we can take the

*marginal*posterior maximizers for each position. That is, we choose an estimate by choosing the value of each component of the solution that is most probable,

*consensus estimator*:

### Constrained Centroid Estimation

Centroid estimators also have their drawbacks. For instance, it might not be straightforward to derive a centroid estimator if the solution space is shaped by complex constraints. A naïve approach would be to employ the consensus estimator, but, since the inference is driven by the marginals, it is possible to find an estimate that is not feasible, for example, that does not belong to the solution space of the original problem. A simple example occurs when the space comprises three points (1, 0, 0), (0, 1, 0), and (0, 0, 1) with probabilities *p*_{1}, *p*_{2}, and *p*_{3} < 0.5, respectively; the consensus estimator would be (0, 0, 0), which does not belong to the space. In general, taking the maximizers of the marginals is not a feasible solution in such a constrained problem, but under appropriate conditions it can be.

For many important applications, like RNA secondary structure or sequence alignment, the discrete unknowns are binary and the constraints have the characteristic form Σ_{i∈J} θ_{i} ≤ 1, where *J* ⊂ {1, …, *n*}. Since, at most, one position in *J* can be matched, we say that *J* is restricted by an *exclusivity* constraint. This implies that, if we marginalize on *J*, then no two marginal sets of positions can have 1 at the same position. Therefore, we can reduce the problem to either selecting the alternative that has a probability greater than a half or, if none exists, assigning zero to all alternatives of this constraint. This would always yield a feasible centroid estimator. Formally, a more general result is available:

### Theorem 2.

*If* θ⊂ {0,1}^{n} *is such that* θ ∈ θ *satisfies a set of conditions*{*C*_{k}}_{k = 1}^{K} *of the form C*_{k}: Σ_{i∈Jk} θ_{i} ≤ 1, *where*{*J*_{k}}_{k = 1}^{K} *is a collection of index sets* (*J*_{k} ⊂ {1,…,*n*}, 1 ≤ *k* ≤ *K*), *then* $\widehat{\theta}$*_{C} *also satisfies each condition C*_{k}, 1 ≤ *k* ≤ *K*, *that is*, $\widehat{\theta}$*_{C} = $\widehat{\theta}$** _{C}**. [

*See*

*supporting information (SI) Appendix*

*for the proof*.]

Theorem 2 shows that, for problems in this class, consensus estimates will satisfy the original problem's constraint set even if the constraints overlap, and thus are centroid estimators. For example, in the sequence alignment problem there are two essential sets of constraints. First, if we view each solution as an array of binary variables θ_{ij}, 1 ≤ *i* ≤ *n* and 1 ≤ *j* ≤ *m* for sequences of size *n* and *m*, then each character in the first sequence should match with at most one other character in the second sequence, and vice versa: Σ_{j = 1}^{m} θ_{ij} ≤ 1 for 1 ≤ *i* ≤ *n* and Σ_{i = 1}^{n} θ_{ij} ≤ 1 for 1 ≤ *j* ≤ *m*. The second set of constraints are collinearity constraints that prohibit the crossing of aligned character pairs: θ_{ij} + θ_{kl} ≤ 1, 1 ≤ *i* < *k* and l < *j* ≤ *n*. Because these are all exclusivity constraints, Theorem 2 applies.

Consider next *p*th power loss functions. These loss functions cover the broad class of loss functions that minimize *p*th order centered moments, including the important special case of the expected second centered moment. For categorical variables we can adopt a suitable binary representation to obtain the following result:

### Theorem 3.

$\widehat{\theta}$_{C} *is the posterior pth power loss risk minimizer*. (*See* *SI Appendix* *for the proof*.)

The estimator $\widehat{\theta}$_{C} minimizes the expected second moment centered around itself and so it is analogous to a multidimensional mean. As a matter of fact, under the same representation as before, it is the closest point to the mean.

### Theorem 4.

$\widehat{\theta}$_{C} *minimizes the squared distance to the posterior mean*. (*See* *SI Appendix* *for the proof*.)

Because this estimator is nearest to the center of mass of the posterior space, we call $\widehat{\theta}$_{C} the *centroid estimator.* Moreover, because $\widehat{\theta}$_{C} depends on the distance to other points and their probabilities, its behavior is quite different from $\widehat{\theta}$_{MAP}: it seeks to find a point that minimizes the posterior-weighted distance to all points in the ensemble instead of choosing the single highest peak in the space. Thus, it pools the data's evidence from all points in the solution space.

### Do These Alternative Estimators Offer Improved Representation of the Data in Practice?

Ding *et al.* (18) also applied centroid estimators to the characterization of messenger RNAs. In this study they showed that the variance about the ML estimate, minimum free-energy (MFE) structure, was on average 66% greater than the variance around the centroid estimate, indicating that posterior space is often asymmetric and that the most likely structure was often far from the center of mass of the posterior space. Although not as extreme as the illustrative example in Fig. 1, Fig. 2 shows an example in which the most likely structure, the MFE, lies in the periphery of the posterior space. Ding *et al.* (18) also showed that the posterior space often contained multiple clusters, and that the most likely structure was not in the largest cluster for 55 of the 100 mRNAs in their study.

### Do These Alternative Estimators Predict Known References Cases Better?

We know of three applications of centroid estimators that have compared their predictions of known references to those of ML estimators. Ding *et al.* (12) made such a comparison by using an energy model that was identical to the model used by the most popular RNA structural prediction web server mfold (6) that predicts the most likely structure. Thus, their comparisons contrast directly only the two estimators. They found that, on average, the predicted base pairs of centroids have 30% fewer prediction errors (positive predictive value improvement) than those in the most likely structure, while also correctly predicting 3.5% more base pairs (sensitivity improvement). By using a different set of free-energy parameters (19), Mathews also showed that consensus estimators, thus centroid estimators, of RNA secondary structure improve positive predictive values by, on average, eight percentage points compared with the MFE structure (20).

In a article on the reliable alignment of protein sequences, Miyazawa (9) used a probabilistic model to identify the marginal probabilities of pairs of aligned protein residues and used a consensus, centroid estimator to estimate an alignment. He compared the ability of these alignments and optimal alignments that correspond to most probable alignments in their ability to predict the x-ray crystal structures of 1 of 109 pairs of proteins from that of the other. He found that the most probable alignments predicted reference's gold standard crystal structures better than centroid alignments by at least 0.25 Å root mean square deviation (rmsd) in only 4 of the 109 protein pairs. For these four, the most probable alignment was, on average, 0.41 Å rmsd closer to the reference structure than the centroid alignment. However, he found 29 pairs for which the centroid alignment predicted the reference structure better than the most probable alignment by at least 0.25 Å with an average improvement of 0.81 Å rmsd, thus demonstrating the centroid estimator's ability to improve the prediction of protein structure.

The computational identification of the locations of the regulatory sites of genes is another important area of study in genomics. Algorithms for this purpose are commonly known as motif-finding algorithms. Recently, Newberg *at al.* (21) developed a Gibbs sampling algorithm that seeks to identify regulatory sites by using the sequences from multiple related species. In this study they showed that centroid solutions consistently outperformed ML estimators. For example, in their simulation study of 1,000-bp-long sequences from five yeast species, they found that the centroid estimator made from 11% to 35% fewer prediction errors than the ML estimators with equal or better sensitivity. The larger differences occur when sites are more difficult to identify.

## Conclusions

ML estimators have dominated prediction and estimation for years. Our results indicate that this paradigm has serious theoretical and practical limitations, and that there are better alternatives. Specifically, by using statistical decision theory with loss functions that incur increased penalties with increasing difference in their components, we develop centroid estimators. These estimators center themselves in the posterior-weighted ensemble by balancing the forces of the members based on their posterior probabilities and their component-wise distances from the centroid estimator. Given the findings in three computational biology applications that centroid estimators substantially improve prediction of ground-truth reference sets without modification of the underlying probabilistic model and perhaps more importantly their principled foundation, centroid estimators offer a promising avenue for improved estimation and prediction in discrete high-D inference problems that are becoming increasingly common in the twenty-first century.

Additional reports also show the utility of exploring the full ensemble of solutions. For example, Bradley *et al.* (7) in studies of protein structure prediction show that it is useful to sample the ensemble of solutions to identify probable energy wells. In CONTRAfold (22), conditional log-linear models are used to specify a probability distribution for RNA secondary structures conditional on RNA sequence. RNA structure estimation is then defined by the maximization of the expected accuracy, where accuracy weights correctly paired positions by a sensitivity/specificity trade-off parameter γ; when γ = 1 the estimator is the centroid estimator. Hartemink *et al.* (14) found posterior model averaging useful after the application of simulated annealing to visit high-scoring regions in inference of genetic regulatory networks, although our experience differs somewhat from theirs. In our experience any use of preliminary optimization such as simulated annealing is detrimental (21). Also, Zhang and Liu (23) found that the incorporation of a side-chain entropy term in a simple free-energy function significantly improved the discrimination of native protein structures from decoys.

Centroid estimation has many other potential applications outside computational molecular biology. A common area of application of high-D discrete inference is variable selection in which discrete choices are made for inclusion of variables in a model. For example, Casella and Moreno (24) treat variable selection in normal regression models through the use of intrinsic priors and select the model with highest posterior probability. Smith and Fahrmeir (25) consider model selection in functional magnetic resonance imaging analysis with indicator variables for inclusion of regressors defined on a lattice, and use marginal probabilities for variable selection. Tadesse *et al.* (26) formulate a clustering problem by using a multivariate normal mixture model. Observations are allocated to classes according to the mode of a marginal posterior distribution, and variables are selected if their marginal posterior probability of above a threshold *a*; if *a* = 1/2 they have a centroid estimator.

Several caveats are appropriate. Since at this stage only a few applications have been examined, the assessment of how well these estimators will predict reference results in other settings awaits further study. The estimators developed here are appropriate in the important set of problems involving categorical variables. However, when discrete spaces involve ordinal or interval variables, estimators based on other loss functions that still achieve the goal of centering estimates in the posterior space may be more appropriate.

Ding *el al.* (18) showed evidence of multimodal posterior ensembles. When the clusters within these spaces are well separated, no single solution, including the centroid, is likely to represent the posterior space well. In such cases, multiple centroids, one for each cluster, may be required (12, 18). An alternate explanation for the results of Ding *el al.* (12) and Mathews (19) showing that there are better predictors of RNA structure than the minimum free-energy structure is that the energy model of these two works is incomplete. Thus, if evolution selects sequences that will adopt the minimum-energy states as the native states, the minimum of the incomplete (secondary structure) energy models of these two studies may not correspond with this overall energy minimum, and thus yields an incorrect prediction.

Although these estimators represent the posterior space in a defined manner, left unaddressed is the question of how representative a proposed estimator is of the specific ensemble from which it is drawn, thus leaving for future development the need to report how far the “true” state of nature may be from the proposed estimate. Whereas our findings on feasibility cover several important cases, for other cases further steps may be required to ensure feasibility. Dynamic programming offers a promising avenue to obtain these estimates while satisfying the constraints of the underlying problem (15, 21). Moreover, whereas for most constraints maintaining feasibility is important, it may not always be desirable to maintain all constraints. For example, it may be desirable to relax constraints imposed to achieve computability such as the no-pseudo-knot constraints of the RNA secondary structure algorithms. Also, centroid estimators focus on making reliable predictions. In the extreme case, posterior space is so widely dispersed the centroid estimator is empty, for example, with no margins >0.5, reflecting the fact that the data provide no support for any prediction. An estimation procedure that forces a result in this circumstance is available (15).

As we have shown in several important cases, centroid estimates can be derived from the marginal probabilities of solution components. However, obtaining marginal distributions is often a hard problem. In such cases, the approximation methods like variational Bayes (27, 28) and Markov chain Monte Carlo (MCMC) (16) can be applied. However, many problems present a sufficiently complicated joint probability structure to render ML estimation intractable by direct optimization. In such instances, it is common to apply sampling methods, such as MCMC. Given such a sample, estimating the marginal distributions can usually be completed with linear complexity in the number of solution components, whereas obtaining global maximum for ML estimation usually requires a more computationally intensive sampling approach, such as simulated annealing (29).

Centroid and ML estimators diverge more as the complexity of the probability space grows and becomes more multimodal, structured, and correlated. When the consensus estimator is the centroid estimator, the converse is easily seen, because, for probability spaces where each dimension is independent of the others, to maximize the joint distribution is equivalent to maximizing the marginals and so the two estimators always coincide.

Rapid improvements in data-acquisition technologies promise to continue to dramatically increase the pool of data in many fields. Although these data will be of great benefit, they also have opened a new universe of high-D inference and prediction problems that will likely provide major data analytic challenges in the coming decades. Among these is the development of point estimators in discrete spaces that are the focus of the centroid estimators developed here. But the more general point estimation challenge is to find one or a small number of feasible solutions among the many in the ensemble that is by some appropriate measure representative of the full ensemble and suitable for the data structural features of the solution space. These new high-D data and unknowns will also almost certainly force a reexamination of extant approaches to interval estimation, hypothesis tests, and predictive inference. In important ways these new challenges hark back to the early days of statistical physics in the age of Newtonian mechanics. For here again we are confronted with large ensembles and entropic effects arising from their shear size. But here it is often insufficient to deliver only distributions and averages for low-dimensional features, but rather specific high-D results are often demanded. Thus, centroid point estimates are only a small step into the challenges being driven by the rapid advances of data-acquisition technology.

## ACKNOWLEDGMENTS.

We thank Profs. Don McClure, Dan Weinreich, Richard Stratt, Ben Raphael, and Stuart Geman from Brown University and Drs. Lee Newberg, Ye Ding, and Clarence Chan of the Wadsworth Center (Albany, NY) for useful discussions and suggestions. This work was supported by Department of Energy Grant DE-FG02-04ER63942 and National Institutes of Health Grant R01-HG01257 (to C.E.L.) and by the Center for Computational Molecular Biology at Brown University.

## Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0712329105/DC1.

## References

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (351K) |
- Citation

- Measuring global credibility with application to local sequence alignment.[PLoS Comput Biol. 2008]
*Webb-Robertson BJ, McCue LA, Lawrence CE.**PLoS Comput Biol. 2008 May 16; 4(5):e1000077. Epub 2008 May 16.* - Generalized centroid estimators in bioinformatics.[PLoS One. 2011]
*Hamada M, Kiryu H, Iwasaki W, Asai K.**PLoS One. 2011 Feb 18; 6(2):e16450. Epub 2011 Feb 18.* - Collaborative double robust targeted maximum likelihood estimation.[Int J Biostat. 2010]
*van der Laan MJ, Gruber S.**Int J Biostat. 2010 May 17; 6(1):Article 17. Epub 2010 May 17.* - A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA).[J Comput Biol. 2012]
*Hamada M, Asai K.**J Comput Biol. 2012 May; 19(5):532-49. Epub 2012 Feb 7.* - Probability, statistics, and computational science.[Methods Mol Biol. 2012]
*Beerenwinkel N, Siebourg J.**Methods Mol Biol. 2012; 855:77-110.*

- Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs[BMC Bioinformatics. ]
*Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J.**BMC Bioinformatics. 16(1)108* - Efficient calculation of exact probability distributions of integer features on RNA secondary structures[BMC Genomics. ]
*Mori R, Hamada M, Asai K.**BMC Genomics. 15(Suppl 10)S6* - A Bayesian hierarchical gene model on latent genotypes for genome-wide association studies[BMC Proceedings. ]
*Johnston I, Carvalho LE.**BMC Proceedings. 8(Suppl 1)S45* - A Decision-Theory Approach to Interpretable Set Analysis for High-Dimensional Data[Biometrics. 2013]
*Boca SM, Bravo HC, Caffo B, Leek JT, Parmigiani G.**Biometrics. 2013 Sep; 69(3)614-623* - Bayesian Centroid Estimation for Motif Discovery[PLoS ONE. ]
*Carvalho L.**PLoS ONE. 8(12)e80511*

- PubMedPubMedPubMed citations for these articles

- Centroid estimation in discrete high-dimensional spaces with applications in bio...Centroid estimation in discrete high-dimensional spaces with applications in biologyProceedings of the National Academy of Sciences of the United States of America. 2008 Mar 4; 105(9)3209

Your browsing activity is empty.

Activity recording is turned off.

See more...