• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS One. 2012; 7(2): e30126.
Published online Feb 3, 2012. doi:  10.1371/journal.pone.0030126
PMCID: PMC3272020

Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics

Jack Anthony Gilbert, Editor

Abstract

We introduce Dirichlet multinomial mixtures (DMM) for the probabilistic modelling of microbial metagenomics data. This data can be represented as a frequency matrix giving the number of times each taxa is observed in each sample. The samples have different size, and the matrix is sparse, as communities are diverse and skewed to rare taxa. Most methods used previously to classify or cluster samples have ignored these features. We describe each community by a vector of taxa probabilities. These vectors are generated from one of a finite number of Dirichlet mixture components each with different hyperparameters. Observed samples are generated through multinomial sampling. The mixture components cluster communities into distinct ‘metacommunities’, and, hence, determine envirotypes or enterotypes, groups of communities with a similar composition. The model can also deduce the impact of a treatment and be used for classification. We wrote software for the fitting of DMM models using the ‘evidence framework’ (http://code.google.com/p/microbedmm/). This includes the Laplace approximation of the model evidence. We applied the DMM model to human gut microbe genera frequencies from Obese and Lean twins. From the model evidence four clusters fit this data best. Two clusters were dominated by Bacteroides and were homogenous; two had a more variable community composition. We could not find a significant impact of body mass on community structure. However, Obese twins were more likely to derive from the high variance clusters. We propose that obesity is not associated with a distinct microbiota but increases the chance that an individual derives from a disturbed enterotype. This is an example of the ‘Anna Karenina principle (AKP)’ applied to microbial communities: disturbed states having many more configurations than undisturbed. We verify this by showing that in a study of inflammatory bowel disease (IBD) phenotypes, ileal Crohn's disease (ICD) is associated with a more variable community.

Introduction

Next generation sequencing, applied to microbial metagenomics, has transformed the study of microbial diversity. Microbial metagenomics, or sequencing of DNA extracted from microbial communities, provides a means to determine what organisms are present without the need for isolation and culturing, which can access less than 1% of the species in a typical environment [1]. Prior to next generation sequencing individual DNA fragments from a sample were cloned and then Sanger sequenced [2] – a procedure that is slow and expensive when done on a per read basis. Direct next generation sequencing, for example 454 pyrosequencing [3] or Illumina [4], is cheaper and faster, which has allowed much larger studies of microbial diversity, with more reads in total, and with more communities sampled. However, the development of statistics to extract ecologically meaningful information from these data sets has not developed as quickly as the experimental methodology. In particular, tools that can account for the discrete nature, sparsity, and variable size of these data sets are lacking. We propose the Dirichlet multinomial mixture as a generative modelling framework that addresses this need.

Broadly, microbial metagenomics data can be of two types: either amplicons or shotgun metagenomics. Amplicons are generated by PCR amplification of a specific marker gene region – typically a variable region from the 16S rRNA gene – prior to sequencing, so that the data consists of reads from homologous genes in different organisms. In shotgun metagenomics DNA is fragmented in some way and those fragments sequenced, generating reads from throughout the genome of the different community members. For both amplicons and shotgun reads it is possible to classify sequence reads against known taxa, and determine a list of those organisms that are present and the read frequency associated with them [5]. For the majority of environments, many organisms will not have been taxonomically classified and sequenced before, in which case the list of taxa may have to be generated at a low resolution phylogenetic level, e.g. phylum, to achieve a reasonable proportion of classified reads. Alternatively, an unsupervised strategy can be used to identify proxies to traditional taxonomic units by clustering sequences, so called Operational Taxonomic Units (OTUs) [6]. This is commonly performed in the case of homologous marker genes from amplicons but can also be applied to shotgun metagenomics data [7]. Whether supervised or unsupervised approaches are used the end result is the same: a community is represented by a list of types, either taxa or OTUs, and their frequency. For shotgun metagenomics data much more analysis is possible, utilising information about the function of genes that are sequenced, but here we will focus on the analysis of community structure generated by microbial metagenomics. Typically, this will be generated as amplicons, which typically will be 454 pyrosequenced, but we would emphasise that the approach can be applied to any list of taxa or OTUs with discrete abundances.

Early studies of microbial communities focussed on cataloguing diversity in individual samples, asking: how many different taxa or OTUs were present [8], [9]? A striking result was that the observed diversity was very high, and that most species were observed with low abundance; this phenomenon has been termed the ‘rare biosphere’ [8]. These early studies ignored the impact of sequencing and PCR errors which can inflate OTU diversities [10], but even after the application of algorithms capable of removing those errors [11], observed diversities remain high in most environments and abundances are still skewed to low abundances in almost all [10], [12]. The consequence of this is that even with very large read numbers we will have only sampled a fraction of the true diversity [13].

The natural extension to examining the diversity in an individual sample is to look at patterns across samples from similar environments. Barcoding allows multiple samples to be sequenced in a single run but difficulties quantifying DNA concentration means that the number of reads from each sample will usually vary substantially [14]. Sub-sampling can be used to reduce all samples to the same size but that inevitably throws away large amounts of meaningful data. The majority of studies have used exploratory statistics to search for natural patterns in the data, unsupervised learning again. A common strategy is to use multivariate ordination techniques, where samples are positioned in a space of reduced dimensionality so as to preserve the distances between them in the original higher dimensional space; often two or three dimensional ordinations are used and then it is possible to look for patterns by eye. A classic example of an ordination method is principal components analysis (PCA), which generates new dimensions that are linear combinations of the original, chosen so as to preserve the Euclidean distance between samples [15]. Euclidean distances are not very appropriate for microbial community analysis, much better is to use measures that incorporate the phylogentic divergence between types, e.g. Unifrac [16]. Ordination can be performed with arbitrary distance metrics using multidimensional scaling methods, these can be either metric in that they preserve distances or non-metric in that they preserve the ranking of the distances. An example of a metric multidimensional scaling is principal coordinates analysis which has proven a useful and popular tool when coupled with Unifrac for exploratory data analysis [17].

Clustering is another means of exploratory data analysis which searches for natural groups or partitions in the samples. Hierarchical clustering, where a tree of relationships is generated without explicitly grouping samples unless an arbitrary cut-off is chosen, is quite commonly used in microbial community analyses, partitional clustering where the samples are divided into groups has traditionally been less popular. This may be because of the need to decide a priori how many clusters are present. Generally variants of the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e001.jpg-means algorithm have been used together with heuristics to decide how good a clustering is. To date there has been no model based clustering of microbial community data. This question of the natural number of types of communities has received particular attention recently in the context of the human gut, for which it has been suggested that three microbial community types, known as envirotypes (or, in the context of the gut, enterotypes) are to be found [18]. Classification, or supervised learning, is closely related to clustering, except here the problem is not to find natural groups in the data but to predict the group of a new sample, given a labelling of samples in a training data set. Two studies applying classification methods to microbial communities have appeared recently [19], [20]. Most of the algorithms used were, as for the unsupervised approaches, developed for continuous data with the notable exception of the multinomial naive Bayes (MNB) model in Knights et al. (2001) [20].

There are, however, problems inherent in using standard multivariate techniques for the analysis of microbial metagenomics data. The data, even if normalised into relative abundances, is fundamentally discrete and can only be approximately modelled by continuous variables. In addition, the high diversity (relative to sampling effort) results in very sparse data sets; most taxa appear in only a few samples at low abundance. Finally, the samples vary in read number: a small sample will inherently be more noisy than a larger one. All these issues can be addressed using an explicit sampling scheme. Instead of viewing the sample as representing the community, we view it as having being generated by sampling from the community. The most natural assumption to make is sampling with replacement, so that the likelihood of an observed sample is a multinomial distribution with a parameter vector where a given entry represents the probability that a read is from a given taxa. These probabilities in the limit of very large community sizes will become the relative frequencies of the taxa. This provides a discrete model, that accounts for different sample sizes, and can model sparse data.

We will show how this multinomial sampling can be used as a starting point for a generative modelling framework, one that explicitly describes a model for generating the observed data [21]. This provides model-based alternatives for both clustering and classification of microbial communities. The natural prior for the parameters of the multinomial distribution is the Dirichlet. This is a probability distribution over probability vectors. In the context of microbial communities we can view it as describing a metacommunity from which communities can be sampled. Its parameters then describe both the mean expected community and the variance in the communities. As we will show, one of the major advantages of the Dirichlet prior is that the community parameter vectors which are unobserved can be integrated out or marginalised to give an analytic solution to the evidence: the probability that the data was generated by the model. By extending the Dirichlet prior to a mixture of Dirichlets [22][24], so that the data set is generated not by a single metacommunity but a mixture of multiple metacommunities, we obtain both a more flexible model for our data and a means to cluster communities. To perform the clustering, we simply impute for each sample the component which is most likely to have generated it. This separates samples into groups according to the metacommunity it has the highest probability of deriving from. The advantage of this approach over simple An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e002.jpg-means type strategies is twofold: (1) the clusters can be of different sizes depending on the variability of the metacommunity, and more importantly (2) because we now have an explicit probabilistic model that is appropriate to the data, then we can use the evidence together with methods to penalise model complexity to provide a rigorous means of determining optimal cluster number.

Multinomial sampling has been used previously in the study of microbial communities [20], and it has been coupled with a Dirichlet prior [25], but the extension of that prior to a mixture of Dirichlet components in this context is completely novel, as is the explicit association of each Dirichlet component with a different metacommunity. The major challenge for our framework is how to fit the Dirichlet mixture given the very large dimensionality of microbial metagenomics data sets. This will make Gibbs sampling to obtain posterior distributions for the Dirichlet parameters challenging, at least for OTU based data sets. Instead, we utilise the analytic form for the evidence and fit the Dirichlet parameters by maximising this, given a gamma hyperprior distribution for those parameters, this is an example of the ‘evidence framework’ [26]. In practice, this is achieved by coupling an Expectation-Maximisation (EM) algorithm for the Dirichlet mixture parameters with multi-dimensional optimisation of each component's parameters. To answer the crucial question of model fit, we use a Laplace approximation to integrate out the hyperparameters, and estimate the evidence of the complete model. In contrast, the extension to a classifier is relatively simple. We simply fit the model to the different classes, estimate priors as the frequencies of the classes in the training data, and then use Bayes' theorem to calculate the probability that a sample to be classified was generated from each of the classes. We now explain in more detail the model framework and illustrate its utility by application to two example data sets of human gut microbiota [27], [28].

Materials and Methods

Multinomial sampling

Our starting point is a matrix of occupancies An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e003.jpg with elements An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e004.jpg that give the observed abundance of taxa An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e005.jpg in community sample An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e006.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e007.jpg runs from 1 to the total number of taxa An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e008.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e009.jpg from 1 to the total number of communities An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e010.jpg. We will denote the rows of this matrix that give the occupancies in each individual community sample by the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e011.jpg vectors An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e012.jpg. We assume that each community sample is generated from a multinomial distribution with parameter vector An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e013.jpg. The elements of An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e014.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e015.jpg, are the probabilities that an individual read taken from community An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e016.jpg belongs to species An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e017.jpg. The multinomial distribution corresponds to sampling with replacement from the community. This gives a likelihood for observing each community sample:

equation image
(1)

where the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e019.jpg are the total number of reads from each community An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e020.jpg. The total likelihood is the product of the community sample likelihoods:

equation image

Dirichlet mixture priors

In a Bayesian approach we now need to define a prior distribution for the multinomial parameter probability vectors An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e022.jpg. We will refer to these as ‘communities’ since they reflect the underlying structure of the community An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e023.jpg that is sampled. A prior based on the Dirichlet distribution is natural, as it is conjugate to the multinomial and (as we will discuss) has a number of convenient properties. The Dirichlet is a probability distribution over distributions:

equation image
(2)

This distribution has An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e025.jpg parameters which we can represent as a vector An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e026.jpg that is a measure i.e. all elements are strictly positive, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e027.jpg. We can express An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e028.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e029.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e030.jpg is a normalised measure with An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e031.jpg. The elements An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e032.jpg then give the mean An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e033.jpg values and the value An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e034.jpg acts like a precision, determining how close the values lie to that mean: a large An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e035.jpg gives little variance about the mean values, while a small An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e036.jpg leads to widely distributed samples. Conceptually we view these parameters as describing a ‘metacommunity’, from which different communities can be sampled. The Dirac delta function ensures normalisation, i.e. An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e037.jpg.

To provide a more flexible modelling framework and to allow clustering we extend this single Dirichlet prior to a mixture of An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e038.jpg Dirichlets, indexed An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e039.jpg, each with parameters An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e040.jpg and weight An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e041.jpg [22], [23]. Each community vector An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e042.jpg is assumed to derive from a single metacommunity. For each sample An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e043.jpg, we represent this using a An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e044.jpg-dimensional indicator vector An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e045.jpg that consists of zeros except for the entry corresponding to the metacommunity that sample An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e046.jpg derives from which is equal to one. The prior probabilities for the vectors An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e047.jpg are then just the mixture weights, so:

equation image
(3)

and the complete mixture prior is:

equation image
(4)

where the Dirichlet distribution is given by Equation 2 , and the mixture prior hyperparameters are An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e050.jpg.

The numerical behaviour of the model can be improved by placing independent and identically distributed Gamma hyperpriors on the Dirichlet parameters An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e051.jpg, i.e., An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e052.jpg. Thus,

equation image
(5)

as we will later use the following reparameterisation: An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e054.jpg, the change of variables formula for probability density functions was used to convert the prior for An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e055.jpg into one for An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e056.jpg, which yields the result that:

equation image
(6)

Posterior distribution of the multinomial parameters

The posterior distribution of the community parameters is obtained by multiplying the Dirichlet mixture prior by the multinomial likelihood ( Equation 1 ) and appropriately normalising to give for community An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e058.jpg:

equation image
(7)

The Dirichlet is a conjugate prior for the multinomial: for a single Dirichlet the posterior is itself a Dirichlet with parameters obtained by summing the observed counts and the Dirichlet parameters, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e060.jpg. For the Dirichlet mixture this conjugacy is maintained and Equation 7 can also be written as a Dirichlet mixture:

equation image
(8)

We will discuss the calculation of the posterior probabilities, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e062.jpg, for a sample deriving from a metacommunity below.

Marginalising the multinomial parameters

The denominator of Equation 7 is equivalent to An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e063.jpg, the evidence for community sample An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e064.jpg. This is obtained by integrating the numerator, i.e. the mixture prior An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e065.jpg multiplied by the likelihood An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e066.jpg, over all possible community priors. It is the complete probability of observing this data marginalising out the unseen vector of probabilities An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e067.jpg. One of the useful properties of the Dirichlet prior is that this evidence has a closed form. So focussing on just a single mixture component An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e068.jpg:

equation image
equation image

where the function An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e071.jpg is the multinomial Beta function and can be expressed in terms of Gamma functions as:

equation image

So far we have considered the posterior and evidence for just a single community sample An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e073.jpg. The evidence over all samples is just the product of the evidences for each sample:

equation image
(9)

EM algorithm for fitting the mixture of Dirichlets prior

Our strategy for fitting the mixture of Dirichlets is to maximise the evidence given the gamma hyperpriors. The strictly Bayesian approach would be to sample from the unobserved hyperparameters, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e075.jpg, and latent variables An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e076.jpg, given the hyperpriors, using Markov chain Monte Carlo (MCMC), and then marginalise. This would be computationally challenging for the high dimensional An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e077.jpg vectors that are encountered in microbiomics data. Maximising the evidence allows us to obtain a single parameter vector that will correspond to the most likely set of parameters given the gamma hyperpriors. The technique is well established and is known as the ‘evidence framework’ [21], [26]. The posterior distribution of the hyperparameters is given by the product of the evidence (Equation 9) and the hyperprior for the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e078.jpg given by Equation 5. Strictly, to distinguish this from the posterior of the multinomial parameters we should refer to this as the marginal posterior distribution but our meaning should be clear from the context used. We are also implicitly assuming uniform hyperpriors for the other components of An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e079.jpg, the mixing coefficients An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e080.jpg. Maximising the posterior of the hyperparameters is equivalent to maximising the log posterior of the hyperparameters, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e081.jpg. Thus:

equation image

where

equation image
(10)

We now use a binary latent variable matrix An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e084.jpg with elements An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e085.jpg that are 1 if the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e086.jpgth community sample belongs to the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e087.jpgth metacommunity and 0 otherwise. The rows of this matrix are the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e088.jpg vectors introduced above. This allows us to maximise the log posterior distribution using the popular expectation-maximisation (EM) algorithm [21]. Augmenting the data with these latent variables, the evidence and log posterior distribution, respectively, become:

equation image
equation image

Using Jensen's inequality we obtain a lower bound for the expected log posterior distribution:

equation image
(11)

We can calculate An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e092.jpg as follows:

equation image
(12)

where we have used Bayes' theorem and An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e094.jpg.

Following Sjölander et al (1996) [22], we now reparameterise and optimise the expected log posterior distribution with respect to these new parameters: to keep the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e095.jpg's positive, we set An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e096.jpg, and to keep the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e097.jpg's normalised, we set An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e098.jpg. Optimising An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e099.jpg with respect to An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e100.jpg is equivalent to solving the following equation:

equation image

Rearranging this equation we obtain:

equation image

and thus:

equation image
(13)

Our EM algorithm to find An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e104.jpg thus alternates between updating the responsibilities An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e105.jpg, the mixing coefficients An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e106.jpg and the Dirichlet parameters An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e107.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e108.jpg:

  • Calculate An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e109.jpg using Equation 12.
  • Update An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e110.jpg by finding parameters that minimise the negative of Equation 11. In practice we used the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm as implemented in the Gnu Science Library [29].
  • Calculate An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e111.jpg using Equation 13.
  • Repeat until convergence of An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e112.jpg, which can be calculated from Equation 11.

We will refer to the hyperparameter values obtained by this method as the maximum posterior estimates (MPE).

Model comparison through Laplace approximation

We need to determine the number of components An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e113.jpg in the Dirichlet mixture. We cannot simply choose the one with the largest log posterior, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e114.jpg, as this takes no account of model complexity: as the number of components is increased, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e115.jpg must increase. We could use a heuristic like the Aikaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) to penalise the model parameters but these can give misleading results [21]. Better is to take a fully Bayesian approach to model comparison where probabilities are used to represent uncertainty in the choice of model. Applying Bayes' theorem, the posterior probability of the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e116.jpg component model An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e117.jpg given the data matrix An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e118.jpg is:

equation image

where An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e120.jpg is the prior probability for the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e121.jpg component model, which allows us to express a preference for different models, and An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e122.jpg is the model evidence, which expresses the preference of the data for different models. In our case, the model evidence is given by:

equation image

This integral cannot be calculated analytically, but it can be estimated using the Laplace approximation:

equation image
(14)

where An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e125.jpg is the number of parameters in An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e126.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e127.jpg are the parameters maximising the posterior distribution, and An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e128.jpg is the Hessian matrix of second derivatives of the negative log posterior evaluated at An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e129.jpg:

equation image
(15)

Thus,

equation image

The nonzero elements of the Hessian matrix are given below:

equation image
equation image
equation image

and

equation image

where An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e136.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e137.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e138.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e139.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e140.jpg. In the results we will give the negative of Equation 14 so that a better fit corresponds to a smaller value. The Hessian also allows us to calculate uncertainties in the parameter estimates of An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e141.jpg, through computing the inverse, then the diagonal elements give the variance of the corresponding parameter.

Data Sets

Twins

To illustrate the application of these ideas to a real data set we reanalysed a study of the gut microbiomes of twins and their mothers [27]. These comprised faecal samples from 154 different individuals characterised by family and body mass index – ‘Lean’, ‘Obese’ and ‘Overweight’. Each individual was sampled at two time points approximately two months apart. The V2 hypervariable region of the 16S rRNA gene was amplified by PCR and then sequenced using 454. We reanalysed this data set filtering the reads, denoising and removing chimeras using the AmpliconNoise pipeline [10], [11]. Denoised reads were then classified to the genus level using the RDP stand-alone classifier [5]. This gave a total of 570,851 reads split over 278 samples since of the 308 possible some failed to possess any reads following filtering. The size of individual samples varied from just 53 to 10,585 with a median of 1,599. A total of 129 different genera were observed with a genera diversity per sample that varied from just 12 to 50 with a median of 28. One extra category ‘Unknown’ was used for those reads that failed to be classified with greater than 50% bootstrap certainty. We will refer to this as the ‘Twins’ data set.

IBD

We also include a brief analysis of microbiome data from a study of inflammatory bowel diseases (IBDs) [28]. This comprised faecal samples from 78 individuals where the V5-6 region of the 16S rRNA gene was pyrosequenced using 454. 35 samples were from healthy individuals, 12 from individuals with colonic Crohn's disease (CCD), 15 from individuals exhibiting ileal Crohn's disease (ICD), and 16 from individuals with ulcerative colitis (UC). We processed the data as above. This gave a total of 134,276 reads with individual samples varying in size from 394 to 3,258 with a median of 1,710 reads. 93 separate genera were observed in these samples with a genera diversity per sample that varied from 8 to 33 with a median of 22.

Results

Clustering Twins data at the metacommunity level

The mixture of Dirichlets prior can be used to cluster samples at the metacommunity level. Assuming each sample represents a unique community, we can try to infer which metacommunity that community is most likely to have originated from. This is the component for which the posterior probability of each membership is the highest, i.e. the value of An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e142.jpg that maximizes An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e143.jpg for a particular sample An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e144.jpg. We will denote this value as An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e145.jpg. These posterior probabilities will be the equilibrium values of the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e146.jpg calculated by the EM fitting algorithm.

To use the mixture of Dirichlets prior for clustering at the metacommunity level we first need to determine what the number of clusters or mixture components An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e147.jpg should be. To do this we fitted Dirichlet mixtures by minimising the negative log posterior as described above. To calculate model fit accounting for complexity we then used the Laplace approximation to the model evidence. We did this for increasing values of An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e148.jpg starting with just a single component An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e149.jpg. The results are shown in Figure 1 where we see a minimum for An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e150.jpg suggesting, firstly, that a mixture of Dirichlets is more appropriate than a single Dirichlet prior for this data set and that, secondly, the mixture has four components.

Figure 1
Model fit for mixture of Dirichlets prior to Twins dataset.

The four components have weights An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e152.jpg. They differ also in how variable their communities are with An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e153.jpg. Therefore we have two less abundant highly variable clusters 1 and 4 and two more abundant homogeneous clusters 2 and 3. To graphically illustrate this optimal clustering in Figure 2 we used non-metric multidimensional scaling (NMDS) to generate two-dimensional positions for each community sample, and the mean vectors associated with the four Dirichlet components An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e154.jpg, that reflect their Bray-Curtis distances using the isoMDS function of R [30]. From this the higher variability in the first and fourth clusters is readily apparent. Another striking observation is that communities are not necessarily associated with the closest cluster mean. Partially this may reflect imperfect mapping to the two-dimensional space but it will also likely reflect properly accounting for sampling through the multinomial-Dirichlet structure.

Figure 2
NMDS plot of Twins dataset with hierarchical cluster labellings.

To explore the component composition we use the Dirichlet parameter vector obtained by fitting a single mixture to the data set as a reference, which we will denote An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e155.jpg. For interest An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e156.jpg a value that is intermediate to that of the four components. We can get a sense of how different the components are by calculating the sum of their posterior mean absolute differences to the reference An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e157.jpg. A quantity which will vary between 0 and 200% for metacommunities that are identical and completely dissimilar to the reference respectively. Calculating this gives 34%, 26%, 51% and 47% for the four components, and a total of 158%, indicating substantial differences in community structures for each component from the reference. How the different OTUs contribute to these differences is shown in Table 1. Comparing the means of the posterior distributions for the four components we find that 30 out of 131 genera account for over 90% of this difference. The Bacteroides alone account for 29% of this difference. This genera is substantially over represented in the third cluster comprising nearly 39% of the community, close to the reference at 23% in the second cluster and observed at much lower proportions in the first and fourth clusters at around 7% and 8%, respectively. The next most significantly different category is actually ‘Unknown’ with nearly 15% more sequences failing to be classified with sufficient confidence in the fourth component, and 8% less in the third component than the reference. Faecilibacterium are substantially under-represented in the fourth component whereas Prevotella is mostly found in the first. The other genera exhibit various patterns but frequently we see over representation in one of or both the first and fourth clusters and little representation in the second and third e.g. Colinsella, Eubacterium, Streptococcus, et cetera.

Table 1
Genera frequencies in the Twins Clusters.

These patterns are also illustrated graphically in the ‘heat map’ of relative frequencies shown in Figure 3. The relative frequencies of the 30 genera accounting for the most difference between clusters are shown for all the samples. The samples are grouped into the cluster that they had the highest probability of being generated from, as defined above. The cluster means are plotted to the right of the samples mapped to that cluster. Roughly we have that the two low variance clusters are dominated by Bacteroides and Faecilibacterium, albeit to a greater extent in the third cluster. The high variance, first and fourth clusters, contain a greater variety of genera but with substantially more Prevotella and Faecilibacterium in the first, rather than the fourth, where no genus really dominates.

Figure 3
Heat map of the Twins data and hierarchical clustering.

Generative classifier for Twins data

The Dirichlet-multinomial framework can also be used for classification. This is a supervised learning approach as opposed to the unsupervised approach used in the previous section. Here, we will consider the case of binary classes but any number of classes is a simple extension. Given a training data set of An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e165.jpg samples An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e166.jpg then we denote class membership with the An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e167.jpg dimensional vector An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e168.jpg with elements An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e169.jpg which are either 0 or 1. The classification problem is to deduce the class An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e170.jpg of a new sample An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e171.jpg. To do this we associate a separate Dirichlet multinomial mixture model with each class. We denote the hyperparameters of these mixtures by An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e172.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e173.jpg, respectively. Then we can marginalise over the multinomial parameters of the sample to be classified so that:

equation image
(16)

is the probability of the sample belonging to the second class and An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e175.jpg. The prior class probabilities are estimated as the observed class frequencies so that An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e176.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e177.jpg. The class mixture themselves are determined just as before but with data points restricted to those class members. We can also determine if the fit is significant by comparing the sum of model fits of the classes with the model fit ignoring the class variables. This is our generative classification scheme.

We will apply this to the Twins data denoting individuals with ‘Lean’ BMI by An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e178.jpg and ‘Obese’ as An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e179.jpg. We will ignore the ‘Overweight’ category to avoid ambiguity. In Figure 4 we replot the NMDS plot of Figure 2 with these class labels. There is no dramatic separation of points according to class labels. We found that for the Lean An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e180.jpg class a single component Dirichlet mixture was optimal but that for the Obese An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e181.jpg class three components minimised the Laplace approximation to the model evidence. The means of each of the three Obese components were quite different but the posterior mean for the entire prior sampling from all three according to their weights (black circle in Figure 4) is close to the single component from the Lean class (black asterisk Figure 4). In fact, accounting for uncertainty in both the Dirichlet priors and the sampling from those, then only one low frequency genera, Megasphaera, was significantly differently expressed between classes, having a 97% probability of being more abundant in Obese people. In addition, fitting to the two classes separately did not give a significantly better fit than fitting to the whole data set, 35640 vs. 35385. It is also apparent from comparing Figure 2 and Figure 4 that each of the class components map onto one of the components from the clustering of the whole data set, this was confirmed by comparing the Bray-Curtis distances between the two sets of mean vectors, the component from the Lean class maps onto the second of the four from the whole data set, and the three components from the Obese class map onto the third, first and fourth, respectively. In summary, it appears that the difference between Lean and Obese classes lies not at the level of mean community composition but that the Obese individuals contain a greater variety of community structures including three out of the four components found in the complete data set.

Figure 4
NMDS plot of Twins dataset with class labels.

In a recent evaluation of classification algorithms applied to microbial community data the random forests algorithm was found to perform best [20], substantially outperforming elastic nets, support vector machines, and multinomial naive Bayes (MNB). The random forests algorithm is an example of ensemble learning where many classifiers are generated and their predictions are aggregated. In particular, it is an extension of the machine learning technique known as bootstrap aggregating or bagging for short. The bagging approach constructs decision trees from bootstrap samples of the data and makes class predictions via majority vote. Random forests adds an extra layer of randomness to bagging by changing how the decision trees are constructed. Instead of splitting each node using the best split amongst all the variables, the best split amongst a subset of randomly chosen predictors is used. Moreover, the random forests algorithm also gives a measure of the importance of a variable by calculating how much prediction error increases when data for that variable is permuted. Random forests therefore seemed like an appropriate benchmark to compare the performance of our generative classifier to. Following Knights et al. (2011) [20], we implemented the random forests algorithm using the randomForest package in R, though we tuned the parameters of the algorithm (the number of variables in the random subset at each node and the number of trees in the forest) according to the heuristics suggested by Liaw and Wiener (2002) [31].

To compare the two classification methods we performed leave-one-out validation. We removed each sample in turn from the data set, trained the classifier, and classified the missing data point. Assigning the data point as Obese if the predicted probability was greater than or equal to 0.5. We obtained a slightly lower error rate, i.e. fraction of samples misclassified, for the random forests algorithm (18.5%) as opposed to the Dirichlet multinomial generative classifier (22.4%). Examining the ‘confusion matrix’ for each classifier, Table 2, that is the number of individuals from each true class classified into the two classes, reveals that the generative classifier does have a better distribution of errors across classes. We then generated receiver-operating characteristic (ROC) curves for each classifier. These are shown in Figure 5. They are generated by ordering samples by decreasing likelihood of being Obese: for the generative classifier that is simply the probability of being Obese i.e. An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e184.jpg; for random forests this is the weighted vote. We then lower a threshold from 1.0 to 0.0 with intervals defined by the sample probabilities. All samples with probability greater than or equal to a given threshold are classified as Obese, all other samples as Lean. Based on these classifications, the false positive percentage (i.e. Lean classified as Obese) and true positive rate (Obese classified as Obese) are calculated and plotted against each other. This is repeated for all thresholds. It is a means of summarising the performance of a classifier over all decision thresholds. Both classifiers do substantially better than random but at lower thresholds random forests outperforms the generative classifier with fewer false positives. A summary statistic is the area under the ROC curve, for random forests this was 85%; for the Dirichlet-Multinomial 79% was obtained.

Figure 5
Receiver operating characteristic (ROC) curves for the Twins Dirichlet multinomial and random forests classifiers.
Table 2
Confusion matrices for classification of Twins data.

Analysis of IBD phenotypes

We conclude with a brief analysis of the inflammatory bowel disease (IBD) phenotypes. In Figure 6 we show an NMDS plot with samples coloured according to phenotype for this data set generated as described above. It is apparent from this that the Healthy (H) individuals, and those exhibiting colonic Crohn's disease (CCD) and ulcerative colitis (UC), have similar, fairly homogeneous community structures whereas the individuals with ileal Crohn's disease (ICD) have a much larger variation in community structure. We can use the DMM model to quantify this, we fitted single component models, to all the samples together, and then each phenotype separately. The An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e185.jpg values obtained were 15.7 for the whole data set and (H) 22.2, (CCD) 39.4, (ICD) 5.1, (UC) 38.5 for the phenotypes. Remembering, that An external file that holds a picture, illustration, etc.
Object name is pone.0030126.e186.jpg is related to the inverse of the variance, then this confirms that the ICD phenotype is associated with an increase in metacommunity variability. We also show the metacommunity means in Figure 6 as crosses: H, CCD and UC have a similar location whereas the ICD mean is displaced. Exactly how the different OTUs contribute to the differences in the ICD samples is shown in Table 3 and graphically in Figure 7. The proportion of the Unknown, Bacteroides, and Faecalibacterium genera are reduced whereas numerous other genera for example the Escherichia/Shigella, Sutterella, and Prevotella are increased.

Figure 6
NMDS plot of IBD dataset with class labels.
Figure 7
Heat map of the IBD data divided by phenotype together with phenotype means.
Table 3
Genera frequencies in the IBD phenotypes.

Discussion

We have demonstrated that the Dirichlet multinomial mixture is a powerful framework for the generative modelling of microbial community data. It operates at several levels, it allows read numbers and hence sampling noise to be naturally accounted for, and the Dirichlet parameters are easily interpretable in terms of the mean and variance of the communities generated from each component. Used for ‘unsupervised learning’ or clustering it provides a means to determine clusters of communities or envirotypes, a highly topical problem in the analysis of microbial community data. Since it is a probabilistic model, we can harness rigorous statistical theory for determining how well the data is explained by a given cluster number.

We illustrated this approach with the Twins data set. Using our models, the most probable estimate for the number of envirotypes present in this sample (or ‘enterotypes’ as they are known in the context of gut microbiota samples) is four. Our measure of model fit, the negative logarithm of the approximate model evidence, was 41 less than the next best cluster number, three. Thus, in the context of our model the probability that there are four rather than three or five clusters is practically a 100%. However, a direct implication of the Bayesian approach is that any point estimate of the number of envirotypes represents a summary (in our case, the mode) of the posterior distribution over the number of clusters. For other data sets the predicted cluster number may be more uncertain. This uncertainty can be naturally incorporated by our approach.

Our analysis, and its statistical implications, may be contrasted with a previous analysis of this same Twins dataset, which used a partitioning around medoid (PAM) clustering coupled with the heuristic Calinski-Harabasz (CH) index [18]. The CH approach makes no acknowledgment of the fact that there is inherent uncertainty in the number of clusters, and thus may potentially be misread as offering an unambiguous and definitive assessment of the number of clusters. Furthermore, the PAM clustering algorithm does not allow clusters to be of variable spread. This may be the reason why they found three rather than four clusters. The extra flexibility of the DMM model could better represent the true patterns in the data. This, to us, supports the promise of a probabilistic model with the flexibility to model clusters of different size and a Bayesian approach to determining the cluster number.

Used for ‘supervised learning’ the Dirichlet multinomial mixture provides an effective classifier. Absolute classification power as summarised by the area under the ROC curve is less than for the best performing of previously tested algorithms - random forests. However, using the standard classification threshold of 0.5 it had a better distribution of errors across classes, outperforming random forests on the smaller ‘Lean’ class. In general, we would expect discriminative classifiers, which only model the conditional probability of the class label given the data, to outperform generative models, which fit the actual class distributions. On the other hand, the generative approach allows much easier interpretation of the fitted models, which is often more important than accuracy per se. The fitted Dirichlet parameters describe both the composition of the communities, and critically variance in composition associated with the classes. The probabilistic framework that we present also allows the hypothesis of whether two classes do differ in community composition to be rigorously tested. Or equivalently whether a discrete experimental treatment significantly impacts community structure.

Generative models provide a framework for both clustering and classification but their full power derives from their ability to combine the two. We will illustrate this for the Twins data. In Table 4 we give the proportion of samples from each BMI category, i.e. Lean, Obese and Overweight, that fell into our four enterotypes. For this data set we did not see a significant difference in mean community composition between Lean and Obese individuals. However, it is clear that the two classes do differ significantly in their probability of deriving from each of the clusters. Lean individuals are much less likely to derive from the first and fourth clusters than Obese individuals. They are much more likely to derive from the second and somewhat less likely from the third. This suggests a novel explanation for the differences in taxa frequency that have been previously reported between Lean and Obese individuals from this data. BMI itself is not correlated with changes in community structure rather it influences the likelihood of deriving from the four enterotypes.

Table 4
Comparison of BMI and cluster or ‘Enterotype’.

This raises the intriguing possibility that the first and fourth enterotypes may be associated with a disturbed possibly unhealthy gut microbiota – ‘dysbiosis’. This implies that obesity does not guarantee a disturbed microflora but increases its likelihood. Finally, we return to the observation that the first and fourth enterotypes have a higher variance in community structure than the second and third. We suggest that this is an example of the ‘Anna Karenina principle’ as applied to microbial communities. This principle popularised by Jared Diamond [32] derives from the first line of Tolstoy's novel: “Happy families are all alike; every unhappy family is unhappy in its own way” [33]. We propose that the same thing may apply to microbial communities in human health, there are many more configurations associated with dysbiosis than are possible for a healthy community which is relatively predictable and homogeneous as it requires certain key components. This is not to suggest that the first and fourth enterotypes are associated with higher genera level diversity in individual samples, the median diversities are not significantly different between the enterotypes, it is the diversity in community compositions that increases. Our observations are also consistent, therefore, with the conclusion of the original study that the major impact of obesity was a reduction in OTU diversity [27].

This interpretation of the Twins data is obviously speculative and will require further studies with more meta-data on host health to corroborate. The analysis of the IBD phenotype data represents a first step in this direction. There we did find a much more variable microbiota associated with one of the disease phenotypes, ileal Crohn's disease, but not colonic Crohn's or ulcerative colitis. This is, therefore partial support for the AKP. However, it is possible that the latter two diseases are not strongly associated with gut dysbiosis. Certainly, at the genera level we were unable to discriminate their community compositions from healthy individuals. The number of samples in each of the disease phenotypes was also quite small. We hope that future large-scale sequencing projects will allow us to investigate this question further. The ‘Human Microbiome Project’ is restricted to healthy individuals but that will allow us to verify the existence of the two enterotypes that we propose are associated with a healthy microbiota [34].

The software for fitting the Dirichlet multinomial mixture is available for download from the Google Code project MicrobeDMM (http://code.google.com/p/microbedmm/).

Acknowledgments

We wish to thank Jose Carlos Clemente, Alan Walker and three anonymous reviewers for comments on an earlier draft of this manuscript, Peter Turnbaugh for providing the Twins data, and Ben Willing, Johan Dicksved and Anders Andersson for providing the IBD data set.

Footnotes

Competing Interests: KH is directly funded through a Unilever research grant to develop bioinformatics tools. All tools developed under this grant are being released open source. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.

Funding: CQ is funded by an EPSRC Career Acceleration Fellowship EP/H003851/1. KH by a Unilever directly funded research grant to the University of Glasgow. IH was supported by NIH/NIGMS grant R01-GM076705. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Streit W, Schmitz R. Metagenomics - the key to the uncultured microbes. Curr Opin Microbiol. 2004;7:492–498. [PubMed]
2. Dorigo U, Volatier L, Humbert JF. Molecular approaches to the assessment of biodiversity in aquatic microbial communities. Water Res. 2005;39:2207–2218. [PubMed]
3. Margulies M, Egholm M, Altman W, Attiya S, Bader J, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. [PMC free article] [PubMed]
4. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA. 2011 e-pub ahead of print doi: 10.1073/pnas.1000080107. [PMC free article] [PubMed]
5. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007;73:5261–5267. [PMC free article] [PubMed]
6. Schloss P, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol. 2005;71:1501–1506. [PMC free article] [PubMed]
7. Schloss PD, Handelsman J. A statistical toolbox for metagenomics: assessing functional diversity in microbial communities. BMC Bioinf. 2008;9 [PMC free article] [PubMed]
8. Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, et al. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl Acad Sci USA. 2006;103:12115–12120. [PMC free article] [PubMed]
9. Huber JA, Mark Welch D, Morrison HG, Huse SM, Neal PR, et al. Microbial population structures in the deep marine biosphere. Science. 2007;318:97–100. [PubMed]
10. Quince C, Lanzen A, Curtis TP, Davenport RJ, Hall N, et al. Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods. 2009;6:639–641. [PubMed]
11. Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ. Removing noise from pyrosequenced amplicons. BMC Bioinf. 2011;12 [PMC free article] [PubMed]
12. Turnbaugh PJ, Quince C, Faith JJ, McHardy AC, Yatsunenko T, et al. Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins. Proc Natl Acad Sci USA. 2010;107:7503–7508. [PMC free article] [PubMed]
13. Quince C, Curtis TP, Sloan WT. The rational exploration of microbial diversity. ISME J. 2008;2:997–1006. [PubMed]
14. Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R. Error-correcting barcoded primersfor pyrosequencing hundreds of samples in multiplex. Nat Methods. 2008;5:235–237. [PMC free article] [PubMed]
15. Ramette A. Multivariate analyses in microbial ecology. FEMS Microbiol Ecol. 2007;62:142–160. [PMC free article] [PubMed]
16. Lozupone C, Knight R. UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol. 2005;71:8228–8235. [PMC free article] [PubMed]
17. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010;7:335–336. [PMC free article] [PubMed]
18. Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, et al. Enterotypes of the human gut microbiome. Nature. 2011;473:174–180. [PMC free article] [PubMed]
19. Sun Y, Cai Y, Mai V, Farmerie W, Yu F, et al. Advanced computational algorithms for microbial community analysis using massive 16S rRNA sequence data. Nucleic Acids Res. 2010;38 [PMC free article] [PubMed]
20. Knights D, Costello E, Knight R. Supervised classification of human microbiota. FEMS Microbiol Rev. 2011;35:343–359. [PubMed]
21. Bishop CM. Pattern Recognition and Machine Learning. Springer: Yale University Press; 2006.
22. Sjolander K, Karplus K, Brown M, Hughey R, Krogh A, et al. Dirichlet mixtures: A method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci. 1996;12:327–345. [PubMed]
23. X Y, Yu YK, Altschul SF. Compositional adjustment of Dirichlet mixture priors. J Comput Biol. 2010;17:1607–1620. [PMC free article] [PubMed]
24. Bouguila N. Count data modeling and classification using finite mixtures of distributions. IEEE Trans Neural Netw. 2011;22:186–198. [PubMed]
25. D K, Kuczynski J, Charlson ES, Zaneveld J, Mozer MC, et al. Bayesian community-wide culture-independent microbial source tracking. Nat Methods. 2011;8:761–763. [PMC free article] [PubMed]
26. Mackay DJ. Bayesian interpolation. Neural Comput. 1992;4:415–417.
27. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457:480–484. [PMC free article] [PubMed]
28. Willing BP, Dicksved J, Halfvarson J, Andersson AF, Lucio M, et al. A Pyrosequencing Study in Twins Shows That Gastrointestinal Microbial Profiles Vary With Inflammatory Bowel Disease Phenotypes. Gastroenterology. 2010;139:1844–U105. [PubMed]
29. Galassi M. GNU Scientific Library Reference Manual. 2009. URL http://www.gnu.org/software/gsl/. ISBN 0-954612-07-8.
30. R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. 2010. URL http://www.R-project.org. ISBN 3-900051-07-0.
31. Liaw A, Wiener M. Classification and regression by randomforest. R news. 2002;2:18–22.
32. Diamond J. Guns, Germs, and Steel. New York, New York: W. W. Norton; 1997.
33. Tolstoy L. Anna Karenina. Moscow, Russia: The Russian Messenger; 1877.
34. Peterson J, Garges S, Giovanni M, McInnes P, Wang L, et al. The NIH Human Microbiome Project. Genome Res. 2009;19:2317–2323. [PMC free article] [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...