• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS One. 2012; 7(2): e30906.
Published online Feb 17, 2012. doi:  10.1371/journal.pone.0030906
PMCID: PMC3281906

Efficient Exact Maximum a Posteriori Computation for Bayesian SNP Genotyping in Polyploids

Fabio Rapallo, Editor

Abstract

The problem of genotyping polyploids is extremely important for the creation of genetic maps and assembly of complex plant genomes. Despite its significance, polyploid genotyping still remains largely unsolved and suffers from a lack of statistical formality. In this paper a graphical Bayesian model for SNP genotyping data is introduced. This model can infer genotypes even when the ploidy of the population is unknown. We also introduce an algorithm for finding the exact maximum a posteriori genotype configuration with this model. This algorithm is implemented in a freely available web-based software package SuperMASSA. We demonstrate the utility, efficiency, and flexibility of the model and algorithm by applying them to two different platforms, each of which is applied to a polyploid data set: Illumina GoldenGate data from potato and Sequenom MassARRAY data from sugarcane. Our method achieves state-of-the-art performance on both data sets and can be trivially adapted to use models that utilize prior information about any platform or species.

Introduction

Most agriculturally important plant species, such as potato, sugarcane, coffee, cotton and alfalfa, are polyploids. In fact, about half of the natural flowering plant species are polyploids [1]. Despite their importance, our understanding of these species does not fully benefit from marker technology. Molecular markers are widely used for diploid species and can be very useful for building linkage maps [2], finding genomic regions associated with variation in quantitative traits (or QTL) [3], studying the genetic architecture of quantitative traits [4], and assembling genome sequences [5].

Accurate genotyping of polyploids (even for largely uncharacterized species or in cases when the ploidy is unknown) is a missing keystone in genetics that must be solved in order to utilize the approaches that have marked a revolution in biology over the past hundred years. Accurate genotypes are necessary to understand the genetic mechanisms and specific loci that determine phenotypes via QTL mapping and association studies. These genotypes are also necessary for the creation of linkage maps, which are exceedingly useful in developing a greater understanding of genome evolution. These linkage maps will be essential for the assembly of complex polyploid genomes.

The current approach used for several genetic studies on polyploids, especially for linkage mapping, is based on marker loci with only a single copy (simplex) in one of the parents and a nulliplex in the other, in An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e001.jpg populations obtained from the cross of non-inbred parents. Markers such as AFLP and SSR ( i.e. microsatelites) are then scored as presence or absence of bands [6][8] and behave like dominant markers. For sugarcane, most available linkage maps are based on markers segregating in An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e002.jpg (single dose in one parent) or An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e003.jpg patterns (single dose in both parents) [9]. Even if complex statistical methods are applied to obtain integrated maps that combine information from markers with both patterns simultaneously [10], [11], the available maps are based on a small sample of the genome, since markers with higher doses are normally not included; therefore, they are not well saturated and informative for genome assembly [12]. For QTL studies in sugarcane, the situation is similar. Statistical models developed for backcrosses are used for simplexAn external file that holds a picture, illustration, etc.
Object name is pone.0030906.e004.jpgnulliplex configurations with available software that was developed for diploids [13]. Since the ploidy level could be related with gene expression [14], these approaches need to be modified to incorporate allele dosage using more efficient marker systems.

Nowadays, new technologies such as Illumina GoldenGate™ [15] and Sequenom iPLEX MassARRAY® [16] allow researchers to generate high-throughput genotyping data from SNPs. These data usually contain two signals for each SNP locus, each one corresponding to an intensity recorded for one of the two possible alleles. The expected value of each signal intensity is proportional to the corresponding allele dosage [16], [17], and therefore SNPs are the marker of choice for genetic studies in polyploids. They are more informative than presence/absence markers, and should allow a better coverage of the genome and the development of more realistic models for linkage studies, QTL and association mapping, among other applications.

In order to explore the full potential of such technologies, a first required step is the development of statistical methods for SNP genotype calling, i.e. inferring the (discrete) genotype of each individual for each locus, identifying the number of copies of each allele. For diploids, including humans, a number of methods are already available [18]. This is not the case for polyploids. Methods for polyploid genotyping need to be able to deal not only with multiple copies of the alleles, but also with some complex problems such as aneuploidy and unknown ploidy, which can be present for some species.

Voorrips et al. [19] presented an approach based on mixture models for genotype calling in autotetraploids, in a similar way as done by [20] in diploids. Based on the (transformed) allele signal ratio (ratio of one signal peak to the total), they fitted a mixture of five normal distributions, each one corresponding to one genotype class (from zero to four copies of the allele). They compared several models and were also able to test for Hardy-Weinberg equilibrium in a potato panel with An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e005.jpg tetraploid potato varieties. Their model could be expanded for allowing the inclusion of more classes in the mixture in order to be useful for other autopolyploids; however, in certain situations the ploidy (and hence the number of classes) is unknown and need to be estimated. Also, crosses with distinct ploidies and parents may result in similar segregation patterns, making the selection of the best model a complicated task. This is the case for sugarcane, which is a very complex polyploid and aneuploid species. Genotype calling in sugarcane is extremely difficult, especially if commercial varieties are used, since they are interspecific hybrids between domesticated and wild relatives [21].

Here we present a graphical Bayesian model for SNP genotyping calling. Our graphical Bayesian method can infer genotypes even when the ploidy of the population is unknown. At the core of Bayesian thinking is the notion of modeling processes forwards rather than trying to model their inverse. Generally, a great deal of prior knowledge is available regarding the way any process behaves running forwards; when the process is modeled generatively ( i.e. running forwards), this prior knowledge can be exploited to improve the fidelity with which it describes the process. In graphical models prior knowledge regarding independence and conditional independence of variables can be visualized in the structure of the graph. The highly connected subunits of the graph can be considered with modularity; that is, a subunit can be easily interchanged with another. This modularity is what allows our model to work with populations in Hardy-Weinberg equilibrium, the progeny of an An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e006.jpg cross, or any population with a known theoretical distribution of genotypes. This modularity results in a model and inference procedure that are compatible with any theoretical distribution of genotypes in the population. There are many other ways that our model, and similarly motivated models, can be easily changed and improved because of their modularity and generality.

We also introduce an algorithm for finding the exact maximum a posteriori (MAP) genotype configuration with this model. This algorithm is implemented in a freely available software package named SuperMASSA. We demonstrate the utility, efficiency, and flexibility of the model and algorithm by applying them to data from two polyploids processed with two different platforms: potato [19] using Illumina GoldenGateTM assay [15] and sugarcane using Sequenom iPLEX MassARRAY® [16].

Materials and Methods

Data

Potato

An autotetraploid potato collection was used, comprising 384 SNPs scored in a panel of 224 individuals using the Illumina GoldenGateTM assay, as described in [22] and [19]. This data set is distributed along with the free R package fitTetra [23], under the GNU General Public License. To exemplify the results obtained using the mixture model, [19] chose three loci, PotSNP016, PotSNP034 (Figure 1) and PotSNP192. However, for loci PotSNP192, they noted that the Illumina GoldenGate assay produced significantly different signal strengths for the alleles, resulting in skewed clusters. Thus, the intensity ratio between those alleles can not be easily used to infer genotypes. Since our model assumes the signal strength of each allele is proportional to the dosage (and that the proportionality constant for both alleles is similar), we used only PotSNP016 and PotSNP034 to exemplify our method. For this data set, we use the same model of the genotype distribution as [19] ( i.e. Hardy-Weinberg). Moreover, since we know the ploidy for both the diploid and tetraploid potatoes, we can check if the ploidy estimated by our model matches the actual one. These two SNPs were also scored in 64 diploid potato varieties that were used for a visual check of the goodness of fit. We also analyze the diploid individuals using PotSNP016 and PotSNP034.

Figure 1
Raw data.

Sugarcane

A sugarcane mapping population derived from a cross between two commercial varieties (IACSP 95-3018An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e007.jpgIACSP 93-3046) was used. It was comprised of 180 individuals scored for 241 SNPs using the Sequenom iPLEX MassARRAY® technology [16]. This assay is based on allele-specific primer extension with a mass-modified terminator [24]. The DNA products of this reaction are analyzed by a MALDI-TOF mass spectrometer and each polymorphic region of interest is detected by a mass of the allele-specific primer [25]. Both parents were also scored 12 times for each SNP. If the ionization efficiency is similar for both alleles, the intensities produced by mass spectrometry are proportional to abundance (with very similar proportionality constant if run in the same sample prep); therefore, the if the amplification of both alleles is similar, the skew is minimal. We observe much less skew in the sugarcane data set compared to the potato data set.

Modern sugarcane varieties have highly polyploid and aneuploid genomes, with ploidy levels ranging from 5 to 16 [26], [27]. Therefore, unless there is strong cytological information for a marker, it is important to also estimate the ploidy. Since we want to test our model and do not have a reference point for sugarcane (such as the known diploids or tetraploid potato varieties), and also because sugarcane meiosis frequently result in deviations from the expected Mendelian segregation ratios [26][28], we used a blind method to curate the data and evaluate SuperMASSA.

First, all sugarcane loci were curated by eye using several criteria. For each locus, an expert looked at raw scatter plots as shown in Figure 1 and assessed the following: i) the overall quality; ii) the number of clusters; and iii) the expected ploidy level based on parental data. This resulted in 27 SNPs that were easily classified by eye. SuperMASSA was used to predict the ploidy and number of clusters for each of these 27 loci and three of them (the three judged to be of the highest quality) are used to show the results of our model.

It is important to note that in this blind validation experiment, SuperMASSA was not used to curate the data and the model behind SuperMASSA was not changed after observing and curating the data.

Probabilistic Graphical Model

We use a Bayesian approach to model the probability of the observed data given the ploidy and all genotypes. By modeling the generative process ( i.e. the process by which the data is produced assuming we know the ploidy and genotypes of all individuals), we can build the model from realistic assumptions for the data. Using the model, we then perform inference (described in the Probabilistic Inference section) to effectively enumerate all possible ploidies and genotypes for individuals in the population, and choose the configuration that maximizes the posterior probability of the model. This configuration is known as the maximum a posterior (MAP) and is guaranteed to result in the highest possible probability.

In Figure 2 we present two probabilistic directed graphical models of the SNP genotyping process for a single locus: a Hardy-Weinberg model and an An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e008.jpg model. These models represent dependencies using directed edges. Both models share similar motivation and notation; the few differences arise from different models of the distribution of genotypes in the population. We first present the shared model components and then present the details specific to each model.

Figure 2
A Graphical View of SNP Genotyping.

Hardy-Weinberg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e021.jpg Model Similarities

For both models, the “genotype configuration” An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e022.jpg is the collection of genotype assignments for all individuals in the data set. Because the ploidy, denoted An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e023.jpg, determines the possible set of genotype outcomes, the genotype configuration depends on the ploidy An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e024.jpg. Denote the set of possible genotype outcomes for a given ploidy as An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e025.jpg. For example, for a diploid locus An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e026.jpg and the set of possible genotypes is An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e027.jpg. Both models use a uniform prior on the ploidy An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e028.jpg; it should be noted that for the data we analyzed, the influence of any weak priors is negligible because of a pronounced drop in suboptimal posteriors relative to the MAP configuration.

The observed data An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e029.jpg is composed of a collection of data points An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e030.jpg, each of which comprises an An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e031.jpg intensity pair and an individual An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e032.jpg that gave the sample producing the An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e033.jpg pair. We assume that each data point depends only on the individual that produced it; therefore, the likelihood of any genotype configuration An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e034.jpg can be written as a product over individuals:

equation image

For some An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e036.jpg, we model the likelihood proportional to An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e037.jpg using a normal distribution with unknown standard deviation An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e038.jpg:

equation image

where the operator An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e040.jpg is used to perform An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e041.jpg normalization on An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e042.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e043.jpg. This likelihood effectively uses the expected angles of each genotype and penalizes individuals deviant from the genotype of the expected angle. For this reason, “skewed” data, where the intensities measured by allele 1 and allele 2 use very different constants of proportionality with their respective dosages, cannot be modeled without including a latent variable for the skew. Sigma is given a uniform prior and inference is performed in a manner similar to inference over all ploidy.

For any genotype configuration An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e044.jpg, both models also compute An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e045.jpg, the distribution of possible genotypes. An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e046.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e047.jpg equals An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e048.jpg, the number of individuals assigned to genotype An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e049.jpg. The probability of any distribution An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e050.jpg is modeled using the theoretical distribution of genotypes An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e051.jpg. Given the theoretical genotype frequencies for the population An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e052.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e053.jpg, the probability of observing any genotype distribution An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e054.jpg is multinomial:

equation image

Both the Hardy-Weinberg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e056.jpg models allow for individuals with replicate data points. If all individuals have the same number of replicate data points, then the MAP configuration is guaranteed to be found (as shown in the Supplement S1).

Hardy-Weinberg Model

Figure 2A depicts the dependencies of the Hardy-Weinberg model. In the Hardy-Weinberg model, the theoretical distribution of genotypes is modeled using a binomial distribution. Given An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e057.jpg, the allele frequency of the first allele (in the ordered pair), the probability of any genotype An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e058.jpg is An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e059.jpg. The parameter An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e060.jpg is modeled using a uniform prior. To perform grid search, we discretize An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e061.jpg into the range An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e062.jpg with a resolution of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e063.jpg.

An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e064.jpg Model

Figure 2B depicts the dependencies of the An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e065.jpg model. In the An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e066.jpg model, the theoretical distribution of genotypes is modeled using hypergeometric distributions for the gametes (it is important to note that any model could be trivially applied instead). Denote An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e067.jpg to be the dosage for the first allele in the ordered pair and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e068.jpg to be the dosage of the second allele in the pair. Given parents An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e069.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e070.jpg, both which have values in An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e071.jpg, the probability of observing gamete An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e072.jpg from An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e073.jpg (without loss of generality) is

equation image

Therefore, the probability of observing offspring An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e075.jpg is

equation image

In the An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e077.jpg model, the parent genotypes An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e078.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e079.jpg depend on the ploidy since the outcomes of both must be in An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e080.jpg. We model the prior probability as uniform for the number of unique outcomes: An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e081.jpg

In Figure 2B dashed nodes and arrows represent variables and dependencies that exist only when data from the parents is included. The probability of these parameters can be modeled as conditionally independent, just like An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e082.jpg:

equation image

When parental data is used, the parents are distinct and so the number of unique parental combinations becomes An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e084.jpg; therefore, when parental data is available, the prior probability on parental configurations becomes uniform over these An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e085.jpg distinguishable outcomes.

Generalized Population Model

The inference procedure described does not make any special use of the type of parameters that determine An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e086.jpg; therefore, given the parameters An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e087.jpg that determine An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e088.jpg (and do not depend on An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e089.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e090.jpg, or An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e091.jpg), our inference method will find the MAP genotype configuration. This illustrates that both the Hardy-Weinberg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e092.jpg models are specific instances of a general model (where An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e093.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e094.jpg, respectively). An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e095.jpg is searched in a similar manner, but since we use a uniform prior, we search all parameter configurations for a given An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e096.jpg and omit An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e097.jpg from An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e098.jpg for simplicity (this strategy also allows us to cache the table of likelihoods for a given An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e099.jpg). When parental data is included in the An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e100.jpg model, it can be modeled by setting the prior probability (that is, the probability including available parent data but excluding data from progeny) to

equation image

We define the “generalized population model” as the model defined using An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e102.jpg. For each An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e103.jpg we will compute the MAP genotype configuration An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e104.jpg; using the prior probability of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e105.jpg, we can enumerate the possible outcomes of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e106.jpg and compute both the genotype configuration and parameters An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e107.jpg that jointly maximize the posterior probability for these parameters. Using this approach we can also approximate An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e108.jpg, the posterior belief that the MAP parameter and genotype configuration is correct.

Identifiability

Before inference is performed, it is necessary to demonstrate that the parameters An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e109.jpg can be inferred with a sufficient amount of data ( i.e. they are “identifiable”). By the law of large numbers, the densities of the genotypes and allele intensities converge to the density expected from the parameters An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e110.jpg as An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e111.jpg; therefore, with enough individuals, the exact distribution of genotypes and allele intensities is known. In order to prove that the parameters are identifiable, we must demonstrate that An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e112.jpg can be computed from this density An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e113.jpg ( i.e. that An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e114.jpg is one-to-one). It is sufficient here to prove that no two non-identical pair of parameters An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e115.jpg can yield the same density.

By assumption, our model considers data which is a weighted sum of Gaussians (one for each genotype), each with a mean An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e116.jpg at the expected slope for the two allele intensities. Algabraically, for two densities to be equal, the two equivalent sums of shifted Gaussians, each of the form An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e117.jpg, must use identical sets An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e118.jpg (when An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e119.jpg). Furthermore, the corresponding weights An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e120.jpg must be equal for Gaussians shifted by the same An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e121.jpg. Together, these statements require that identical densities must be created by sets of parameters with identical angles An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e122.jpg for all possible genotypes (An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e123.jpg). This requires that all genotypes have an equal dosage to ploidy ratio for each possible genotype.

If this set of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e124.jpg contains more than one possible genotype, then the difference between the two dosages increases for the larger ploidy (because the ploidy, the denominator in both slopes, has increased, but the slopes remains constant). Because these dosages are necessarily integers, then the difference must increase by at least one, indicating a new genotype class with expected slope between the other two. Therefore, to have the same set of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e125.jpg, the larger ploidy has a possible genotype class not possible with the smaller ploidy, and this genotype class is not possible with the smaller ploidy. Thus, the larger ploidy must assign a weight An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e126.jpg to that new genotype class.

However, both models considered (Hardy-Weinberg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e127.jpg) create unimodal (or flat) weight distributions. For this reason, they cannot create sequential weights that are nonzero, zero and then nonzero again. Furthermore, given the ploidy, the weights (or expected frequencies) are sufficient to estimate An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e128.jpg. Therefore, if more than one possible genotype exists, the parameters are identifiable (the lowest ploidy that could produce the desired angles is the only one possible). When only one possible genotype exists, the ploidy cannot be estimated (it could be any multiple of a ploidy that produces the correct angle). In this case, we use an Occam's razor approach by placing a decreasing prior on the ploidy An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e129.jpg.

Probabilistic Inference

In order to perform inference on the generalized population model described in the Probabilistic Graphical Model section, we introduce three approaches: a greedy approach (maximum likelihood), an exact approach (MAP) via dynamic programming, and a substantially more efficient exact approach (also MAP). For all inference methods, assume An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e130.jpg is known. The best greedy genotype configuration and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e131.jpg can be chosen by enumerating all outcomes of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e132.jpg and selecting the one with highest posterior.

Graphically, it is trivial to demonstrate why MAP inference is difficult. Consider An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e133.jpg, a single bin in the distribution An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e134.jpg; it has incoming edges from all individuals' genotypes An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e135.jpg. Thus, in the the moral graph (in which all nodes with a common successor are joined by an undirected edge), an edge joins each pair of nodes An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e136.jpg, resulting in a clique of size An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e137.jpg. The treewidth [29] of a graph containing an An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e138.jpgclique is at least An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e139.jpg, so standard inference methods ( e.g. naive enumeration or junction tree inference [30], [31]) will require number of steps exponential in An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e140.jpg at least; for problems of the size we consider (An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e141.jpg), a runtime exponential in An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e142.jpg is infeasible.

Greedy Inference

Rather than jointly consider all genotype assignments, the greedy approach approximates An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e143.jpg by using maximum likelihood estimation. The likelihood considers only An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e144.jpg. Because of conditional independence of data given the genotype configuration, the maximum likelihood genotype configuration is defined:

equation image

The greedy estimate can independently compute the most likely genotype of each An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e146.jpg individually, effectively ignoring their combinatorial joint dependencies.

For each An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e147.jpg, the maximum likelihood genotype configuration An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e148.jpg can be evaluated by computing the joint probability with the data. Denote the distribution resulting from a given genotype configuration An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e149.jpg as An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e150.jpg. Then the joint probability given An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e151.jpg can be written as follows:

equation image
(1)
equation image
(2)
equation image
(3)

Using the equation 3, the configuration with the highest joint posterior

equation image

can be found by enumerating outcomes of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e156.jpg.

Exact Inference

The combinatorial dependencies between genotypes in different individuals must be recognized in order to compute the MAP genotype configuration. It is tempting to approximate these dependencies with a mixture model. A mixture model approach treats all An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e157.jpg as independent draws from the distribution An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e158.jpg; however, a mixture model rewards configurations assigning all individuals the most probable genotype in An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e159.jpg. In reality, such a configuration is extremely improbable because there is only one series of genotype assignments that result in this outcome. On the other hand, if An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e160.jpg is chosen so that not all individuals are assigned the most probable genotype in An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e161.jpg, the multinomial probability may be larger because there are many genotype configurations that could lead to An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e162.jpg (compared to the single configuration that yields the most probable genotypes). Modeling this dependency between all individuals, although computationally challenging, is extremely important.

In the simplest approach, all possible genotype configurations can be enumerated naively in exponential time, resulting in the tree shown in Figure 3A. Although it is infeasible to think of enumerating the entire tree, it may be possible to ignore subtrees that cannot lead to an optimum, substantially reducing the search space.

Figure 3
Illustration of Exact Inference.

Consider individuals in an arbitrary order with some genotypes assigned: Let An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e168.jpg denote An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e169.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e170.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e171.jpg denote the unassigned genotypes An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e172.jpg. We refer to the assigned genotypes An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e173.jpg as a “prefix” genotype configuration and the unknown An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e174.jpg as a “suffix”. Given a prefix genotype configuration, it is possible to bound the joint probability of all configurations with this prefix by bounding the likelihood for the remaining configurations:

equation image
equation image
equation image
equation image
equation image
equation image
equation image
(4)

Given a genotype and parameter configuration An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e182.jpg, any configuration including the prefix satisfying the following inequality is suboptimal:

equation image
equation image

The prefixes correspond to paths from the top of the tree in Figure 3A; prefixes that are shown to be suboptimal can be “bound,” meaning that they are not branched and searched further down. The second product may be cached for all An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e185.jpg for a speedup of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e186.jpg. It is worth noting that this second product must be included, because the likelihood constant on An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e187.jpg is unknown and so we cannot guarantee that An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e188.jpg. With all of the branch and bound approaches, the initial values An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e189.jpg can be computed using the greedy maximum likelihood approach and then improved as more probable configurations are found.

A more sophisticated dynamic programming approach (shown in Figure 3B) merges nodes of equal depth that produce identical distribution prefixes

equation image

and the number of individuals with each genotype in the genotype prefix. Because An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e191.jpg, then if two prefixes An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e192.jpg produce the same distribution prefixes An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e193.jpg, the suffixes satisfying An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e194.jpg are the same as the suffixes satisfying An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e195.jpg. For this reason, other than the prefix likelihoods An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e196.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e197.jpg, all other values in equations 4 will be the same; therefore, all prefixes producing the same prefix distribution can be grouped together, using the greatest prefix likelihood and corresponding prefix path. These grouped nodes can be added in batches for each depth to produce a “layer;” by induction the best path to each node in a layer includes the best path to the nodes in the layer above. The same bound from the naive tree is used, but subproblems that are identical are grouped and solved together to avoid redundant computation and storage.

Efficient Exact Inference

There are a number of reasons that the naive and dynamic programming branch and bound methods are inefficient. First, the number of nodes visited in these trees may be as much as An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e198.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e199.jpg, both of which are exponential in An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e200.jpg. This number of nodes determines the time and (if implemented in a manner that emphasizes runtime efficiency), the space required. Secondly, the suffix path is unconstrained; given An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e201.jpg, there is no restriction on An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e202.jpg, and so the bound must use the maximum likelihood for the remaining An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e203.jpg likelihood. Most importantly, the bound in equation 4 is very conservative; in order to bound a subtree with prefix An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e204.jpg, the overall likelihood of all subsequent trees must be less than the product of the overall likelihood and multinomial multiplier An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e205.jpg for a full configuration An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e206.jpg. Because even the largest multinomial probability An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e207.jpg is usually very small, the bound is extremely conservative. It is not feasible to use either the naive or dynamic programming branch and bound methods on the presented data.

For these reasons, we introduce a novel geometric branch and bound method; this method has several advantages. First, when the number of individuals is substantially larger than the ploidy (An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e208.jpg), the worst-case tree produced by our method is several orders of magnitude smaller (An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e209.jpg rather than An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e210.jpg). Secondly, our geometric method allows us to substantially constrain valid suffix configurations. Lastly, our method makes use of the multinomial probability in the bound; this multinomial probability is very influential in selecting the optimum (especially when the optimal An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e211.jpg is not very close to zero). Our geometric method has these advantages because it exploits a geometric property that MAP configurations must exhibit. By searching only configurations with this property, our method dramatically reduces the possible search space.

To present our branch and bound method, we first rephrase the problem in a geometric context and then derive a geometric property of optimal configurations (Figure 4). In the likelihood An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e212.jpg, both the data An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e213.jpg and the theoretical genotypes An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e214.jpg are normalized so that An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e215.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e216.jpg. This likelihood is therefore equivalent to An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e217.jpg. This normalization effectively places the points along the line An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e218.jpg. For all An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e219.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e220.jpg, define the operator An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e221.jpg to order them using their normalized values along the line An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e222.jpg (the direction of the ordering is arbitrary). Similarly, for all An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e223.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e224.jpg define the distance An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e225.jpg to operate on normalized values of the points on this line. It should be noted that other methods of normalization ( e.g. normalizing on a unit circle) will also enable ordering the points in this way and are compatible with this method.

Figure 4
Illustration of a Suboptimal Genotype Configuration.

Fix the genotype distribution An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e232.jpg. In the joint probability An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e233.jpg in equation 4, An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e234.jpg is a constant multiplied by all genotype configurations for which An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e235.jpg. Thus the optimal genotype configuration producing this An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e236.jpg is the one that maximizes the likelihood An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e237.jpg. Consider two genotype configurations An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e238.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e239.jpg that result in identical genotype distributions An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e240.jpg. If these configurations are identical except two individuals' genotype assignments, then one configuration must swap the genotype assignments of these individuals (or else the distribution An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e241.jpg would change). Let these individuals' indices be denoted An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e242.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e243.jpg and the possible genotypes be denoted An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e244.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e245.jpg. If

equation image

then An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e247.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e248.jpg. We prove (see Supplement S1) that genotype configurations that do not form contiguous genotype blocks along the line An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e249.jpg always contain two genotypes that can be swapped to decrease the distance and increase the likelihood; therefore, the optimal genotype configuration consistent with An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e250.jpg (which cannot be improved without changing An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e251.jpg) must contain only contiguous blocks of genotype assignments along the line An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e252.jpg.

This approach lets us find the optimal genotype configuration for a given An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e253.jpg in An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e254.jpg steps by sorting (the sorted order of individuals can be cached and won't vary with the parameters An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e255.jpg or An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e256.jpg). We prove that, for this reason, the optimal genotype configuration can be found by searching possible genotype distributions An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e257.jpg and for each An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e258.jpg choosing the optimal genotype configuration.

Given a prefix distribution An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e259.jpg, the best genotype configuration prefix An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e260.jpg can likewise be trivially found using the sorted order of individuals. In general, we generalize a previous method that performs search on the cardinality of sets rather than on the sets, themselves [32]; our approach generalizes this for the multinomial distribution, rather than a single count. Furthermore, the joint probability of the best genotype configuration consistent with the prefix distribution is bounded above by the product of the multinomial bound, the prefix likelihood, and the best remaining suffix likelihood (more thorough proof shown in Supplement S1):

equation image
equation image
equation image
(5)

Using this formula, branch and bound can be performed on the tree composed of the search space for the distribution An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e264.jpg; unlike the naive tree and the dynamic programming graph, the tree of all possible distributions has a significantly smaller depth of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e265.jpg, rather than An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e266.jpg. Furthermore, performing branch and bound on this tree is significantly more efficient and can utilize information from a prefix An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e267.jpg ( e.g. using the multinomial and restricting the suffix genotype configurations) to establish a much tighter bound. This method lets us efficiently find the exact MAP An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e268.jpg for any An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e269.jpg and the overall MAP An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e270.jpg.

Approximating the Posterior Probability of the MAP Configuration

Given an initial guess at the MAP configuration An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e271.jpg (from the greedy search), it is possible to simultaneously compute the MAP configuration An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e272.jpg and also approximate the posterior probability of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e273.jpg. This posterior probability is of great practical utility because it indicates the reliability of the results by quantifying how much better the MAP configuration is compared to all other configurations. In order to approximate the posterior of the MAP, we make two assumptions: first, most of the joint distribution's mass is from the neighborhood nearby the MAP, and second, the posterior distribution of configurations in these neighborhoods behave similar for different values of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e274.jpg. Using these two assumptions, we can approximate the marginal probability as proportional to the joint probability of the MAP:

equation image
equation image

where the constant of proportionality is similar for An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e277.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e278.jpg.

Therefore, the posterior of a configuration can be approximated:

equation image
(6)

Denote the greedy genotype configuration for An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e280.jpg as An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e281.jpg and the greedy configuration with highest posterior An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e282.jpg. During the branch and bound, it is possible to bound only configurations with joint probabilities so low, omitting them cannot significantly influence the denominator, and hence the overall value of equation 6. To the best of our knowledge, this is the first application of branch and bound to numerical marginalization; in our approach the maximum absolute posterior error (provided as a parameter) determines how conservative the approach must be to bound subtrees when estimating the posterior of the MAP configuration.

Rather than bound any distribution prefix for which all joint probabilities provably inferior to An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e283.jpg, we can only bound distribution prefixes that are substantially inferior. For some An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e284.jpg, we bound configurations when An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e285.jpg (where the maximum is conservatively estimated using the upper bound from equation 5). Larger values of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e286.jpg permit more aggressive bounding and smaller values bound more conservatively. We demonstrate (see Supplement S1) that the greatest absolute posterior error An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e287.jpg is bounded by the product of An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e288.jpg and the total number of parameter configurations queried (not including the MAP): An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e289.jpg. Given An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e290.jpg, the minimum allowed An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e291.jpg can be found An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e292.jpg.

Approximating Posterior Probabilities for Each Genotype Assignment

It is important to distinguish the configuration posterior (which we approximate above) from posterior estimates that each individual is assigned the correct genotype. SuperMASSA, our implementation of the proposed efficient geometric inference method, also approximates the posteriors for each individual by using the relative likelihood between the MAP genotype and the other possible genotypes for that individual. The user is allowed to set a threshold for this value, and only the individuals with a likelihood ratio exceeding this posterior will be reported (in both figures and output genotype assignments). This approach formalizes heuristics that filter out data points with a total intensity An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e293.jpg less than some threshold.

Furthermore, it is possible to extend our approach to compute exact posteriors for each genotype assignment. The space searched by branch and bound would be much more complex; however, the MAP genotype configuration computed above would provide the most efficient possible bound. When the MAP has a substantial portion of the probability mass, nearly every subtree will be bounded, resulting in a very efficient runtime.

Results

Runtime Improvement with Geometric Branch and Bound

The improved runtime of our geometric branch and bound method relative to the dynamic programming method is a nontrivial change; it makes exact MAP computation feasible where it was not before. In Figure 5, we demonstrate the relationship between the ploidy An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e294.jpg, the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e295.jpg and the runtime of these methods using the SugSNP225 locus. Not only is the geometric method substantially more efficient for more difficult problems (over 100 times faster in some instances), the gap between the two methods grows nonlinearly (as shown by the increasing gap on the log-scale runtimes). Furthermore, the amount of memory used by the dynamic programming method is prohibitively large; in both cases, the dynamic programming runtime series is terminated early for using more than 3 GB of RAM. Most importantly, the dynamic programming time and memory requirements prohibit analysis using the optimal parameters. The optimal ploidy for this locus is 10 and the optimal An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e296.jpg value is 0.16; it is infeasible to run the dynamic programming method for any ploidy greater than four (when An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e297.jpg is at its optimal value 0.16) and for any An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e298.jpg greater than 0.03 when the ploidy is at its optimal value of 10. For this reason, the dynamic programming method could not practically be applied to this data set.

Figure 5
Runtime of Exact MAP Computation with Dynamic Programming and Geometric Branch and Bound.

Inference Results from Potato and Sugarcane Data

For all loci investigated, Table 1 shows the ploidy and number of clusters predicted by both the expert and SuperMASSA. The application of our method provided very good results for the SNPs evaluated, both for potato (diploid and tetraploid) and sugarcane. For potato, SuperMASSA was able to find the correct ploidy level and number of clusters in all cases. For sugarcane the ploidy level was the same for 21 SNPs. For the remaining loci, SuperMASSA predicted similar ploidies for four (differences from 10 to 8 in SugSNP004, 12 to 14 in SugSNP013, and 8 to 6 in SugSNP186 and SugSNP204) and incorrect ploidies (10 to 14 in SugSNP060 and 6 to 14 in SugSNP114). It is important to note that the curated result is not sacrosanct; the exact answer is not known, since the ploidy level is unknown for sugarcane. The number of clusters for sugarcane was the same for 24 SNPs, with only small differences in the remaining. Interestingly, this happened only for loci with different results for ploidy level as well.

Table 1
SuperMASSA Results on Potato and Sugarcane Loci.

Further investigation into the loci where the expert and SuperMASSA disagree revealed that the distributions resulting from the ploidies set by the expert were quite divergent from the theoretical distributions expected for any possible sets of parents. The expert did not analyze these distributions when curating the data, because it was prohibitively time-consuming: the number of possible parents for the considered ploidy range (two to 16) totals 444; enumerating all sets of parents for the 241 considered sugarcane loci would have resulted in 107,004 figures requiring manual analysis.

SuperMASSA Output from Selected Potato and Sugarcane Loci

SuperMASSA was run on two potato loci (from both the diploid and tetraploid individuals) and on sugarcane loci using the same parameters. The ploidy range searched was 2 to 16 (only even ploidies were searched) and the An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e310.jpg range searched was An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e311.jpg. For the sugarcane data, peak heights were used as the measure of intensity ( SuperMASSA has the option of using the peak areas for MassARRAY data). Figures reported were generated automatically without manual editing using the –save_figures option. Parental data (consisting of 12 replicates of each parent) was used for sugarcane loci (results were very similar without using this data).

Figure 6 shows the output from SuperMASSA on potato loci from the diploids and tetraploids. For the diploid potato used as reference by [19], it is easy to see that the results strongly agree with what is expected. First, the observed and estimated proportion of individuals on each class of the distribution are very close to each other. Second, there are 3 clusters corresponding to alleles with 0, 1 or 2 copies. It is also possible to see that there is no skew on the clusters around the expected angles for each cluster (An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e312.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e313.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e314.jpg). It is important to note that the method was able to deal with clusters containing few individuals. More importantly, the ploidy level was correctly estimated as two. In individuals of the tetraploid potato variety, the results also indicated that the proposed method works well. The estimated ploidy was four, there are five clusters, and the expected and observed proportions under HWE are quite similar. Little skew from the expected angles was observed.

Figure 6
SuperMASSA Output on Potato Loci.

Figure 7 shows the output from SuperMASSA on three sugarcane loci. For each of these loci, there is a strong agreement between the expected and observed number of individuals in each cluster for an An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e316.jpg. There is no evidence of skew on the annotated scatter plots and individuals were correctly allocated to clusters close to the expected angles for the given ploidy and estimated dosage on parents. Furthermore, the expected angles of these estimated parent dosages closely matched the angles seen in the scatter plot of parent genotype data. The ploidy level was correctly estimated based on what is expected from eye-curation: 12 for SugSNP122, 10 for SugSNP201 and 10 for SugSNP225. The allele dosage in the parents was also estimated as simplexAn external file that holds a picture, illustration, etc.
Object name is pone.0030906.e317.jpgnulliplex, simplexAn external file that holds a picture, illustration, etc.
Object name is pone.0030906.e318.jpgsimplex and triplexAn external file that holds a picture, illustration, etc.
Object name is pone.0030906.e319.jpgnulliplex, respectively.

Figure 7
SuperMASSA Output on Sugarcane Loci.

Discussion

These results presented were possible only because our novel approach to inference substantially reduced the search space and permitted much greater utilization of available information ( e.g. prior knowledge about rare genotype frequencies) in the branch and bound. We present a geometric interpretation of how our procedure reparameterizes and decreases the size of the search space; however, the key mathematical concept that allowed us to discover the geometric property of optima was due to an exploitation of symmetry. In general, it is possible to condition on outcomes of nodes in a graphical model that perform associative operations (in this instance counting), even though these nodes depend jointly on the state of all predecessor nodes. This is possible by effectively collapsing predecessor configurations that lead to the same outcome. In state-of-the-art software packages for graphical models [33], this type of symmetry may not be exploited to its full potential, and so for our problem, the best runtime for an exact result would have had a worst-case time exponential in the number of individuals. In the future, these special types of dependencies could be identified automatically; it is possible that this type of symmetry is hidden in myriad other problems and could be exploited.

One such straightforward generalization that could be made to our model would use a latent variable to represent the skew of each locus. A prior probability on the skew with a unique mode at zero (no skew) would choose a skewed solution only if it was inferior to all solutions with a skew of zero. Performing inference using a discretization of this latent variable would simply multiply by a constant the runtime of our method. This improvement, though simple, would be quite useful for fluorescence-based genotyping assays, which are sometimes prone to distortion in the relative intensities of each allele.

It is important to note that the method that we present is not exclusively for polyploids; instead, it is a generalized method that is applicable to any ploidy. This is especially important since our method generalizes independent mixture models so that the genotypes of individuals are considered and assigned in concert rather than one at a time. Because of its simple and modular nature, both our model and the inference procedures could be trivially inserted into existing methods. Perhaps even more importantly, the mathematical inference problem we solve is nearly identical to important inference problems proposed for analysis of copy number variation; the platforms that we tested our method on are of great importance for identifying copy number variants. Our method (or components of the model or inference algorithm) could be applied to the relative ratio intensities (due to copy number rather than ploidy) described in [16].

Our approach undoubtedly simplifies the model of meioses in polyploids. However, even when the assumptions of our meiotic model are violated, the anomalous or seemingly contradictory results ( e.g. parents with a ploidy different from some or all progeny in an An external file that holds a picture, illustration, etc.
Object name is pone.0030906.e320.jpg), are extremely informative. By using a simple available model of meioses in polyploids, our approach will facilitate the discovery of loci with these anomalous behaviors; identifying and studying examples that violate a simple meiotic model is crucial for furthering our understanding of and developing more accurate models of meiosis in polyploids. A greater understanding of these processes will not only benefit the study of polyploids, it will add insight into the processes involved in cell biology.

Availability

Our software SuperMASSA is implemented in Python and freely available as an online application at http://statgen.esalq.usp.br/SuperMASSA. The data from the sugarcane loci analyzed are also available at this URL. The potato data analyzed is available in [23].

Supporting Information

Supplement S1

Proof that both the dynamic programming branch and bound and geometric branch and bound find the MAP genotype configuration.

(PDF)

Acknowledgments

We would sincerely like to thank Anete P. Souza and Thiago G. Marconi for making their sugarcane data available to us.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This research was supported by Fundacao de Amparo a Pesquisa do Estado de Sao Paulo (grant 2008/52197-4 and 2008/54402-4). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Hieter P, Griffiths T. Polyploidy–More Is More or Less. Science. 1999;285:210–211. [PubMed]
2. Lander ES, Green P. Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci USA. 1987;84:2363–2367. [PMC free article] [PubMed]
3. Lander ES, Botstein D. Mapping Mendelian Factors Underlying Quantitative Traits Using RFLP Linkage Maps. Genetics. 1989;121:185–199. [PMC free article] [PubMed]
4. Zeng ZB, Kao C, Basten CJ. Estimating the genetic architecture of quantitative traits. Genetical Research. 1999;74:279–289. [PubMed]
5. Lewin HA, Larkin DM, Pontius J. Every genome sequence needs a good map. Genome Research. 2009;19:1925–1928. [PMC free article] [PubMed]
6. Wu KK, Burnquist W, Sorrells ME, Tew TL, Moore PH, et al. The detection and estimation of linkage in polyploids using single-dose restriction fragments. TAG Theoretical and Applied Genetics. 1992;83:294–300. [PubMed]
7. Ripol MI, Churchill GA, Silva JAGD, Sorrells M. Statistical aspects of genetic mapping in autopolyploids. Gene. 1999;235:31–41. [PubMed]
8. Baker P, Jackson P, Aitken K. Bayesian estimation of marker dosage in sugarcane and other autopolyploids. Theor Appl Genet. 2010;120:1653–72. [PubMed]
9. Alwala S, Kimbeng CA. 2010. 272 Molecular Genetic Linkage Mapping in Saccharum: Strategies, Resources and Achievements, CRC Press, Science Publishers, chapter 5. 1 edition.
10. Garcia AAF, Kido EA, Meza AN, Souza HMB, Pinto LR, et al. Development of an integrated genetic map of a sugarcane (Saccharum spp.) commercial cross, based on a maximum-likelihood approach for estimation of linkage and linkage phases. Theor Appl Genet. 2006;112:298–314. [PubMed]
11. Oliveira KM, Pinto LR, Marconi TG, Margarido GRA, Pastina MM, et al. Functional integrated genetic linkage map based on EST-markers for a sugarcane (Saccharum spp.) commercial cross. Molecular Breeding. 2007;20:189–208.
12. Wang J, Roe B, Macmil S, Yu Q, Murray JE, et al. Microcollinearity between autopolyploid sugarcane and diploid sorghum genomes. 2010 doi: 10.1186/1471-2164-11-261. [PMC free article] [PubMed]
13. Pastina MM, Pinto LR, Oliveira KM, Souza AP, Garcia AAF. 2010. 272 (2010) Molecular Mapping of Complex Traits, CRC Press, chapter 7. 1 edition.
14. Galitski T, Saldanha AJ, Styles CA, Lander ES, Fink GR. Ploidy Regulation of Gene Expression. Science. 1999;285:251–254. [PubMed]
15. Fan JB, Oliphant A, Shen R, Kermani BG, Garcia F, et al. Highly parallel SNP genotyping. Cold Spring Harbor symposia on quantitative biology. 2003;68:69–78. [PubMed]
16. Oeth P, de Mistro G, Marnellos G, Shi T, van den Boom D. 2009. pp. 307–343. Single Nucleotide Polymor-phisms - Single Nucleotide Polymorphisms, Humana Press, chapter\Qualitative and Quantitative Genotyping Using Single Base Primer Extension Coupled with Matrix-Assisted Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (MassARRAY)”. [PubMed]
17. Akhunov E, Nicolet C, Dvorak J. Single nucleotide polymorphism genotyping in polyploidy wheat with the Illumina GoldenGate assay. Theor Appl Genet. 2009;119:507–17. [PMC free article] [PubMed]
18. Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics. 2011;12:443–451. [PMC free article] [PubMed]
19. Voorrips RE, Gort G, Vosman B. Genotype calling in tetraploid species from bi-allelic marker data using mixture models. BMC bioinformatics. 2011;12:172. [PMC free article] [PubMed]
20. Fujisawa H, Eguchi S, Ushijima M, Miyata S, Miki Y, et al. Genotyping of single nucleotide polymorphism using model-based clustering. Bioinformatics (Oxford, England) 2004;20:718–26. [PubMed]
21. Grivet L, Hont AD, Roques D, Feldmann P, Lanaud C, et al. RFLP Mapping in Cultivated Sugarcane (Saccharum spp.): Genome Organization in a Highly Polyploid and Aneuploid Interespecific Hybrid. Genetics. 1995;142:987–1000. [PMC free article] [PubMed]
22. Anithakumari AM, Tang J, van Eck HJ, Visser RG, Leunissen JA, et al. A pipeline for high throughput detection and mapping of SNPs from EST databases. Molecular Breeding. 2010;26:65–75. [PMC free article] [PubMed]
23. Voorrips R, Gort G. fitTetra: fitTetra is an R package for assigning tetraploid genotype scores. 2011. R package version 1.0.
24. Sequenom 2007. Typer 4.0 manual.
25. Storm N, Darnhofer-Patel B, van den Boom D, CP R. 2003. pp. 241–262. Single Nucleotide Polymorphisms - Methods and Protocols, Humana Press, chapter MALDI-TOF Mass Spectrometry-Based SNP Genotyping. [PubMed]
26. Grivet L, Arruda P. Sugarcane genomics: depicting the complex genome of an important tropical crop. Current Opinion in Plant Biology. 2001;5:122–127. [PubMed]
27. Jannoo N, Grivet L, David J, D'Hont A, Glaszmann JC. Differential chromosome pairing affinities at meiosis in polyploid sugarcane revealed by molecular markers. Heredity. 2004;93:460–467. [PubMed]
28. Singh RJ. Plant Cytogenetics. CRC Press, 2nd edition; 2002. 521
29. Arnborg S, Corneil DG, Proskurowski A. Complexity of finding embeddings in a k-tree. SIAM Journal on Algebraic and Discrete Methods. 1987;8:277–284.
30. Robertson N, Seymour PD. Graph minors. iii. planar tree-width. Journal of Combinatorial Theory, Series B. 1984;36:49–64.
31. Andersen SK, Olesen KG, Jensen FV. HUGIN, a shell for building Bayesian belief universes for expert systems. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc; 1990. pp. 332–337.
32. Serang O, MacCoss MJ, Noble WS. Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. Journal of Proteome Research. 2010;9:5346–5357. [PMC free article] [PubMed]
33. Bilmes J, Zweig G. The Graphical Models Toolkit: An open source software system for speech and time-series processing. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 2002.

Articles from PLoS ONE are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...