![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2009 The Author(s) Array-based genotyping in S.cerevisiae using semi-supervised clustering 1EMBL - European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and 2European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany *To whom correspondence should be addressed. Associate Editor: Martin Bishop Received December 28, 2008; Revised February 1, 2009; Accepted February 17, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Motivation: Microarrays provide an accurate and cost-effective method for genotyping large numbers of individuals at high resolution. The resulting data permit the identification of loci at which genetic variation is associated with quantitative traits, or fine mapping of meiotic recombination, which is a key determinant of genetic diversity among individuals. Several issues inherent to short oligonucleotide arrays—cross-hybridization, or variability in probe response to target—have the potential to produce genotyping errors. There is a need for improved statistical methods for array-based genotyping. Results: We developed ssGenotyping (ssG), a multivariate, semi-supervised approach for using microarrays to genotype haploid individuals at thousands of polymorphic sites. Using a meiotic recombination dataset, we show that ssG is more accurate than existing supervised classification methods, and that it produces denser marker coverage. The ssG algorithm is able to fit probe-specific affinity differences and to detect and filter spurious signal, permitting high-confidence genotyping at nucleotide resolution. We also demonstrate that oligonucleotide probe response depends significantly on genomic background, even when the probe's specific target sequence is unchanged. As a result, supervised classifiers trained on reference strains may not generalize well to diverged strains; ssG's semi-supervised approach, on the other hand, adapts automatically. Availability: The ssGenotyping software is implemented in R. It is currently available for download (www.ebi.ac.uk/~bourgon/yeast_genotyping/ssG) and is being submitted to Bioconductor. Contact: bourgon/at/ebi.ac.uk Supplementary information: Supplementary data and a version including color figures are available at Bioinformatics online. 1 INTRODUCTION During meiosis, homologous copies of the chromosomes align, and the repair of programmed double-stranded breaks in the DNA leads to recombination: the reciprocal exchange of DNA between homologs (crossovers), or the non-reciprocal modification of one homolog, using the other as a template (non-crossover gene conversion). As a consequence, the genome of each meiotic product, or ‘segregant’, is a mosaic of the two parental genotypes (Fig. 1
Oligonucleotide microarrays provide an accurate and cost-effective means of identifying and genotyping polymorphic loci. Oligonucleotide microarray probes hybridize more efficiently to targets whose sequence is exactly complementary than to targets which only partially or imperfectly match the probes. Winzeler et al. (1998) used this fact to identify several thousand polymorphic positions in the same two yeast strains we consider here. Since then, numerous authors have made use of these so-called ‘single feature polymorphisms’ (SFPs)—in yeast (Brem et al., 2002; Deutschbauer and Davis, 2005; Gresham et al., 2006; Steinmetz et al., 2002; Winzeler et al., 2003), and also in other organisms (Albert et al., 2005; Borevitz et al., 2003; Rostoks et al., 2005; Turner et al., 2005). With the exception of Brem et al. (2002), these authors have taken a supervised approach to the problem, training a genotyping classifier on samples of known genotype and then applying the classifier to new samples. Winzeler et al. (1998) hybridized parental genomic DNA from each of the two strains to standard yeast expression arrays. Then, after preprocessing, analysis of variance (ANOVA) was used to identify probes whose observed log-scale fluorescence intensities appeared to be better fit by a model with two means than by a model with one. Such probes were deemed to be SFPs. To genotype segregants from a cross, a posterior probability was computed using the estimated Gaussian densities from the parental-array ANOVA, plus a uniform prior on the two genotypes:
Variants on this procedure soon emerged. The 1- versus 2-mean ANOVA is equivalent to a two-sample t-test for difference in means, and Borevitz et al. (2003) proposed an alternative t-test for identification of SFPs, using the ad hoc moderated t-statistic of SAM (Tusher et al., 2001). Brem et al. (2002)—whose data included hybridizations from numerous segregants of unknown genotype, as well from parental samples of known genotype—further augmented this approach: using parental data, candidate SFPs were identified on the basis of a high moderated t-statistic. Then, known parental genotype labels were temporarily set aside, and the combined parental and segregant data were subjected to k-means clustering (k = 2). Candidate SFPs were only retained if the parental samples were correctly separated by the resulting clusters. Further, Brem et al. estimated the Gaussian densities required in (1) from all data in the clusters, rather than only from parental observations of known genotype. The more recent, multivariate approach of Gresham et al. (2006)—designed for high-density tiling microarrays—is quite different: the authors considered the set of probes which interrogate a given position, and they modeled the decrease in fluorescence intensity caused by a SNP as a function of (i) the SNP's position within each probe, (ii) known response of the probes to reference sequence and (iii) various aspects of the probes' base composition. Their algorithm, SNPscanner, was trained on a set of ‘high-quality’ known SNPs to produce two predictions for probe set behavior: one which corresponds to reference sequence, and the other, to sequence with a variant base at the given position. Observed behavior on new arrays was compared with the two predictions, and genotype was assigned on the basis of which model fits best. In the remainder of this article, we introduce ssGenotyping (ssG) as an alternative to SNPscanner, and show that it provides both more specific and more sensitive genotyping in the context of a meiotic recombination dataset. In addition, we use the comparison between the methods to illustrate two points which are important for successful array-based genotyping in any context: (i) the extent to which probe behavior—cross-hybridization behavior, in particular—is sensitive to genomic background, and (ii) the ability of predictive models to describe probe behavior in a complex setting. 2 APPROACH AND METHODS 2.1 Motivation We developed the ssG algorithm to genotype over 50 000 polymorphic markers in 220 segregants—51 wild-type tetrads and 5 msh4 deletion mutant tetrads—resulting from the sporulation of a diploid cross of two substantially diverged strains of S.cerevisiae (see Supplementary Methods). One strain, S96, is isogenic with the common laboratory strain S288c, for which the whole-genome sequence is known; the other, YJM789, is a clinical isolate that has recently been sequenced (Gu et al., 2005; Wei et al., 2007). The segregation patterns of the markers provided detailed information about local recombination rates, patterns of crossover interference, and the size and spatial distribution of gene conversion events (Mancera et al., 2008). Genomic DNA from the segregants as well as from 25 parental samples was hybridized to Affymetrix tiling microarrays which provide dense coverage of the reference S288c genome, typically at 4 bp resolution. The arrays also include probes which interrogate YJM789 sequence, at positions where this sequence differs from the S288c reference (see Supplementary Methods). Comparison of the aligned sequences from the two strains revealed ≈61 000 putative polymorphisms—single nucleotide polymorphisms (SNPs), insertions or deletions—of which ≈52 000 were interrogated by distinct sets of one or more uniquely mapping probes. Given the tiling design, the vast majority of these polymorphisms were interrogated by sets of overlapping probes (Supplementary Figure S1). We therefore selected a multivariate approach which is able to accommodate correlation arising from the overlap among the probes in a probe set (Fig. 2
2.2 ssG: raw genotype calls Figure 2 Let Yi {1, 2} denote the genotype of sample i at the polymorphism in question. If A, B and S denote the indices corresponding to the two parental types and the segregants, respectively, then Yi is known whenever i A B, but unknown for i S. We postulated that X|Y ~ d(μY, ΣY), and that (Xi, Yi) was independent of (Xi′, Yi′) for i ≠ i′. Importantly, we did not require μ1j to differ from μ2j for every j—reflecting the fact that the marginal behavior of some probes in a probe set may not distinguish between the two genotypes. Further, Σ1 and Σ2 were not assumed to be equal nor diagonal. Figure 2In the Gaussian mixture case, the M step of the EM algorithm—which maximizes an estimate of the conditional expectation of the log likelihood—only requires estimates of P(Yi = g|Xi) for all i S. To initialize these conditional probabilities (hereafter denoted pig), we applied a simple clustering algorithm—k-means, with the two clusters seeded with parental observations—to the combined parental and segregant data, and then set each p(0)ig to either 0 or 1, depending on the outcome of this clustering. (Alternately, one could begin with the E step and initialize the and using the parental data; this strategy produced identical results.) Defining p·g ∑ipig, it is straightforward to show that the M step's objective function is maximized in μg by
S, by
(·; μ, Σ) denotes the density of a multivariate normal distribution with mean μ and covariance matrix Σ. We also define to be . Final assignment of genotype for the segregants was then obtained by comparing pi1 and pi2. This is analogous to (1), although the two distributions are now multivariate, and the parameter estimates are derived from a combination of the parental and offspring data rather than from parental data alone. The contrast between the two fit types (semi-supervised versus supervised parental-only) can be substantial, as illustrated in Figure 3
2.3 ssG: filtering After fitting distributions to all probe sets, we applied quality filtering at the (i) array, (ii) polymorphism and (iii) individual call levels. Genotype calls for four arrays implied a huge increase (more than an order of magnitude above what was typically observed) in genotype switching along the associated segregants' chromosomes, so these four arrays were set aside. The distributional estimates and returned by the EM algorithm admit natural polymorphism- and call-level filtering as well. Figure 3B and to compute expected misclassification rates, and set aside probe sets for which this rate was too large (> 1%). Finally, individual calls which were intermediate with respect to and —producing pig which were too far from both 0 and 1—were removed. Individual calls with unambiguous pig but which were nonetheless outliers with respect to their assigned class were also removed. (See Supplementary Methods for additional details.)Supplementary Figure S2 depicts a further problem found in a small fraction (0.7%) of probe sets: behavior which is inconsistent with the biological and statistical models for meiotic recombination. Such behavior may be due to cross-hybridization from sequence at an unlinked locus, or to unanticipated translocations in our S96 parental strain, relative to the S288c reference sequence. To address this issue, we computed auxiliary fits for each probe set—by using only parental or only offspring data, or by fitting more than two clusters—and compared these fitted distributions with the main semi-supervised results. Strong disagreement between the fit types, or a significant improvement in fit quality when three or four clusters were used, permitted identification and removal of these aberrant probe sets. 2.4 Comparison to SNPscanner We compared our ssG results to those of the supervised classifier, SNPscanner (Gresham et al., 2006). The purpose of this comparison was two fold. First, we were interested in exploring the extent to which SNPscanner's statistical model, trained on parental data, could predict the behavior of probes in a different genomic context. Second, we were interested in knowing which algorithm provided better genotyping data. The SNPscanner algorithm was designed to work with arrays which only interrogate the reference genome (S288c). In addition, the SNPscanner algorithm uses loess-type (Cleveland, 1979) normalization instead of the experiment-wide VSN we used in Mancera et al. (2008), and was originally trained on a different array design. To facilitate comparison, we carried out a secondary ssG analysis using only S288c-specific probes, and following the SNPscanner normalization strategy. We retrained SNPscanner on our parental hybridization data, using only SNPs meeting its ‘high-quality’ criterion. We then used the so-trained model to genotype our segregant arrays, and passed results through the same SNPscanner quality filters used in Gresham et al. (2006) (see Supplementary Methods). Filtered ssG and SNPscanner genotyping results were compared on the basis of call rate, concordance and accuracy. 3 RESULTS Application of ssG to the data described in Section 2.1 permitted assessment of (i) the relationship between supervised and semi-supervised approaches to the genotyping task, (ii) the importance of quality filtering for genotyping accuracy and (iii) the differences between ssG and SNPscanner's model-based, supervised approach. 3.1 Parental versus offspring hybridizations Probe set behavior in parental hybridizations—the only source of training data available to a supervised classifier—was often not representative of behavior in offspring hybridizations. Figures 2 3.2 Filtering One objective of Mancera et al. (2008) was the characterization of short non-crossover gene conversion events. The number of putative small events seen in unfiltered ssG (Fig. 4A and , not on event size; therefore, they are not biased against small events.
As validation, sequencing-based genotype calls were obtained for 283 markers involved in or immediately adjacent to putative small events observed in the unfiltered ssG data (see Supplementary Methods). Figure 5
3.3 Comparison to SNPscanner When both ssG and SNPscanner employed their native filters, ssG made 45% more calls than SNPscanner, producing significantly denser effective marker coverage. A visual comparison of Figure 4B Are the short events identified by SNPscanner in Figure 4C
When applied to the replicated tetrad discussed in the previous section, ssG produced just one discrepant call across the four spores. SNPscanner, on the other hand, produced discrepancies for 13% of the markers at which it made a call in both replicates. SNPscanner's filters were also very sensitive to laboratory effect. When applied to the replicate hybridized in the same laboratory as its training data, SNPscanner filtered S96 and YJM789 calls in roughly equal proportion; when applied to the other replicate, however, it filtered out almost all S96 calls—most likely due to a shift in distributional locations caused by the different conditions. Gresham et al. (2006) suggest that, when using SNPscanner to genotype new samples, the training data provided by the authors are sufficient, i.e., that it is not necessary to retrain the model on locally produced training data. The observed sensitivity to laboratory effect for our replicated tetrad, however, suggests that this is not always the case. 3.4 Application: gene conversion After filtering, it is straightforward to infer the crossover and gene conversion history for each tetrad, on each chromosome. Figure 7
4 DISCUSSION AND CONCLUSION Classification and clustering algorithms have traditionally been called supervised and unsupervised approaches, respectively. Supervised classification learns model parameters from labeled training data in one step, then attempts to assign labels to new data in a separate step. Unsupervised clustering, on the other hand, is not given labeled training data; instead, it attempts to divide unlabeled data into sensible groups in a single step. In this article, we present the multivariate ssG algorithm. While many previously proposed array-based genotyping methods are supervised classifiers, ssG takes a semi-supervised approach: it clusters data by genotype in a single step, but in a way that takes advantage of the limited amount of labeled parental data. It is clear from Figures 2 We also contrast ssG with SNPscanner, a recently proposed supervised classifier which is also based on multivariate Gaussian mixtures. SNPscanner employs a parametric model to predict the impact of polymorphisms on probe behavior, while ssG uses no such model, relying instead on empirical distributions derived from the clusters it identifies. By using a probe behavior model, SNPscanner attempts to shift statistical testing from the asymmetric case (H0: θ = θ0 vs. HA : θ ≠ θ0) to the simpler symmetric case (H0 : θ = θ0 vs. HA : θ = θ1). Such a shift is only possible if one can correctly specify θ1. Figure 6 Our results have focused on genotyping in the context of meiotic recombination, but the ssG algorithm is immediately applicable to other contexts. It can be applied to individual probes as well as to probe sets, and can be used with other array designs—i.e. non-tiling arrays, or arrays which interrogate a single genome. Because sequence data were available for both strains considered in this study, it was natural to define probe sets on the basis of known polymorphisms. In general, sequence for the second strain is not required: probe sets may be defined simply on the basis of shared regions of interrogation. In such a case, most probe sets will interrogate non-polymorphic sequence and thus be uninformative, but standard model selection procedures (e.g. Bayesian information criterion (BIC)) appear to be sufficient for identification of sets which exhibit two-class behavior. As shown above, however, two-class behavior is necessary but not sufficient for effective genotyping: some probe sets corresponding to known polymorphisms do not clearly distinguish between the alleles; in other cases, varying genomic background and cross-hybridization may create two-class behavior even when there is no polymorphism at the interrogated locus. Our results have important implications for the detection of polymorphisms in novel, unsequenced strains. Detection is typically accomplished by testing the null hypothesis that the novel strain's data have arisen from the same distribution seen in the reference strain. The discrepancies between parental and segregant behavior seen in Figure 3 [Supplementary Data]
ACKNOWLEDGEMENTS We thank contributors to the Bioconductor (www.bioconductor.org, Gentleman et al., 2004)and R (www.R-project.org) projects for their software. We also thank Zhenyu Xu and Paul McGettigan for providing insight in the early stages of the project. Funding: Deutsche Forschungsgemeinschaft; National Institutes of Health (to L.M.S.). Conflict of Interest: none declared. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Nature. 2008 Jul 24; 454(7203):479-85.
[Nature. 2008]Science. 1998 Aug 21; 281(5380):1194-7.
[Science. 1998]Science. 2002 Apr 26; 296(5568):752-5.
[Science. 2002]Nat Genet. 2005 Dec; 37(12):1333-40.
[Nat Genet. 2005]Science. 2006 Mar 31; 311(5769):1932-6.
[Science. 2006]Nature. 2002 Mar 21; 416(6878):326-30.
[Nature. 2002]Genome Res. 2003 Mar; 13(3):513-23.
[Genome Res. 2003]Proc Natl Acad Sci U S A. 2001 Apr 24; 98(9):5116-21.
[Proc Natl Acad Sci U S A. 2001]Science. 2002 Apr 26; 296(5568):752-5.
[Science. 2002]Science. 2006 Mar 31; 311(5769):1932-6.
[Science. 2006]Proc Natl Acad Sci U S A. 2005 Jan 25; 102(4):1092-7.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2007 Jul 31; 104(31):12825-30.
[Proc Natl Acad Sci U S A. 2007]Nature. 2008 Jul 24; 454(7203):479-85.
[Nature. 2008]Science. 2006 Mar 31; 311(5769):1932-6.
[Science. 2006]Nature. 2008 Jul 24; 454(7203):479-85.
[Nature. 2008]Science. 2006 Mar 31; 311(5769):1932-6.
[Science. 2006]Nature. 2008 Jul 24; 454(7203):479-85.
[Nature. 2008]Science. 2006 Mar 31; 311(5769):1932-6.
[Science. 2006]Nature. 2008 Jul 24; 454(7203):479-85.
[Nature. 2008]Trends Genet. 2003 Sep; 19(9):514-22.
[Trends Genet. 2003]Trends Genet. 2003 Sep; 19(9):514-22.
[Trends Genet. 2003]Genome Biol. 2004; 5(10):R80.
[Genome Biol. 2004]