• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of genoresGenome ResearchCSHL PressJournal HomeSubscriptionseTOC AlertsBioSupplyNet
Genome Res. Jun 2008; 18(6): 918–929.
PMCID: PMC2413159

Detecting polymorphic regions in Arabidopsis thaliana with resequencing microarrays

Abstract

Whole-genome oligonucleotide resequencing arrays have allowed the comprehensive discovery of single nucleotide polymorphisms (SNPs) in eukaryotic genomes of moderate to large size. With this technology, the detection rate for isolated SNPs is typically high. However, it is greatly reduced when other polymorphisms are located near a SNP as multiple mismatches inhibit hybridization to arrayed oligonucleotides. Contiguous tracts of suppressed hybridization therefore typify polymorphic regions (PRs) such as clusters of SNPs or deletions. We developed a machine learning method, designated margin-based prediction of polymorphic regions (mPPR), to predict PRs from resequencing array data. Conceptually similar to hidden Markov models, the method is trained with discriminative learning techniques related to support vector machines, and accurately identifies even very short polymorphic tracts (<10 bp). We applied this method to resequencing array data previously generated for the euchromatic genomes of 20 strains (accessions) of the best-characterized plant, Arabidopsis thaliana. Nonredundantly, 27% of the genome was included within the boundaries of PRs predicted at high specificity (≈97%). The resulting data set provides a fine-scale view of polymorphic sequences in A. thaliana; patterns of polymorphism not apparent in SNP data were readily detected, especially for noncoding regions. Our predictions provide a valuable resource for evolutionary genetic and functional studies in A. thaliana, and our method is applicable to similar data sets in other species. More broadly, our computational approach can be applied to other segmentation tasks related to the analysis of genomic variation.

Describing the complement of sequence variation within a species is a first step in linking genetic variation to phenotypes (The International HapMap Consortium 2005), and the development of methods for whole-genome polymorphism discovery has been a top priority in the life sciences (Shendure et al. 2004). Toward this goal, the generation of high-density, oligonucleotide microarrays suitable for whole-genome variation detection was a major technological breakthrough (e.g., Chee et al. 1996; Patil et al. 2001; Hinds et al. 2005). Such microarrays, hereafter referred to as resequencing arrays, employ a 1-bp tiling path to query bases relative to a known reference sequence. Each base is interrogated with eight features that consist of forward and reverse strand 25-mer oligonucleotide quartets. Within a quartet, oligonucleotides are identical to the reference sequence except at the central position, where each sequence possibility is represented. When hybridized to labeled genomic DNA, the highest signal intensity is expected for the perfect match oligonucleotide, thereby predicting the base in the corresponding target DNA sample. Large-scale polymorphism discovery using resequencing arrays was first performed in humans, identifying a large fraction of common single nucleotide polymorphisms (SNPs) in the global population (Patil et al. 2001; Hinds et al. 2005).

Although conceptually simple, detection of polymorphisms from resequencing array data is nonetheless a computational challenge (Cutler et al. 2001; Patil et al. 2001; Clark et al. 2007). For SNPs, relative differences in feature intensities at a polymorphic position indicate the base call, and hybridization is reduced for flanking features as a consequence of off-center mismatches (cf. Fig. 1A,B). The resulting hybridization pattern provides a “SNP signature” that has been exploited by several algorithms to predict SNPs from resequencing array data (Patil et al. 2001; Hinds et al. 2005; Clark et al. 2007). However, where multiple SNPs or insertion/deletion (indel) polymorphisms are closely adjacent (occur within the same 25-mer), all oligonucleotides harbor off-center mismatches, and SNP prediction is generally not possible. For these regions, hybridization is suppressed for contiguous features in a tiling path. This pattern is therefore a signature of high underlying polymorphism, in the form either of closely linked SNPs or small indels, or potentially of larger deletions (cf. Fig. 1B,C). This phenomenon has limited the utility of resequencing array data for describing patterns of genome-wide sequence variation. Regions where no SNPs are predicted (1) may be monomorphic to the reference sequence or, alternatively, (2) may be so dissimilar that no underlying polymorphisms are detected.

Figure 1.
Effect of polymorphisms on hybridization patterns, labels for the mPPR algorithm, and polymorphic predictions. (A) Log2 intensities for oligonucleotides in a 56-bp tiling path (chromosome 4, positions 8,375,747–8,375,802) for the reference Col-0 ...

Despite the obvious value in predicting regions of high sequence diversity from resequencing array data, advanced computational approaches to this problem have not been reported. In one study, Hinds et al. (2006) used a simple thresholding algorithm coupled with visual inspection to identify more than a hundred deletions of length 70 bp to 7 kb (median, 750 bp) from resequencing array data for the mouse. More recently, Clark et al. (2007) applied a simple heuristic algorithm to predict tracts of highly divergent or missing sequences from similar data for Arabidopsis thaliana. Although this heuristic algorithm generated several hundred predictions per accession, it only identified extended polymorphic tracts (~300 bp to many kilobases) consisting largely of deletions. Currently, no methods have been reported to predict short indels (tens of base pairs) or clustered SNPs from resequencing array data. This limited investment in methods reflects, in part, the complex nature of the primary data (Clark et al. 2007). In contrast to most microarrays, resequencing arrays harbor all possible oligonucleotides for tiled regions, including those that are repetitive or that have inherently poor hybridization properties. Moreover, replication to reduce experimental noise has typically not been performed for resequencing array studies owing to the high cost of whole-genome analyses (Hinds et al. 2005; Clark et al. 2007; Frazer et al. 2007).

In this work, we describe a machine learning method suitable for predicting regions of high polymorphism density from resequencing array data. Our technique is related to hidden Markov models (HMMs) (e.g., Durbin et al. 1998), which are ubiquitous in computational biology and which have been applied to various segmentation and label sequence learning problems, such as gene finding (e.g., Burge and Karlin 1997). In our case, the prediction task is to label each tiled position in the genome either (1) as conserved or (2) as being at or immediately adjacent to a polymorphism (cf. Fig. 1). The relation between adjoining sites is exploited using a state model representation of these labels (see Methods). For HMMs the label sequence should satisfy the Markov property, and at each time point, observations are assumed to be independent (Durbin et al. 1998). For resequencing array data, the latter assumption is invalid as neighboring 25-mer oligonucleotides overlap and hybridization measurements are highly dependent. Recently, a number of discriminative learning algorithms such as conditional random fields (CRFs) (Lafferty et al. 2001), hidden Markov support vector machines (HMSVMs) (Altun et al. 2003; Tsochantaridis et al. 2005), and the related max-margin Markov networks (Taskar et al. 2003) have been proposed to solve various label sequence learning problems. These methods can handle dependencies between features and have been shown to be very powerful, e.g., for gene finding tasks (Bernal et al. 2007; Rätsch et al. 2007; Schulze et al. 2007). Our method, which we call margin-based prediction of polymorphic regions (mPPR), employs HMSVMs modeling the array measurement sequences to learn to identify polymorphic regions (PRs). Here, we define PRs as contiguous regions of nucleotides, each of which is at most 6 bp from a polymorphism or is between two polymorphisms separated by at most 18 bp (for a discussion of these distances, cf. Supplemental Fig. S1).

We applied mPPR to an A. thaliana resequencing array data set for 20 accessions, hereafter called AtAD20, that contains data generated for more than 99.99% of bases in the 119-Mb reference genome (The Arabidopsis Genome Initiative 2000) for each accession (Clark et al. 2007). These data were previously used to identify ≈648,000 SNPs at a specificity of ~98% (the MBML2 SNP data set). With mPPR, on average ≈288,000 PRs were predicted per accession at a specificity of ~97%. A large proportion (~66%) of a set of known SNPs were included within PR predictions, of which 42% were absent from the MBML2 data set. The resulting PR data set defines a large fraction of the highly polymorphic or deleted regions segregating in the global A. thaliana population, and provides a high-resolution description of the genome-wide distribution of such regions in a moderately-sized eukaryotic genome.

Results

We adapted HMSVMs to predict PRs from array resequencing data. In brief, HMSVMs try to estimate a function π = fθ(x) of the input sequence x = x1...xt, in our case representing the features derived from the array measurements. This function predicts a label sequence π = π1...πt, of the same length t, indicating whether or not a position was within a PR. To estimate the free parameters θ of function fθ, n training examples, i.e., input sequences x(i) with corresponding labels π(i), i = 1,...,n, were used. In its most basic form, the method optimizes the parameters θ such that there is a large margin between the correct and any incorrect labeling (for details, see Methods), as similarly done in support vector machine classification (e.g., Vapnik 1995; Müller et al. 2001; Schölkopf and Smola 2002).

Our algorithm required a set of accession-matched, known sequences for the generation of label sequences used for training and evaluation. For 19 of the 20 AtAD20 accessions, 1213 fragments of ≈550 bp in length located throughout the genome had been sampled by PCR and dideoxy sequencing (Nordborg et al. 2005). This data set, hereafter called 2010, covers ~0.5% of the genome per accession and harbors ≈2700 SNPs and ≈400 indel polymorphisms per target accession (Nordborg et al. 2005). Col-0, the reference accession, was included in the AtAD20 accession set (Clark et al. 2007), and we used Col-0 array data to assess hybridization performance of arrayed oligonucleotides. As a consequence, predictions could not be generated for Col-0 itself (e.g., to detect errors in the reference sequence) (The Arabidopsis Genome Initiative 2000). Our method also used information about the repetitiveness of each arrayed 25-mer oligonucleotide determined from the Col-0 reference sequence (Clark et al. 2007). In particular, we separately modeled repetitive sequences from nonrepetitive sequences in an effort to avoid fragmentation of predictions in regions of low to moderate repeat content (see Methods).

Performance evaluation on 2010

We trained our method on 60% of the 2010 data, and used 20% for hyper-parameter tuning and 20% for evaluation; we employed a fivefold cross-validation strategy to obtain out-of-sample predictions for all 2010 fragments. For our method, we considered a prediction as a true positive (TP) if a portion λ (or more) was covered by PR(s); else it was counted as a false positive (FP). Conversely, a known PR was counted as a true discovery (TD) if all underlying polymorphisms were inclusive to a prediction or if at least λ of its length was contained in one or more PR prediction(s); else it was a false negative (FN). We used these counts to assess specificity and sensitivity (for details, see Supplemental Fig. S2 and Methods), and we excluded PRs from evaluation that were more than 75% duplicated elsewhere in the reference genome (these repetitive PRs constituted 3.4% of examples in 2010).

Tuning an internal parameter of our algorithm on the five cross-validation sets allowed us to adjust the trade-off between specificity and sensitivity (for details, see Fig. 2A and Methods). For 2010, we generated predictions at a specificity of ≥90% for λ = 75% (for the effect of varying λ on specificity and sensitivity, see Supplemental Fig. S3). Across all sequence types and accessions, our method identified 56% of PRs in 2010, and performance estimates varied only moderately between accessions (Supplemental Table S1). In A. thaliana, coding sequences have higher GC content and sequence complexity than noncoding sequences (The Arabidopsis Genome Initiative 2000). These factors are favorable for hybridization-based methods (Lee et al. 2004; Clark et al. 2007) and likely contributed to the higher sensitivity in coding regions (e.g., about a 1.3-fold difference compared with noncoding sequences at a similar specificity; cf. Fig. 2A; Table 1). As minor differences in prediction boundaries affect performance estimates—especially for small predictions (cf. Supplemental Fig. S2)—we also assessed the performance of the predictions with a relaxed overlap criterion. For λ = 50%, sensitivity was slightly higher, and specificity was at least 95% for all sequence types and ≈97% on average (cf. Table 1).

Table 1.
Specificity and sensitivity for PR predictions assessed with 2010 for different overlap cut-offs, λ (see main text)
Figure 2.
Relationship between specificity and sensitivity for PR predictions with overlap criteria λ = 75%. (A) Specificity–sensitivity curves averaged over cross-validation test subsets for different sequence types (for color code, see inset). ...

The labels we used for training are abstractions for underlying polymorphism; however, all polymorphism types were labeled (e.g., both SNPs and indels) and were thus targets for prediction. We therefore assessed the polymorphism content of predictions on the 2010 test data. Sixty-two percent of predictions identified single SNPs, 3.4% harbored single indels, and the remaining predictions identified complex mixes of polymorphism types, with clusters of SNPs most common (Supplemental Table S2). For indel polymorphisms, 53.3% of deleted bases and 38.9% of insertion sites in 2010 were included within predicted PRs. Across all prediction types, ~90% of bases within predictions were at or within 6 bp to a known polymorphism (see Fig. 2B).

While PR predictions typically reflected the underlying patterns of polymorphisms with high accuracy, prediction boundaries sometimes differed substantially from labels, and for some regions, even highly clustered polymorphisms were not identified (Fig. 1C,D). In large part, such FNs occurred for regions with poor hybridization properties in the reference accession (e.g., Fig. 1C,D, cf. predictions to reference feature intensities for regions 2 and 5; see also Supplemental Fig. S4). Additionally, although explicitly modeled by our method, repeats were overrepresented among FN predictions. For example, in 2010, 5.5% of all positions were repetitive (see Methods), while the fraction of repetitive positions in FN PRs was twice as high (10.9%). In contrast, only 2.1% of sites in correctly predicted PRs were repetitive. Therefore, repeats are a source of error for our predictions; however, mPPR was cautious in making predictions that included repetitive sites.

Prediction content and comparison to MBML2

We designed mPPR to produce predictions that complement existing SNP data sets ascertained from resequencing array data (Fig. 3). Although our method only identifies the approximate location of polymorphisms, 74.8% of clustered SNPs (≤18 bp away from the nearest polymorphism) in 2010 were included within boundaries of PR predictions (Table 2). This contrasts markedly to MBML2, for which a mere 12.4% of the clustered SNPs were identified. Although mPPR performed well for clustered SNPs, the method nevertheless also identified 55.4% of isolated SNPs (those >18 bp to the nearest polymorphism). Compared with MBML2, 42% of 2010 SNPs were located exclusively within mPPR prediction boundaries, whereas only 8% were found exclusively in MBML2. The most striking differences between the data sets were for clustered SNPs in untranslated and intergenic regions, where our method identified the approximate location of sevenfold to 10-fold as many SNPs as MBML2.

Table 2.
Sensitivity by polymorphism and sequence type
Figure 3.
Dependency of SNP sensitivity on distance between polymorphisms by detection method. SNPs were partitioned according to the distance to the nearest polymorphism. The frequency of SNPs in each distance bin (X-axis) is shown as bars. Sensitivity rates per ...

Whole-genome predictions and evaluation

HMSVMs trained on 2010 data were used for genome-wide prediction on AtAD20 accessions using the same settings as for evaluations on 2010 data (Table 1). Nonredundantly, 27% of the A. thaliana genome was included within the boundaries of the resulting predictions, and 92% of the predictions harbored <75% repetitive sites, the criteria we used for evaluation with 2010. Per accession, between 240,538 and 361,184 PRs were predicted, comprising between 5.3% and 8.5% of the genome (Supplemental Table S1). The accession with the most predictions, Cvi-0, was known from earlier work to be highly dissimilar to Col-0 (Schmid et al. 2003; Nordborg et al. 2005). By sequence type, intergenic positions were most strongly overrepresented within prediction boundaries (Supplemental Fig. S5).

Given the size and genome-wide sampling for the 2010 data (Nordborg et al. 2005), our performance evaluations likely generalize well for much of the genome. Nevertheless, the 2010 data are biased in several ways that potentially affect performance estimates. First, 2010 is overrepresented for coding sequences, and we adjusted performance estimates for genome predictions to account for the difference in sequence composition between 2010 and the whole genome (Table 1). However, noncoding sequences in 2010 are also biased, and are generally located in close proximity to coding sequences. A consequence is that polymorphism levels for the 2010 sequences are likely reduced compared to the genome average. Another concern is that, irrespective of sequence type, the PCR-based 2010 data are underrepresented for highly divergent or deleted sequences that could not be amplified by PCR.

We therefore used several resources partially or entirely independent of 2010 to evaluate genome-wide predictions. First, we assessed prediction quality using clone-based genomic sequence data available for three of the studied accessions. This included 37 kb of BAC sequences available for accession Cvi-0 and 14 kb for C24. Here, specificity was 96% and 100% (for λ = 50%) at a sensitivity of 67% and 45% for Cvi-0 and C24, respectively (Supplemental Table S4). Moreover, we assessed our predictions using the much larger twofold draft shotgun sequence data available for Ler-1 (see Supplemental Methods). Although we excluded repetitive regions from this evaluation, performance estimates with this genome-wide resource are expected to be largely unbiased by sequence composition. After removing contigs that were likely the result of assembly errors (see Supplemental Methods), the prediction quality assessed with 37.9 Mb of aligned sequence data differed only marginally from that assessed with the 2010 test data (e.g., specificity was 96%) (Supplemental Table S4). Thus, performance estimates with the genomic clone data were in general agreement with the PCR-based test data even though the composition of the predictions differed somewhat from those in the 2010 test set (e.g., more PRs harbored clusters of SNPs or indels; Supplemental Table S2).

Second, we assessed the performance of predictions for long deletions, a polymorphism type absent from 2010 and that we excluded from the clone-based data owing to alignment uncertainties in the draft genomic data (see Supplemental Methods). More than 100 known deletions of greater than 300 bp had been previously characterized in AtAD20 accessions (Clark et al. 2007) or were characterized in the current study (see Supplemental Methods). These deletions were almost entirely included within PR predictions (Fig. 4B; Supplemental Table S5; Supplemental Fig. S6).

Figure 4.
PRs reveal haplotype sharing at chromosomal and local scales. (A) Genes (top) and PRs (gray blocks beneath) for five accessions for 0.8 Mb surrounding the FRI locus. In Est-1 a region of ~0.6 Mb (dashed black box) including FRI (vertical line) ...

Finally, we note that extended tracts of repetitive sequences (>500 bp) are entirely absent from our evaluations (see Supplemental Methods). Nonetheless, such sequences are common in A. thaliana and are dispersed throughout the genome. To evaluate these as potential sources for false predictions, we took advantage of large regions known to be substantially identical to the Col-0 reference. Previously, Toomajian et al. (2006) used 2010 data to infer regions of extended haplotype sharing (i.e., sequence identity) with the Col-0 genome for the AtAD20 accessions. In such accessions and regions, our method predicted few PRs, e.g., as can be seen for a 600-kb region in Est-1 for which all 2010 segments are identical to Col-0 (Fig. 4A; Toomajian et al. 2006; Clark et al. 2007). This suggests a low incidence of false predictions in regions that are monomorphic to the reference genome sequence but that have repetitive sequence compositions broadly representative of the A. thaliana euchromatic genome.

Polymorphism patterns ascertained with PR and MBML2 data

An immediate use of PR predictions is the characterization of genome-wide patterns of genetic variation. While PR predictions delineate clusters of SNPs and indels with high accuracy, the nature of polymorphism underlying a given prediction is unknown. To examine genome-wide polymorphism levels, we therefore simply counted whether a base was included in a PR prediction in one or more of the AtAD20 accessions. To provide insights into ascertainment biases introduced by different methods, we also calculated the analogous polymorphism estimate with MBML2 SNP data.

Despite the inherent differences in prediction methods, patterns of polymorphism assessed using the PR and SNP data sets were nonetheless broadly correlated at chromosomal scales (Fig. 5; Supplemental Fig. S7). Polymorphism patterns apparent in the PR data also resembled that for pairwise nucleotide diversity, as previously calculated with MBML2 (Clark et al. 2007), as well as for several data sets generated by dideoxy sequencing (Nordborg et al. 2005; Schmid et al. 2005; Clark et al. 2007). Moreover, the patterns were also similar to those observed in single feature polymorphism data collected with the A. thaliana ATH1 microarray (Borevitz et al. 2007). In particular, polymorphism tended to be higher for centromeric and pericentromeric sequences, with additional regions of extended high polymorphism also apparent on chromosomal arms (e.g., distal to the centromeres on chromosomes 1 and 5) (Fig. 5).

Figure 5.
Genome-wide patterns of polymorphism in PRs and MBML2 SNPs. A sliding window of 100 kb was used, with values for every 10,000th position plotted. The Y-axis displays the fraction of bp in each window included within PRs nonredundantly over all accessions ...

We also examined polymorphism levels by sequence type by determining, for each position, the fraction of bases included in predictions across all accessions. Here, polymorphism apparent in PR and SNP data varied in a manner consistent with ascertainment biases (Table 2; Clark et al. 2007). Within genes, predicted polymorphism levels were on average higher for intronic sequences than for coding sequences when assessed with PR, but not with MBML2 data (Fig. 6A). For the PR data, the observed pattern is consistent with the general expectation of reduced evolutionary constraint for nontranslated sequences, as well as with estimates of nucleotide diversity from 2010 (Nordborg et al. 2005). In addition, inclusion of indels as prediction targets for mPPR, coupled with the bias for indel polymorphisms in noncoding regions (The Arabidopsis Genome Initiative 2000), is a likely factor contributing to fine-scale differences in polymorphism estimated from the different data sets.

Figure 6.
Patterns of polymorphism apparent in PR and SNP data in noncoding regions. (A) Polymorphism near splice donor (left) and splice acceptor (right) sites as averaged over 116,971 splice sites and assessed with both the PR prediction and MBML2 (SNP) data ...

We also used PR data to infer the distribution of polymorphisms in intergenic sequences for which SNP sensitivity for MBML2 is very low (Table 2; Clark et al. 2007) and for which diversity estimates from 2010 are largely limited to sequences near genes (Nordborg et al. 2005). Average levels of polymorphism varied as a function of distance from coding sequences and were asymmetric relative to gene orientation (Fig. 6B). Extending upstream to 5′ UTRs, polymorphism reached a plateau at ~450 bp, while the analogous plateau was reached within ~50 bp downstream from 3′ UTRs. Upstream of transcription start sites, polymorphism tended to be inversely associated with the density of predicted cis-regulatory elements (O’Connor et al. 2005). The reduced polymorphism 5′ to genes may, therefore, reflect constraint on cis-regulatory sequences, as suggested by permutation tests that revealed a highly significant under-representation for PR overlaps to predicted cis-regulatory sites (Fig. 6C; Supplemental Fig. S8; O’Connor et al. 2005).

Highly polymorphic genes and gene families in A. thaliana

At the local scale, we used PR predictions to characterize, at high resolution, genes that are highly polymorphic in the A. thaliana population. On an accession basis, an average of 117 of 26,541 coding genes had more than 75% of their coding sequence within predictions. Across all accessions, we also assessed patterns of polymorphism among classes of genes by determining the fraction of coding bases per gene included in PR predictions (denoted “PR content”). Globally, intraspecific patterns of genic polymorphism predicted interspecific conservation, with lower PR content for A. thaliana genes with orthologs in black cottonwood (Populus trichocarpa), the most closely related plant with a sequenced genome (Supplemental Fig. S9; Tuskan et al. 2006). Among large gene families within A. thaliana (n > 125) (Clark et al. 2007), variation in PR content was readily apparent (Fig. 7A,B; Supplemental Fig. S10). Transcription factors, for which MBML2 SNP data suggested strong purifying selection, harbored few members with high PR content (Supplemental Fig. S10). In contrast, higher PR content was observed for F-box genes (Supplemental Fig. S10), for which many inactivating mutations have been identified (Clark et al. 2007), and for which patterns of sequence variation indicate high death rates in the A. thaliana genome (Thomas 2006). Among large gene families, nucleotidebinding leucine rich repeat (NB-LRR) genes that mediate disease resistance harbored extreme levels of polymorphism (Fig. 7B), a finding that was even apparent in low resolution predictions of PRs from AtAD20 data (Clark et al. 2007).

Figure 7.
Percentage of coding and miRNA genes included in PRs over all accessions by gene category. (A, B) Distribution of coding genes as a function of percentage inclusion in PRs for all genes and NB-LRR genes, respectively (see Supplemental Methods). (C) Polymorphism ...

As our PR predictions have high specificity and sensitivity in noncoding regions, we also used PR content to assess sequence variation within and among micro-RNA (miRNA) genes, where comparatively little is known about within-species polymorphism. Among A. thaliana miRNAs with homologs in other species (Jones-Rhoades et al. 2006), very little variation was observed for the 21-nucleotide miRNA sequences required for miRNA mediated gene suppression (Fig. 7C). Marginally higher variation was observed for the complementary miRNA* sequence, with PR content substantially higher for precursor end and loop regions of miRNA precursor sequences. For a set of 68 validated or predicted miRNAs lacking homologs in other species (Rajagopalan et al. 2006; Fahlgren et al. 2007), PR content was generally much higher, and the pattern of reduced PR content for the miRNA sequence relative to the rest of the precursor was less clear (Fig. 7C). Whether this pattern reflects poor annotation for the nonconserved miRNAs, or potentially the evolution of new genes that are not fixed in the population, remains to be determined.

Data release

The PR prediction data set is available for download from The Arabidopsis Information Resource (TAIR; http://www.arabidopsis.org), as are lists for the percentage of each coding gene included in predictions by accession.

Discussion

Because array-based resequencing relies on hybridization, highly polymorphic regions present a substantial challenge. Generally, few SNPs are predicted for these regions, and precise methods for prediction of indels have not been developed. Nevertheless, clustered polymorphisms and indels, which can comprise more than 15% of polymorphisms in eukaryotic genomes (e.g., Dawson et al. 2001; Wicks et al. 2001; Mills et al. 2006), are a central component of sequence variation and contribute to phenotypic variation. Here, we present a method, mPPR, for accurate prediction of PRs from resequencing array data of the reference plant A. thaliana, where SNP polymorphism is higher than for human (Wright and Gaut 2005 and references therein), and for which indel polymorphisms are common (The Arabidopsis Genome Initiative 2000; Nordborg et al. 2005). While replicated hybridization measurements are typically not available for primary whole-genome hybridization data, each base in a tiling path is interrogated on the arrays, an ultimate determinant for the theoretical accuracy of predictions. By using a machine learning method to overcome experimental noise and to relate complex, dependent hybridization measurements from overlapping oligonucleotides to underlying polymorphisms, we detected even small clusters of SNPs or indels (within less than 10 bp) with high accuracy. A challenge for our learning method were large deletions that were absent from the 2010 training data. Nonetheless, deletions maximally suppress intensity measurements throughout a tiling path, and suppressed hybridization is the pattern identified by mPPR. In our predictions, long deletions were readily recognizable as (potentially interrupted) long PRs (for an example, see Fig. 4B).

Fine-scale patterns of polymorphisms

Our PR predictions, which are genome-wide and largely unbiased by sequence type, revealed patterns of polymorphism not apparent in earlier analyses. For example, in intergenic sequences, we found that average polymorphism is lowest immediately 5′ to transcribed sequences, rising to maximal levels within ~450 bp of the transcription start site. This observation is unlikely to result from an artifact in the PR data; a similar pattern is apparent in an interspecific comparison of promoter regions between A. thaliana and a close relative, Boechera stricta (Windsor et al. 2006). Constrained sequence evolution for regions immediately 5′ to genes may reflect the action of purifying selection on cis-regulatory sequences, as suggested by a significant underrepresentation of overlaps between PRs and transcriptional cis-elements predicted in a previous study (O’Connor et al. 2005). This finding suggests that in A. thaliana, the information required for gene expression is densest in close proximity to transcript start sites even though full recapitulation of complex expression patterns often requires substantially larger promoter fragments (e.g., Lee et al. 2006). An implication of this observation is that deep sampling of variation within A. thaliana populations will be important for both detecting cis-regulatory sequences and characterizing their evolution.

The specificity of our predictions also allowed us to characterize polymorphisms in transcribed A. thaliana sequences at a resolution of tens of base pairs. Hundreds of transcribed regions, representing genes from many families, were largely covered by PRs in one or more accession. In some cases, this may reflect the absence of selection at annotated genes that are in fact pseudo-genes. In other cases, highly dissimilar sequences may reflect the action of balancing selection, where linked mutations accumulate nearby a selectively maintained polymorphism. Allele frequency patterns in SNP data support balancing selection as a central force leading to high polymorphism levels for NB-LRR genes (Bakker et al. 2006; Clark et al. 2007), the predominant class of disease resistance (R) genes in plants (Jones and Dangl 2006). In our study, family-wide polymorphism for NB-LRR genes was extreme, as also noted from earlier work with the AtAD20 data (Clark et al. 2007), as well as from studies of a select set of NB-LRR genes in A. thaliana (Grant et al. 1998; Bakker et al. 2006; Shen et al. 2006). Nevertheless, polymorphism levels for individual NB-LRR genes varied greatly; some genes were almost entirely included in PRs, while others were predicted to be largely monomorphic across the AtAD20 accession set. This might reflect the action of different selective pressures on specific family members, and NB-LRR genes harboring little or no variation may have been targets of recent positive selection (sweeps) in A. thaliana populations. Although the primary function for NB-LRR genes is in race-specific resistance to pathogens, not all R genes are NB-LRR members (e.g., Song et al. 1995). The extent to which other highly polymorphic genes identified in this study mediate interactions with the biotic (or potentially abiotic) environment requires empirical study.

Utility of predictions for functional studies

Our predictions are immediately useful for functional studies in A. thaliana. Many genes entirely covered by PRs are likely to be partially or completely deleted. These constitute a potential source of loss-of-function alleles for genes for which knockout alleles have not been found in sequence indexed A. thaliana mutant collections (Alonso and Ecker 2006). Moreover, the AtAD20 set was selected not only to maximally capture diversity within the species but also to include many parents of recombinant inbred line (RIL) populations constructed for quantitative trait locus (QTL) mapping (http://www.inra.fr/internet/Produits/vast/RILs.htm). Deletions or highly polymorphic sequences have been shown to underlie diverse phenotypes that segregate in A. thaliana populations (e.g., Johanson et al. 2000), and our predictions should be valuable for identifying causal alleles found in QTL studies, or that are linked to SNPs employed in whole-genome association mapping scans (Kim et al. 2007). At a more basic level, our predictions will facilitate the design of perfect match primers for genotyping and for collecting diversity data with PCR-based methods. Further, the predictions are useful for identifying mismatched probes present on microarrays employed for interrogating RNA expression in different accessions.

Application of our methods to other data and broader relevance

Although mPPR was tailored for predicting PRs with A. thaliana resequencing array data, it should be readily applicable to other resequencing array data sets with some modifications. In previous experiments with human and mouse (e.g., Hinds et al. 2005; Frazer et al. 2007), and for ongoing work with rice (McNally et al. 2006), DNA hybridized to arrays was generated by pooling long-range PCR amplicons of selected regions. For A. thaliana, the entire genomic DNA was subjected to isothermal amplification (Clark et al. 2007). Nevertheless, the framework of our learning algorithm can be adapted to accommodate additional intensity variation resulting from concentration differences between individual long-range PCR products. In humans, heterozygosity presents an additional challenge, as does the lack of sample-matched training data. In contrast, for other species, hybridization was performed using inbred (homozygous) strains (Frazer et al. 2004, 2007; McNally et al. 2006), and sample-matched data sets that could potentially be used for training have been reported for mouse (e.g., Mural et al. 2002) and are being generated for rice (K. Childs and R. Buell, pers. comm.) that would facilitate application of the mPPR approach.

In this work we adopted a new machine learning method for inferring structural information from sequences to the problem of identifying polymorphisms. The underlying inference method has recently been developed in the field of machine learning (Altun et al. 2003; Tsochantaridis et al. 2005; Rätsch and Sonnenburg 2007) and can be seen as an extension of support vector machines that are frequently used in computational biology. It has been successfully applied in natural language processing (e.g., Altun et al. 2003; Sha and Saul 2007), computational gene finding (Rätsch et al. 2007), and spliced sequence alignment (Schulze et al. 2007). This diversity illustrates the flexibility and power of the approach. Traditionally, generative models such as HMMs have been used for similar applications. They attempt to estimate probability densities over the observation sequence and their segmentation. However, it has been argued that such approaches do not lead to the best discrimination performance, as high-dimensional density estimation is known to be a harder task than discrimination (Vapnik 1995). One reason generative methods are often outperformed by discriminative methods is that they typically need to assume independence between observations in a sequence (for a comparison of label sequence learning methods and discussion, see also Nguyen and Guo 2007). Since our method does not assume independence, it is well-suited for many tasks in genome research for which measurements are dependent.

Methods

Preparation of hybridization, repeat, and sequence data

Hybridization data from Clark et al. (2007) were quantile-normalized (Bolstad et al. 2003) to correct for between-array variation in hybridization intensities and to facilitate the use of predictors trained with data from all accessions. Consequently, predictors were available to make predictions on any accession. The 2010 data set that we used to generate the label set for both PRs and conserved regions (see below) is available for download as previously described (Nordborg et al. 2005; Clark et al. 2007). Array measurements for repetitive oligonucleotides are much less reliable than for unique oligonucleotides; therefore, we annotated repetitive 25-mer oligonucleotides on the resequencing arrays as described by Clark et al. (2007). We combined information for all types of 25-mer repeats defined by Clark et al. (2007) to create a 0/1-sequence that indicated whether a site was repetitive according to any of the categories. This repeat-mask (called RM) was an input for our algorithm.

Overview of the mPPR algorithm

We started by introducing a graphical model of states and allowed transitions (Supplemental Fig. S11). Instead of predicting the label (polymorphic or conserved) directly, our algorithm was designed to learn to assign a state to each sequence position given the hybridization measurements. To do this, each known sequence in the 2010 data set was first translated into a state sequence, i.e., the “truth” that we tried to approximate. We then applied HMSVMs (Altun et al. 2003) for label sequence learning. We augmented these with explicit feature scoring functions and adapted these to our task by defining an appropriate loss function. From the predicted state sequence, we afterward inferred the label sequence (see color coding in Supplemental Fig. S11).

State model

The simplest possible model, with one state, C, for conserved nucleotides and one state P for PRs, was extended in two ways. First, we noted that hybridization signal gradually decreases over a few nucleotides toward a polymorphism. We therefore included a series of three states—T1, T2, T3—modeling decreasing intensities upstream of PRs and similarly three states—T4, T5, T6—for increasing intensities downstream (for details, see Supplemental Fig. S11). The second extension relates to repetitive sequences, which we modeled separately from unique sequences via duplicated states that effectively allowed feature scoring functions for repetitive regions to be learned differently. The model contains a state CR for conserved, repetitive sequences (positions p where RM(p) = 1) and a state CU for conserved, unique sequences (where RM(p) = 0), likewise a state PU for polymorphic, unique sequences with PR as the repetitive counterpart. Transition states Ti were not duplicated. We denote the set of states by S. Allowed transitions between the states are drawn as arcs in Supplemental Figure S11. Real-valued scores ϕ(i,j) were associated with transitions from state i [set membership] S to state j [set membership] S, which were determined during training of the method except for the transitions ϕ(i,cR) or ϕ(i,cU) that were made deterministically depending on whether RM(p) = 1 or RM(p) = 0, respectively (and similarly for ϕ(i,PR) and ϕ(i,PU)).

Generation of labels

To train our method, we first generated the target state sequence that is to be reproduced given only the input sequence. Initially, all polymorphic sites (deleted nucleotides, SNPs, and nucleotides directly upstream of an insertion site) known from the 2010 set were assigned PU or PR states depending on the repeat annotation. In the next step, we assigned PU or PR states to sites between two polymorphic labels at a distance of ≤18 bp (for the choice of this distance, cf. Supplemental Fig. S1). Every segment of P states was then extended 6 bp in each direction, and the transition states T1,...,T3 and T4,...,T6 were inserted upstream and downstream of every segment of P states, respectively. Finally, CU or CR states were assigned to the remaining positions. This procedure generated a state sequence for every fragment in the 2010 data set.

Generation of input features

As input to our learning algorithm, seven features were derived from hybridization data. Some of these also used information from the reference genome sequence. Three groups of features were used. First were features directly derived from array intensities (cf. Supplemental Table S6, features 1–4). Some of these were based on a ratio between hybridization intensities of the target and the reference accession. Second, one feature was computed from quality scores (feature 5). Third, several features were included that capture the (dis)agreement between raw base calls from the arrays and the reference sequence (features 6, 7). Quality scores and raw base calls were as defined previously (Clark et al. 2007). The result was a feature vector of length m = 7 associated with every position in the genome. Additionally, the repeat annotation RM was included; however, this was used to switch deterministically between CU and CR states, as well as between PU to PR, and not for learning per se.

Parametrization

Formally, our goal was to learn a function

equation image

that predicts the state sequence (path) π [set membership] S* given the sequence of observations x [set membership] X (input features), both of equal length t, where S* denotes the Kleene closure. This was done indirectly via a θ-parametrized discriminant function

equation image

that assigned a real-valued score to a pair of observation and state sequence (Altun et al. 2003). Once Fθ is known, f can be obtained as

equation image

In our case Fθ satisfied the Markov property, which is sufficient to show that this decoding can be computed efficiently by dynamic programming (Durbin et al. 1998; Giegerich et al. 2004).

The input to the discriminant function Fθ consisted of observations x, an m × t matrix of m different features, and a sequence of states π = π1,...,πt. For every pair of features j = 1,...,m and states k [set membership] S, we employed a feature scoring function gj,k: RR. Fθ was then obtained as a linear combination of the feature scoring contributions and the transition scores ϕ:

equation image

where [[.]] denotes the indicator function. For convenience of notation, we assumed a pseudo-transition ϕ(π01) = 0. We modeled the feature scoring functions gj,k as piecewise linear functions as follows (Rätsch et al. 2007): Let Q be the number of supporting points ql (satisfying ql < ql+1) and vl their values, then the piecewise linear function is defined by

equation image

We chose Q = 10 supporting points on the abscissa such that in each interval [ql, ql+1] there were approximately equally many feature values (determined on the training set). In the following, θj,k,l will denote the value vl of gj,k. Together with the transition scores ϕ, the values at the supporting points θj,k,l constituted the parametrization of the model (in the following collectively denoted by θ).

Learning algorithm

Let n be the number of training examples (x(i), π(i)), i = 1,...,n. Following the discriminative learning paradigm, we wanted to enforce a large margin of separation between the correct path π(i) and any other wrong path ππ(i), i.e.,

equation image

To achieve this, the following linear programming problem (LP) is solved:

equation image

s.t.

equation image

where Ω is a linear regularization term of the form

equation image

Note that Fθ is linear in all parameters and hence the constraints in Equation 1 are linear. Regularization is a technique commonly used in empirical inference to avoid overfitting. Our regularizer implements the idea that absolute parameter values should be small, and it penalizes the variation of the feature scoring functions (with respect to the choice of supporting points). Regularization strength can be adjusted using the hyper-parameter C.

We introduced so-called slack variables ξ(i) to implement a soft-margin (Cortes and Vapnik 1995) allowing some prediction errors on the training set. As there are exponentially many wrong paths π, we also have an exponential number of margin constraints in Equation 1. This prohibits solving the optimization problem directly. Instead, starting from an empty set of margin constraints and a random parametrization θ(1), for every training example we computed the wrong path that maximally violates the margin constraints. We used a generalized Viterbi algorithm that decodes the two best paths, thereby allowing us to identify the wrong path (since there is only one correct path). Adopting a column generation technique, adding constraints and solving the intermediate LP were iterated till convergence to the (provably) optimal solution (Hettich and Kortanek 1993; Rätsch et al. 2002; Altun et al. 2003): At iteration t, new margin constraints of the form

equation image

were added to the problem, which was solved again to obtain the next intermediate solution θ(t+1). The intermediate LPs were solved using the CPLEX optimization software (http://www.ilog.com/products/cplex/), which facilitated training with n = 12,000 examples.

Loss function

We augmented the basic algorithm described above with a loss function Δ that adjusts the loss a path incurs depending on its similarity to the true path. That is, a path that closely resembles the truth incurs a small loss compared to one that is completely different from the true path. The loss function is used to rescale the margin (Altun et al. 2003; Taskar et al. 2003), replacing the margin constraints in Equation 1 with

equation image

During optimization, the loss was taken into account when decoding to find the maximal margin violator:

equation image

The loss was required to be non-negative and decomposable for efficient decoding via dynamic programming. We chose a position-wise loss An external file that holds a picture, illustration, etc.
Object name is 918inf1.jpg, which is summed over the whole sequence (of length t): An external file that holds a picture, illustration, etc.
Object name is 918inf2.jpg (similar to a weighted Hamming loss; for details, cf. Supplemental Table S7).

Cross-validation, evaluation, and whole-genome predictions

For fivefold cross-validation, fragments in the 2010 set were randomly split into five subsets, where we ensured that across all accessions overlapping sequences were assigned to the same subset. The first predictor was trained on the first three subsets, its optimal regularization parameter C was selected on the fourth subset, and its performance was evaluated on the fifth subset. For the other four predictors, the assignment of training, validation, and test set was permuted in order to obtain unbiased (test) predictions for all 2010 data.

All evaluations were based on data from 18 accessions (no predictions were made for the reference, and for Van-0 no reliably labeled set exists; Clark et al. 2007). Furthermore, known PRs as well as predicted PRs were excluded from specificity–sensitivity estimation if they contained ≥75% repetitive sites.

Replacing transition scores ϕ(i,i), i [set membership] {CU,CR} after training by An external file that holds a picture, illustration, etc.
Object name is 918inf3.jpg resulted in predictions either with increased specificity (δ > 0) or with increased sensitivity (δ < 0). Fifty-one values for δ were uniformly chosen from the interval [−3, 2] to generate specificity-sensitivity curves for all five test subsets. For Figure 2 and Supplemental Figure S3, specificity-sensitivity curves were averaged over the subsets.

The sequence type of each nucleotide was determined based on the TAIR6 A. thaliana genome annotation available at http://www.arabidopsis.org. In cases where annotations overlapped, the sequence type was assigned following the hierarchy: coding > UTR/intron > intergenic. PRs were assigned a sequence type based on the majority of nucleotides contained.

Specificity and sensitivity for whole-genome predictions are expected to be slightly different from the values estimated on the 2010 set as coding sequences are relatively overrepresented for 2010 compared with the entire genome (Nordborg et al. 2005; Clark et al. 2007). To account for the compositional bias of the 2010 data, we applied the following correction: Let An external file that holds a picture, illustration, etc.
Object name is 918inf4.jpg be the number of coding bases in the 2010 data; An external file that holds a picture, illustration, etc.
Object name is 918inf5.jpg, the number of coding bases in the genome. Then, for the whole genome the number of TPs in coding regions is estimated as An external file that holds a picture, illustration, etc.
Object name is 918inf6.jpg. Applying the same corrections for FPs, TDs, and FNs, as well as for intergenic (ige) and UTR/intron bases (utr), specificity was recalculated as

equation image

and sensitivity as

equation image

To obtain PR predictions with high specificity, transition scores were independently tuned by choosing the smallest δ for which each of the predictors achieved specificity ≥90% on its test set. Whole-genome predictions were made independently with every predictor, and a single prediction was assigned to every position according to the following scheme: The genome was partitioned into chunks of ~1 kb (breakpoints between chunks were only set where all five predictors agreed on CU or CR). If a chunk contained a 2010 sequence fragment, the respective test predictions were used. Otherwise one of the five predictors was chosen randomly for the given chunk.

Methods for the analyses of genome-wide predictions, experimental characterization of predictions, evaluation of genomewide patterns of polymorphism, overlap of predictions relative to cis-regulatory sites, and annotation of predictions relative to genes are provided in the Supplemental material. GenBank accession numbers for RPM1 are ET181618-ET181629. The software used to produce the results in the article and an open source toolbox with an improved and easier-to-use implementation of the algorithm is available at http://www.fml.tuebingen.mpg.de/raetsch/projects/mppr.

Acknowledgments

We thank Gabriele Schweikert, Bernhard Schölkopf, and Stephan Ossowski for helpful discussions and Stephan Ossowski for assistance in visualizing predictions. Supported by Innovation Funds and core funding of the Max Planck Society. D.W. is a director of the Max Planck Institute. G.R. is a group leader at the Friedrich Miescher Laboratory of the Max Planck Society.

Footnotes

[Supplemental material is available online at www.genome.org.]

Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.070169.107.

References

  • Alonso J., Ecker J., Ecker J. Moving forward in reverse: Genetic technologies to enable genome-wide phenomic screens in Arabidopsis. Nat. Rev. Genet. 2006;7:524–536. [PubMed]
  • Altun Y., Tsochantaridis I., Hofmann T., Tsochantaridis I., Hofmann T., Hofmann T. Proceedings of the 20th International Conference on Machine Learning. AAAI Press; Menlo Park, CA: 2003. Hidden Markov support vector machines; pp. 3–10.
  • The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. [PubMed]
  • Bakker E., Toomajian C., Kreitman M., Bergelson J., Toomajian C., Kreitman M., Bergelson J., Kreitman M., Bergelson J., Bergelson J. A genome-wide survey of R gene polymorphisms in Arabidopsis. Plant Cell. 2006;18:1803–1818. [PMC free article] [PubMed]
  • Bernal A., Crammer K., Hatzigeorgiou A., Pereira F., Crammer K., Hatzigeorgiou A., Pereira F., Hatzigeorgiou A., Pereira F., Pereira F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 2007;3:e54. doi: 10.1371/journal.pcbi.0030054. [PMC free article] [PubMed] [Cross Ref]
  • Bolstad B., Irizarry R., Astrand M., Speed T., Irizarry R., Astrand M., Speed T., Astrand M., Speed T., Speed T. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. [PubMed]
  • Borevitz J., Hazen S., Michael T., Morris G., Baxter I., Hu T., Chen H., Werner J., Nordborg M., Salt D., Hazen S., Michael T., Morris G., Baxter I., Hu T., Chen H., Werner J., Nordborg M., Salt D., Michael T., Morris G., Baxter I., Hu T., Chen H., Werner J., Nordborg M., Salt D., Morris G., Baxter I., Hu T., Chen H., Werner J., Nordborg M., Salt D., Baxter I., Hu T., Chen H., Werner J., Nordborg M., Salt D., Hu T., Chen H., Werner J., Nordborg M., Salt D., Chen H., Werner J., Nordborg M., Salt D., Werner J., Nordborg M., Salt D., Nordborg M., Salt D., Salt D., et al. Genome-wide patterns of single-feature polymorphism in Arabidopsis thaliana. Proc. Natl. Acad. Sci. 2007;104:12057–12062. [PMC free article] [PubMed]
  • Burge C., Karlin S., Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997;268:78–94. [PubMed]
  • Chee M., Yang R., Hubbell E., Berno A., Huang X., Stern D., Winkler J., Lockhart D., Morris M., Fodor S., Yang R., Hubbell E., Berno A., Huang X., Stern D., Winkler J., Lockhart D., Morris M., Fodor S., Hubbell E., Berno A., Huang X., Stern D., Winkler J., Lockhart D., Morris M., Fodor S., Berno A., Huang X., Stern D., Winkler J., Lockhart D., Morris M., Fodor S., Huang X., Stern D., Winkler J., Lockhart D., Morris M., Fodor S., Stern D., Winkler J., Lockhart D., Morris M., Fodor S., Winkler J., Lockhart D., Morris M., Fodor S., Lockhart D., Morris M., Fodor S., Morris M., Fodor S., Fodor S., et al. Accessing genetic information with high-density DNA arrays. Science. 1996;274:610–614. [PubMed]
  • Clark R., Schweikert G., Toomajian C., Ossowski S., Zeller G., Shinn P., Warthmann N., Hu T., Fu G., Hinds D., Schweikert G., Toomajian C., Ossowski S., Zeller G., Shinn P., Warthmann N., Hu T., Fu G., Hinds D., Toomajian C., Ossowski S., Zeller G., Shinn P., Warthmann N., Hu T., Fu G., Hinds D., Ossowski S., Zeller G., Shinn P., Warthmann N., Hu T., Fu G., Hinds D., Zeller G., Shinn P., Warthmann N., Hu T., Fu G., Hinds D., Shinn P., Warthmann N., Hu T., Fu G., Hinds D., Warthmann N., Hu T., Fu G., Hinds D., Hu T., Fu G., Hinds D., Fu G., Hinds D., Hinds D., et al. Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science. 2007;317:338–342. [PubMed]
  • Cortes C., Vapnik V., Vapnik V. Support vector networks. Mach. Learn. 1995;20:273–297.
  • Cutler D., Zwick M., Carrasquillo M., Yohn C., Tobin K., Kashuk C., Mathews D., Shah N., Eichler E.J.W., Warrington J.A., Zwick M., Carrasquillo M., Yohn C., Tobin K., Kashuk C., Mathews D., Shah N., Eichler E.J.W., Warrington J.A., Carrasquillo M., Yohn C., Tobin K., Kashuk C., Mathews D., Shah N., Eichler E.J.W., Warrington J.A., Yohn C., Tobin K., Kashuk C., Mathews D., Shah N., Eichler E.J.W., Warrington J.A., Tobin K., Kashuk C., Mathews D., Shah N., Eichler E.J.W., Warrington J.A., Kashuk C., Mathews D., Shah N., Eichler E.J.W., Warrington J.A., Mathews D., Shah N., Eichler E.J.W., Warrington J.A., Shah N., Eichler E.J.W., Warrington J.A., Eichler E.J.W., Warrington J.A., Warrington J.A., et al. High-throughput variation detection and genotyping using microarrays. Genome Res. 2001;11:1913–1925. [PMC free article] [PubMed]
  • Dawson E., Chen Y., Hunt S., Smink L., Hunt A., Rice K., Livingston S., Bumpstead S., Bruskiewich R., Sham P., Chen Y., Hunt S., Smink L., Hunt A., Rice K., Livingston S., Bumpstead S., Bruskiewich R., Sham P., Hunt S., Smink L., Hunt A., Rice K., Livingston S., Bumpstead S., Bruskiewich R., Sham P., Smink L., Hunt A., Rice K., Livingston S., Bumpstead S., Bruskiewich R., Sham P., Hunt A., Rice K., Livingston S., Bumpstead S., Bruskiewich R., Sham P., Rice K., Livingston S., Bumpstead S., Bruskiewich R., Sham P., Livingston S., Bumpstead S., Bruskiewich R., Sham P., Bumpstead S., Bruskiewich R., Sham P., Bruskiewich R., Sham P., Sham P., et al. A SNP resource for human chromosome 22: Extracting dense clusters of SNPs from the genomic sequence. Genome Res. 2001;11:170–178. [PMC free article] [PubMed]
  • Durbin R., Eddy S., Krogh A., Mitchison G., Eddy S., Krogh A., Mitchison G., Krogh A., Mitchison G., Mitchison G. Biological sequence analysis. Probabilistic models of protein and nucleic acids. Cambridge University Press; Cambridge: 1998.
  • Fahlgren N., Howell M., Kasschau K., Chapman E., Sullivan C., Cumbie J., Givan S., Law T., Grant S., Dangl J., Howell M., Kasschau K., Chapman E., Sullivan C., Cumbie J., Givan S., Law T., Grant S., Dangl J., Kasschau K., Chapman E., Sullivan C., Cumbie J., Givan S., Law T., Grant S., Dangl J., Chapman E., Sullivan C., Cumbie J., Givan S., Law T., Grant S., Dangl J., Sullivan C., Cumbie J., Givan S., Law T., Grant S., Dangl J., Cumbie J., Givan S., Law T., Grant S., Dangl J., Givan S., Law T., Grant S., Dangl J., Law T., Grant S., Dangl J., Grant S., Dangl J., Dangl J., et al. Highthroughput sequencing of Arabidopsis microRNAs: Evidence for frequent birth and death of miRNA genes. PLoS ONE. 2007;1:e14. doi: 10.1371/journal.pone.0000219. [PMC free article] [PubMed] [Cross Ref]
  • Frazer K., Wade C., Hinds D., Patil N., Cox D., Daly M., Wade C., Hinds D., Patil N., Cox D., Daly M., Hinds D., Patil N., Cox D., Daly M., Patil N., Cox D., Daly M., Cox D., Daly M., Daly M. Segmental phylogenetic relationships of inbred mouse strains revealed by finescale analysis of sequence variation across 4.6 Mb of mouse genome. Genome Res. 2004;14:1493–1500. [PMC free article] [PubMed]
  • Frazer K., Eleazar E., Kang H., Bogue M., Hinds D., Beilharz E., Gupta R., Montgomery J., Morenzoni M., Nilsen G., Eleazar E., Kang H., Bogue M., Hinds D., Beilharz E., Gupta R., Montgomery J., Morenzoni M., Nilsen G., Kang H., Bogue M., Hinds D., Beilharz E., Gupta R., Montgomery J., Morenzoni M., Nilsen G., Bogue M., Hinds D., Beilharz E., Gupta R., Montgomery J., Morenzoni M., Nilsen G., Hinds D., Beilharz E., Gupta R., Montgomery J., Morenzoni M., Nilsen G., Beilharz E., Gupta R., Montgomery J., Morenzoni M., Nilsen G., Gupta R., Montgomery J., Morenzoni M., Nilsen G., Montgomery J., Morenzoni M., Nilsen G., Morenzoni M., Nilsen G., Nilsen G., et al. A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature. 2007;448:1050–1053. [PubMed]
  • Giegerich R., Meyer C., Steffen P., Meyer C., Steffen P., Steffen P. A discipline of dynamic programming over sequence data. Sci. Comput. Program. 2004;51:215–263.
  • Grant M., Godiard L., Straube E., Ashfield T., Lewald J., Sattler A., Innes R., Dangl J., Godiard L., Straube E., Ashfield T., Lewald J., Sattler A., Innes R., Dangl J., Straube E., Ashfield T., Lewald J., Sattler A., Innes R., Dangl J., Ashfield T., Lewald J., Sattler A., Innes R., Dangl J., Lewald J., Sattler A., Innes R., Dangl J., Sattler A., Innes R., Dangl J., Innes R., Dangl J., Dangl J. Structure of the Arabidopsis RPM1 gene enabling dual specificity disease resistance. Science. 1995;269:843–846. [PubMed]
  • Grant M., McDowell J., Sharpe A., de Torres Zabala M., Lydiate D., Dangl J., McDowell J., Sharpe A., de Torres Zabala M., Lydiate D., Dangl J., Sharpe A., de Torres Zabala M., Lydiate D., Dangl J., de Torres Zabala M., Lydiate D., Dangl J., Lydiate D., Dangl J., Dangl J. Independent deletions of a pathogen-resistance gene in Brassica and Arabidopsis. Proc. Natl. Acad. Sci. 1998;95:15843–15848. [PMC free article] [PubMed]
  • Hettich R., Kortanek K., Kortanek K. Semi-infinite programming: Theory, methods and applications. SIAM Rev. 1993;3:380–429.
  • Hinds D., Stuve L., Nilsen G., Halperin E., Eskin E., Ballinger D., Frazer K., Cox D., Stuve L., Nilsen G., Halperin E., Eskin E., Ballinger D., Frazer K., Cox D., Nilsen G., Halperin E., Eskin E., Ballinger D., Frazer K., Cox D., Halperin E., Eskin E., Ballinger D., Frazer K., Cox D., Eskin E., Ballinger D., Frazer K., Cox D., Ballinger D., Frazer K., Cox D., Frazer K., Cox D., Cox D. Whole-genome patterns of common DNA variation in three human populations. Science. 2005;307:1072–1079. [PubMed]
  • Hinds D., Kloek A., Jen M., Chen X., Frazer K., Kloek A., Jen M., Chen X., Frazer K., Jen M., Chen X., Frazer K., Chen X., Frazer K., Frazer K. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat. Genet. 2006;38:82–85. [PubMed]
  • The International HapMap Consortium A haplotype map of the human genome. Nature. 2005;437:1299–1320. [PMC free article] [PubMed]
  • Johanson U., West J., Lister C., Michaels S., Amasino R., Dean C., West J., Lister C., Michaels S., Amasino R., Dean C., Lister C., Michaels S., Amasino R., Dean C., Michaels S., Amasino R., Dean C., Amasino R., Dean C., Dean C. Molecular analysis of FRIGIDA, a major determinant of natural variation in Arabidopsis flowering time. Science. 2000;290:344–347. [PubMed]
  • Jones J., Dangl J., Dangl J. The plant immune system. Nature. 2006;444:323–329. [PubMed]
  • Jones-Rhoades M., Bartel D., Bartel B., Bartel D., Bartel B., Bartel B. MicroRNAs and their regulatory roles in plants. Annu. Rev. Plant Biol. 2006;57:19–53. [PubMed]
  • Kim S., Plagnol V., Hu T., Toomajian C., Clark R., Ossowski S., Ecker J., Weigel D., Nordborg M., Plagnol V., Hu T., Toomajian C., Clark R., Ossowski S., Ecker J., Weigel D., Nordborg M., Hu T., Toomajian C., Clark R., Ossowski S., Ecker J., Weigel D., Nordborg M., Toomajian C., Clark R., Ossowski S., Ecker J., Weigel D., Nordborg M., Clark R., Ossowski S., Ecker J., Weigel D., Nordborg M., Ossowski S., Ecker J., Weigel D., Nordborg M., Ecker J., Weigel D., Nordborg M., Weigel D., Nordborg M., Nordborg M. Recombination and linkage disequilibrium in Arabidopsis thaliana. Nat. Genet. 2007;39:1151–1155. [PubMed]
  • Lafferty J., McCallum A., Pereira F., McCallum A., Pereira F., Pereira F. Proceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann Publishers Inc.; San Francisco, CA: 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data.
  • Lee I., Dombkowski A., Athey B., Dombkowski A., Athey B., Athey B. Guidelines for incorporating non-perfectly matched oligonucleotides into target-specific hybridization probes for a DNA microarray. Nucleic Acids Res. 2004;32:681–690. [PMC free article] [PubMed]
  • Lee J.-Y., Colinas J., Wang J., Mace D., Ohler U., Benfey P., Colinas J., Wang J., Mace D., Ohler U., Benfey P., Wang J., Mace D., Ohler U., Benfey P., Mace D., Ohler U., Benfey P., Ohler U., Benfey P., Benfey P. Transcriptional and posttranscriptional regulation of transcription factor expression in Arabidopsis roots. Proc. Natl. Acad. Sci. 2006;103:6055–6060. [PMC free article] [PubMed]
  • McNally K., Bruskiewich R., Mackill D., Buell C., Leach J., Leung H., Bruskiewich R., Mackill D., Buell C., Leach J., Leung H., Mackill D., Buell C., Leach J., Leung H., Buell C., Leach J., Leung H., Leach J., Leung H., Leung H. Sequencing multiple and diverse rice varieties. connecting whole genome variation with phenotypes. Plant Physiol. 2006;141:26–31. [PMC free article] [PubMed]
  • Mills R., Luttig C., Larkins C., Beauchamp A., Tsui C., Pittard W., Devine S., Luttig C., Larkins C., Beauchamp A., Tsui C., Pittard W., Devine S., Larkins C., Beauchamp A., Tsui C., Pittard W., Devine S., Beauchamp A., Tsui C., Pittard W., Devine S., Tsui C., Pittard W., Devine S., Pittard W., Devine S., Devine S. An initial map of insertion and deletion (indel) variation in the human genome. Genome Res. 2006;16:1182–1190. [PMC free article] [PubMed]
  • Müller K.-R., Mika S., Rätsch G., Tsuda K., Schölkopf B., Mika S., Rätsch G., Tsuda K., Schölkopf B., Rätsch G., Tsuda K., Schölkopf B., Tsuda K., Schölkopf B., Schölkopf B. An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw. 2001;12:181–201. [PubMed]
  • Mural R., Adams M., Myers E., Smith H., Miklos G., Wides R., Halpern A., Li P., Sutton G., Nadeau J.A., Adams M., Myers E., Smith H., Miklos G., Wides R., Halpern A., Li P., Sutton G., Nadeau J.A., Myers E., Smith H., Miklos G., Wides R., Halpern A., Li P., Sutton G., Nadeau J.A., Smith H., Miklos G., Wides R., Halpern A., Li P., Sutton G., Nadeau J.A., Miklos G., Wides R., Halpern A., Li P., Sutton G., Nadeau J.A., Wides R., Halpern A., Li P., Sutton G., Nadeau J.A., Halpern A., Li P., Sutton G., Nadeau J.A., Li P., Sutton G., Nadeau J.A., Sutton G., Nadeau J.A., Nadeau J.A., et al. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science. 2002;296:1661–1671. [PubMed]
  • Nguyen N., Guo Y., Guo Y. Proceedings of the 24th International Conference on Machine Learning. ACM Press; Corvalis, OR: 2007. Comparisons of sequence labeling algorithms and extensions; pp. 681–688.
  • Nordborg M., Hu T., Ishino Y., Jhaveri J., Toomajian C., Zheng H., Bakker E., Calabrese P., Gladstone J., Goyal R., et al , Hu T., Ishino Y., Jhaveri J., Toomajian C., Zheng H., Bakker E., Calabrese P., Gladstone J., Goyal R., et al , Ishino Y., Jhaveri J., Toomajian C., Zheng H., Bakker E., Calabrese P., Gladstone J., Goyal R., et al , Jhaveri J., Toomajian C., Zheng H., Bakker E., Calabrese P., Gladstone J., Goyal R., et al , Toomajian C., Zheng H., Bakker E., Calabrese P., Gladstone J., Goyal R., et al , Zheng H., Bakker E., Calabrese P., Gladstone J., Goyal R., et al , Bakker E., Calabrese P., Gladstone J., Goyal R., et al , Calabrese P., Gladstone J., Goyal R., et al , Gladstone J., Goyal R., et al , Goyal R., et al , et al The pattern of polymorphism in Arabidopsis thaliana. PLoS Biol. 2005;3:e196. doi: 10.1371/journal.pbio.0030196. [PMC free article] [PubMed] [Cross Ref]
  • O’Connor T., Dyreson C., Wyrick J., Dyreson C., Wyrick J., Wyrick J. Athena: A resource for rapid visualization and systematic analysis of Arabidopsis promoter sequences. Bioinformatics. 2005;21:4411–4413. [PubMed]
  • Patil N., Berno A., Hinds D., Barrett W., Doshi J., Hacker C., Kautzer C., Lee D., Marjoribanks C., McDonough D., Berno A., Hinds D., Barrett W., Doshi J., Hacker C., Kautzer C., Lee D., Marjoribanks C., McDonough D., Hinds D., Barrett W., Doshi J., Hacker C., Kautzer C., Lee D., Marjoribanks C., McDonough D., Barrett W., Doshi J., Hacker C., Kautzer C., Lee D., Marjoribanks C., McDonough D., Doshi J., Hacker C., Kautzer C., Lee D., Marjoribanks C., McDonough D., Hacker C., Kautzer C., Lee D., Marjoribanks C., McDonough D., Kautzer C., Lee D., Marjoribanks C., McDonough D., Lee D., Marjoribanks C., McDonough D., Marjoribanks C., McDonough D., McDonough D., et al. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science. 2001;294:1719–1723. [PubMed]
  • Rajagopalan R., Vaucheret H., Trejo J., Bartel D., Vaucheret H., Trejo J., Bartel D., Trejo J., Bartel D., Bartel D. A diverse and evolutionarily fluid set of microRNAs in Arabidopsis thaliana. Genes & Dev. 2006;20:3407–3425. [PMC free article] [PubMed]
  • Rätsch G., Sonnenburg S., Sonnenburg S. Large scale hidden semi-markov SVMs. In: Schölkopf B., et al., editors. Advances in neural information processing systems 19. MIT Press; Cambridge, MA: 2007. pp. 1161–1168.
  • Rätsch G., Demiriz A., Bennett K., Demiriz A., Bennett K., Bennett K. Sparse regression ensembles in infinite and finite hypothesis spaces. Mach. Learn. 2002;48:193–221.
  • Rätsch G., Sonnenburg S., Srinivasan J., Witte H., Müller K.-R., Sommer R., Schölkopf B., Sonnenburg S., Srinivasan J., Witte H., Müller K.-R., Sommer R., Schölkopf B., Srinivasan J., Witte H., Müller K.-R., Sommer R., Schölkopf B., Witte H., Müller K.-R., Sommer R., Schölkopf B., Müller K.-R., Sommer R., Schölkopf B., Sommer R., Schölkopf B., Schölkopf B. Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Comput. Biol. 2007;3:e20. doi: 10.1371/journal.pcbi.0030020. [PMC free article] [PubMed] [Cross Ref]
  • Schmid K., Sorensen T., Stracke R., Torjek O., Altmann T., Mitchell-Olds T., Weisshaar B., Sorensen T., Stracke R., Torjek O., Altmann T., Mitchell-Olds T., Weisshaar B., Stracke R., Torjek O., Altmann T., Mitchell-Olds T., Weisshaar B., Torjek O., Altmann T., Mitchell-Olds T., Weisshaar B., Altmann T., Mitchell-Olds T., Weisshaar B., Mitchell-Olds T., Weisshaar B., Weisshaar B. Large-scale identification and analysis of genome-wide single-nucleotide polymorphisms for mapping in Arabidopsis thaliana. Genome Res. 2003;13:1250–1257. [PMC free article] [PubMed]
  • Schmid K., Ramos-Onsins S., Ringys-Beckstein H., Weisshaar B., Mitchell-Olds T., Ramos-Onsins S., Ringys-Beckstein H., Weisshaar B., Mitchell-Olds T., Ringys-Beckstein H., Weisshaar B., Mitchell-Olds T., Weisshaar B., Mitchell-Olds T., Mitchell-Olds T. A multilocus sequence survey in Arabidopsis thaliana reveals a genome-wide departure from a neutral model of DNA sequence polymorphism. Genetics. 2005;169:1601–1615. [PMC free article] [PubMed]
  • Schölkopf B., Smola A., Smola A. Learning with kernels. MIT Press; Cambridge, MA: 2002.
  • Schulze U., Hepp B., Ong C., Rätsch G., Hepp B., Ong C., Rätsch G., Ong C., Rätsch G., Rätsch G. 2007. PALMA: mRNA to genome alignments using large margin algorithms Bioinformatics .231892–1890.1890 [PubMed]
  • Sha F., Saul L., Saul L. Large margin hidden Markov models for automatic speech recognition. In: Schölkopf B., et al., editors. Advances in neural information processing systems 19. MIT Press; Cambridge, MA: 2007. pp. 1249–1256.
  • Shen J., Araki H., Chen L., Chen J., Tian D., Araki H., Chen L., Chen J., Tian D., Chen L., Chen J., Tian D., Chen J., Tian D., Tian D. Unique evolutionary mechanism in r-genes under the presence/absence polymorphism in Arabidopsis thaliana. Genetics. 2006;172:1243–1250. [PMC free article] [PubMed]
  • Shendure J., Mitra R., Varma C., Church G., Mitra R., Varma C., Church G., Varma C., Church G., Church G. Advanced sequencing technologies: Methods and goals. Nat. Rev. Genet. 2004;5:335–344. [PubMed]
  • Song W.-Y., Wang G.-L., Chen L.-L., Kim H.-S., Pi L.-Y., Holsten T., Gardner J., Wang B., Zhai W.-X., Zhu L.-H., Wang G.-L., Chen L.-L., Kim H.-S., Pi L.-Y., Holsten T., Gardner J., Wang B., Zhai W.-X., Zhu L.-H., Chen L.-L., Kim H.-S., Pi L.-Y., Holsten T., Gardner J., Wang B., Zhai W.-X., Zhu L.-H., Kim H.-S., Pi L.-Y., Holsten T., Gardner J., Wang B., Zhai W.-X., Zhu L.-H., Pi L.-Y., Holsten T., Gardner J., Wang B., Zhai W.-X., Zhu L.-H., Holsten T., Gardner J., Wang B., Zhai W.-X., Zhu L.-H., Gardner J., Wang B., Zhai W.-X., Zhu L.-H., Wang B., Zhai W.-X., Zhu L.-H., Zhai W.-X., Zhu L.-H., Zhu L.-H., et al. A receptor kinase-like protein encoded by the rice disease resistance gene, Xa21. Science. 1995;270:1804–1806. [PubMed]
  • Taskar B., Guestrin C., Koller D., Guestrin C., Koller D., Koller D. Max-margin Markov networks. In: Thrun S., et al., editors. Advances in neural information processing systems 16. MIT Press; Cambridge, MA: 2004. pp. 25–32.
  • Thomas J. Adaptive evolution in two large families of ubiquitin ligase adapters in nematodes and plants. Genome Res. 2006;16:1017–1030. [PMC free article] [PubMed]
  • Toomajian C., Hu T., Aranzana M., Lister C., Tang C., Zheng H., Zhao K., Calabrese P., Dean C., Nordborg M., Hu T., Aranzana M., Lister C., Tang C., Zheng H., Zhao K., Calabrese P., Dean C., Nordborg M., Aranzana M., Lister C., Tang C., Zheng H., Zhao K., Calabrese P., Dean C., Nordborg M., Lister C., Tang C., Zheng H., Zhao K., Calabrese P., Dean C., Nordborg M., Tang C., Zheng H., Zhao K., Calabrese P., Dean C., Nordborg M., Zheng H., Zhao K., Calabrese P., Dean C., Nordborg M., Zhao K., Calabrese P., Dean C., Nordborg M., Calabrese P., Dean C., Nordborg M., Dean C., Nordborg M., Nordborg M., et al. 2006. A nonparametric test reveals selection for rapid flowering in the Arabidopsis genome PLoS Biol. 4e137 .10.1371/journal.pbio.0040137 [PMC free article] [PubMed] [Cross Ref]
  • Tsochantaridis I., Joachims T., Hofmann T., Altun Y., Joachims T., Hofmann T., Altun Y., Hofmann T., Altun Y., Altun Y. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 2005;6:1453–1484.
  • Tuskan G., DiFazio S., Jansson S., Bohlmann J., Grigoriev I., Hellsten U., Putnam N., Ralph S., Rombauts S., Salamov A., DiFazio S., Jansson S., Bohlmann J., Grigoriev I., Hellsten U., Putnam N., Ralph S., Rombauts S., Salamov A., Jansson S., Bohlmann J., Grigoriev I., Hellsten U., Putnam N., Ralph S., Rombauts S., Salamov A., Bohlmann J., Grigoriev I., Hellsten U., Putnam N., Ralph S., Rombauts S., Salamov A., Grigoriev I., Hellsten U., Putnam N., Ralph S., Rombauts S., Salamov A., Hellsten U., Putnam N., Ralph S., Rombauts S., Salamov A., Putnam N., Ralph S., Rombauts S., Salamov A., Ralph S., Rombauts S., Salamov A., Rombauts S., Salamov A., Salamov A., et al. The genome of black cottonwood, Populus trichocarpa. Science. 2006;313:1596–1604. [PubMed]
  • Vapnik V. The nature of statistical learning theory. Springer Verlag; New York: 1995.
  • Wicks S., Yeh R., Gish W., Waterston R., Plasterk R., Yeh R., Gish W., Waterston R., Plasterk R., Gish W., Waterston R., Plasterk R., Waterston R., Plasterk R., Plasterk R. Rapid gene mapping in Caenorhabditis elegans using a high density polymorphism map. Nat. Genet. 2001;28:160–164. [PubMed]
  • Windsor A., Schranz M., Formanova N., Gebauer-Jung S., Bishop J., Schnabelrauch D., Kroymann J., Mitchell-Olds T., Schranz M., Formanova N., Gebauer-Jung S., Bishop J., Schnabelrauch D., Kroymann J., Mitchell-Olds T., Formanova N., Gebauer-Jung S., Bishop J., Schnabelrauch D., Kroymann J., Mitchell-Olds T., Gebauer-Jung S., Bishop J., Schnabelrauch D., Kroymann J., Mitchell-Olds T., Bishop J., Schnabelrauch D., Kroymann J., Mitchell-Olds T., Schnabelrauch D., Kroymann J., Mitchell-Olds T., Kroymann J., Mitchell-Olds T., Mitchell-Olds T. Partial shotgun sequencing of the Boechera stricta genome reveals extensive microsynteny and promoter conservation with Arabidopsis. Plant Physiol. 2006;140:1169–1182. [PMC free article] [PubMed]
  • Wright S., Gaut B., Gaut B. Molecular population genetics and the search for adaptive evolution in plants. Mol. Biol. Evol. 2005;22:506–519. [PubMed]

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • GSS
    GSS
    Published GSS sequences
  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles
  • Taxonomy
    Taxonomy
    Related taxonomy entry
  • Taxonomy Tree
    Taxonomy Tree

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...