• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Nat Methods. Author manuscript; available in PMC Oct 1, 2009.
Published in final edited form as:
PMCID: PMC2574580
NIHMSID: NIHMS66806

Caenorhabditis elegans mutant allele identification by whole-genome sequencing

Abstract

Identification of the molecular lesion in Caenorhabditis elegans mutants isolated through forward genetic screens usually involves time-consuming genetic mapping. We used Illumina deep sequencing technology to sequence a complete, mutant C. elegans genome and thus pinpointed a single-nucleotide mutation in the genome that affects a neuronal cell fate decision. This constitutes a proof-of-principle for using whole-genome sequencing to analyze C. elegans mutants.

C. elegans is used extensively to identify genes involved in various aspects of animal development, behavior and physiology1 (http://www.wormbook.org/). The traditional forward genetic approach involves random mutagenesis and subsequent isolation of mutants defective in a given process1. The ensuing characterization of the molecular lesion in a mutant strain is a painstaking process that involves mapping with genetic and/or single-nucleotide polymorphism (SNP) markers. The relative gene density in C. elegans and limited recombinant frequencies can make traditional mapping a very time-consuming process. This issue becomes even more apparent when the scoring of the mutant phenotype is cumbersome and recombinants are therefore tedious to identify. Another problem with traditional mapping approaches is that genetic-background effects on a given phenotype prohibit the use of many genetic markers and mapping strains.

We considered whole-genome sequencing as an approach to identify the molecular lesion in a specific ethyl methanesulfonate (EMS)-induced mutant C. elegans strain. We had previously described a genetic locus, lsy-12, in which a neuronal fate decision is aberrantly executed2. Instead of generating two left/right asymmetric, distinct chemosensory neurons, ASEL and ASER, lsy-12 mutants generate two ASER neurons2. To determine the molecular identity of lsy-12, we undertook a single mapping cross of the recessive lsy-12 allele ot177 (lsy-12(ot177)) with a Hawaiian mapping strain3, analyzed 200 F2 progeny for SNPs by PCR and thereby mapped lsy-12(ot177) to a 4-Mb interval on chromosome V. This interval represents 4% of the genome and contains 1,142 predicted genes (5% of total genes).

We prepared genomic DNA from lsy-12(ot177) worms and sequenced the DNA using paired-end Illumina (formerly Solexa) sequencing technology4. We generated 4.35 Gb of paired 35-mer sequence reads in a 1-week sequencing run. Then we mapped the sequence data to the wild-type N2 reference genome using ELAND (efficient large-scale alignment of nucleotide databases) and the Maq alignment tools (Supplementary Methods online); 3.1 Gb of reads were mapped exactly onto the genome with an average coverage of ~28×. To label differences between our sequence data and the N2 reference genome as ‘variants’, we filtered for those reads that mapped uniquely with high-quality scores on both strands and were read at least ten times, thus eliminating the vast majority of ambiguous calls (Supplementary Methods). The filtering left 80 variants between lsy-12(ot177) genomic DNA and the published N2 wild-type reference genome in the 4-Mb interval, into which lsy-12(ot177) mapped. We ranked these 80 variants according to standard quality scores (Supplementary Table 1 online). Fifty-four of the 80 variants were single-nucleotide variants and 26 were small, mainly 1-nt insertions-deletions (indels; Fig. 1 and Supplementary Table 1). None of the indels mapped to exons or splice sites of predicted genes, and 21 of the 54 single-nucleotide variants affected exons of protein-coding genes. Five of the 21 exonic variants were silent variants. The remaining 16 variants (15 missense and 1 nonsense) were the best candidates for the lsy-12 mutation as more than 90% of EMS-induced mutant alleles are generally point mutations that introduce changes in amino acids or splice junctions (Supplementary Table 2 online).

Figure 1
Variants found in whole-genome sequence analysis. ot177 is shorthand for lsy-12(ot177). The asterisk denotes 6 variants clustered within 100 bp of a single intron (supplementary Table 1a), which upon amplification and Sanger-sequencing we found to map ...

To determine whether the detected variations were (i) sequencing or mapping errors in the Illumina Genome Analyzer pipeline, (ii) sequence variations in the original transgenic strain mutagenized, a strain containing the transgene otIs114 (ref. 5) or (iii) true mutagenesis-induced mutations, we PCR-amplified ~300-bp fragments that contained each of the 80 variants from both the starting strain used for mutagenesis (strain containing the transgene otIs114), and from the mutant strain (lsy-12(ot177)), which also contains the unlinked otIs114 transgene. Each of these strains had been outcrossed against N2 wild-type multiple times. Resequencing of the amplicons derived from the lsy-12(ot177) mutant strain by the traditional Sanger method confirmed the presence of all of the 26 indels. Notably, all of the indels were already present in the transgenic starting strain containing the transgene otIs114. We then Sanger-sequenced the genomic regions containing these indels in our available N2 wild-type isolate and found each of the indels to be present in N2 as well. We therefore unintentionally uncovered a large amount of sequence variation between the N2 reference genome and the N2 wild-type strain distributed by the Caenorhabditis Genetics Center at the University of Minnesota, which could be due to genetic drift or to errors in sequencing the reference genome. These findings are consistent with a previous study6.

We confirmed 17 of the 33 non-exonic variants by Sanger-sequencing of lsy-12(ot177) (Fig. 1). The remaining 16 uncon-firmed ‘variants’ (10 were apparent sequencing errors and 6 were clustered within a single intron that we could not reliably identify in the genome) were supported by less variant reads than wild-type reads, thereby suggesting another rule by which to filter variants (Supplementary Table 1a). Of the 17 confirmed variants, 8 variants were present in both the starting strain containing the transgene otIs114 and also in our N2 wild-type strain, again underscoring the sequence differences between the reference genome and our wild-type strain (Fig. 1 and Table 1).

Table 1
Frequency of sequence variation types

We confirmed 15 out of the 21 exonic variants, and specifically 11 of the 16 exonic variants that altered an amino acid, by Sanger resequencing (Supplementary Table 1a). The remaining erroneous variants again had lower quality scores and in each case, we observed more wild-type than variant reads (Fig. 1 and Supplementary Table 1a). Of the 11 confirmed exonic variants that alter an amino acid, 7 were already present in the starting strain containing the trans-gene otIs114 and most of these again were present in our N2 wild-type isolate (Fig. 1). This left 4 amino acid—changing variants between the lsy-12(ot177) mutant and the N2 wild-type genome in the mapped 4-Mb interval (Fig. 2). In sum, we discounted the majority of initial variations in the mapped 4-Mb interval (80 variants) as they did not affect protein-coding regions (64 of original 80 variants, 80%) and/or were sequencing errors and/or are variations between strain backgrounds, leaving only a total of four exonic variants that we predicted to alter a protein product (Fig. 1 and Table 1).

Figure 2
Physical location of the lsy-12 locus and its lesions. lsy-12(ot177) was mapped between two SNP markers, pkP5112 and pkP5064 (http://www.wormbase.org/), which define a 4-Mb interval. Validated exonic variants in this interval are marked with arrowheads. ...

One of these four exonic variants is a nonsense mutation in the predicted R07B5.9 gene, the sole nonsense mutation in the entire dataset. We sequenced this predicted gene in the other five available strains that harbor mutant alleles of lsy-12, as determined by complementation testing and mapping (Supplementary Methods and Supplementary Table 3 online). We found that each one of them harbors a mutation in R07B5.9 (Fig. 2). None of the lsy-12 strains displayed variations in the other three candidate genes revealed by genome sequencing. Moreover, the lsy-12 mutant phenotype was rescued by injecting a ~39 kb genomic interval that contains R07B5.9 but no other candidate gene suggested by the whole-genome analysis (Fig. 2 and Supplementary Table 3). Lastly, we performed RNA interference (RNAi) of all four genes with exonic variants and found that only RNAi of R07B5.9 phenocopies the lsy-12 mutant phenotype (Supplementary Table 3). We concluded that lsy-12 is R07B5.9. Whole-genome sequencing has therefore revealed the identity of a previously unknown mutant gene.

The ability to perform additional experiments to distinguish the true phenotype-causing mutation from the set of sequence variants identified by whole-genome sequencing will dictate the amount of mapping one needs to perform before using whole-genome sequencing. From our identification of four protein-changing variants in a 4-Mb interval, we extrapolate that an entire chromosome, such as chromosome V (20.9 Mb), may only contain ~20 protein-changing candidate variants. The availability of multiple alleles is the easiest way to sift through these candidates, as it is fast and simple to manually Sanger-sequence many candidate genes in the allelic strains. RNAi and transformation rescue represent other powerful tools to test whether sequence variants are responsible for the mutant pheno-type. We conclude that minimal mapping to as little as a chromosome and perhaps even less is required before using a whole-genome sequencing strategy.

To facilitate the design of future studies, we statistically analyzed the sequence data and found that the number of reads at a particular location (that is, the coverage), did not follow traditional formulae7 but could be approximated as a gamma-distributed random variable (see Supplementary Methods and Supplementary Table 4 online). Under the assumption of our observed 0.6% error rate, this observation predicts that 8× coverage would yield ~150 variants in a 4-Mb interval, supported by at least four variant reads (see Supplementary Methods for details on these predictions and their detection power). Given that, in our case, only 4 in 80 variants in a 4-Mb interval were non-silent variants within a coding region (Fig. 1), this would translate into only ~8 variants requiring validation. We therefore recommend aiming for eightfold coverage, which should be reached with ~0.8 Gb of aligned sequence, produced by two lanes in a Genome Analyzer flow cell.

The application of whole-genome sequencing offers many unique advantages. The sequence run only takes a few days and costs are in the few-thousand dollar range, which compare favorably to the personnel and reagent costs of a traditional multi-year gene cloning project. The implications of the substantial time savings with this approach go beyond mere cost considerations. The approach should motivate large-scale genetic screens, followed by the rapid sequencing of many mutants retrieved from such screens. This will not only lead to a more comprehensive genetic understanding of a given biological process but will offer the practical advantage of being able to sift through a collection of mutants and to focus on those genes whose molecular identity bear the most interest to the investigator. Moreover, mutants for which the phenotype is tedious to score (for example, behavioral mutants), mutants for which the phenotype is subject to modification by the genetic background of mapping strains and mutants that require specific genetic backgrounds (that is, modifier mutants), can now be more easily identified by whole-genome sequencing.

Supplementary Material

1

Supplementary Table 1a: List of single nucleotide changes

Supplementary Table 1b: List of insertion/deletions (indels)

Supplementary Table 2: A non-comprehensive sampling of EMS-induced mutants retrieved from genetic screens.

Supplementary Table 3: lsy-12 phenotypic data

Supplementary Table 4: Coverage analysis

Supplementary Methods

Strains and alleles used in this study

We deep-sequenced strain OH8001, which contains the ot177 mutant allele on chromosome V and the otIs114 (lim-6::gfp) transgene on chromosome I 1,2. The strain had been outcrossed 5× with N2 wild type. Even though the recessive ot177 allele was initially thought to define its own complementation group (lsy-19) 1, subsequent complementation tests, shown in Supplementary Table 3, revealed it to be an allele of the previously uncloned lsy-12 locus, which is defined by 5 separate, recessive alleles, ot89, ot154, ot170, ot185 and ot171 1. Previous complementation tests 1 were done with unmarked chromosomes and the use of heterozygous lsy-12 animals (due to male mating inefficiency of homozygous lsy-12 mutants) may have selected against transheterozygous lsy-12 cross progeny. In the present complementation tests, we used homozygous dpy-11 ot89 unc-76 animals and mated them with ot177/otIs151 transheterozygous males (otIs151 is a rfp marked array that is linked to ot177). Rfp-negative, non-unc/dpy progeny therefore identifies ot89/ot177 cross progeny which we scored for their lsy phenotype with an otIs114 array.

Sanger-based sequenced was done on amplicons prepared from OH8001 (ot177; otIs114), from OH812 (otIs114) and from N2 wild-type which we obtained in 1999 from CGC and had kept in the freezer through 2008. Primers were designed to yield around 300 bp amplicons with the sequence variation in the middle. Amplicons were sequenced on both strands.

Genome Sequencing

The sequence analysis was done with a Illumina Genome Analyzer I (GA I). Seven channels of a paired-end flow cell were used. Paired-end library preparation, sequencing and base calling were done according to the manufacturers recommendations through Illumina's FastTrack Sequencing Services Laboratory.

Genome sequence alignment

Overview: The sequence data was mapped to the sequence of wild-type N2 reference genome sequence 3 initially using ELAND, an alignment tool integrated in the Illumina Genome Analyzer (GA) analysis pipeline. Independent alignments with the 4 MB target region were performed using the MAQ alignment tool 4. 64.4% of the ~124.9 million reads were mapped exactly to the reference genome (3.1GB mapped sequence). 64.6% read pairs were mapped consistently (orientation, distance) with up to two mismatches on each mate therefore providing 28.2× coverage on average. The genetically mapped 4 MB interval was covered similarly to the rest of the genome (27.8×) with 96% of the bases being read at least ten times and 71% of the bases being read at least 20 times (Supplementary Table 1 and 4). 375,446 mapped reads reported a difference from the reference sequence in one site (typically) or more, suggesting an average sequencing error of less than 0.6%. As we report in the paper much of this apparent error rate was not based on sequence errors but due to sequence variability in different wild-type genomes.

Additional points:

• The Genome Analyzer output included sequenced reads of 35 bases, each base associated with a read quality (Solexa Q score). These reported qualities were converted to fastq scores for MAQ processing 4.

• The Genome Analyzer pipeline reports an alignment for each read, as positioned by the ELAND alignment algorithm. All reads that were not mapped within the 4 MB interval of interest on LGV (ChrV:6,351,225 to ChrV:10,366,043) were filtered out. Remaining reads were re-processed using MAQ.

• Given a read, MAQ's algorithm scans the reference sequence and records the index/position of the best two alignments for that read. Only those alignments with less than two mismatches (between read and reference) were considered.

• Each alignment was given a score by MAQ (given in Supplementary Table 1). An alignment score is the sum of qualities of mismatched bases (between read and reference) for that alignment. Thus, the best two alignments are those with the minimum scores (i.e. minimum mismatch).

• Based on how many permissible alignments are observed with respect to the reference sequence, each base was associated with a unique mapping score. A 100% unique mapping means reads including this base were uniquely aligned here.

• Read quality reported is combined confidence in its alignment of the reads as well as the base qualities of the read to call single nucleotide changes.

• Our mapping data and sequence error rate are comparable to a recent Illumina-based re-sequencing of a wild-type genome 5.

Filtering of candidate lesions

Detected variants were first filtered for near-unique mapping, discarding variant sites with >25% of spanning probes that map also elsewhere in the genome.

Surviving variants were filtered to have a read with high mapping quality (q<0.005), the strong allele being more likely than the weak allele (q<0.005), and overall consensus quality (q<0.1).

Four candidate variants were further discarded as they were discovered only on one strand, and in low coverage (≤5). All remaining candidate sites had coverage of at least 10 reads, at least 3 on each strand.

Coverage analysis

Coverage, the number of reads that span a particular site, is a random variable. Traditional genomic sequencing modeled this variable as Poisson distributed, parameterized by the average coverage, λ. We observe that a gamma distribution, with parameters α=6.3, β=λ/α (mean λ) approximates observed coverage well (Supplementary Table 4). For an experiment design requiring a threshold t of supporting reads to follow up a region, the cumulative gamma distribution at t -1/2 approximates the probability of missing that variant due to insufficient number reads, and therefore implies the power to detect a lesion at t copies or more.

For example, in an experimental design of 8× coverage, at a threshold of 4 reads, this gives gamma (6.3,1.27) evaluated at 3.5 (=t-1/2), which equals 0.046, implying a 95% power to detect a variant. We continue to analyze this proposed design below.

Setting the threshold for following up a variant affects not only false negatives, but also false positives, requiring consideration of the tradeoffs between sensitivity, specificity, resources for initial sequencing, and efforts for following up putative variants. The effect of coverage on base call quality is substantial, but complicated to quantify analytically. In order to estimate the number of variants to be followed up, we make the worst case assumption that base call quality filters will be ineffective at this low coverage, and follow-up decisions will be made only based on number of reads observing a variant. We considered the 224 single nucleotide candidate variants observed at four or more reads in our 28× data before any quality filtering. We simulated resampling of 8×, by a random variable per candidate Xi ~ Binomial(ni ,p) with ni being he number of reads covering candidate i and p=8×/28× = 0.28. We computed the expectation of the overall number of candidates for which Xi4. The expected number of single nucleotide variants thus computed is ~111. Extrapolating from the ratio of indels to single nucleotide variants in the current data, we expect 150−200 variants to be detected in such an experiment (including non-coding variants which may be of no importance; see main text).

References:

1. S. Sarin, M. O'Meara M, E. B. Flowers et al., Genetics 176 (4), 2109 (2007).

2. S. Chang, R. J. Johnston, Jr., and O. Hobert, Genes Dev 17 (17), 2123 (2003).

3. The-C.elegans-Sequencing-Consortium, Science 282 (5396), 2012 (1998).

4. http://maq.sourceforge.net/.

5. L. W. Hillier, G. T. Marth, A. R. Quinlan et al., Nat Methods 5 (2), 183 (2008).

ACKNOWLEDGMENTS

We thank C.T. Lawley (Illumina, Inc.) for generously producing the paired-end read data described in the manuscript, Q. Chen for performing microinjection, Hobert lab members for comments on the manuscript and the Caenorhabditis Genetics Center for providing the N2 strain. This work was funded by the Howard Hughes Medical Institute, the Muscular Dystrophy Association and the US National Institutes of Health (R01NS039996-05; R01NS050266-03 to O.H., F31 predoctoral grant NS054540-01 to S.S. and U54 CA121852-03 to I.P.).

Footnotes

Note: Supplementary information is available on the Nature Methods website.

References

1. Brenner S. Genetics. 1974;77:71–94. [PMC free article] [PubMed]
2. Sarin S, et al. Genetics. 2007;176:2109–2130. [PMC free article] [PubMed]
3. Davis MW, et al. BMC Genomics. 2005;6:118. [PMC free article] [PubMed]
4. Bentley DR. Curr. Opin. Genet. Dev. 2006;16:545–552. [PubMed]
5. Chang S, Johnston RJ, Jr., Hobert O. Genes Dev. 2003;17:2123–2137. [PMC free article] [PubMed]
6. Hillier LW, et al. Nat. Methods. 2008;5:183–188. [PubMed]
7. Lander ES, Waterman MS. Genomics. 1988;2:231–1239. [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...