Genome Biol. 2011 Sep 14;12(9):R84. doi: 10.1186/gb-2011-12-9-r84.
The functional spectrum of low-frequency coding variation.
Marth GT,
Yu F,
Indap AR,
Garimella K,
Gravel S,
Leong WF,
Tyler-Smith C,
Bainbridge M,
Blackwell T,
Zheng-Bradley X,
Chen Y,
Challis D,
Clarke L,
Ball EV,
Cibulskis K,
Cooper DN,
Fulton B,
Hartl C,
Koboldt D,
Muzny D,
Smith R,
Sougnez C,
Stewart C,
Ward A,
Yu J,
Xue Y,
Altshuler D,
Bustamante CD,
Clark AG,
Daly M,
DePristo M,
Flicek P,
Gabriel S,
Mardis E,
Palotie A,
Gibbs R;
1000 Genomes Project.
Durbin RM, Burton J, Carter DM, Churcher C, Coffey A, Cox A, Palotie A, Quail M, Skelly T, Stalker J, Swerdlow HP, Turner D, Ayub Q, Balasubramaniam S, Barrett JC, Chen Y, Conrad DF, Danecek P, Hu M, Huang N, Hurles ME, Jostins L, Keane TM, Le SQ, Lindsay S, Long Q, MacArthur DG, Parts L, Tyler-Smith C, Walter K, Xue Y, Zhang Y, Coffey A, Scott C, Gabriel SB, Lander ES, Lander ES, Altshuler D, Ambrogio L, Bloom T, Cibulskis K, Fennell TJ, Gabriel SB, Jaffe DB, Shefler E, Sougnez CL, Daly MJ, DePristo MA, Ball AD, Banks E, Garimella KV, Grossman SR, Handsaker RE, Hanna M, Hartl C, Kernytsky AM, Korn JM, Li H, Maguire JR, McCarroll SA, McKenna A, Nemesh JC, Philippakis AA, Poplin RE, Rivas MA, Sabeti PC, Schaffner SF, Shlyakhter IA, DePristo MA, Wilkinson J, Altshuler D, Altshuler D, McCarroll SA, Li Y, Anderson P, Blackwell T, Chen W, Ding J, Kang HM, Sidore C, Snyder M, Zhan X, Zllner S, Abecasis GR, Bentley DR, Gormley N, Humphray S, Kingsbury Z, Kokko-Gonzales P, Stone J, Cheetham R, Cox T, Eberle M, James T, Kahn S, Murray L, Chakravarti A, Clark AG, Degenhardt J, Collins FS, De la Vega FM, Hyland FC, Sakarya O, Sun YA, Donnelly P, McVean GA, Auton A, Iqbal Z, Lunter G, Marchini JL, Myers S, Egholm M, Flicek P, Clarke L, Cunningham F, Herrero J, Keenen S, Kulesha E, Leinonen R, McLaren WM, Radhakrishnan R, Smith RE, Zalunin V, Zheng-Bradley X, Gibbs RA, Deiros D, Metzker M, Muzny D, Reid J, Wheeler D, Bainbridge M, Challis D, Sabo A, Yu F, Yu J, Coafra C, Dinh H, Kovar C, Lee S, Nazareth L, Knoppers BM, Lehrach H, Sudbrak R, Borodina TA, Davydov AN, Marquardt P, Mertes F, Nietfeld W, Soldatov AV, Timmermann B, Tolzmann M, Albrecht MW, Amstislavskiy VS, Herwig R, Parkhomchuk DV, Mardis ER, Wilson RK, Dooling D, Fulton L, Fulton R, Weinstock G, Chen K, Chinwalla A, Ding L, Koboldt DC, McLellan MD, Wallis JW, Wendl MC, Zhang Q, Marchini JL, Moutsianas L, Myers S, Tumian A, McVean GA, Nickerson DA, Aksay G, Kidd JM, Schafer AJ, Duncanson A, Sherry ST, Agarwala R, Khouri HM, Morgulis AO, Paschall JE, Phan LD, Rotmistrovsky KE, Sanders RD, Shumway MF, Xiao C, Wang J, Jian M, Li G, Li R, Liang H, Tian G, Wang B, Wang J, Wang W, Yang H, Zhang X, Zheng H, Wang J, Fang X, Guo X, Li Y, Luo R, Tai S, Wu H, Zheng H, Zheng X, Zhou Y, Li T, Su Y, Wang J, Li R, McKernan KJ, Costa GL, Ichikawa JK, Lee CC, Fu Y, Manning JM, McLaughlin SF, Peckham HE, Tsung EF, Dahl A, Rosenstiel P, Schreiber S, Affourtit J, Ashworth D, Attiya S, Bachorski M, Buglione E, Burke A, Caprio A, Celone C, Clark S, Conners D, Desany B, Gu L, Guccione L, Kao K, Kebbel A, Knowlton J, Labrecque M, McDade L, Mealmaker C, Minderman M, Nawrocki A, Niazi F, Pareja K, Ramenani R, Riches D, Song W, Turcotte C, Wang S, Knight J, Winer R, Palotie A, De Witte A, Giles S, Marth GT, Garrison EP, Indap A, Kural D, Lee WP, Leong WF, Stewart C, Ward AN, Wu J, Huang W, Quinlan AR, Stromberg MP, Lee C, Mills RE, Shi X, Altshuler D, Browning BL, Grossman SR, Sabeti PC, Shlyakhter IA, Price A, Cooper DN, Ball EV, Mort M, Phillips AD, Stenson PD, Sebat J, Makarov V, Yoon SC, Ye K, Bustamante CD, Snyder M, Grubert F, Lam HY, Urban AE, Kaganovich M, Kidd JM, Gravel S, Sttz AM, Korbel JO, Ye K, Batzer MA, Konkel MK, Walker JA, Craig DW, Beckstrom-Sternberg SM, Christoforides A, Kurdoglu AA, Pearson JV, Sinari SA, Tembe WD, Haussler D, Hinrichs AS, Katzman SJ, Kern A, Kuhn RM, Przeworski M, Hernandez RD, Howie B, Kelley JL, Melton S, Cookson WO, Moffatt MF, Lathrop M, Liang L, Scheet P, Awadalla P, Casals F, Idaghdour Y, Keebler J, Stone EA, Zilversmit M, Xing J, Jorde L, Eichler EE, Alkan C, Hajirasouliha I, Hormozdiari F, Albers CA, Dermitzakis ET, Montgomery SB, Jin H, Gerstein MB, Abyzov A, Habegger L, Haraksingh R, Jee J, Leng J, Mu XJ, Bjornson R, Du J, Gerstein MB, Balasubramanian S, Khurana E, Zhang Z, Urban AE, Gharani N, Toji LH, Kaye JS, Kent A, McGuire AL, Ossorio PN, Rotimi CN, Brooks LD, Felsenfeld AL, McEwen JE, Clemm NC, Guyer MS, Peterson JL, Abdallah A, Juenger CR, Green ED, Cartwright RA.
Source
Department of Biology, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, USA. gabor.marth@bc.edu
Abstract
BACKGROUND:
Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.
RESULTS:
The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.
CONCLUSIONS:
This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.
- PMID:
- 21917140
- [PubMed - indexed for MEDLINE]
- PMCID:
- PMC3308047
Free PMC ArticleFigure 1
Variant calling procedure in the Exon Pilot Project. (a) The SNP calling procedure. Read alignment and SNP calling were carried out by Boston College (BC) and the Broad Institute (BI) independently using complementary pipelines. The call sets were intersected for the final release. (b) The INDEL calling procedure. INDELs were called on the Illumina and Roche 454 platforms. The sequence was processed on three independent pipelines, Illumina at the Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC), Illumina at BI, and Roche 454 at BCM-HGSC. The union of the three call sets formed the final call set. The Venn diagram provided is not to scale. AB: allele balance; MSA: multiple sequence alignment; QDP: discovery confidence of the variant divided by the depth of coverage; SW: software.
Genome Biol. 2011;12(9):R84-R84.
Figure 3
Sensitivity measurement of Exon Pilot SNP calls. Sensitivity was estimated by comparison to variants in HapMap, version 3.2, in regions overlapping the Exon Pilot exon targets. Circles connected with solid lines show the number of SNPs in such regions in HapMap, the Exon Pilot, and the Low Coverage Pilot project, as a function of alternative allele count. Dashed lines indicate the calculated sensitivity against the HapMap 3.2 variants. Sensitivity is shown for three sets of calls: the intersection between filtered call sets from BC and BI (most stringent); the union between the BC and BI filtered call sets; and the union between the BC and BI raw, unfiltered call sets (most permissive).
Genome Biol. 2011;12(9):R84-R84.
Figure 5
The distribution of functionally characterized Exon Pilot SNPs according to minor allele frequency within all samples. (a) Annotation according to amino acid change. The distribution of the Exon Pilot coding SNPs classified according to amino acid change introduced by the alternative allele (silent, missense, and nonsense) is shown, as a function of AF. Both missense and nonsense variants are enriched in the rare allele frequency bin compared to silent variants, with highly significant P << 10-16. The differences remain significant after correcting for the differential error rates in different bins (P << 10-16 for missense, and P << 10-5 for nonsense). (b) Computational prediction of functional impact. The distribution of SNPs classified according to functional impact (benign, possibly damaging, and damaging) based on computational predictions by the SIFT and PolyPhen-2 programs, as a function of allele frequency. In case of disagreement, the more severe classification was used. Silent SNPs are also shown, as neutral internal control for each bin. The damaging variants are highly enriched in the rare bin compared to the silent variants with highly significant P << 10-16. This remains significant after correcting for the differential error rates in different bins (P << 10-16). (a-b) Allele frequency was binned as follows: low frequency, <0.01; intermediate frequency, 0.01 to 0.1; and common, >0.1. The fraction of SNPs also called in the 1000 Genomes Low Coverage Pilot is indicated by blue shading, in each category. (c) Functional impact among variants shared with HGMD. Functional predictions using SIFT and PolyPhen-2 for the variants shared between the Exon Pilot and HGMD-DM, as a function of the disease allele frequency bin (<0.01, 0.01 to 0.1, and >0.1). Color represents predicted damage (green, benign; orange, possibly damaging; red, damaging); open sections represent variants shared between the Exon Pilot and Low Coverage Pilot, while solid sections represent variants observed only in the Exon Pilot.
Genome Biol. 2011;12(9):R84-R84.
Figure 2
Coverage distribution. (a) Coverage across exon targets. Per-sample read depth of the 8,000 targets in all CEU and TSI samples. Targets were ordered by median per-sample read coverage (black). For each target, the upper and lower decile coverage value is also shown. Upper panel: samples sequenced with Illumina. Lower panel: samples sequenced with 454. (b) Cumulative distribution of base coverage at every target position in every sample. Depth of coverage is shown for all Exon Pilot capture targets, ordered according to decreasing coverage. Blue, samples sequenced by Illumina only; red, 454 only; green, all samples regardless of sequencing platform.
Genome Biol. 2011;12(9):R84-R84.
Figure 4
Allele frequency properties of the Exon Pilot SNP variants. (a) The allele frequency spectra (AFS) for each of the seven population panels sequenced in this study, projected to 100 chromosomes, using chimpanzee as a polarizing out-group. The expected AFS for a constant population undergoing neutral evolution, θ/x, corresponds to a straight line of slope -1 on this graph (shown here for the average value of the Watterson's θ nucleotide diversity parameter over the seven populations). Individuals with low coverage or high HapMap discordance (section 9, 'Allele sharing among populations', in Additional file 1) have not been used in this analysis. (b) Comparison of the site frequency spectra obtained from silent and missense sites in the Exon Pilot, as well as intergenic regions from the HapMap resequencing of ENCODE regions, within CEU population samples. The frequency spectra are normalized to 1, and S indicates the total number of segregating sites in each AFS. Individuals with low coverage or high HapMap discordance (section 9 in Additional file 1) have not been used in this analysis. (c) Allele frequency spectrum considering all 697 Exon Pilot samples. The inset shows the AFS at low alternative allele counts, and the fraction of known variant sites (defined as the fraction of SNPs from our study that were also present in dbSNP version 129).
Genome Biol. 2011;12(9):R84-R84.
Figure 6
Allele sharing among populations in the Exon Pilot versus ENCODE intergenic SNPs. The probability that two minor alleles, sampled at random without replacement among all minor alleles, come from the same population, different populations on the same continent, or different continents, displayed according to minor allele frequency bin (<0.01, 0.01 to 0.1, and 0.1 to 0.5). For comparison, we also show the expected level of sharing in a panmictic population, which is independent of AF. The ENCODE and the Exon Pilot data have different sample sizes for each population panel, which could impact sharing probabilities. We therefore calculated the expected sharing based on subsets of equal size, corresponding to 90% of the smallest sample size for each population (section 9, 'Allele sharing among populations', in Additional file 1). To reduce possible biases due to reduced sensitivity in rare variants, only high-coverage sites were used, and individuals with overall low coverage or poor agreement with ENCODE genotypes were discarded. Error bars indicate the 95% confidence interval based on bootstrapping at individual variant sites.
Genome Biol. 2011;12(9):R84-R84.
Publication Types
MeSH Terms
Grant Support
Full Text Sources
Other Literature Sources