• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of ajhgLink to Publisher's site
Am J Hum Genet. Apr 8, 2011; 88(4): 458–468.
PMCID: PMC3071924

A Genome-wide Comparison of the Functional Properties of Rare and Common Genetic Variants in Humans

Abstract

One of the longest running debates in evolutionary biology concerns the kind of genetic variation that is primarily responsible for phenotypic variation in species. Here, we address this question for humans specifically from the perspective of population allele frequency of variants across the complete genome, including both coding and noncoding regions. We establish simple criteria to assess the likelihood that variants are functional based on their genomic locations and then use whole-genome sequence data from 29 subjects of European origin to assess the relationship between the functional properties of variants and their population allele frequencies. We find that for all criteria used to assess the likelihood that a variant is functional, the rarer variants are significantly more likely to be functional than the more common variants. Strikingly, these patterns disappear when we focus on only those variants in which the major alleles are derived. These analyses indicate that the majority of the genetic variation in terms of phenotypic consequence may result from a mutation-selection balance, as opposed to balancing selection, and have direct relevance to the study of human disease.

Introduction

Although hundreds of associations between genetic variants and human traits have been identified through genome-wide association studies (GWAS), these disease-associated common variants collectively make only a small contribution to the known genetic risk of most diseases.1 This observation has led many to question the generality of the common disease-common variant hypothesis and has contributed to growing interest in evaluating the roles of rare genetic variants in common diseases.2,3 Here, we attempt to address this question by assessing the likelihood that variants are functional along the allele frequency spectrum. In order to appropriately address the question, it is necessary to have an unbiased and complete collection of variants in human genomes, as well as an extensive evaluation of diverse functional categories. We use the genomic location of variants as a surrogate for functionality. Although the connection between genomic location of variants and phenotypic effects is generally poorly known, this approach has the advantage of allowing comprehensive classification of variants for all known functional regions of the genome.

Although our knowledge of which parts of the human genome are functional, and in what ways, is far from complete, important functional regions of the genome have been clearly identified. In addition to the very well validated protein-coding regions, significant progress has been made in identifying genome regions that are important in controlling gene expression.4,5 Here, we couple the information on the genomic distribution of functional sequence with whole-genome sequence data to systematically assess the relationship between population allele frequency and the tendency of polymorphic sites to be found in genomic regions annotated as functional, including both regulatory and coding. We find a clear pattern in which the less common variants are much more likely to have functional consequence than the more common variants. Our finding extends the observations from previous analyses6–19 to more comprehensive functional categories across the whole genome. Surprisingly, when we focus on variants in which the derived alleles have become common in humans, the relationship between allele frequency and functional categorization is either reduced or absent. This observation supports the supposition that purifying selection (as opposed to any form of positive or balancing selection) is the main force shaping the distribution of functional variation in human populations.

Material and Methods

Study Subjects

All samples collected at Duke were collected under local institutional review board (IRB) approval with approved informed consent forms. In addition, these samples had a corresponding approved consent form allowing for the use of samples as controls. All samples received from outside institutions were received in de-identified state. All de-identified samples were received under a Duke IRB exemption and therefore classified as nonhuman subjects.

Whole-Genome Sequencing

The genomic DNA of each individual was sequenced with the Illumina Genome Analyzer II. Sequence reads were then aligned to reference genome (NCBI build, 36 release 50) with Burrows-Wheeler Aligner.20 Once all the reads were aligned to reference, SAMtools21 was used to report genotype at each genomic position and identify single nucleotide variants (SNVs) in each individual. SNVs in X and Y chromosomes were excluded. We checked the quality of the SNVs and only kept high-quality ones satisfying three criteria: consensus quality no less than 20, SNP quality no less than 20, and no less than three reads supporting the variant allele.

Genotype

We collected the genotype information of all 29 individuals at genomic positions where at least one SNV existed. We checked the read depth of each position in individuals where no SNV was identified. If the read depth was less than eight, we considered the genotype was missing at the position for the particular individual; otherwise reference genotype was assumed. If three or more individuals had missing genotypes at the same genomic position, we eliminated all SNVs at this position from our analysis. We also removed SNVs at positions where more than two alleles were observed.

Vulnerable Genomic Regions

Some genomic regions are more vulnerable to misalignment and incorrect SNV calling, most likely due to high sequence similarity shared by multiple loci, such as repeat regions,22 copy-number variations (CNVs),23,24 segmental duplications,25 regions close to assembly gaps, and regions aligned to deletions annotated in the National Center for Biotechnology Information (NCBI) human reference genome assembly compared to Celera assembly.26 Repeat regions annotated by RepeatMasker version 3.2.7 and Tandem Repeats Finder,22 as well as the positions of assembly gaps in reference genome, were downloaded from University of California Santa Cruz (UCSC) Genome Browser. We considered 1 kb around assembly gaps as vulnerable regions. We used CNVs identified from the 29 genomes using our in-house software Estimation by Read Depth with SNVs24 and also CNVs of the Utah residents with Northern and Western European ancestry from the CEPH collection (CEU) population from the Copy Number Variation Project.23 CNVs detected in CEU individuals with Whole Genome TilePath (WGTP) arrays corresponding to assembly NCBI36 were directly downloaded from the project website. CNVs of CEU individuals detected with Affymetrix GeneChip Human Mapping 500K early access arrays (500K EA) were downloaded from the supplementary information in Redon et al.23These CNVs corresponded to build 35 of the NCBI reference assembly. They were converted to assembly NCBI36 by using LiftOver. The coordinates of segmental duplications were downloaded from the Segmental Duplication Database. Previously, Khaja et al.26 identified many fragments present in the Celera assembly but missing in the NCBI assembly. Such fragments can be realigned to regions of NCBI36 with BLAT.26 We considered the aligned regions at least 36 bp long and with greater than 94% identity vulnerable for alignment error. In order to keep our SNVs as immune to artifacts as possible, we eliminated all SNVs inside the above regions.

Minor Allele Frequency

Allele frequency was calculated as the occurrence of a specific allele relative to the total number of chromosome copies, which was twice the number of individuals without missing genotype. For each SNV, if the reference allele frequency was lower, the corresponding minor allele frequency (MAF) was the frequency of the reference allele; otherwise, MAF was the frequency of the variant allele. SNVs with MAF equal to zero were removed because in that case the genotype (which differ from the reference genotype) was identical across all individuals.

Conservation Score

PhastCons and PhyloP scores for primates, placental mammals, and vertebrates were downloaded from UCSC genome browser. Only the SNVs of which the corresponding genomic positions had conservation scores were used for studying the relationship between MAF and evolution.

Functional Regions

The coordinates of gene structure units were based upon annotation of UCSC genes. If an SNV falls into a protein-coding sequence (CDS) and results in amino acid change of the corresponding protein sequence, the SNV is considered nonsynonymous. Open chromatin regions used in the analysis correspond to the peaks identified by formaldehyde-assisted isolation of regulatory elements coupled with high-throughput sequencing (FAIRE-seq) from the Open Chromatin track of the Encyclopedia of DNA Elements (ENCODE) project (restricted until July 14, 2010), which included data from eight different cell lines. Genomic regions marked by mono- and trimethylation of histone H3 lysine 4 (H3K4me1 and H3K4me3) were the peaks identified by chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) from ENCODE Histone Modifications by Broad Institute ChIP-seq track (restricted until June 30, 2010), which included data from seven and eight cell lines, respectively. In vivo transcription factor binding sites (TFBSs) were ChIP-seq peaks from ENCODE Transcription Factor Binding Sites by ChIP-seq from Yale/UC-Davis/Harvard track (restricted until July 28, 2010), including data from a total of 34 transcription factors (TFs) in varying numbers of cell lines. Conserved TFBSs were from Human/Mouse/Rat (HMR) Conserved Transcription Factor Binding Sites track and were identified by searching within human-mouse-rat alignments using the position weight matrices (PWMs) from the TRANSFAC database. Visel et al.27 identified 2614 extremely conserved noncoding elements between human and rodent using the Gumby sofware. We obtained the coordinates of these elements from the Gumby website.28

False Discovery Rate Estimation for SNV Calling and Variant Enrichment in Functional Regions after False Discovery Rate Correction

We obtained the sequence reads of one individual NA12878 from the 1K Genome Project19 and made variant calls using our pipeline (see Whole-Genome Sequencing). The SNVs in NA12878 were then filtered in the same way as the SNVs from the previous 29 genomes to keep only the most confident variant calls. We separated these SNVs into 30 bins based on their MAF values calculated using all 30 genomes. In each of the 30 MAF bins, the false discovery rate (FDR) of our identified SNVs was estimated as the fraction of SNVs in NA12878 in which genotype calls from our pipeline were not identical with the genotype calls from the 1K Genome Project. We note that this procedure is expected to be conservative in that the variant calls from the 1K Genome Project, although comparing multiple calling methods, may miss some variants that are actually present. The FDR estimates corresponding to the first 29 bins were considered as close estimates of the SNVs in the 29 MAF bins from the previous 29 genomes (FDRi, i = 1,…,29). To estimate the probability of false positive SNVs falling inside each category of functional regions (H), we calculated the fraction of SNVs identified by us but not by the 1K Genome Project in NA12878 that are within the functional regions. In each of the 29 MAF bins, we had observed the fraction of SNVs falling inside each category of functional regions (Fi) and the number of SNVs (Ni). Therefore, the actual fraction of SNVs falling inside each category of functional regions after correcting for FDR is

NiFiNiFDRiHNiNiFDRi=FiFDRiH1FDRi.

Randomization

Using specified MAF bin cutoff, we obtained two populations of SNVs: rare SNVs (MAF bin ≤ 2) and common SNVs (MAF bin > 2). We randomly drew 25,000 SNVs without replacement from each population and only kept this pair of samples when their SNP quality distribution was similar (Kolmogorov-Smirnov test p value > 0.05). A total of 1000 such pairs of samples were obtained. The SNP quality score of each SNV was calculated as the mean of the quality scores across individuals. If the SNV was homozygous in one individual, the corresponding SNP quality score was counted twice.

Ancestral Alleles

Pairwise alignment between human (hg18) and chimpanzee (panTro2) and between human and rhesus (rheMac2) in axt format were obtained from UCSC genome browser. For a specified position in human genome, the allele at the corresponding aligned position in the chimpanzee genome or in the rhesus genome was considered the ancestral allele. We ignored any SNV in which the ancestral allele from the chimpanzee genome was different from the ancestral allele from the rhesus genome. SNVs with MAF equal to 0.5, SNVs without alignment information, and SNVs in which the minor allele is neither the reference allele nor the ancestral allele were excluded from the analysis.

SNVs from Whole-Exome Sequencing

We made use of SNVs from 168 unrelated individuals of European ancestry that have been whole-exome sequenced. At least 90% of the capture regions in each sample were sequenced with ≥ 5× coverage. The SNVs were required to pass some quality filters: a consensus quality no less than 20, an SNP quality no less than 20, no fewer than three reads supporting the variant allele, and a maximum read depth of 500. If 17 or more samples had missing genotypes at a particular position, the SNVs at this position were removed from further analysis. We also eliminated SNVs inside the genomic regions vulnerable to misalignment, SNVs with zero MAF, and SNVs in the positions where more than two alleles were observed. Because the CNVs from these 168 individuals were not available, we only utilized CNVs detected in HapMap samples.23 The reference genomes used for whole-exome sequencing data and whole-genome sequencing data were slightly different. The four alternate haplotypes (c22_H2, c5_H2, c6_COX, and c6_QBL) were excluded, whereas mitochondrial DNA, as well as the genome of human herpesvirus 4 type 1, were included in the human reference genome used for whole-exome data.

Simulation

We assumed the per-generation mutation rate was 1.8 × 10−8 and used a demography model of European population, in which the ancestral stationary population (Na = 8100) undergoes a bottleneck and exponential growth.29 In order to generate a large number of neutral SNVs, negatively selected SNVs, and positively selected SNVs, we simulated 250,000, 1,000,000, and 500,000 loci (each 4 kb long) under neutral, negative, and positive selection, respectively, using SFS_CODE.30 We then collected segregating SNVs from 29 individuals in the final population for analysis. Under negative selection, the population-scaled selection coefficient (γ < 0) was drawn from a gamma distribution, −γ~Г(α,β), with shape parameter α = 1.02 and rate parameter β = 0.00125 that were estimated from nonsynonymous SNVs in a European population.15 Under positive selection, γ > 0 and γ~Г(α,β), where the values of α and β were kept the same as in the negative-selection model. An additive model of genic selection was used in SFS_CODE when the fitness of each individual was calculated. From the simulations, we obtained three pools of SNVs: 2,999,475 neutral SNVs, 335,008 negatively selected SNVs, and 675,316 positively selected SNVs. Let us consider drawing a total of N SNVs where the probability of SNVs coming from the neutral, negative-selection, and positive-selection pools are w0, w1, and w2, respectively. On the basis of the allele frequency distribution of SNVs in each pool, we estimated the probability of SNVs in which the minor allele is rare and ancestral (p0, p1, and p2), as well as the probability of SNVs in which the minor allele is common and ancestral (q0, q1, and q2), in neutral, negative-selection, and positive-selection pools, respectively. Here, we consider SNVs in which the minor alleles occur in one or two chromosomes as rare SNVs and the remaining SNVs as common.

Therefore, when the minor allele is ancestral the probability that rare SNVs are under selection is

P=Ni=12wipiNi=02wipi=i=12wipii=02wipi,

and the probability that common SNVs are under selection is

Q=Ni=12wiqiNi=02wiqi=i=12wiqii=02wiqi.

The odds ratio of rare SNVs under selection compared with common SNVs when the minor allele is ancestral is

OR=P/(1P)Q/(1Q)=(w1p1+w2p2)q0(w1q1+w2q2)p0=(w1w2p1+p2)q0(w1w2q1+q2)p0.

We calculated the odds ratio when the minor allele is derived in the same way.

Because we were interested in comparing the pattern of rare SNVs being more enriched for selection than common SNVs between the situation in which the minor allele is ancestral and the situation in which the minor allele is derived, we calculated a statistic, ΔOR=(ORancestral1)/(ORderived1), where ORancestral and ORderived correspond to the odds ratio when the minor allele is ancestral and derived, respectively. We focused on the comparison between the simulation results and the results from nonsynonymous SNVs in our real data because most nonsynonymous SNVs are under selection, and the selection parameters we used in simulation fit nonsynonymous SNVs the best.

Results

We made use of SNVs from 29 unrelated individuals of European ancestry that were whole-genome sequenced at an average coverage of 28×. In order to focus on SNVs that are as immune to artifacts as possible, we applied a series of stringent filters, eliminating SNVs inside repeat regions,22 CNVs,23,24 segmental duplications,25 1 kb regions around assembly gaps, and regions aligned to deletions annotated in human reference genome NCBI assembly compared to Celera assembly26(see Material and Methods). These steps eliminated 5,491,245 called SNVs, leaving 3,522,186 high-quality SNVs available for analysis. The high-quality variants were then divided into 29 bins according to their MAF values defined using our own data (Table 1). The number of SNVs decreases dramatically as MAF increases, and about 27.7% of the SNVs were observed (as heterozygotes) in only one individual (Figure 1).

Figure 1
The Distribution of SNVs in MAF Bins
Table 1
MAF Range of SNVs in Each of the 29 Bins

We then utilized three categories of functional properties to compare variants at different MAF levels on the basis of evolutionary conservation, gene structure, and regulatory potential, described in turn.

Evolutionary Conservation

We considered two conservation scores, PhastCons31 and PhyloP,32 based on alignments of 44 vertebrate species. A larger PhastCons score corresponds to greater selective constraint. Positive and negative PhyloP scores measure conservation and acceleration, respectively. We compared the distribution of conservation scores at genomic sites carrying variants from each frequency bin. We found a strong negative relationship between MAF and conservation score: the genomic positions corresponding to low-frequency SNVs have larger scores and are therefore more conserved than positions corresponding to high-frequency SNVs (Figure 2 and Figures S1 and S2, available online), consistent with previous studies using much smaller sets of variants.12–14,16–18 This relationship is strong until the MAF becomes quite high (i.e., around MAF 0.207 [bin 11] and 0.276 [bin 15] for PhastCons and PhyloP, respectively; see Figure S3). We then chose different MAF cutoffs up to bin 11 (corresponding to MAF < 0.207), and compared the variants below and above the cutoff. We always found the conservation scores of the rarer SNVs to be significantly larger than the more common SNVs (one-sided Wilcoxon rank sum test p value < 10−20 for all evaluated MAF bin cutoffs, Table 2).

Figure 2
The Cumulative Distribution Plot of Conservation Scores Corresponding to SNVs in Each MAF Bin
Table 2
Comparison of Conservation Scores in Genomic Positions Corresponding to Rare and Common SNVs

Gene Structure

We next defined functional regions on the basis of gene structure using gene annotations from the UCSC genome browser33 (see Material and Methods) and considered the following defined regions: all genes, protein-coding genes, exons, introns, CDSs, 5′ and 3′ untranslated regions (UTRs). We found that as MAF decreases, the enrichment of SNVs inside these functional regions is consistently elevated (Figure 3A), with the single exception of 5′ UTRs for which the pattern with MAF is not significant. Interestingly, this relationship becomes stronger the closer the analysis moves toward variants that affect protein sequence, and exons and CDSs show sharper differences (Table 3, odds ratios = 1.353 and 1.584 and one-sided Fisher's exact test p value < 10−20 when comparing rare SNVs [MAF bin ≤ 2, which corresponds to MAF < 0.052] versus common SNVs [MAF bin > 2]). Consistent with this, the separation between low-frequency and high-frequency SNVs is the highest when analyzing nonsynonymous variants (odds ratio = 2.041, and one-sided Fisher's exact test p value < 10−20 when comparing rare [MAF bin ≤ 2] and common SNVs [MAF bin > 2]), which agrees well with previous findings.6–11,15,19 For all the MAF bin cutoffs we examined, rare SNVs were always more significantly enriched in these functional regions than common SNVs (one-sided Fisher's exact p values < 10−20), and maximum odds ratios achieved when SNVs from the first two MAF bins (corresponding to MAF < 0.052) were compared with the remaining higher-frequency SNVs (Figure S4).

Figure 3
The Enrichment of SNVs from Each MAF Bin in Functional Regions
Table 3
Comparison of Rare SNVs and Common SNVs for their Enrichment in Each Functional Region

Regions with Regulatory Potential

Although variants on coding regions have been well-studied, the relationship between allele frequency and regions with apparent regulatory functions (not simply based on conservation) has not been investigated before. To consider the role of genomic regions important in gene regulation, we made use of data from the ENCODE project5,34 to define open chromatin regions, regions marked by H3K4me3 and H3K4me1 (the chromatin signatures of promoters and enhancers35,36), and in vivo TFBSs (see Material and Methods). Currently, the in vivo binding sites of only 34 TFs mapped by the ENCODE project are available, and the length of these TFBSs are usually much larger (648.9 bp on average) than what actual TFBSs should be (usually 4–8 bp). To compensate for such limitations, we examined the computationally predicted TFBSs that are conserved across human, mouse, and rat by using the PWMs of 258 TFs from the TRANSFAC database (see Material and Methods). Similar to our observations in the gene structure analysis, we found low-frequency SNVs are more likely to fall into these regulatory regions than high-frequency SNVs (Figure 3B and Table 3). The pattern becomes stronger when we analyzed SNVs inside the conserved TFBSs (odds ratio = 1.335, and one-sided Fisher's exact p value < 10−20 when comparing rare SNVs [MAF bin ≤ 2] and common SNVs [MAF bin > 2]). We observed the strongest separation between low-frequency SNVs and high-frequency SNVs in noncoding elements extremely conserved between human and rodent,12 which cover about 0.05% of the human genome and are known to be enriched for developmental enhancers.27,37 The degree of separation is comparable to nonsynonymous SNVs (odds ratio = 1.996, and one-sided Fisher's exact p value < 10−20 when comparing rare SNVs [MAF bin ≤ 2] and common SNVs [MAF bin > 2]). This suggests that among those regulatory categories considered, the regions potentially playing important roles during development maybe be the most constrained. Through all frequency bin cutoffs considered here, the rarer SNVs are significantly more enriched inside regulatory regions than common SNVs. As before, the difference is maximized in most cases when the first two bins are compared with high-frequency SNVs (Figure S4).

Importantly, we observed that higher-frequency variants are progressively less likely to be found in most studied functional genomic regions than the variants in the next low-frequency bin until around MAF bin 5 (0.086 ≤ MAF < 0.103), at which point the frequency bins become more similar to one another (Figure 3). For variants beyond MAF bin 5, the tendency to be located in functional genomic regions stabilizes and does not change much with further increases in MAF. This argues that above a frequency threshold of around 8% to 10%, variants are similar and much less likely to be functional than the rare variants for all functional categories.

In order to prove the patterns we observed were not simply due to increased FDR at low MAF level (Figure S5), we performed two control analyses. In the first analysis, we randomly drew 25,000 rare SNVs (MAF bin ≤ 2) and 25,000 common SNVs (MAF bin > 2). We required the SNP quality score distribution of the randomly drawn rare and common SNVs to be similar and repeated the sampling procedure 1000 times (see Material and Methods). Consistent with what we observed above, rare SNVs, with exception of 5′UTR (empirical p = 0.076), are more enriched in conserved positions and in all functional regions than common SNVs in randomly drawn samples (empirical p ≤ 0.03) (Tables 2 and 3). In the second analysis, we recalculated the fraction of SNVs falling inside functional regions in each MAF bin after taking FDR into account (see Material and Methods), and we observed the same pattern (Figure S6).

Variants in which the Minor Allele Is Ancestral

Although the majority of SNVs follow the expected pattern in which the minor allele is the derived allele, 19.6% of all targeted SNVs have the reverse pattern in which the minor allele is ancestral (n = 548,347). Because these variants should provide us enhanced resolution of what kinds of selection are at work in the human genome, we decided to carry out analyses analogous to those above but focused only on the variants in which the minor allele is ancestral (and the derived allele is common). For the variants that have been positively selected in the population, we expected them to preferentially concentrate in the variants in which the derived allele is more common. We note that the human leukocyte antigen (HLA) region, in which positive or balancing selection is thought to be of particular importance, is largely ignored in our analysis due to poor sequencing quality. Moreover, for all variants that are either neutral or deleterious, we expect those in which the derived variant is more common to be enriched for neutral or near neutral variants. When we analyzed this special class of variants, we expected that if there were detectable positive selection then we would see a pattern similar to that described above in which the enrichment in functional regions increases as MAF decreases (the frequency of derived allele increases). However, we found the general pattern had dramatically diminished. For most studied functional regions, the pattern had completely evaporated as the odds ratios of being functional between rare and common SNVs become quite close to one (Figure 4A and Table 4). Our results suggest little difference in the tendency of being functional between rare and common SNVs when the minor alleles are ancestral. On the other hand, focusing on only variants in which the minor allele is derived (the typical pattern), we observed a stronger negative relationship between MAF and the tendency of being functional than was described above when both types of variants were considered together (Figure 4B and Table 4). On the basis of results from Woolf's test for homogeneity of odds ratios, the comparison of rare SNVs falling inside functional regions compared with common SNVs shows a significant difference between SNVs in which the minor allele is derived and SNVs in which the minor allele is ancestral (p values ≤ 5.36 × 10−3 for all functional categories when rare variants were defined as MAF bin ≤ 2 [MAF < 0.052], Table 4).

Figure 4
The Enrichment of SNVs in Functional Regions when Minor Alleles Are Ancestral or Derived
Table 4
Comparison of SNVs in which the Minor Allele Is Derived with SNVs in which the Minor Allele Is Ancestral for the Enrichment of Rare SNVs Relative to Common SNVs in Each Functional Region

We therefore observed a marked difference in the properties of the variants for which the minor allele is either ancestral or derived. To investigate what could be responsible for this difference, we used a forward simulation framework to create sets of variants under positive directional selection, purifying selection, or selective neutrality. We then simply measured the proportion of variants under selection as a function of allele frequency because our categorization into functional or nonfunctional parts of the genome can be considered an imprecise estimation of which variants are under selection. As expected, we found that for variants under only purifying selection and selective neutrality, there is a greatly elevated proportion of selected variants among those with lower frequency. Critically, this difference is dramatically reduced when we focused on those variants for which the minor allele is ancestral and, therefore, results in lower ΔOR values (Figure 5, see Material and Methods). When we introduced balancing selection, however, we found that the difference between the two classes of variants is greatly reduced (corresponding to higher ΔOR values). Our empirical data is therefore most consistent with a model in which the majority of variants in the human genome that are nonneutral are deleterious.38–41

Figure 5
Simulation Results Corresponding to Varying Proportions of Purifying and Positive Selection

Finally, the sample size of whole-genome sequence data used does not allow us any good assessment of the behavior of variants that are rarer than 2% in the population and, in particular, whether the patterns observed here continue into rarer frequencies. To assess this, we considered whole-exome sequence data from 168 individuals with average coverage on the targeted regions of 73×. The larger sample size allows us to consider bins with frequencies that range all the way down to a frequency of 0.3%. This data set contains 124,582 high-quality SNVs, 58,486 of them in the lowest-frequency bin. We found that through all the lower-frequency bins, the proportion of nonsynomymous SNVs to synonymous SNVs is higher in the rarer frequency bin compared to the next most common one (Figure S7), consistent with a recent study of 200 human exomes.10

Discussion

Previous smaller-scale studies have clearly shown that functional variants segregate at lower frequencies in the human population than nonfunctional or neutral variants.6–19 A consistent pattern was observed in our systematic genome-wide study. In addition, this study extends three aspects of our understanding of the relationship between population allele frequency and the functional properties of variants. First, it is of particular interest to note that regulatory regions show preferential exclusion of common variants relative to rare ones just like protein-coding sequence. This observation suggests that regulatory variants are generally subject to the same kinds of selection as protein-coding variants and that this selection is generally purifying as opposed to balancing. Second, we found that for most of the functional regions considered, common variants are less likely to be functional than the next rarer frequency class, up to a threshold frequency of somewhere between 8% and 10%, at which point the frequency classes tend to become indistinguishable. This analysis indicates that variants achieving a frequency of more than 8% to 10% in the human population normally do so precisely because they are less likely to have any important function. Below this threshold, as far as can be discerned, the rarer the SNVs the more likely to be functional. Presumably, this pattern simply continues all the way to de novo and private alleles. Finally and critically, when we carried out analyses specifically on variants in which the major allele is the derived form and the minor allele is ancestral, we found almost no discernable difference between rare and common variants in whether they fall in functionally important parts of the human genome for any of the functional regions considered. This analysis not only shows that the general pattern observed for rare and common variants evaporates, but there is no evidence that the more common form for such variants falls in functional regions. Therefore, our observation suggests that even those variants that should be most enriched for positive selection are generally becoming common through genetic drift as opposed to positive selection. In line with this observation, a recent study using resequencing data from 175 human genomes did not identify strong evidence of classic sweeps in which alleles are positively selected and spread rapidly in the population.42

These observations have direct relevance to the common disease-common variant hypothesis. Our analyses have focused on functional genomic regions as a surrogate for the likelihood that the variants may have some phenotypic effects. Because variants of phenotypic consequences are likely to be preferentially located in these functional genomic regions, our analyses suggest that most of the genetic variation of phenotypic consequence is likely to be rare in the human population. Although there are a handful of examples of variants that are positively selected in certain contexts but that are also risk factors for diseases, for example variants in β-globin and variants in the HLA region, our analyses suggest that these examples are the exception rather than the rule. Our analyses suggest that most of the functional variation carried by humans is likely to be rare genetic variation that is at least moderately deleterious and held to low frequency by selection.43,44 These analyses therefore provide a possible explanation for the relatively limited role of common genetic variation in most human diseases identified by genome-wide association studies.1,3

In summary, our analyses indicate that the primary selective pressure influencing functional variation in the human genome is purifying selection and that the bulk of the functional variation present in the human population is present because of mutation-selection balance.45

Acknowledgments

We thank K. Pelak, E. Ruzzo, A. Need, and T. Urban for helpful discussion and comments on the manuscript. We acknowledge J. Goedert, J. Milner, D. Valle, J. Hoover-Fong, N. Sobriera, J.P. McEvoy, A. Need, J. Silver, M. Silver, R. Radtke, N. Walley, A. Husain, D. Attix, L. Cirulli, J. McEvoy, K. Linney, W. Lowe, C. Depondt, G. Cavalleri, S. Sisodiya, and N. Delanty for supplying research material and C. Gumbs, H. Onabanjo, K. Cronin, and L. Little for DNA and RNA extraction. Funding for analysis of genomes was provided by Bill and Melinda Gates Foundation (grant 157412) with additional funding from National Institute of Allergy and Infectious Diseases Center for HIV/AIDS Vaccine Immunology grant AI067854, National Institute of Neurological Disorders and Stroke grant RC2NS070344, and National Institute of Mental Health grant RC2MH089915.

Supplemental Data

Document S1. Seven Figures:

Web Resources

The URLs for data presented herein are as follows:

References

1. Manolio T.A., Collins F.S., Cox N.J., Goldstein D.B., Hindorff L.A., Hunter D.J., McCarthy M.I., Ramos E.M., Cardon L.R., Chakravarti A. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. [PMC free article] [PubMed]
2. Cirulli E.T., Goldstein D.B. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat. Rev. Genet. 2010;11:415–425. [PubMed]
3. McClellan J., King M.C. Genetic heterogeneity in human disease. Cell. 2010;141:210–217. [PubMed]
4. Alexander R.P., Fang G., Rozowsky J., Snyder M., Gerstein M.B. Annotating non-coding regions of the genome. Nat. Rev. Genet. 2010;11:559–571. [PubMed]
5. Birney E., Stamatoyannopoulos J.A., Dutta A., Guigó R., Gingeras T.R., Margulies E.H., Weng Z., Snyder M., Dermitzakis E.T., Thurman R.E., ENCODE Project Consortium. NISC Comparative Sequencing Program. Baylor College of Medicine Human Genome Sequencing Center. Washington University Genome Sequencing Center. Broad Institute. Children's Hospital Oakland Research Institute Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. [PMC free article] [PubMed]
6. Halushka M.K., Fan J.-B., Bentley K., Hsie L., Shen N., Weder A., Cooper R., Lipshutz R., Chakravarti A. Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet. 1999;22:239–247. [PubMed]
7. Kryukov G.V., Pennacchio L.A., Sunyaev S.R. Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am. J. Hum. Genet. 2007;80:727–739. [PMC free article] [PubMed]
8. Ng S.B., Turner E.H., Robertson P.D., Flygare S.D., Bigham A.W., Lee C., Shaffer T., Wong M., Bhattacharjee A., Eichler E.E. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:272–276. [PMC free article] [PubMed]
9. Gorlov I.P., Gorlova O.Y., Sunyaev S.R., Spitz M.R., Amos C.I. Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet. 2008;82:100–112. [PMC free article] [PubMed]
10. Li Y., Vinckenbosch N., Tian G., Huerta-Sanchez E., Jiang T., Jiang H., Albrechtsen A., Andersen G., Cao H., Korneliussen T. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat. Genet. 2010;42:969–972. [PubMed]
11. Torgerson D.G., Boyko A.R., Hernandez R.D., Indap A., Hu X., White T.J., Sninsky J.J., Cargill M., Adams M.D., Bustamante C.D., Clark A.G. Evolutionary processes acting on candidate cis-regulatory regions in humans inferred from patterns of polymorphism and divergence. PLoS Genet. 2009;5:e1000592. [PMC free article] [PubMed]
12. Katzman S., Kern A.D., Bejerano G., Fewell G., Fulton L., Wilson R.K., Salama S.R., Haussler D. Human genome ultraconserved elements are ultraselected. Science. 2007;317:915. [PubMed]
13. Asthana S., Roytberg M., Stamatoyannopoulos J., Sunyaev S. Analysis of sequence conservation at nucleotide resolution. PLoS Comput. Biol. 2007;3:e254. [PMC free article] [PubMed]
14. Drake J.A., Bird C., Nemesh J., Thomas D.J., Newton-Cheh C., Reymond A., Excoffier L., Attar H., Antonarakis S.E., Dermitzakis E.T., Hirschhorn J.N. Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nat. Genet. 2006;38:223–227. [PubMed]
15. Boyko A.R., Williamson S.H., Indap A.R., Degenhardt J.D., Hernandez R.D., Lohmueller K.E., Adams M.D., Schmidt S., Sninsky J.J., Sunyaev S.R. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet. 2008;4:e1000083. [PMC free article] [PubMed]
16. Asthana S., Noble W.S., Kryukov G., Grant C.E., Sunyaev S., Stamatoyannopoulos J.A. Widely distributed noncoding purifying selection in the human genome. Proc. Natl. Acad. Sci. USA. 2007;104:12410–12415. [PMC free article] [PubMed]
17. Goode D.L., Cooper G.M., Schmutz J., Dickson M., Gonzales E., Tsai M., Karra K., Davydov E., Batzoglou S., Myers R.M., Sidow A. Evolutionary constraint facilitates interpretation of genetic variation in resequenced human genomes. Genome Res. 2010;20:301–310. [PMC free article] [PubMed]
18. Cooper G.M., Goode D.L., Ng S.B., Sidow A., Bamshad M.J., Shendure J., Nickerson D.A. Single-nucleotide evolutionary constraint scores highlight disease-causing mutations. Nat. Methods. 2010;7:250–251. [PMC free article] [PubMed]
19. Durbin R.M., Abecasis G.R., Altshuler D.L., Auton A., Brooks L.D., Durbin R.M., Gibbs R.A., Hurles M.E., McVean G.A., 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. [PMC free article] [PubMed]
20. Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. [PMC free article] [PubMed]
21. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. [PMC free article] [PubMed]
22. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–580. [PMC free article] [PubMed]
23. Redon R., Ishikawa S., Fitch K.R., Feuk L., Perry G.H., Andrews T.D., Fiegler H., Shapero M.H., Carson A.R., Chen W. Global variation in copy number in the human genome. Nature. 2006;444:444–454. [PMC free article] [PubMed]
24. Zhu, M., and Goldstein, D. (2010). Estimation by Read Depth with SNVs (ERDS). http://www.duke.edu/~mz34/erds.htm.
25. Bailey J.A., Yavor A.M., Massa H.F., Trask B.J., Eichler E.E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. [PMC free article] [PubMed]
26. Khaja R., Zhang J., MacDonald J.R., He Y., Joseph-George A.M., Wei J., Rafiq M.A., Qian C., Shago M., Pantano L. Genome assembly comparison identifies structural variants in the human genome. Nat. Genet. 2006;38:1413–1418. [PMC free article] [PubMed]
27. Visel A., Prabhakar S., Akiyama J.A., Shoukry M., Lewis K.D., Holt A., Plajzer-Frick I., Afzal V., Rubin E.M., Pennacchio L.A. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat. Genet. 2008;40:158–160. [PMC free article] [PubMed]
28. Gumby website. http://pga.jgi-psf.org/gumby/.
29. Kryukov G.V., Shpunt A., Stamatoyannopoulos J.A., Sunyaev S.R. Power of deep, all-exon resequencing for discovery of human trait genes. Proc. Natl. Acad. Sci. USA. 2009;106:3871–3876. [PMC free article] [PubMed]
30. Hernandez R.D. A flexible forward simulator for populations subject to selection and demography. Bioinformatics. 2008;24:2786–2787. [PMC free article] [PubMed]
31. Siepel A., Bejerano G., Pedersen J.S., Hinrichs A.S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–1050. [PMC free article] [PubMed]
32. Pollard K.S., Hubisz M.J., Rosenbloom K.R., Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20:110–121. [PMC free article] [PubMed]
33. Kent W.J., Sugnet C.W., Furey T.S., Roskin K.M., Pringle T.H., Zahler A.M., Haussler D. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. [PMC free article] [PubMed]
34. ENCODE Project Consortium The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306:636–640. [PubMed]
35. Heintzman N.D., Stuart R.K., Hon G., Fu Y., Ching C.W., Hawkins R.D., Barrera L.O., Van Calcar S., Qu C., Ching K.A. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 2007;39:311–318. [PubMed]
36. Heintzman N.D., Hon G.C., Hawkins R.D., Kheradpour P., Stark A., Harp L.F., Ye Z., Lee L.K., Stuart R.K., Ching C.W. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature. 2009;459:108–112. [PMC free article] [PubMed]
37. Bejerano G., Pheasant M., Makunin I., Stephen S., Kent W.J., Mattick J.S., Haussler D. Ultraconserved elements in the human genome. Science. 2004;304:1321–1325. [PubMed]
38. Lohmueller K.E., Indap A.R., Schmidt S., Boyko A.R., Hernandez R.D., Hubisz M.J., Sninsky J.J., White T.J., Sunyaev S.R., Nielsen R. Proportionally more deleterious genetic variation in European than in African populations. Nature. 2008;451:994–997. [PMC free article] [PubMed]
39. Eyre-Walker A., Woolfit M., Phelps T. The distribution of fitness effects of new deleterious amino acid mutations in humans. Genetics. 2006;173:891–900. [PMC free article] [PubMed]
40. Subramanian S. High proportions of deleterious polymorphisms in constrained human genes. Mol. Biol. Evol. 2011;28:49–52. [PubMed]
41. Williamson S.H., Hernandez R., Fledel-Alon A., Zhu L., Nielsen R., Bustamante C.D. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. USA. 2005;102:7882–7887. [PMC free article] [PubMed]
42. Hernandez R.D., Kelley J.L., Elyashiv E., Melton S.C., Auton A., McVean G., Sella G., Przeworski M., 1000 Genomes Project Classic selective sweeps were rare in recent human evolution. Science. 2011;331:920–924. [PMC free article] [PubMed]
43. Pritchard J.K. Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 2001;69:124–137. [PMC free article] [PubMed]
44. Eyre-Walker A. Evolution in health and medicine Sackler colloquium: Genetic architecture of a complex trait and its implications for fitness and genome-wide association studies. Proc. Natl. Acad. Sci. USA. 2010;107(Suppl 1):1752–1756. [PMC free article] [PubMed]
45. Nei M., Suzuki Y., Nozawa M. The neutral theory of molecular evolution in the genomic era. Annu. Rev. Genomics Hum. Genet. 2010;11:265–289. [PubMed]
46. Woolf B. On estimating the relation between blood group and disease. Ann. Hum. Genet. 1955;19:251–253. [PubMed]

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...