Logo of ajhgLink to Publisher's site
Am J Hum Genet. Jun 2004; 74(6): 1111–1120.
Published online Apr 26, 2004. doi:  10.1086/421051
PMCID: PMC1182075

Genetic Signatures of Strong Recent Positive Selection at the Lactase Gene

Abstract

In most human populations, the ability to digest lactose contained in milk usually disappears in childhood, but in European-derived populations, lactase activity frequently persists into adulthood (Scrimshaw and Murray 1988). It has been suggested (Cavalli-Sforza 1973; Hollox et al. 2001; Enattah et al. 2002; Poulter et al. 2003) that a selective advantage based on additional nutrition from dairy explains these genetically determined population differences (Simoons 1970; Kretchmer 1971; Scrimshaw and Murray 1988; Enattah et al. 2002), but formal population-genetics–based evidence of selection has not yet been provided. To assess the population-genetics evidence for selection, we typed 101 single-nucleotide polymorphisms covering 3.2 Mb around the lactase gene. In northern European–derived populations, two alleles that are tightly associated with lactase persistence (Enattah et al. 2002) uniquely mark a common (~77%) haplotype that extends largely undisrupted for >1 Mb. We provide two new lines of genetic evidence that this long, common haplotype arose rapidly due to recent selection: (1) by use of the traditional FST measure and a novel test based on pexcess, we demonstrate large frequency differences among populations for the persistence-associated markers and for flanking markers throughout the haplotype, and (2) we show that the haplotype is unusually long, given its high frequency—a hallmark of recent selection. We estimate that strong selection occurred within the past 5,000–10,000 years, consistent with an advantage to lactase persistence in the setting of dairy farming; the signals of selection we observe are among the strongest yet seen for any gene in the genome.

Introduction

Genes that have experienced recent positive selection offer a window into the evolutionary forces that shaped recent human history. For example, signatures of recent selection for resistance to malaria have been demonstrated around the HbS allele in the β-globin gene HBB (MIM 141900) (Pagnier et al. 1984), the A and Med alleles in G6PD (MIM 305900) (Tishkoff et al. 2001), the *O allele of the Duffy gene FY (MIM 110700) (Hamblin et al. 2002), and a promoter variant in the CD40 ligand gene TNFSF5 (MIM 300386) (Sabeti et al. 2002). Other genes for which genetic data support a recent selective event include CKR5 (MIM 601373) (Stephens et al. 1998), HFE (MIM 235200) (Toomajian et al. 2003), ADH1B (MIM 103720) (Osier et al. 2002), and possibly CFTR (MIM 602421) (Wiuf 2001 and references therein); the particular evolutionary advantage in these cases is less clear. Many of the selected alleles also contribute to or cause disease, indicating that identification of genes under selection may have significant consequences for medical genetics. Furthermore, once such genes have been definitively identified, characterizing the signatures of selection at these genes will guide the development of tools to search for other genes under selection.

One of the genes most frequently proposed to have experienced recent positive selection is LCT (MIM 603202), which encodes the enzyme lactase-phlorizin hydrolase. The epidemiologic data in favor of selection are quite strong: the ability to use this enzyme to digest lactose during adulthood varies dramatically across worldwide populations, with particularly high rates among northern Europeans (Bayless and Rosensweig 1966; Simoons 1969; Scrimshaw and Murray 1988). Furthermore, persistence of lactase activity into adulthood is genetically determined (Simoons 1970; Kretchmer 1971; Scrimshaw and Murray 1988; Enattah et al. 2002), and the geographic distribution of lactase persistence matches the distribution of dairy farming (Simoons 1969; Kretchmer 1971; Scrimshaw and Murray 1988). Because of these features, Cavalli-Sforza (1973) and others (Simoons 1970; Flatz 1987; Hollox et al. 2001; Poulter et al. 2003) proposed that the high rate of lactase persistence in European populations is explained by positive selection resulting from increased nutrition from dairy, the only dietary source of lactose. Despite these compelling epidemiologic data, neither formal population-genetics–based evidence of selection nor an estimate of the timing and magnitude of positive selection has been provided by analyzing genetic data at the LCT locus. In addition, many non-European populations show high rates of lactase persistence, raising questions about whether a single allele arose once and is shared by all lactase-persistent individuals or whether different alleles have arisen in human history.

Recently, new tools to study selection at LCT have become available. In particular, Enattah et al. (2002) demonstrated that two polymorphisms upstream of LCT are tightly associated with lactase persistence. In that study, the persistence-associated alleles were found primarily on a single 250-kb microsatellite haplotype in the Finnish population. By use of 18 SNPs spanning 1 Mb, Swallow and colleagues also recently reported a long haplotype around these alleles (Poulter et al. 2003). However, the mere presence of a long haplotype, although consistent with selection, does not by itself constitute a signature of a selective event (Sabeti et al. 2002).

A variety of genetic signatures of positive selection have been described (reviewed in Bamshad and Wooding 2003). These include an excess of rare variants (indicating a selective sweep followed by the accumulation of new, rare mutations), large allele-frequency differences among populations (indicating differential effects of selection that cause alleles to rise dramatically in frequency in some but not all of the populations), or a common haplotype that remains intact over unusually long distances (indicating an allele that rose rapidly to high frequency before recombination could disrupt the haplotype on which the allele lies). The last two signatures are particularly appealing because they can be detected by genotyping common polymorphisms in one or more populations and may have better power for identifying recent positive selection (Sabeti et al. 2002). Large differences in allele frequencies between populations have traditionally been detected by use of the population-genetics measure FST (e.g., Akey et al. 2002), whereas demonstration that a common haplotype is unexpectedly long requires application of the recently described long-range haplotype test (Sabeti et al. 2002).

In this study, we analyze genotypes for >100 SNPs in multiple populations, and we demonstrate two striking signatures of selection at the LCT gene. First, SNPs near LCT show large differences in allele frequencies among populations, demonstrated not only with the traditional FST measure but also with a more informative metric, pexcess. In addition, we show that the long (1 Mb) haplotype carrying the persistence-associated alleles is much longer and more common than would be expected in the absence of selection. We are also able to estimate from these genetic data the time period during which selection occurred, and we show that the selective pressure at LCT was comparable to the strongest selection yet documented in the genome.

Subjects and Methods

DNA Samples

DNA samples for European American, African American, and East Asian populations were obtained from the Coriell Institute (Coriell Institute for Medical Research Web site); a complete list of these samples and geographic origins is given in table A1 (online only). The Scandinavian population, which has been described elsewhere (Altshuler et al. 2000), is a subset of 379 normal glucose-tolerant trios from Finland and Sweden, and the samples we typed represent 360 independent chromosomes. The remaining populations listed in table 1 have also been described elsewhere (Rosenberg et al. 2002). This project was approved by the appropriate local institutional review boards, and subjects gave informed consent.

Table 1
Frequencies in Different Populations of Two Alleles Associated with Lactase Persistence[Note]

Selection and Genotyping of SNPs

SNPs were selected from dbSNP (dbSNP Home Page), preferentially choosing the SNP Consortium (TSC) and BAC overlap SNPs (submitter handles: TSC, SC_JCM, and KWOK) and genotyping SNPs at a greater density closer to the LCT gene. In addition, we intentionally genotyped the two SNPs reported to be associated with LCT persistence (Enattah et al. 2002). A complete list is given in table A2 (online only). SNPs were genotyped by use of the mass-spectrometry–based MassArray platform provided by Sequenom, implemented as described elsewhere (Gabriel et al. 2002). Primers were designed by use of Spectrodesigner software (Sequenom), and sequences are available on request.

Statistical Analysis

FST was calculated as described by Akey et al. (2002), with Nei’s correction for sample size (Nei and Chesser 1983). To generate a genomewide distribution for FST and pexcess, allele frequencies at markers throughout the genome were downloaded from the SNP Consortium (TSC) Web site, by use of data from the Whitehead Institute Center for Genome Research (WICGR), Celera, Motorola, and Orchid. We excluded data from pooled samples, since the FST distribution was different for pooled data (Akey et al. 2002 and data not shown). In total, data from 28,440 markers were used to generate a genomewide FST distribution. To compare the FST at markers around LCT with the genomewide distribution, we applied the Wilcoxon rank-sum test (Rosner 1982), limiting our analysis to markers separated by at least 20 kb to minimize correlation between markers. To eliminate artifactual effects at the lower end of the FST distribution (which can be due, in part, to the correction for sample size), we treated all FST values below the population mean as ties. Applying this test to the markers around LCT yields a P value of .002. However, because we cannot fully correct for the correlation between markers, this P value may overestimate the significance of the excess markers with high FST values.

To understand the rationale for using the pexcess statistic, consider the scenario where positive selection rapidly introduces a single haplotype at frequency h into a population. Under the model of strong selection, a particular long-range haplotype will rapidly rise from a single copy (frequency near 0) to a frequency of h in the selected population. Consider now a marker within the long-range haplotype with an allele of frequency p prior to the selective event. If there has been little opportunity for recombination, nearly all copies of the selected haplotype will carry the same allele at this marker. For the allele that lies on the selected haplotype, the allele frequency will increase to p1=p(1-h)+h after selection; for an allele that does not lie on the selected haplotype, the allele frequency will decrease to p1=p(1-h). Solving for h, h=(p1-p)/(1-p) if p1>p and h=(p-p1)/p if p > p1. This is algebraically identical to pexcess (Hastbacka et al. 1994); here, p1 is the allele frequency in the population under consideration, and p is the ancestral allele frequency, which we estimate by the average allele frequency in the populations that have not experienced selection (in this case, the East Asian and African American populations). To maximize the chance that the variant predates the selective event (essential for using pexcess to estimate h), we only calculate pexcess for polymorphisms in which the allele frequencies in all populations are between 10% and 90%. Similar results were obtained whether or not we corrected the allele frequencies in African Americans for the estimated 21% European admixture (Parra et al. 1998). Of the markers from the SNP Consortium (TSC) Web site, 13,696 have allele frequencies between 10% and 90% for all three populations, and these were used for calculating the genomewide characteristics of pexcess. For comparison, we identified 952 regions with at least 5 markers spanning 50 kb–100 kb. We found that none of these 952 regions contains runs of [gt-or-equal, slanted]5 consecutive markers that span at least 50 kb and have pexcess values above the 90th percentile; the LCT region has 16 consecutive markers spanning 800 kb with pexcess values above the 95th percentile.

The long-range haplotype test, the calculation of relative extended haplotype homozygosity (REHH), and the assessment of the significance of REHH by use of simulations were performed as described elsewhere (Sabeti et al. 2002). In brief, a core region was defined as a block of linkage disequilibrium with little evidence of recombination (Gabriel et al. 2002). The genotype data was converted to inferred, fully phased haplotype data, and, within the core region, each common haplotype (>5% frequency) was analyzed separately. At each marker, a chromosome was considered intact if, from the core through that marker, the chromosome was identical to all other intact chromosomes carrying the same core haplotype. For LCT, the core region was chosen to contain the persistence-associated markers. For the simulations, cores and genotypes extending outward from the cores were generated as described elsewhere (Sabeti et al. 2002). The empirical P value for the 5′ markers was .012. For the 3′ markers, 10,000 simulations generated ~25,000 core haplotypes, of which ~2,500 had a frequency similar to that of the LCT core; none of these had an REHH near that seen for LCT (empirical P < .0004). To better estimate the P value for the 3′ markers, the REHH distribution from the simulated data was log-transformed to achieve normality, and the mean, median, and SD were used to estimate P values for the actual REHH value observed in LCT. The estimation of dates was performed according to methods described elsewhere (Reich and Goldstein 1998; Stephens et al. 1998).

For these analyses, fully phased haplotype data were required. We used two phasing programs: PHASE, a Bayesian method for phasing diploid genotype data (Stephens and Donnelly 2003; PHASE Web site), and also a similar program (wphase) that we developed for this purpose. Similar results were obtained from the two phasing algorithms. The mathematical models underlying the two programs are similar, but PHASE performs a Markov Chain–Monte Carlo procedure, whereas wphase carries out a hill climb, (approximately) maximizing the likelihood. We estimated REHH and dates at distances on either side of the core region, where approximately one recombination per chromosome had occurred on the persistence-associated haplotype (that is, ~1/e chromosomes carrying the persistence-associated haplotype remained unrecombined).

We estimated the coefficient of selection, s, by applying a formula (Hartl and Clark 1997) that relates the frequency in generation t+1(pt+1) to the frequency in generation t(pt):

equation image

In this formula, qt=1-pt,w11 is the relative fitness of individuals homozygous for the selected allele, w12 is the relative fitness of heterozygous individuals, and w22 is the relative fitness of individuals homozygous for the unselected allele. We assumed a dominant model for lactase persistence—that is, w11=w12=1 and w22=1-s. We also assumed the initial frequency p0 to be between 1/1,000 and 1/10,000 (corresponding to a new mutation in a population with an effective size between 500 and 5,000; larger population sizes yield even higher coefficients of selection). Starting from these initial frequencies, we calculated values of w22 that would yield a frequency of p = 0.77 after 2,188–20,650 years of selective pressure for the United States population and 1,625–3,188 years for the Scandinavian population, assuming 25 years/generation.

Results

To examine the evidence for selection, we began by genotyping the two SNPs that were recently reported to be very tightly associated with lactase persistence (Enattah et al. 2002): rs4988235 (−13910C→T) and rs182549 (−22018G→A). We determined the frequencies of the persistence-associated alleles (T and A, respectively) in three populations for which many thousands of markers have been genotyped (European Americans, African Americans, and East Asians), thereby permitting comparison of our results to a genomewide background distribution (Akey et al. 2002). The persistence-associated alleles occur with a frequency of 77% in European Americans, 13% and 14% in African Americans, and 0% in East Asians (table 1), broadly consistent with the rates of lactase persistence in these populations (Scrimshaw and Murray 1988). Large differences in allele frequencies across populations, such as we observe at these markers, are suggestive of selective pressure that differed among the populations (Lewontin and Krakauer 1973; Bowcock et al. 1991; Akey et al. 2002). The unusually large magnitude of the population frequency differences for these two markers is reflected in their values of FST, a traditional measure of population differentiation—the FST values (0.53 for both markers) exceed 99.9% of the FST values from a genomewide set of >28,000 SNPs (see the “Subjects and Methods” section). We also genotyped these two associated SNPs in a more diverse set of samples (Altshuler et al. 2000; Rosenberg et al. 2002); the frequencies of the persistence-associated alleles were much lower in southern European than in northern European or Basque populations, and the persistence-associated alleles were rare or absent in almost all non-European–derived populations tested, except Algerians and Pakistanis (table 1). The wide range of allele frequencies among European populations is consistent with selective pressure that postdates the colonization of Europe, resulting in different prevalences of lactase-persistence alleles in northern and southern European populations.

To extend these results, we genotyped an additional 99 markers in 3.2 Mb flanking the LCT locus, again looking for high degrees of population differentiation. In response to strong positive selection, a selected allele rises rapidly in frequency. The frequency of the haplotype on which the allele occurs will increase correspondingly, because there is insufficient time for recombination to disrupt the haplotype while it becomes more common. Thus, allele frequencies at flanking markers on the haplotype will be altered. To measure this effect, we used two metrics of allele-frequency differences: the traditional FST and a newer metric, pexcess. FST has limited utility when the flanking allele on the selected haplotype was already fairly common prior to selection, because, in this case, the FST value will be quite low; thus, only a fraction of flanking markers are expected to show elevated FST values within a region of selection. Consistent with this expectation, there was an excess of high FST values among the 99 markers, but FST values varied widely from marker to marker (fig. 1a; see the “Subjects and Methods” section for additional details). The excess elevation of FST is predominantly derived from markers located in the vicinity of the LCT gene (fig. 1a), with allele frequencies that are generally different in Europeans than in the other two populations (table A2 [online only]). This elevated FST in markers flanking LCT confirms the signal of selection seen with the −13910C→T and −22018G→A variants. However, as expected, only some of the markers near LCT have elevated FST values. Accordingly, we sought an alternative measure of population differentiation that would reveal a more consistent signal in the vicinity of a selected allele.

Figure  1
Elevation in (a) FST and (b) pexcess at multiple SNPs in a 3.2-Mb region around the LCT gene. Position in kb relative to the start of transcription of LCT is on the X-axis. The 90th, 99th, and 99.9th percentiles for FST and pexcess are indicated by dashed ...

We chose to study the pexcess statistic, which has previously been used to localize disease-causing alleles in founder populations and is a measure of differences in haplotype frequencies across long distances (Hastbacka et al. 1994). pexcess is also equivalent to the measure of linkage disequilibrium, δ (Devlin and Risch 1995). If a single haplotype differs in frequency across a long region, pexcess will be elevated and relatively constant across multiple markers within that region, with values approximately equal to the increase in frequency of the haplotype (see the “Subjects and Methods” section for details). We observed a consistent, marked elevation of pexcess in the LCT region: 17 consecutive markers in a region spanning 500 kb around LCT have nearly identical, very high values of pexcess that approximate the frequency of the persistence-associated haplotype (0.77) (fig. 1b). Furthermore, the elevation in pexcess extends for at least 1,500 kb (fig. 1b; table A2 [online only]). To provide a framework for comparison, we calculated pexcess values for marker pairs and the correlation between pairs as a function of distance for >13,000 SNPs throughout the genome; we found that the correlation is normally minimal at distances of as little as 100 kb (r2=0.002). Indeed, in this genomewide data set, none of 952 comparison regions had a consistent elevation in pexcess values approaching that seen around LCT (see the “Subjects and Methods” section for details). These results further mark the LCT region as very unusual when compared with the remainder of the genome, and they strongly suggest that genetic hitchhiking due to selection has occurred: that is, a selected allele rose in frequency over such a short time period that the frequencies of linked alleles on the surrounding >1 Mb haplotype were dragged up as well (Braverman et al. 1995).

In addition to the tests above, which are measures of differentiation between populations, we also employed the recently described long-range haplotype test of Sabeti et al. (2002), which detects selection by measuring the characteristics of haplotypes within a single population. A recent haplotype should be surrounded by long stretches of homozygosity, since recombination will have had few opportunities to juxtapose adjacent segments from other chromosomes with the selected haplotype. The evidence for selection is a haplotype that arose recently—as evidenced by long flanking stretches of homozygosity—but is so common that the haplotype could not have risen quickly to such high frequency without the aid of selection. We observed precisely this pattern at the haplotype containing the lactase-persistence–associated alleles −13910T and −22018A. The haplotype containing these alleles was very common (77% in European Americans) but also largely identical over nearly 1 cM (>800 kb), indicating a recent origin (red bars in fig. 2). This long stretch of homozygosity was not simply due to a low local recombination rate—the other haplotypes in this region show shorter extents of homozygosity, indicating abundant historical recombination (blue bars near the bottom of fig. 2), and the recombination rate in this region is typical of that in the genome as a whole (Kong et al. 2002).

Figure  2
Long-range extended homozygosity for the core haplotype containing the persistence-associated alleles at LCT at various distances from LCT. The extent to which the common core haplotypes remains intact is shown for each chromosome in cM. The core region ...

To formally assess the significance of these results, we focused on the REHH statistic (Sabeti et al. 2002); REHH values much greater than 1 indicate increased homozygosity of a haplotype compared with other haplotypes in the region. For the lactase-persistence–associated haplotype, REHH was 13.2 in the region 3′ to LCT, indicating much less breakdown of homozygosity at the persistence-associated haplotype than at haplotypes not carrying the persistence-associated alleles. We compared the LCT data to data from coalescent population-genetics simulations analogous to those in Sabeti et al. (2002), and the empirical P value for excess homozygosity 3′ to LCT was .0004 (fig. 3 and the “Subjects and Methods” section); other estimates of significance suggest a P value closer to 10−7 (see the “Subjects and Methods” section). As confirmation, we compared the LCT haplotype to actual genotype data from 12 control regions spanning 500 kb each. The distribution of REHH was similar for the control regions and the simulations, and the LCT haplotype had a higher REHH than any of the matched control haplotypes. It is notable that the signal for selection is much stronger for LCT than for the well-established case of G6PD—although higher haplotype frequencies are in general associated with lower REHH values (Sabeti et al. 2002) (fig. 3), we observe a larger REHH statistic for the 77% LCT haplotype (REHH=13.2) than for the 18% G6PD haplotype (REHH=7) (see Sabeti et al. 2002). Although we cannot rule out the possibility that the extended homozygosity of the high-frequency LCT haplotype is due to dominant suppression of recombination over Mb distances because of an allele on this haplotype, positive selection seems to be a more biologically plausible phenomenon, especially since the haplotype has such a strikingly wide spread of frequencies across European populations. Furthermore, the parental core haplotype on which the persistence-associated alleles arose is present in Asian and African American populations, and it does not have an elevated REHH value (data not shown).

Figure  3
REHH, a measure of extended haplotype homozygosity, plotted for the persistence-associated haplotype at LCT, in comparison with REHH from haplotypes in 10,000 sets of simulated data (Sabeti et al. 2002). Data are shown using markers (a) 5′ and ...

We next estimated the age of the lactase-persistence–associated haplotype, on the basis of the decay of haplotypes in either direction from the LCT core region (Reich and Goldstein 1998; Stephens et al. 1998). On the basis of our analysis of European-derived U.S. pedigrees, the best estimates of the time at which the persistence-associated haplotype began to rise rapidly in frequency are between 2,188 and 20,650 years ago, consistent with the estimated origin of dairy farming in northern Europe ~9,000 years ago (Simoons 1970; Kretchmer 1971; Scrimshaw and Murray 1988). Even more recent estimates (1,625–3,188 years ago) were obtained by analyzing a Scandinavian population of parent-offspring trios, suggesting stronger and more recent selection in this population. On the basis of these ranges of ages, we estimate the coefficient of selection associated with carrying at least one copy of the lactase-persistence allele to be between 0.014 and 0.15 for the CEPH population and between 0.09 and 0.19 for the Scandinavian population (see the “Subjects and Methods” section for details). By comparison, the selective advantage in a region endemic for malaria has been estimated at 0.02–0.05 for G6PD deficiency (Tishkoff et al. 2001) and 0.05–0.18 for the sickle-cell trait (Li 1975). Thus, the added nutrition from dairy appears to have provided a selective advantage in northern Europe comparable to that provided by resistance to malaria in malaria-endemic regions.

Discussion

We have now demonstrated, on the basis of three different analytic methods (elevated FST at markers associated with lactase persistence, runs of elevated pexcess at flanking markers, and extended haplotype homozygosity), that strong positive selection occurred in a large region that includes the LCT gene. This selection occurred after the separation of European-derived populations from Asian- and African-derived populations, and it likely occurred after the colonization of Europe. The high frequency and young age of this haplotype, the high estimated coefficient of selection, and the very high REHH value all suggest that LCT represents one of the strongest signals of recent positive selection yet documented in the genome. Our results strongly support the hypothesis that the additional nutrition provided by dairy was very important for survival in the recent history of Europe and perhaps in other regions of the world as well.

Our results show that chromosomes carrying the allele associated with lactase persistence (−13910T) share a very long haplotype around this allele. We and others have noted that the presence of this long haplotype raises the possibility that a variant located somewhere in this large region, other than −13910C→T, could be the cause of lactase persistence (Grand et al. 2003; Poulter et al. 2003). Indeed, Swallow and colleagues have identified an individual who is homozygous for the nonpersistence-associated allele at −13910C→T but retains lactase activity (Poulter et al. 2003). Recently, Olds and Sibley (2003) demonstrated differential in vitro transcriptional activity between short segments of DNA carrying the C and T alleles, but the predictive value of such in vitro data for the in vivo phenotype remains uncertain. A comprehensive assessment of variation throughout this long haplotype may be required to determine if −13910C→T is truly the causal polymorphism. Of course, it is also possible that the strong signature of selection is not due to variation at LCT but rather to a coincidental selective event acting on a nearby unrelated gene. However, the striking geographic correlation of lactase persistence with dairy farming (Simoons 1969; Kretchmer 1971; Scrimshaw and Murray 1988) and the recently described evidence of selection on cattle-milk protein genes in regions of Europe with a high prevalence of lactase persistence (Beja-Pereira et al. 2003) lend strong support to the dairy hypothesis.

The −13910T allele was rare or absent in the sub-Saharan African populations we tested, indicating that the presence of the T allele in African Americans that we and Enattah et al. (2002) observed is probably explained by admixture of European-derived chromosomes into the African American population (Parra et al. 1998). Thus, our data do not provide evidence that the −13910T allele predates the differentiation of European and African populations. The absence of the T allele in African populations also suggests that either −13910C/T is not the causal allele or that lactase persistence arose multiple times, because lactase persistence is prevalent in a number of African populations (Scrimshaw and Murray 1988). Consistent with these suggestions, the study by Mulcare and colleagues (in this issue of the Journal) showed that the −13910T allele was absent from several African populations known to have high rates of lactase persistence (Mulcare et al. 2004 [in this issue]). We did not specifically survey these populations, but such surveys will help determine whether lactase persistence arose multiple times in human history or whether a single very old polymorphism rose independently to high frequencies in multiple populations, as has been suggested (Enattah et al. 2002). Finally, the T allele was present at high frequencies in Pakistan and at somewhat lower frequencies in Middle Eastern populations (table 1) and was found on the same local haplotype in these populations as in Europeans (data not shown). These data suggest that individuals carrying the lactase-persistence allele might have migrated between populations (perhaps along with dairy farming), and their descendants may be responsible for the increased allele frequencies in diverse populations in Europe and neighboring regions.

More generally, we have implemented two methods of detecting signatures of positive selection: runs of consecutive markers with elevated pexcess and the long-range haplotype test. It is important to note that these two tests identified LCT as strikingly unusual because LCT was at the far extreme of the genomewide distribution. With the availability of data for loci throughout the genome, empirical comparisons of individual loci to the genomewide distribution will distinguish other genes that are in the extreme tail of the distribution and, thus, are likely to have experienced selection. Ideally, the metrics will be compared not only to an empirical distribution but also to a simulated distribution derived from an appropriate model of recent human evolution that is consistent with empirical data. As models that incorporate more-complete descriptions of human history are developed, such simulations will become more useful.

Both of these methods should be readily applicable to genomewide SNP genotype data being generated by the haplotype map of the human genome (HapMap Project Web site). In particular, runs of markers with consistently elevated pexcess should be detectable once an adequate number of SNPs have been genotyped in multiple populations; our experience with LCT suggests that these runs of elevated pexcess may be more informative than signals from individual markers with high FST values, particularly where selection has dramatically increased the frequency of a single haplotype. The long-range haplotype test should also be useful, even in studies of a single population. Thus, it should be possible in the near future to identify many other loci that have undergone recent positive selection, leading to new insights into recent human evolution and also human disease.

Acknowledgments

D.E.R. and J.N.H. are recipients of Burroughs Wellcome Career Awards in Biomedical Sciences. We thank Richard Grand, Robert Montgomery, Eric Lander, David Altshuler, Helen Lyon, and members of the Hirschhorn Lab for useful comments and discussion.

Table A1

DNA Samples from Coriell Used in This Study

Sample IDPopulationMother IDFather ID
NA06988European AmericanNA07057NA06990
NA06983European AmericanNA07057NA06990
NA07057European AmericanNA0707NA07340
NA07007European American00
NA07340European American00
NA06990European AmericanNA07016NA07050
NA07016European American00
NA07050European American00
NA07011European AmericanNA07038NA06987
NA07009European AmericanNA07038NA06987
NA07038European AmericanNA07049NA0702
NA07049European American00
NA07002European American00
NA06987European AmericanNA07017NA07341
NA07017European American00
NA07341European American00
NA12138European AmericanNA10846NA10847
NA12139European AmericanNA10846NA10847
NA10846European AmericanNA12144NA12145
NA12144European American00
NA12145European American00
NA10847European AmericanNA12146NA12239
NA12146European American00
NA12239European American00
NA07053European AmericanNA07029NA07019
NA07040European AmericanNA07029NA07019
NA07029European AmericanNA06994NA0700
NA06994European American00
NA07000European American00
NA07019European AmericanNA07022NA07056
NA07022European American00
NA07056European American00
NA07006European AmericanNA07048NA06991
NA07020European AmericanNA07048NA06991
NA07048European AmericanNA07034NA07055
NA07034European American00
NA07055European American00
NA06991European AmericanNA06993NA06985
NA06993European American00
NA06985European American00
NA12040European AmericanNA10857NA10852
NA10857European AmericanNA12043NA12044
NA12043European American00
NA12044European American00
NA10852European AmericanNA12045NA12046
NA12045European American00
NA12046European American00
NA11870European AmericanNA10858NA10859
NA11871European AmericanNA10858NA10859
NA10858European AmericanNA11879NA11880
NA11879European American00
NA11880European American00
NA10859European AmericanNA11881NA11882
NA11881European American00
NA11882European American00
NA11984European AmericanNA10860NA10861
NA11985European AmericanNA10860NA10861
NA10860European AmericanNA11992NA11993
NA11992European American00
NA11993European American00
NA10861European AmericanNA11994NA11995
NA11994European American00
NA11995European American00
NA12148European AmericanNA10830NA10831
NA12149European AmericanNA10830NA10831
NA10830European AmericanNA12154NA12236
NA12154European American00
NA12236European American00
NA10831European AmericanNA12155NA12156
NA12155European American00
NA12156European American00
NA12243European AmericanNA10835NA10834
NA12244European AmericanNA10835NA10834
NA10835European AmericanNA12248NA12249
NA12248European American00
NA12249European American00
NA10834European AmericanNA12250NA12251
NA12250European American00
NA12251European American00
NA12007European AmericanNA10838NA10839
NA10838European AmericanNA1203NA1204
NA12003European American00
NA12004European American00
NA10839European AmericanNA1205NA1206
NA12005European American00
NA12006European American00
NA11909European AmericanNA10842NA10843
NA10842European AmericanNA11917NA11918
NA11917European American00
NA11918European American00
NA10843European AmericanNA11919NA11920
NA11919European American00
NA11920European American00
NA17031African American
NA17032African American
NA17033African American
NA17034African American
NA17035African American
NA17036African American
NA17037African American
NA17038African American
NA17039African American
NA17040African American
NA17101African American
NA17102African American
NA17103African American
NA17106African American
NA17107African American
NA17108African American
NA17109African American
NA17111African American
NA17112African American
NA17114African American
NA17115African American
NA17117African American
NA17119African American
NA17122African American
NA17124African American
NA17125African American
NA17132African American
NA17134African American
NA17136African American
NA17137African American
NA17139African American
NA17140African American
NA17144African American
NA17147African American
NA17148African American
NA17149African American
NA17152African American
NA17155African American
NA17156African American
NA17157African American
NA17158African American
NA17159African American
NA17160African American
NA17169African American
NA17172African American
NA17196African American
NA17197African American
NA17198African American
NA17199African American
NA17200African American
NA11321Chinese
NA11322Chinese
NA11323Chinese
NA16654Chinese
NA16688Chinese
NA16689Chinese
NA17014Chinese
NA17015Chinese
NA17016Chinese
NA17017Chinese
NA17018Chinese
NA17019Chinese
NA17020Chinese
NA11589Japanese
NA11590Japanese
NA17051Japanese
NA17052Japanese
NA17053Japanese
NA17054Japanese
NA17055Japanese
NA17056Japanese
NA17057Japanese
NA17058Japanese
NA17059Japanese
NA17060Japanese
NA17081Southeast Asian
NA17082Southeast Asian
NA17083Southeast Asian
NA17084Southeast Asian
NA17085Southeast Asian
NA17086Southeast Asian
NA17087Southeast Asian
NA17088Southeast Asian
NA17089Southeast Asian
NA17090Southeast Asian

Table A2

FST and pexcess for 101 SNPs around LCT

Frequency (%) in
Value for
SNP IDCoordinateaAllelebEuropean AmericansAfrican AmericansEast AsiansFSTpexcess
rs1531957134781635T21.38.226.50.03
rs1996589134887524T68.833.060.0.09.42
rs1257168134986220A40.47.118.2.10
rs1257220135037675A17.716.031.4.02.25
rs842360135370213C34.446.876.5.12.44
rs1942043135577820C3.15.16.10
rs749017135595987G30.044.830.6.01.20
rs766271135689459C55.430.646.3.03.28
rs2322254135773177C19.835.051.4.07.54
rs1551497135809970C15.047.816.1.11.53
rs1031575135880258G3.11.02.90
rs2290518135901142G85.442.080.0.17.63
rs2305594135912936C9.46.115.7.01
rs4954222135934583G9.46.115.7.01
rs2305247135950620T2.120.01.4.10
rs2305248135950640A85.440.481.8.19.62
rs935612135963831A88.542.092.6.27
rs4954228135998826A89.643.091.2.26
rs4954231136038842T8.132.07.6.09
rs737388136095539C2.121.01.4.10
rs1469950136150582G4.91.21.70
rs2118395136223648T7.44.07.10
rs4954259136260322A4.103.10
rs1370533136272613C94.843.991.4.30
rs984763136367366A2.54.07.10.00
rs2034277136399322C015.60.10
rs958400136403174A035.00.26
rs2289963136428206A6.010.28.6.00
rs4954278136430619T9.220.89.1.02
rs1438303136452185T9.418.451.4.16
rs313522136453194T83.026.011.8.39.79
rs313520136462199A011.00.07
rs629377136474052T013.00.08
rs2117511136484989A90.643.964.7.16
rs2304367136489492C013.00.08
rs1347767136507985G013.00.08
rs1438307136521494T83.025.033.3.26.76
rs3213889136533903G82.626.535.3.24.75
rs2304601136550362A01.000
rs2304602136560269G004.3.02
rs1030766136575510A8.345.027.1.11
rs1030764136575857T86.547.062.9.11.70
rs1011361136575967A83.328.035.7.23.76
rs2015532136577853G8.320.018.8.01
rs2322659136577987C86.446.040.0.17.76
rs872151136579133T8.311.014.30
rs892715136598905C81.523.934.8.24.74
rs2322812136600368G5.812.014.3.01
rs2874874136600522C6.810.414.70
rs2164210136602615C81.324.537.1.23.73
rs1470457136604176G15.645.738.2.07.63
rs730005136605022C7.624.523.5.03
rs2322813136605137G6.814.618.3.01
rs745500136605520A81.925.037.9.23.74
rs2236783136616486A81.925.032.9.25.75
rs2082730136629069G010.00.06
rs4988235136630974T77.214.00.53
rs2304369136631648A3.418.01.4.07
rs309180136636583A82.623.532.9.26.76
rs309181136637141G81.826.542.6.21.72
rs182549136639082T77.113.30.53
rs309176136644544C81.425.032.9.24.74
rs309125136665883C81.528.032.9.23.73
rs309167136691592T9.520.424.3.02
rs2322725136699520C9.119.017.1.01
rs192822136704602T85.740.032.9.21.78
rs309163136713685A08.20.05
rs309120136731115G8.039.048.6.13
rs3112496136733392T8.339.848.6.13
rs309142136737652C8.139.148.4.13
rs522086136757469T05.10.03
rs309118136768552C8.326.246.7.12
rs309137136788279T83.321.028.6.31.78
rs1469816136814744A1.232.311.4.12
rs2090660136841047T8.812.217.60
rs2090663136852863G08.21.4.03
rs1112156136899042A05.10.03
rs953388136929457T12.55.132.9.09
rs2176716136946021T18.822.045.7.06.45
rs1519523136956777T52.122.025.7.07.37
rs1519529136996585G19.58.00.07
rs4440020137012655A91.748.050.0.17
rs4075810137025473T3.54.846.4.26
rs4347891137058006G3.132.719.7.09
rs4245843137062112A3.144.027.3.14
rs4954411137098753T58.325.518.6.13.47
rs4501004137129075T27.150.041.4.03.41
rs2138140137133257A4.235.05.7.15
rs1399604137152993G27.924.055.7.08.30
rs867563137164828G25.022.440.0.02.20
rs578935137233319C10.48.031.4.07
rs1346822137236689A9.815.024.2.02
rs694510137303189T21.844.468.3.14.61
rs876338137311475T75.049.040.0.08.55
rs1427588137514654C43.836.055.7.02.05
rs1346731137634915A40.319.820.6.04.25
rs2370192137649312A4.31.00.01
rs518614137739179C61.518.814.3.20.54
rs574135137762448G62.527.819.1.14.51
rs1432232137821992C64.628.054.3.09.40
rs882374137935623A25.036.040.0.01.34
aCoordinate on chromosome 2, according to the hg15 freeze of the human genome (UCSC Genome Bioinformatics Web site).
bAllele shown is the minor allele in the African American population

Electronic-Database Information

The URLs for data presented herein are as follows:

Coriell Institute for Medical Research, http://locus.umdnj.edu/ccr/
HapMap Project, http://www.hapmap.org
Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/ (for HBB, G6PD, FY, TNFSF5, CKR5, HFE, ADH1B, CFTR, and LCT)
SNP Consortium (TSC) Web Site, http://snp.cshl.org/allele_frequency_project/
UCSC Genome Bioinformatics, http://genome.ucsc.edu

References

Akey J, Zhang G, Zhang K, Jin L, Shriver M (2002) Interrogating a high-density SNP map for signatures of natural selection. Genome Res 12:1805–1814 [PMC free article] [PubMed] [Cross Ref]10.1101/gr.631202
Altshuler D, Hirschhorn JN, Klannemark M, Lindgren CM, Vohl MC, Nemesh J, Lane CR, Schaffner SF, Bolk S, Brewer C, Tuomi T, Gaudet D, Hudson TJ, Daly M, Groop L, Lander ES (2000) The common PPARγ Pro12Ala polymorphism is associated with decreased risk of type 2 diabetes. Nat Genet 26:76–80 [PubMed] [Cross Ref]10.1038/79216
Bamshad M, Wooding SP (2003) Signatures of natural selection in the human genome. Nat Rev Genet 4:99–111 [PubMed] [Cross Ref]10.1038/nrg999
Bayless TM, Rosensweig NS (1966) A racial difference in incidence of lactase deficiency: a survey of milk intolerance and lactase deficiency in healthy adult males. JAMA 197:968–972 [PubMed]
Beja-Pereira A, Luikart G, England PR, Bradley DG, Jann OC, Bertorelle G, Chamberlain AT, Nunes TP, Metodiev S, Ferrand N, Erhardt G (2003) Gene-culture coevolution between cattle milk protein genes and human lactase genes. Nat Genet 35:311–313 [PubMed] [Cross Ref]10.1038/ng1263
Bowcock AM, Kidd JR, Mountain JL, Hebert JM, Carotenuto L, Kidd KK, Cavalli-Sforza LL (1991) Drift, admixture, and selection in human evolution: a study with DNA polymorphisms. Proc Natl Acad Sci USA 88:839–843 [PMC free article] [PubMed]
Braverman JM, Hudson RR, Kaplan NL, Langley CH, Stephan W (1995) The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics 140:783–96 [PMC free article] [PubMed]
Cavalli-Sforza L (1973) Analytic review: some current problems of population genetics. Am J Hum Genet 25:82–104 [PMC free article] [PubMed]
Devlin B, Risch N (1995) A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29:311–322 [PubMed] [Cross Ref]10.1006/geno.1995.9003
Enattah NS, Sahi T, Savilahti E, Terwilliger JD, Peltonen L, Jarvela I (2002) Identification of a variant associated with adult-type hypolactasia. Nat Genet 30:233–237 [PubMed] [Cross Ref]10.1038/ng826
Flatz G (1987) Genetics of lactose digestion in humans. In: Harris H, Hirschhorn K (eds) Advances in human genetics. Vol 16. Plenum Press, New York, pp 1–77
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D (2002) The structure of haplotype blocks in the human genome. Science 296:2225–2229 [PubMed] [Cross Ref]10.1126/science.1069424
Grand RJ, Montgomery RK, Chitkara DK, Hirschhorn JN (2003) Changing genes; losing lactase. Gut 52:617–619 [PMC free article] [PubMed] [Cross Ref]10.1136/gut.52.5.617
Hamblin MT, Thompson EE, Di Rienzo A (2002) Complex signatures of natural selection at the Duffy blood group locus. Am J Hum Genet 70:369–383 [PMC free article] [PubMed]
Hartl D, Clark A (1997) Principles of population genetics. Sinauer Associates, Sunderland, MA
Hastbacka J, de la Chapelle A, Mahtani MM, Clines G, Reeve-Daly MP, Daly M, Hamilton BA, Kusumi K, Trivedi B, Weaver A, Coloma A, Lovett M, Buckler A, Kaitila I, Lander ES (1994) The diastrophic dysplasia gene encodes a novel sulfate transporter: positional cloning by fine-structure linkage disequilibrium mapping. Cell 78:1073–87 [PubMed] [Cross Ref]10.1016/0092-8674(94)90281-X
Hollox EJ, Poulter M, Zvarik M, Ferak V, Krause A, Jenkins T, Saha N, Kozlov AI, Swallow DM (2001) Lactase haplotype diversity in the Old World. Am J Hum Genet 68:160–172 [PMC free article] [PubMed]
Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir S, Barnard J, Hallbeck B, Masson G, Shlien A, Palsson ST, Frigge ML, Thorgeirsson TE, Gulcher JR, Stefansson K (2002) A high-resolution recombination map of the human genome. Nat Genet 31:241–247 [PubMed]
Kretchmer N (1971) Memorial lecture: lactose and lactase—a historical perspective. Gastroenterology 61:805–813 [PubMed]
Lewontin RC, Krakauer J (1973) Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics 74:175–195 [PMC free article] [PubMed]
Li WH (1975) The first arrival time and mean age of a deleterious mutant gene in a finite population. Am J Hum Genet 27:274–286 [PMC free article] [PubMed]
Mulcare CA, Weale ME, Jones AL, Connell B, Zeitlyn D, Tarekegn A, Swallow DM, Bradman N, Thomas MG (2004) The T allele of a single-nucleotide polymorphism 13.9 kb upstream of the lactase gene (LCT) (C−13.9kbT) does not predict or cause the lactase-persistence phenotype in Africans. Am J Hum Genet 74:1102–1110 (in this issue) [PMC free article] [PubMed]
Nei M, Chesser R (1983) Estimation of fixation indices and gene diversities. Ann Hum Genet 47:253–259 [PubMed]
Olds LC, Sibley E (2003) Lactase persistence DNA variant enhances lactase promoter activity in vitro: functional role as a cis regulatory element. Hum Mol Genet 12:2333–2340 [PubMed] [Cross Ref]10.1093/hmg/ddg244
Osier MV, Pakstis AJ, Soodyall H, Comas D, Goldman D, Odunsi A, Okonofua F, Parnas J, Schulz LO, Bertranpetit J, Bonne-Tamir B, Lu RB, Kidd JR, Kidd KK (2002) A global perspective on genetic variation at the ADH genes reveals unusual patterns of linkage disequilibrium and diversity. Am J Hum Genet 71:84–99 [PMC free article] [PubMed]
Pagnier J, Mears JG, Dunda-Belkhodja O, Schaefer-Rego KE, Beldjord C, Nagel RL, Labie D (1984) Evidence for the multicentric origin of the sickle cell hemoglobin gene in Africa. Proc Natl Acad Sci USA 81:1771–1773 [PMC free article] [PubMed]
Parra EJ, Marcini A, Akey J, Martinson J, Batzer MA, Cooper R, Forrester T, Allison DB, Deka R, Ferrell RE, Shriver MD (1998) Estimating African American admixture proportions by use of population-specific alleles. Am J Hum Genet 63:1839–1851 [PMC free article] [PubMed]
Poulter M, Hollox E, Harvey CB, Mulcare C, Peuhkuri K, Kajander K, Sarner M, Korpela R, Swallow DM (2003) The causal element for the lactase persistence/non-persistence polymorphism is located in a 1 Mb region of linkage disequilibrium in Europeans. Ann Hum Genet 67:298–311 [PubMed] [Cross Ref]10.1046/j.1469-1809.2003.00048.x
Reich DE, Goldstein DB (1998) Estimating the age of mutations using the variation at linked markers. In: Goldstein DB, Schlotter C (eds) Microsatellites: evolution and applications. Oxford University Press, Oxford
Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW (2002) Genetic structure of human populations. Science 298:2381–2385 [PubMed] [Cross Ref]10.1126/science.1078311
Rosner B (1982) Fundamentals of biostatistics. Duxbury Press, Boston, MA
Sabeti PC, Reich DE, Higgins JM, Levine HZ, Richter DJ, Schaffner SF, Gabriel SB, Platko JV, Patterson NJ, McDonald GJ, Ackerman HC, Campbell SJ, Altshuler D, Cooper R, Kwiatkowski D, Ward R, Lander ES (2002) Detecting recent positive selection in the human genome from haplotype structure. Nature 419:832–837 [PubMed] [Cross Ref]10.1038/nature01140
Scrimshaw N, Murray E (1988) The acceptability of milk and milk products in populations with a high prevalence of lactose intolerance. Am J Clin Nutr 48:1079–1159 [PubMed]
Simoons F (1969) Primary adult lactose intolerance and the milking habit: a problem in biologic and cultural interrelations. I. Review of the medical research. Am J Dig Dis 14:819–836 [PubMed]
——— (1970) Primary adult lactose intolerance and the milking habit: a problem in biologic and cultural interrelations. II. A culture historical hypothesis. Am J Dig Dis 15:695–710 [PubMed]
Stephens JC, Reich DE, Goldstein DB, Shin HD, Smith MW, Carrington M, Winkler C, et al (1998) Dating the origin of the CCR5-Δ32 AIDS-resistance allele by the coalescence of haplotypes. Am J Hum Genet 62:1507–1515 [PMC free article] [PubMed]
Stephens M, Donnelly P (2003) A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73:1162–1169 [PMC free article] [PubMed]
Tishkoff SA, Varkonyi R, Cahinhinan N, Abbes S, Argyropoulos G, Destro-Bisol G, Drousiotou A, Dangerfield B, Lefranc G, Loiselet J, Piro A, Stoneking M, Tagarelli A, Tagarelli G, Touma EH, Williams SM, Clark AG (2001) Haplotype diversity and linkage disequilibrium at human G6PD: recent origin of alleles that confer malarial resistance. Science 293:455–462 [PubMed] [Cross Ref]10.1126/science.1061573
Toomajian C, Ajioka RS, Jorde LB, Kushner JP, Kreitman M (2003) A method for detecting recent selection in the human genome from allele age estimates. Genetics 165:287–297 [PMC free article] [PubMed]
Wiuf C (2001) Do ΔF508 heterozygotes have a selective advantage? Genet Res 78:41–47 [PubMed] [Cross Ref]10.1017/S0016672301005195

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...