• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of ajhgLink to Publisher's site
Am J Hum Genet. Jul 11, 2008; 83(1): 132–135.
Published online Jul 3, 2008. doi:  10.1016/j.ajhg.2008.06.005
PMCID: PMC2443852

Long-Range LD Can Confound Genome Scans in Admixed Populations

Main Text

To the Editor: In the September 2007 issue of The Journal, Tang et al. analyzed data from 192 Puerto Ricans genotyped at 112,584 autosomal markers and identified three regions with a deficiency in the proportion of European ancestry. They concluded that recent selection occurred at these regions after the admixture of European, African, and Native American ancestors.1 These signals of selection are very strong: We estimate that they each correspond to selection coefficients of >0.08 per generation, which if confirmed would represent the three most powerful selective adaptations discovered to date in humans. Here, we demonstrate that on the basis of the method the authors applied, these signals of selection could be explained as artifacts of the unusual long-range linkage disequilibrium (LD) that occurs at these regions and that is not specific to Puerto Ricans. We failed to replicate the signal of selection in an independent and larger study of 364 Puerto Rican samples, when we applied a method that is not susceptible to this confounder. Our results highlight a complexity in the analysis of dense genotype data from recently admixed populations; this complexity needs to be taken into account not only in genome-wide screens for selection but also in genome-wide association studies to ensure that false-positive signals are avoided.

The signals of selection were identified with methods described in Tang et al.,2 which uses an extension of a Hidden Markov Model (HMM) to infer segments of ancestry from dense genotype data. The authors note that the assumptions of an HMM “are violated when the marker map is dense and linkage disequilibrium (LD) exists within an ancestral population”; they partially address this confounder by modeling the LD between consecutive pairs of markers but describe this approach as a “compromise” because they do not account for higher order LD.2 In light of the phenomenon that nearby sites in a region may be in weak LD, whereas more distant sites may be in much stronger LD, the approach of modeling only LD between consecutive markers is potentially inadequate.3 As we demonstrate below, local-ancestry estimates in regions where LD is not fully modeled will not only be overconfident but will also be systematically biased, thereby leading to false-positive deficiencies in the population contributing majority ancestry.

In a separate analysis focusing on long-range LD in European populations, we applied principal components analysis (PCA) to several genome-wide data sets and identified 24 autosomal long-range LD regions, each spanning >2 megabases (Mb) (Table 1). The functional basis for these regions is currently being explored. The 24 PCA regions were identified by running the EIGENSOFT software4,5 on a data set of 327 European Americans genotyped on the Illumina 550K array and identifying all regions where there was significant long-range LD extending >2 Mb that explained one of the top eigenvectors. The regions were independently replicated in 1593 European Americans from the Illumina iControl data set genotyped on the Illumina 550K array and in 1504 + 1500 British samples from the Wellcome Trust Case Control Consortium (1958 Birth Cohort and National Blood Service Cohorts, genotyped on the Affymetrix 500K array), confirming that these regions genuinely harbor long-range LD in European populations.

Table 1
Correspondence between Regions from Tang et al. and Regions of Extended LD in European Populations

Strikingly, all three of the signals of selection reported by Tang et al.1 lie in one of the PCA regions (Table 1). Because the PCA regions comprise <4.7% of the autosomal genome, the hypothesis that the regions discussed in Tang et al.1 and the PCA regions are independent is violated with a p value of (0.047)3 = 0.0001. As we will show, the presence of long-range LD in populations ancestral to Puerto Ricans could explain both the signals from Tang et al.1 and the PCA results.

Long-range LD can arise for reasons unrelated to selection. For example, inversions are known to suppress viable recombination, and a known inversion polymorphism at position 8–12 Mb on chromosome 8 has previously been shown to be the cause of long-range LD6 (also see Table 1). (Interestingly, this inversion polymorphism appears to produce a signal of unusual ancestry in Figure 1 of Tang et al.,1 in addition to the three regions highlighted in the same paper.) It is important for studies inferring the action of selection to rule out alternative explanations for the observed data. For the regions identified by Tang et al.,1 long-range LD that arose because of inversion polymorphism or other reasons provides a plausible alternative explanation.

LD that is not properly modeled impacts not only the uncertainty in local-ancestry estimates but also the expected value of these estimates, leading to large systematic biases in regions of long-range LD. To demonstrate this, we consider a hypothetical admixed population with ancestry α1 = 80% from ancestral population 1 and α2 = 20% from ancestral population 2. We then consider an A/C marker in which the A allele has frequency p1 = 25% in population 1 and p2 = 75% in population 2, so that its frequency in the admixed population is p = α1p1 + α2p2 = 35%. Let q1 = 75%, q2 = 25%, and q = 65% denote the corresponding frequencies of the C allele. If local ancestry on a single-haploid chromosome is inferred with only information from that marker, we obtain P(population 1|A) = α1p1/(α1p1 + α2p2) = 0.57 and P(population 1|C) = α1q1/(α1q1 + α2q2) = 0.92, so that the expected value of the ancestry estimate is E(P(population 1)) = p P(population 1|A) + q P(population 1|C) = 0.80, which is an unbiased estimate of α1. Now, we consider a second marker that has identical allele frequencies and that is in perfect LD with the first and suppose that the two markers are used to infer local ancestry, treating them as if they were unlinked (this could happen with the method of Tang et al.2 if the markers are nonconsecutive). The resulting local-ancestry estimates are P(population 1|AA) = α1p12/(α1p12 + α2p22) = 0.31 and P(population 1|CC) = α1q12/(α1q12 + α2q22) = 0.97, so that the expected value of the ancestry estimate is E(P(population 1)) = p P(population 1|AA) + q P(population 1|CC) = 0.74, a downwardly biased estimate of α1. More generally, when n perfectly linked markers are used to infer ancestry and are treated as unlinked, for large n (e.g., n ≥ 5), the evidence of ancestry associated to a particular allele becomes overwhelming, and the estimated ancestry proportion will equal the allele frequency: P(population 1|An) = α1p1n/(α1p1n + α2p2n) ≈ 0 and P(population 1|Cn) = α1q1n/(α1q1n + α2q2n) ≈ 1, so that E(P(population 1)) = p P(population 1|An) + q P(population 1|Cn) = q = 0.65. The deficiency of 15% local ancestry, compared to genome-wide ancestry of 80%, shows that the bias could produce effects as large as the 14% deficiencies in European ancestry reported by Tang et al.1; such deficiencies will persist when local-ancestry estimates are incorporated into an HMM. In a data set of 112,584 markers, the regions of long-range LD listed in Table 1 would be expected to contain at least 100 markers each. As in our example, unmodeled LD could bias ancestry estimates in the direction of allele frequencies, thereby favoring a deficiency of the population contributing majority ancestry—just as reported in Tang et al.1

In addition to their analysis of 112,584 markers, Tang et al.1 report evidence of selection in analyses of individual HLA markers (Table 1 of their paper). These single-marker analyses are immune to the effects of long-range LD but may be affected by their use of inaccurate ancestral populations to model Puerto Rican ancestry. In particular, the Native American ancestry of Puerto Ricans derives from the Taino, a Native South American population that is likely to be highly genetically diverged from the Native North American populations such as the Pima and Maya used by Tang et al.1 to model Native American ancestry.7 Frequency differences among Native American populations could explain why Table 1 of Tang et al.1 reports a 13% increase in Native American ancestry based on allele frequencies of individual markers at the HLA locus, whereas Figure 1 of Tang et al.1 reports no deviation in Native American ancestry at the same locus when flanking genomic data were used.2 We note that if single-marker analyses are affected by the use of inaccurate ancestral populations, analyses of individual markers in new samples from the same populations would not provide an independent replication because the genetic drift underlying the inaccuracy occurs at the population level, not at the individual level.

As an independent test for selection at the chromosome 6 locus, we analyzed 364 new Puerto Rican samples, consisting of 170 individuals with Crohn's disease and 194 matched controls recruited at the University of Puerto Rico School of Medicine. We genotyped these samples at 2459 autosomal markers from our published admixture map that were powerful for distinguishing African from non-African ancestry.8 (Most markers in the map have relatively similar frequencies in Europeans and Native Americans, with very different frequencies in Africans.) Genotyping was performed with the Illumina Golden Gate technology, and standard quality filters were applied.9 After additional filtering to exclude markers that were highly differentiated between Europeans and Native Americans (so as to ensure an effective two-way African versus non-African admixture analysis in a three-way admixed population7) and disallow LD between markers in the ancestral populations,10 we retained 1438 markers for downstream analysis. We found that these markers were sufficient to generate useful ancestry estimates: Our calculations indicate that we capture 61% of maximum information about African versus non-African ancestry at the chromosome 6 region, so our effective sample size is (0.61)(364) = 223, which is larger than the sample size of 192 in Tang et al.1

By using the ANCESTRYMAP software1 to obtain local-ancestry estimates, we failed to replicate the finding of Tang et al.1 of an increase in African ancestry at chromosome 6 (Figure 1) and did not observe an unusual deviation in ancestry at any region of the genome. (These results do not shed light on selection signals at the chromosome 8 and 11 regions because Tang et al.1 reported deviations in European and Native American ancestry at these loci, whereas our 1,438 markers only distinguish African versus non-African ancestry.) To test whether our negative result could be a consequence of low power, we simulated a data set of 364 samples from an admixed population that has 18% African ancestry genome wide but 32% at the chromosome 6 region.1 In detail, we simulated samples by generating ancestry segments and genotypes at the same set of 1438 markers (with the same pattern of missing data as our Puerto Rican samples) assuming 18% African ancestry, 82% European ancestry, and an average of nine generations since admixture (This quantity was inferred from the Puerto Rican data and is similar to values for other Latino populations.7). We preferentially selected samples with African ancestry at marker rs451774 (position 28.6 Mb on chromosome 6) so as to achieve 32% African ancestry at this locus. By running ANCESTRYMAP on 364 simulated samples, we detected a large rise in African ancestry at the chromosome 6 region (Figure 1). Although the local estimate of 24% African ancestry at this region is less than the value of 32% used to simulate the data (because ANCESTRYMAP assumes the null model of no unusual deviation in local ancestry and thus imposes a strong prior of 18% African ancestry), the excess of African ancestry is more than twice what is observed anywhere else in the genome. Thus, our failure to identify a rise in African ancestry in Puerto Rican samples on chromosome 6 is not due to a lack of power.

Figure 1
A Replication Study in 364 Puerto Ricans Finds No Significant Rise in African Ancestry at the Chromosome 6 Locus

To test the robustness of our negative result, we reran our analysis of the 364 Puerto Rican samples with marker sets chosen to have different thresholds for maximum differentiation between Europeans and Native Americans and reran with all African and European allele-frequency data omitted to ensure that our results were not affected by inaccurate ancestral populations. We also reran with the control individuals only, to ensure that our results were not influenced by the inclusion of Crohn's disease cases. In none of these runs did we observe a signal of a rise in African ancestry at the chromosome 6 locus. The above runs used markers that are not in LD in ancestral populations, as required by ANCESTRYMAP. However, as a demonstration of the pitfalls of not accounting for LD between markers, we reran ANCESTRYMAP on a larger set of 1852 markers in which no constraint was applied to disallow LD in ancestral populations. African-ancestry estimates across the genome varied wildly from 15% to 54%, corresponding to large deficiencies in European ancestry analogous to the signals from Tang et al.1

Our analysis demonstrates that the signals of recent selection reported by Tang et al.1 could theoretically be explained as artifacts caused by regions of long-range LD (with which they strikingly coincide) and inaccurate ancestral populations. Furthermore, we empirically failed to replicate the finding of an unusual deviation in African ancestry at the chromosome 6 region in our analysis of a larger Puerto Rican sample set. We believe that the hypothesis of selection since admixture should therefore be viewed with caution. We note that in a joint analysis of more than 10,000 African American samples that we have scanned in admixture-mapping studies, we have not yet found a single locus at which there is signal of a local-ancestry deviation that is not specific to disease cases. We consider it unlikely that recent selection events could lead to three distinct local-ancestry deviations that are large enough to be detected with only 192 Puerto Rican samples, when we failed to detect any such effect in African Americans using >50-fold more samples.

These results also have methodological significance for genome-wide association studies in admixed populations such as Latinos and African Americans. To have maximum power, such studies need to take advantage of admixture association signals (deviations in local ancestry in disease cases compared to their genome-wide average) as well as case-control association signals. The method of Tang et al.2 has been shown to accurately infer ancestry in simulated data sets, but our results suggest that it may produce false-positive admixture association signals in regions of long-range LD in admixed populations. In association studies, such errors can be controlled by computation of local-ancestry estimates in both cases and controls. However, case-only admixture association analyses are known to provide higher statistical power.11 Thus, carrying out robust, fully powered genome-wide association studies in admixed populations will require methods that rigorously account for the confounding effects of long-range LD.

Acknowledgments

A.L.P. is supported by a Ruth Kirschstein National Research Service Award from the NIH. N.P. is supported by a K-01 career development award from the NIH. D.R. is supported by a Burroughs Wellcome Career Development Award in the Biomedical Sciences. This research was also supported by U-01 award HG004168 from the NIH (D.R. and N.P.), by NIDDK grant PO1DK46763 (J.I.R.), and by the Board of Governor's Chair in Medical Genetics at Cedars-Sinai Medical Center (J.I.R.). Genotyping of the Puerto Rican samples was supported in part by grant M01-RR00425 to the Cedars-Sinai GCRC genotyping core (K.D.T.) and by NIH grant DK62413 (K.D.T).

References

1. Tang H., Choudhry S., Mei R., Morgan M., Rodriguez-Clinton W., Burchard E.G., Risch N.J. Recent genetic selection in the ancestral admixture of Puerto Ricans. Am. J. Hum. Genet. 2007;81:626–633. [PMC free article] [PubMed]
2. Tang H., Coram M., Wang P., Zhu X., Risch N. Reconstructing genetic ancestry blocks in admixed individuals. Am. J. Hum. Genet. 2006;79:1–12. [PMC free article] [PubMed]
3. Wall J.D., Pritchard J.K. Linkage disequilibrium in the human genome. Nat. Rev. Genet. 2003;4:587–597. [PubMed]
4. Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. [PubMed]
5. Patterson N., Price A.L., Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. [PMC free article] [PubMed]
6. Tian C., Plenge R.M., Ransom M., Lee A., Villoslada P., Selmi C., Klareskog L., Pulver A.E., Qi L., Gregersen P.K. Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet. 2008;4:e4. [PMC free article] [PubMed]
7. Price A.L., Patterson N., Yu F., Cox D.R., Waliszewska A., McDonald G.J., Tandon A., Schirmer C., Neubauer J., Bedoya G. A genomewide admixture map for Latino populations. Am. J. Hum. Genet. 2007;80:1024–1036. [PMC free article] [PubMed]
8. Smith M.W., Patterson N., Lautenberger J.A., Truelove A.L., McDonald G.J., Waliszewska A., Kessing B.D., Malasky M.J., Scafe C., Le E. A high-density admixture map for disease gene discovery in African Americans. Am. J. Hum. Genet. 2004;74:1001–1013. [PMC free article] [PubMed]
9. Fan J.B., Oliphant A., Shen R., Kermani B.G., Garcia F., Gunderson K.L., Hansen M., Steemers F., Butler S.L., Deloukas P. Highly parallel SNP genotyping. Cold Spring Harb. Symp. Quant. Biol. 2003;68:69–78. [PubMed]
10. Reich D., Patterson N., De Jager P.L., McDonald G.J., Waliszewska A., Tandon A., Lincoln R.R., DeLoa C., Fruhan S.A., Cabre P. A whole-genome admixture scan finds a candidate gene for multiple sclerosis susceptibility. Nat. Genet. 2005;37:1113–1118. [PubMed]
11. Patterson N., Hattangadi N., Lane B., Lohmueller K.E., Hafler D.A., Oksenberg J.R., Hauser S.L., Smith M.W., O'Brien S.J., Altshuler D. Methods for high-density admixture mapping of disease genes. Am. J. Hum. Genet. 2004;74:979–1000. [PMC free article] [PubMed]

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles
  • SNP
    SNP
    PMC to SNP links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...