![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2003 by The American Society of Human Genetics. All rights reserved. Informativeness of Genetic Markers for Inference of Ancestry* 1Program in Molecular and Computational Biology, University of Southern California, Los Angeles; 2Institute of Biological Anthropology, University of Oxford, Oxford, United Kingdom; and 3Department of Human Genetics, University of Chicago, Chicago Address for correspondence and reprints: Dr. Noah A. Rosenberg, Program in Molecular and Computational Biology, University of Southern California, 1042 West 36th Place, DRB 289, Los Angeles, CA 90089. E-mail: noahr/at/usc.edu Received May 29, 2003; Accepted October 2, 2003. This article has been cited by other articles in PMC.Abstract Inference of individual ancestry is useful in various applications, such as admixture mapping and structured-association mapping. Using information-theoretic principles, we introduce a general measure, the informativeness for assignment (In), applicable to any number of potential source populations, for determining the amount of information that multiallelic markers provide about individual ancestry. In a worldwide human microsatellite data set, we identify markers of highest informativeness for inference of regional ancestry and for inference of population ancestry within regions; these markers, which are listed in online-only tables in our article, can be useful both in testing for and in controlling the influence of ancestry on case-control genetic association studies. Markers that are informative in one collection of source populations are generally informative in others. Informativeness of random dinucleotides, the most informative class of microsatellites, is five to eight times that of random single-nucleotide polymorphisms (SNPs), but 2%–12% of SNPs have higher informativeness than the median for dinucleotides. Our results can aid in decisions about the type, quantity, and specific choice of markers for use in studies of ancestry. Introduction Inference of individual ancestry from genetic markers is helpful in diverse situations, including admixture and association mapping, forensics, prediction of medical risks, wildlife management, and studies of dispersal, gene flow, and evolutionary history (Shriver et al. 1997; Davies et al. 1999; Primmer et al. 2000; Manel et al. 2002; Bamshad et al. 2003; Campbell et al. 2003; Ziv and Burchard 2003). Statistical methods for ancestry inference use multilocus genotypes and population allele frequencies, either specified in advance or estimated during the inference process, to assign populations of origin to individuals (Smouse et al. 1982; Paetkau et al. 1995; Rannala and Mountain 1997; Cornuet et al. 1999; Pritchard et al. 2000; Guinand et al. 2002). Because use of highly informative markers can reduce the amount of genotyping required for ancestry inference, it is desirable to measure the extent to which specific markers contribute to this inference. Several approaches have previously been used for measuring these locus contributions (table 1). However, despite their various features in specific scenarios, all of these measures are either difficult to compute, not designed specifically for estimating marker information content, or not applicable to sets with many potential source populations.
Here, using information-theoretic and decision-theoretic approaches, we introduce new criteria: the informativeness for assignment, the optimal rate of correct assignment, and the informativeness for ancestry coefficients. The choice of statistic for use in identifying markers for ancestry inference depends on the inference algorithm that is being used (table 1). The new statistics, as convenient and statistically motivated general measures applicable to any number of alleles and populations, may be useful both in admixed and in multisource human groups, such as those that have formed in the Western Hemisphere by the intermixing of Africans, Native Americans, and Europeans. We first define the statistics, consider their relationships with δ and Fst (two criteria that are often used to measure marker information content), and study the number of markers needed for inference. We demonstrate that the criteria are highly correlated and proceed using only the informativeness for assignment, or, simply, the informativeness. Informativeness rankings of loci in human microsatellite data are found to be robust, and use of markers of highest informativeness is observed to reduce the number of markers needed for inference of population structure. We consider the relationships of informativeness values in different subsets of the human population, and the relative informativeness of microsatellites and SNPs. Tables A, B, C, D, and E (online only) provide lists of informativeness ranks for various sets of source populations.
Theory Consider populations i=1,2,…,K and loci l=1,2,…,L, with K 2 and L 1. Locus l has alleles j=1,2,…,N(l). The relative frequency for allele j of locus l in population i is p(l)ij; this quantity represents a parametric rather than a sample frequency. The (parametric) average frequency of allele j at locus l is defined as
Informativeness for Assignment: the No-Admixture Model In the no-admixture model, individuals are each assumed to originate from one of K populations. Suppose we are given a random individual, whose (random) population assignment is Q, with Q {1,2,…,K}. The probability that the individual belongs to population i is (Q=i); we assume that each population has the same initial probability of being the source of an unknown individual, so that (Q=i)=1/K for all i. The (random) genotype of one of the individual’s two alleles at locus l is J(l).Our aim is to measure the amount of “information” gained about Q from knowledge of J(l) and to compare this quantity across different values of l. This measurement can be performed in a natural way, through use of an information-theoretic framework. If the value of J(l) is unknown, there is uncertainty, or entropy, regarding the value of the random variable Q. Once the value of J(l) is known, the entropy of Q decreases. The reduction in uncertainty about Q due to knowledge of J(l) is the mutual information, In(Q;J(l))=Hn(Q)-Hn(Q|J(l)), where Hn(Q) is the initial entropy of Q, and Hn(Q|J(l)) is the conditional entropy of Q given knowledge of J(l) (the subscript “n” refers to the no-admixture model). Using standard definitions (Cover and Thomas 1991, chapter 2), and leaving off superscripts for convenience, we have
K and no allele is found in more than one population.This measure of the amount of “information” about the ancestry Q contained in the genotype J also arises from a likelihood approach. The quantity The expression for In(Q;J) also has a close correspondence with the G statistic obtained from a contingency table, each of whose N columns of K elements gives the relative frequencies of an allele in the K populations. This G statistic is given by (Sokal and Rohlf 1995, p. 737)
Note that In(Q;J) is a sum over alleles. The contribution of allele j to informativeness for assignment is
1/K, the maximum, pjlogK, occurs when pij=Kpj for exactly one value of i and pij=0 for all other values.Similarly, it is possible to write In(Q;J) as a sum over populations, with the contribution of population i to the informativeness for assignment equaling
(Q=i)=qi, can be accommodated by replacing (Q=i) with qi in the derivations of equations (2) and (3) and 1/K with qi in equations (1), (4), (5), and (6).In can also be extended for use in assignment to populations of multilocus diploid genotypes rather than of single alleles. For convenience, we treat diploid genotypes as being ordered so that (j1,j2) differs from (j2,j1) (an unordered genotype is assigned randomly to one of the possible ordered genotypes). If we assume both Hardy-Weinberg proportions and independence of loci within populations, equation (4) applies with l=1Lpij(l)1pij(l)2 in place of pij, Relationship of In to δ and Fst For the simplest case in which informativeness is of interest—namely, for K=N=2—it is possible to relate In to δ and to Fst. In the mathematical development, we let δ equal the signed difference between the frequencies of allele 1 in two populations, p11-p21, and, without loss of generality, we assume that p11 p21, so that δ [0,1]; when applied to data, it is implicit that δ refers to the absolute difference, |p11-p21|. Denoting σ=p11+p21, we must have σ [δ,2-δ]. Simplifying equation (4) in terms of δ and σ, we obtain (fig. 1
For K=2 and N=2, Fst for a locus (henceforth used interchangeably with F), can be written as (modified from p. 167 of Weir 1996)
For a fixed value of δ or F, a biallelic marker is best able to infer ancestry if one of its alleles is absent in one of the populations: if the random genotype J equals this allele, there is no uncertainty about the origin of an individual. In other words, for a fixed δ (Campbell et al. 2003) or F, the markers with the greatest ability to infer ancestry have a value of σ near one of the extremes. The statistic In captures this aspect of ancestry inference ability (fig. 1
A comparison of figure figure11
An additional consequence of equation (8) and the requirement that σ [δ,2-δ] is that δ can be used to predict F fairly accurately, and vice versa (fig. 2 F δ2+0.0902. The maximal discrepancy between the two approximations, 0.0902, is attained when δ≈0.3820 (table 2). The mean of the two bounds, or (δ+2δ2-δ3)/(4-2δ), is always within 0.0451 of F. The accuracy of such simple approximations as F≈δ2 and F≈δ/(2-δ) is perhaps somewhat surprising.Predictions of δ from F are slightly less accurate than the reverse predictions. Given F, as σ ranges over [2F/(1+F),2/(1+F)], δ ranges from a minimum of 2F/(1+F) at σ=1 to a maximum of Optimal Rate of Correct Assignment If we use the no-admixture model, another way to measure marker information content is to pursue a decision-theoretic approach. We adopt an assignment rule in which observing that one of the two alleles of a random individual is j leads to assignment to population i with probability dij, where (Q=i)=1/K for all i, the aim is to choose an assignment rule, or a set of values of dij, that minimizes the expected value of the loss (Weiss 1961, p. 69), or
K and no allele is found in more than one population. Also similarly to In, general prior assignment probabilities, (Q=i)=qi, can be accommodated by replacing 1/K with qi in equations (9) and (10). Note that, for K=N=2 with p11 p21, we obtain (p11+p22)/2 for ORCA, or
Similarly to In, ORCA can be extended for evaluation of sets of many loci. Because the maximal correct assignment probability when (Hardy-Weinberg) diploid genotypes rather than individual alleles are assigned to populations equals
Informativeness for Ancestry Coefficients: the Admixture Model We have introduced two new measures that can facilitate assignment of individuals to populations. Often, however, a goal of ancestry inference is to estimate “ancestry coefficients” for an individual whose ancestry is from two or more populations (Rannala and Mountain 1997; Pritchard et al. 2000; Anderson and Thompson 2002). Such an individual has a vector of K ancestry coefficients that sum to 1, where the coefficient for population i gives the fraction of the individual’s genome that derives from population i. Ancestry is now a random vector Q rather than a discrete random variable. The mutual information Ia(Q;J) that quantifies the amount of information about Q provided by knowledge of J is of interest. With Q=(Q1,Q2,…,QK), where Qi is the (random) ancestry coefficient for the ith population and
Using the definition of mutual information for continuous random variables (Cover and Thomas 1991, chapter 9), Ia(Q;J) equals (see the appendix)
Number of Markers The statistic Ia suggests a way to prioritize markers for use in inference of ancestry coefficients, but it does not have a simple relationship with the number of markers needed for this inference. However, using a maximum-likelihood approach, it is possible to approximate this number of markers. Equation (13) gives the likelihood of ancestry coefficients (q1,q2,…,qK-1) in a haploid one-locus model, with qK=1-q1-…-qK-1. The expected Fisher information matrix, U, for the likelihood function has dimensions (K-1)×(K-1) and, for each i and i′, the (i,i′)th element equals (Millar 1991, eq. [A.3])
3, a straightforward transformation of U can enable inclusion of qK in the matrix (Millar 1991); for K=2, the maximum-likelihood estimates Using equation (15), in the (diploid) case of K=2, we obtain
p21, the largest value of equation (17) occurs at q1=(1-2p21)/(2δ), producing an upper bound for the approximate variance equal to
The number of independent biallelic markers required for accurate estimation of ancestry coefficients in the two-population admixture model (table 3) is considerably larger than the number required for assignment in corresponding no-admixture models (Risch et al. 2002; Campbell et al. 2003). However, our computations assume that estimation of ancestry coefficients occurs by maximum likelihood; other estimation procedures or use of dependencies between markers might reduce the number of markers needed. In addition, since it is based on an upper bound for the variance and not on the general expression, equation (19) might further overestimate the number of markers needed. Note that, because only the upper bound and not the general expression is directly related to δ, markers with high values of In and Ia rather than of δ might often produce smaller variances at values of q1 relevant to individuals under consideration. Estimation of Informativeness In, ORCA, and Ia have been defined parametrically, as inherent properties of a marker together with a set of populations. In practice, however, estimates made from data rather than parametric allele frequencies must be used. For a given locus, let the number of copies of allele j observed at the locus in population i equal nij, and let the total number of observations in population i equal ni. A simple estimator of informativeness statistics is the count estimate, in which pij is estimated by nij/ni, and the estimated This estimator can produce biased estimates; consider, for example, two samples taken from the same population. Since allele counts in the two samples are likely to differ by chance, markers will have positive estimated In for distinguishing the two samples and estimated ORCA >1/2 when parametric informativeness equals 0 and parametric ORCA is 1/2. This bias is not of major concern when the goal is to compare informativeness estimates for different loci through use of the same sample (Brenner 1998), since a systematic bias affects all loci in a similar manner. In addition, in comparisons of locus informativeness across different samples, sample-specific biases should affect all loci similarly, and the relationship between informativeness estimates of a locus in two samples is preserved even if the estimates are biased. In, ORCA, and Ia have been defined using frequencies of alleles rather than of diploid genotypes. If alleles within individuals are not independent, so that within-population genotype frequencies do not correspond to Hardy-Weinberg proportions, the definitions can be applied treating J as a random diploid genotype, and the count estimates of genotype frequencies can be used in estimation. We do not consider this issue further. Data We consider various subsets (table 4) of a data set of 377 microsatellite markers—45 dinucleotides, 58 trinucleotides, and 274 tetranucleotides—genotyped in 1,056 individuals from 52 human populations (Cann et al. 2002; Rosenberg et al. 2002; Zhivotovsky et al. 2003; Human Diversity Panel Genotypes Web site; Human STRP Screening Sets Web site). Names of regions and regional affiliations of populations are the same as in the article by Rosenberg et al. (2002). At least 50 of the 377 markers are among those reported in table 1 of the article by Collins-Schramm et al. (2002), and at least 212 of them are included in table 1 of the article by Smith et al. (2001).
To compare informativeness for microsatellites and SNPs, we consider SNPs that have been studied in African Americans, European Americans, and East Asians and that were found to have few enough errors for use in analysis of population divergence (Akey et al. 2002; Joshua Akey's Homepage). From the Akey et al. (2002) data, we exclude several types of SNPs: (1) SNPs genotyped by the Whitehead Institute, whose European American sample differed from that used by the other genotyping centers; (2) SNPs with unknown sample sizes or with sample size <40 (20 individuals) in at least one of the three groups; (3) SNPs with unknown, nonunique, or nonautosomal map positions; and (4) SNPs whose frequencies were obtained by DNA pooling or for which one or more of the reported allele frequencies could not be expressed as a rounded quotient of an integer and the reported sample size. The 8,714 SNPs we use were all genotyped by Celera, Motorola, or Orchid. Statistical Properties of Informativeness In this section, we demonstrate that the proposed informativeness statistics are, indeed, useful measures. First, we show that the statistics In, ORCA, and Ia produce similar estimates, so that we can proceed using only one statistic, the informativeness for assignment. Second, we demonstrate that the In statistic is robust, in that rankings of locus informativeness do not vary greatly across resamples of the data. Third, we show that the In statistic does indeed measure ability to infer ancestry, in that population structure inference using markers of high informativeness requires fewer markers than inference using markers of low informativeness. Relationship between In, ORCA, and Ia For each of the data sets in table 4, we computed In, ORCA, and Ia for each of the loci, using the allele count estimates with equations (4), (10), and (14). For each data set, Spearman rank correlation coefficients (Gibbons 1985, p. 226) of locus In and ORCA, In and Ia, and ORCA and Ia values were computed. Loci for which two or more populations had an identical allele frequency estimate were not used in the latter two calculations, to avoid obtaining denominators of 0 in the computation of Ia. For the World-52, Central/South Asia, and East Asia data sets, in which many populations had the same sample sizes and therefore had numerous opportunities to produce equal allele frequency estimates in two or more populations, there were many such loci, and the correlation coefficients involving Ia were not computed. Rankings of loci by In, ORCA, and Ia were all highly correlated, with the largest correlations observed between In and Ia (table 4). Thus, for convenience, in the remainder of this article we restrict attention to estimates of In, or, simply, the informativeness, and we assume that all three measures have similar properties. Locus Informativeness Rankings For each of the 10 data sets (table 4), markers were ranked from highest to lowest estimated informativeness (table A [online only]). To assess the robustness of these rankings, or the extent to which they are affected by the particular choice of individuals included in the data, we performed bootstrap replicates. Individuals were resampled with replacement within groups, holding group sample sizes fixed. For each replicate, informativeness was estimated for each locus, and loci were ranked by estimated informativeness. In some replicates, for at least one group and one locus, the resample included only individuals who did not have genotypes at the locus. This situation arose only for data sets in which some groups had small sample sizes ( 10). In these data sets, it was possible for a resample to consist solely of copies of a few individuals. Thus, if these few individuals had no data at a locus, the resample also had no data. For each data set, excluding these replicates, which were discarded, 1,000 resamples were performed. For data sets in which all groups had larger sample sizes (>10), it was not necessary to discard any replicates.To assess the variability of In values across bootstrap replicates, for each locus, we computed the ratio of the SD of the bootstrap values of In to the value estimated from the data, and we averaged this quantity across loci. Three statistics were used to compare the locus informativeness rankings that were estimated from the data and those that were obtained in the bootstrap replicates: (1) the mean across replicates of RVd,Vb, where Vd denotes the vector of informativeness ranks based on the data, Vb denotes the vector of ranks based on the bth bootstrap replicate, and R denotes the Spearman rank correlation coefficient; (2) the Kendall coefficient of concordance of the 1,000 bootstrap replicates (Gibbons 1985, p. 250); (3) the mean across loci of the mean across replicates of the absolute deviation between the rank of a locus in the data and its rank in the replicate. This third statistic was also computed using only the 50 loci of highest estimated informativeness. Although informativeness fluctuated noticeably across replicates for individual loci, rankings in different replicates were highly concordant with each other and were highly correlated with the rankings based on the estimates from the data (table 5). Similar patterns of correlation across bootstrap replicates were observed for all of the data sets. The World-52, World-5, and World-7 data sets, which contained the most data, produced the most robust informativeness ranks and values; the least robust were found for Oceania, the smallest data set.
The fluctuation of ranks of individual loci across replicates indicates that exact ranks of loci (such as in tables A, B, C, D, and E [online only]) should be regarded with caution. However, fluctuations were small enough that the markers of highest informativeness usually had low ranks in bootstrap replicates (rightmost column of table 5). Thus, confidence can be placed in general statements such as a locus being “among the most informative markers” in a data set. Performance of Markers of High Informativeness in Ancestry Inference One way to test the utility of In as a measurement of the ability of a marker to infer ancestry is to check whether the population structure inferred using the markers of highest In more closely approximates the population structure inferred using all of the markers than does the population structure inferred using the markers of lowest In. Using the computer program structure (Pritchard et al. 2000; available from the Pritchard Lab Web site), with five clusters and the full data of 1,056 individuals, we previously found that the genetically inferred population structure corresponded fairly closely to the five regions in the World-5 data set (Rosenberg et al. 2002). Thus, if In indeed measures ability to infer ancestry, informativeness of a locus in the World-5 data set should correlate well with the contribution of the locus to population structure inference using five clusters. We therefore ran structure with all 1,056 individuals in the data, using the markers of highest informativeness for the World-5 data set. For various choices of the number of markers, M, five structure runs were performed with the M markers of highest In, and five runs were performed with the M most heterozygous markers (table S4 of Rosenberg et al. [2002]). Expected heterozygosity was used for comparison, because, among several statistics studied in a previous analysis (Rosenberg et al. 2001), it produced the greatest reduction in the number of markers needed for inference. One run was performed with each of 20 random sets of M markers; for each value of M, random sets were chosen independently of the sets that were selected for the other values of M. Five runs were performed with the M markers of lowest In. All structure runs used five clusters, and, as in the study by Rosenberg et al. (2002), they employed the admixture model for individual ancestry (Pritchard et al. 2000), the F model for allele frequency correlations (Falush et al. 2003), and a burn-in period of length 20,000 followed by 10,000 iterations. The similarity coefficient C (Rosenberg et al. 2002) was used to compare runs with subsets of the markers against 10 runs that employed all 377 markers and were performed by Rosenberg et al. (2002). As in that study, the normalization required in the computation of C was based on the runs that used all of the markers. For each value of M, M<377, each of the 10 runs that used all 377 markers was compared with each of the five runs that used the M markers of highest informativeness, for a total of 50 comparisons. For M=377, the 90 pairwise comparisons of the 10 full-data runs were performed. For each M, the first quartile, median, and third quartile of the distribution of the 50 values were obtained (90 values for M=377). Comparisons to the full-data runs were made in an analogous manner, using the runs based on the least informative, most heterozygous, and random markers. For the random markers and M<377, the similarity coefficient distribution was based on 200 comparisons. Figure 3
Comparison of Rankings across Data Sets For pairs of data sets, we computed correlation coefficients of locus informativeness (table 6). Most pairs of rankings had correlations of at least 0.2. Markers that had high informativeness for inference of regional ancestry tended to be informative for inference within several regions. One exception was that informativeness in the America data set was not correlated with informativeness in the World-5 and World-7 data sets.
The highest correlations for pairs of regions occurred for regions that were geographically proximate, such as Central/South Asia and East Asia. All correlations for pairs of regions, among those that included two of Africa, Europe, Middle East, Central/South Asia, and East Asia, were larger than correlations that involved Oceania or America. The smallest correlation for a pair of regions was between informativeness in Africa and informativeness in America. Larger absolute levels of informativeness in Africa, America, and Oceania (fig. 5
Most loci were ranked poorly in at least one data set (table A [online only]). D11S2000, D16S3401, D16S422, D21S2055, and D3S2427 were the only markers to rank among the 75 most informative in all data sets; note that D21S2055 was one of three loci identified by Zhivotovsky et al. (2003) as unusually variable. D13S285 and D7S1804 were highly informative in all seven regional data sets (rank 75) but were less informative in at least one of the three worldwide data sets (rank >75). Conversely, D14S1007, D1S235, D22S683, D2S1356, D8S560, D9S1779, D9S1871, NA-D18S-2, and NA-D5S-1 were highly informative in the worldwide data sets (rank 25) but were less informative in one or more of the regional data sets (rank >75).Microsatellites and SNPs Dinucleotide loci, which show the most variation among the markers in these data (Zhivotovsky et al. 2003), were generally more informative than tetranucleotide loci (table 7), consistent with the generally greater differentiation of dinucleotides across human populations (Ruiz Linares 1999; Rosenberg et al. 2003). Dinucleotides were usually also more informative than trinucleotides, but, in many cases, trinucleotides and tetranucleotides had similar levels of informativeness. However, for the worldwide data sets and for Africa, tetranucleotides were by far the least informative class of microsatellite. For example, although 73% of the loci were tetranucleotides, in the World-7 data set, the 25 loci of highest informativeness included only 7 tetranucleotides. Of the 100 loci of lowest informativeness in this data set, 97 were tetranucleotides.
To compare informativeness of microsatellites and SNPs, we determined the informativeness of microsatellites for assignment with three source populations: Africans, Europeans, and East Asians. For these groups, we also determined informativeness for each pairwise combination of source populations (tables B and C [online only]). Similarly, we estimated informativeness of SNPs among African Americans, European Americans, and East Asians. Because the individuals and populations in the microsatellite and SNP data sets were not the same, our comparison of microsatellite and SNP informativeness can only be regarded as approximate. Inclusion of some extremely isolated populations in the microsatellite data but not in the SNP data might exaggerate the relative informativeness of microsatellites. However, this effect might be counteracted by a SNP ascertainment procedure that produced greater divergence across populations than is characteristic of randomly chosen SNP markers (J. Akey, personal communication); the microsatellite data likely show little or no such effect (Rosenberg et al. 2002). The small-sample upward bias in informativeness might also impact relative informativeness estimates. For each set of source populations, randomly chosen microsatellites had greater informativeness than random SNPs (fig. (fig.6).6
One threshold proposed for declaring a SNP to be highly informative is δ=0.5 (Shriver et al. 1997), a value exceeded by 1.9%, 4.6%, and 2.7% of the SNPs (among those polymorphic in the relevant pair of populations) for African Americans and European Americans, African Americans and East Asians, and European Americans and East Asians, respectively. The value δ=0.5 corresponds to Fst [0.250,0.333] and In [0.131,0.216] (table 2); for corresponding comparisons, averaging across the three classes of loci, 26.0%, 42.0%, and 12.4% of microsatellites exceed the lower bound of In, and 5.9%, 10.2%, and 1.5% exceed the upper bound.Discussion In this article, we have introduced new statistics, In, ORCA, and Ia, for measuring the information provided by loci about ancestry. In, which is highly correlated with ORCA and Ia (table 4), is robust, in that it gives similar results in bootstrap replicates (table 5). The statistic is effective for inference of ancestry, in that population structure is more easily inferred using markers that have high values of In than using those that have low values (figs. (figs.33 Use of markers of highest informativeness is desirable for reduction of genotyping effort in such situations as forensics (Shriver et al. 1997; Lowe et al. 2001), admixture mapping (Dean et al. 1994; McKeigue 1998), and structured-association mapping (Pritchard and Donnelly 2001; Hoggart et al. 2003). In these scenarios, it is desirable to maximize information about individual ancestry at minimal cost. For the case of admixture mapping, the additional constraint that loci must be located in candidate regions of the genome applies; unlike other ancestry inference scenarios, admixture mapping makes use not of the ancestry of an individual as a whole but of particular parts of an individual genome. Thus, ideal marker sets for admixture mapping must have representation in regions of interest as well as high informativeness. Highly informative markers are also useful in testing for population stratification in case-control genetic association studies (Pritchard and Rosenberg 1999), although the test does not use individual ancestry estimates. The goal is to determine whether cases and controls differ in ancestry to such an extent that an excess number of random markers will, by chance, be associated with disease status. Because they have the greatest potential to differentiate among ancestry groups, the most informative markers offer the greatest power to reject the null hypothesis of no genomewide allele-frequency differences between cases and controls; thus, their use offers a cautious approach in dealing with population stratification. If allele-frequency differences are detected, these markers are ideal for structured-association methods that employ individual ancestry estimates to avoid identifying the associations that result from ancestry differences rather than from true association with disease status (Pritchard and Donnelly 2001). The number of these markers needed for desired precision in estimated ancestry coefficients can be approximated using the maximum-likelihood model. Consequently, the panels in tables A, B, C, D, and E (online only) can provide a resource for tests of population stratification. For example, in European Americans, the test might use markers that are most informative for distinguishing among various types of European ancestry (table A [online only]); in Hispanic Americans, it might employ markers that are most informative for distinguishing European from Native American ancestry (table B [online only]) or for distinguishing European, Native American, and African ancestry (table C [online only]). Note that panels in tables A–E utilize groups and classifications that might not be identical to those needed in applications: for example, if ancestry inference in African Americans is of interest, the African and European groups in our data do not fully represent the groups from which African Americans have descended. However, we observed that informativeness in one region was often highly correlated with informativeness in another region (table 6; fig. 5 Two exceptions to the general pattern of correlation across data sets were Oceania and America, in which informativeness was not very highly correlated with informativeness in other regions. The small correlations likely indicate that many of the markers that are extremely variable in other regions by chance must not have been highly variable in founder groups of Oceania and the Americas. That informativeness patterns across di-, tri-, and tetranucleotides were different in Oceania and America from those of the other data sets suggests that bottlenecks were strong enough to obscure the typical patterns of variation for these three classes of markers. Thus, to identify a panel of markers that are generally useful for inference of regional ancestry and for population ancestry inference within regions (Hoggart et al. 2003), it is most difficult to find markers that are informative both within continental Eastern Hemisphere regions and within Oceania and the Americas. We have identified a small number of generally informative markers; many more loci will need to be screened if markers that are informative in every region are to be found. Alternatively, a general panel might be assembled by collecting markers useful for inference between specific pairs of groups. Such a procedure may be advantageous, because, unlike sequential accumulation of generally informative markers, it avoids duplication of effort by accounting for the possibility that markers of high informativeness can provide information about ancestry in different ways. A systematic procedure to identify maximally informative sets or loci that are conditionally optimal, given the markers that have already been chosen, might use multilocus In, multilocus ORCA (eq. [12]), or a decision tree (Guinand et al. 2002). Although random microsatellites are considerably more informative than random SNPs for distinguishing among pairs of populations, and highly informative loci constitute a greater fraction of microsatellites than of SNPs, the right-hand tail of the distribution of SNP informativeness crosses that of microsatellites (fig. 6 K, the informativeness of the locus can potentially be as large as logK, whereas, if N=2, the maximal informativeness is no larger than log2, regardless of the value of K. Because microsatellites in the data of Rosenberg et al. (2002) have an average of 12.4 alleles, for relatively large values of K, microsatellites have greater potential for higher information content than SNPs, most of which are biallelic. Thus, for large K, the relative performance of microsatellites compared with SNPs will likely be greater than is seen in figure figure66Thus, for inferring ancestry among groups such as African Americans, European Americans, and East Asians, for which genomewide SNP allele frequencies have already been obtained (Akey et al. 2002), use of the most informative known SNPs is likely to be most efficient. However, because informativeness for distinguishing among populations such as different Native American groups is not correlated with informativeness for distinguishing among major regional groups (table 6), SNPs chosen by their informativeness in other scenarios will likely be considerably less useful in these populations than randomly chosen microsatellites. Until the most informative SNPs are identified for a set of populations of interest, use of microsatellites, especially dinucleotides, may lead to greater statistical efficiency in inference of ancestry. Of course, technical problems associated with dinucleotides (Ghebranious et al. 2003) might outweigh the efficiency that derives from their use, and factors such as laboratory fixed costs and difficulties in multiplexing might make the application of less informative markers more economical. The decision about which markers to use for inference of ancestry in any particular context should incorporate a combination of economic, technical, and statistical concerns. Acknowledgments N.A.R. was supported by an NSF Postdoctoral Fellowship in Biological Informatics. We thank J. Akey for assistance with the SNP data set, P. Calabrese and D. Conti for discussions, and E. Ziv and two reviewers for thoughtful comments on the manuscript. Appendix: Informativeness for Ancestry Coefficients Define
Footnotes *This article is dedicated to the memory of Ryk Ward. Electronic-Database Information The URLs for data presented herein are as follows: Human Diversity Panel Genotypes, Center for Medical Genetics, http://research.marshfieldclinic.org/genetics/Freq/FreqInfo.htm (for microsatellite genotypes). Human STRP Screening Sets, Center for Medical Genetics, http://research.marshfieldclinic.org/genetics/sets/combo.html (for Marshfield panel 10). Joshua Akey’s Homepage, http://cgi.uc.edu/~jakey/ (for SNP allele frequencies). Noah Rosenberg’s Homepage, http://www.cmb.usc.edu/~noahr/distruct.html (for distruct software). Pritchard Lab, http://pritch.bsd.uchicago.edu/ (for structure software). References Abramowitz MA, Stegun IA (1965) Handbook of mathematical functions. Dover, New York. Akey JM, Zhang G, Zhang K, Jin L, Shriver MD (2002) Interrogating a high-density SNP map for signatures of natural selection. Genome Res 12:1805–1814 [PubMed] doi: 10.1101/gr.631202. Anderson EC, Thompson EA (2002) A model-based method for identifying species hybrids using multilocus genetic data. Genetics 160:1217–1229 [PubMed] Bamshad MJ, Wooding S, Watkins WS, Ostler CT, Batzer MA, Jorde LB (2003) Human population genetic structure and inference of group membership. Am J Hum Genet 72:578–589 [PubMed] doi: 10.1086/368061. Banks MA, Eichert W, Olsen JB (2003) Which genetic loci have greater population assignment power? Bioinformatics 19:1436–1438 [PubMed] doi: 10.1093/bioinformatics/btg172. Barron A, Rissanen J, Yu B (1998) The minimum description length principle in coding and modeling. IEEE Trans Inform Theory 44:2743–2760 doi: 10.1109/18.720554. Brenner CH (1998) Difficulties in the estimation of ethnic affiliation. Am J Hum Genet 62:1558–1560 [PubMed] doi: 10.1086/301856. Buchanan FC, Adams LJ, Littlejohn RP, Maddox JF, Crawford AM (1994). Determination of evolutionary relationships among sheep breeds using microsatellites. Genomics 22:397–403 [PubMed] doi: 10.1006/geno.1994.1401. Campbell D, Duchesne P, Bernatchez L (2003) AFLP utility for population assignment studies: analytical investigation and empirical comparison with microsatellites. Mol Ecol 12:1979–1991 [PubMed] doi: 10.1046/j.1365-294X.2003.01856.x. Cann HM, de Toma C, Cazes L, Legrand M-F, Morel V, Piouffre L, Bodmer J, et al (2002) A human genome diversity cell line panel. Science 296:261–262 [PubMed] doi: 10.1126/science.296.5566.261b. Chakraborty R, Weiss KM (1988) Admixture as a tool for finding linked genes and detecting that difference from allelic association between loci. Proc Natl Acad Sci USA 85:9119–9123 [PubMed] Collins-Schramm HE, Phillips CM, Operario DJ, Lee JS, Weber JL, Hanson RL, Knowler WC, et al (2002) Ethnic-difference markers for use in mapping by admixture linkage disequilibrium. Am J Hum Genet 70:737–750 [PubMed] doi: 10.1086/339368. Cornuet J-M, Piry S, Luikart G, Estoup A, Solignac M (1999) New methods employing multilocus genotypes to select or exclude populations as origins of individuals. Genetics 153:1989–2000 [PubMed] Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York. Davies N, Villablanca FX, Roderick GK (1999) Determining the source of individuals: multilocus genotyping in nonequilibrium population genetics. Trends Ecol Evol 14:17–21 [PubMed] doi: 10.1016/S0169-5347(98)01530-4. Dean M, Stephens JC, Winkler C, Lomb DA, Ramsburg M, Boaze R, Stewart C, et al (1994) Polymorphic admixture typing in human ethnic populations. Am J Hum Genet 55:788–808 [PubMed] Elandt-Johnson RC (1971) Probability models and statistical methods in genetics. Wiley, New York. Excoffier L (2001) Analysis of population subdivision. In: Balding DJ, Bishop M, Cannings C (eds) Handbook of statistical genetics. John Wiley & Sons, Chichester, UK, pp 271–307. Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164:1567–1587 [PubMed] Ghebranious N, Vaske D, Yu A, Zhao C, Marth G, Weber JL (2003) STRP screening sets for the human genome at 5 cM density. BMC Genomics 4:6 [PubMed] doi: 10.1186/1471-2164-4-6. Gibbons JD (1985) Nonparametric statistical inference. 2nd ed. Marcel Dekker, New York. Gomulkiewicz R, Brodziak JKT, Mangel M (1990) Ranking loci for genetic stock identification by curvature methods. Can J Fish Aquat Sci 47:611–619. Guinand B, Topchy A, Page KS, Burnham-Curtis MK, Punch WF, Scribner KT (2002) Comparisons of likelihood and machine learning methods of individual classification. J Hered 93:260–269 [PubMed] doi: 10.1093/jhered/93.4.260. Hoggart CJ, Parra EJ, Shriver MD, Bonilla C, Kittles RA, Clayton DG, McKeigue PM (2003) Control of confounding of genetic associations in stratified populations. Am J Hum Genet 72:1492–1504 [PubMed] doi: 10.1086/375613. Kullback S (1959) Information theory and statistics. Dover, Mineola, New York. Lowe AL, Urquhart A, Foreman LA, Evett IW (2001) Inferring ethnic origin by means of an STR profile. Forensic Sci Int 119:17–22 [PubMed] doi: 10.1016/S0379-0738(00)00387-X. Manel S, Berthier P, Luikart G (2002) Detecting wildlife poaching: identifying the origin of individuals with Bayesian assignment tests and multilocus genotypes. Conserv Biol 16:650–659 doi: 10.1046/j.1523-1739.2002.00576.x. McKeigue PM (1998) Mapping genes that underlie ethnic differences in disease risk: methods for detecting linkage in admixed populations, by conditioning on parental admixture. Am J Hum Genet 63:241–251 [PubMed] doi: 10.1086/301908. Millar RB (1991) Selecting loci for genetic stock identification using maximum likelihood, and the connection with curvature methods. Can J Fish Aquat Sci 48:2173–2179. Molokhia M, Hoggart C, Patrick AL, Shriver M, Parra E, Ye J, Silman AJ, et al (2003) Relation of risk of systemic lupus erythematosus to west African admixture in a Caribbean population. Hum Genet 112:310–318 [PubMed] Paetkau D, Calvert W, Stirling I, Strobeck C (1995) Microsatellite analysis of population structure in Canadian polar bears. Mol Ecol 4:347–354 [PubMed] Primmer CR, Koskinen MT, Piironen J (2000) The one that did not get away: individual assignment using microsatellite data detects a case of fishing competition fraud. Proc R Soc Lond B 267:1699–1704 [PubMed] doi: 10.1098/rspb.2000.1197. Pritchard JK, Donnelly P (2001) Case-control studies of association in structured or admixed populations. Theor Popul Biol 60:227–237 [PubMed] doi: 10.1006/tpbi.2001.1543. Pritchard JK, Rosenberg NA (1999) Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet 65:220–228 [PubMed] doi: 10.1086/302449. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959 [PubMed] Ramachandran S, Rosenberg NA, Zhivotovsky LA, Feldman MW. Robustness of the inference of human population structure: a comparison of X-chromosomal and autosomal microsatellites. Hum Genomics (in press). Rannala B, Mountain JL (1997) Detecting immigration by using multilocus genotypes. Proc Natl Acad Sci USA 94:9197–9201 [PubMed] doi: 10.1073/pnas.94.17.9197. Risch N, Burchard E, Ziv E, Tang H (2002) Categorization of humans in biomedical research: genes, race and disease. Genome Biol 3:comment2007 [PubMed] Rosenberg N, Stong R (2003) Problem 11039. Amer Math Monthly 110:743. Rosenberg NA, Burke T, Elo K, Feldman MW, Freidlin PJ, Groenen MAM, Hillel J, et al (2001) Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics 159:699–713 [PubMed] Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW (2002) Genetic structure of human populations. Science 298:2381–2385 [PubMed] doi: 10.1126/science.1078311. ——— (2003) Response to comment on “Genetic structure of human populations.” Science 300:1877. Ruiz Linares A (1999) Microsatellites and the reconstruction of the history of human populations. In: Goldstein DB, Schlötterer C (eds) Microsatellites: evolution and applications. Oxford University Press, Oxford, pp 183–197. Shriver MD, Smith MW, Jin L, Marcini A, Akey JM, Deka R, Ferrell RE (1997) Ethnic-affiliation estimation by use of population-specific DNA markers. Am J Hum Genet 60:957–964 [PubMed] Smith MW, Lautenberger JA, Shin HD, Chretien J-P, Shrestha S, Gilbert DA, O’Brien SJ (2001) Markers for mapping by admixture linkage disequilibrium in African American and Hispanic populations. Am J Hum Genet 69:1080–1094 [PubMed] doi: 10.1086/323922. Smouse PE, Spielman RS, Park MH (1982) Multiple-locus allocation of individuals to groups as a function of the genetic variation within and differences among human populations. Am Nat 119:445–463 doi: 10.1086/283925. Sokal RR, Rohlf FJ (1995) Biometry. 3rd ed. Freeman, New York. Stephens JC, Smith MW, Shin HD, O’Brien SJ (1999) Tracking linkage disequilibrium in admixed populations with MALD using microsatellite loci. In: Goldstein DB, Schlötterer C (eds) Microsatellites: evolution and applications. Oxford University Press, Oxford, pp 211–224. Weir BS (1996) Genetic data analysis II. Sinauer, Sunderland, MA. Weiss L (1961) Statistical decision theory. McGraw-Hill, New York. Zhivotovsky LA, Rosenberg NA, Feldman MW (2003) Features of evolution and expansion of modern humans, inferred from genomewide microsatellite markers. Am J Hum Genet 72:1171–1186 [PubMed] doi: 10.1086/375120. Ziv E, Burchard EG (2003) Human population structure and genetic association studies. Pharmacogenomics 4:431–441 [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Am J Hum Genet. 1997 Apr; 60(4):957-64.
[Am J Hum Genet. 1997]Trends Ecol Evol. 1999 Jan; 14(1):17-21.
[Trends Ecol Evol. 1999]Proc Biol Sci. 2000 Aug 22; 267(1453):1699-704.
[Proc Biol Sci. 2000]Am J Hum Genet. 2003 Mar; 72(3):578-89.
[Am J Hum Genet. 2003]Mol Ecol. 2003 Jul; 12(7):1979-91.
[Mol Ecol. 2003]Mol Ecol. 2003 Jul; 12(7):1979-91.
[Mol Ecol. 2003]Genomics. 1994 Jul 15; 22(2):397-403.
[Genomics. 1994]Mol Ecol. 1995 Jun; 4(3):347-54.
[Mol Ecol. 1995]Bioinformatics. 2003 Jul 22; 19(11):1436-8.
[Bioinformatics. 2003]Proc Natl Acad Sci U S A. 1997 Aug 19; 94(17):9197-201.
[Proc Natl Acad Sci U S A. 1997]Genetics. 2000 Jun; 155(2):945-59.
[Genetics. 2000]Genetics. 2002 Mar; 160(3):1217-29.
[Genetics. 2002]Am J Hum Genet. 2003 Jun; 72(6):1492-1504.
[Am J Hum Genet. 2003]Genome Biol. 2002 Jul 1; 3(7):comment2007.
[Genome Biol. 2002]Mol Ecol. 2003 Jul; 12(7):1979-91.
[Mol Ecol. 2003]Am J Hum Genet. 1998 Jun; 62(6):1558-60; author reply 1560-1.
[Am J Hum Genet. 1998]Science. 2002 Apr 12; 296(5566):261-2.
[Science. 2002]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Am J Hum Genet. 2003 May; 72(5):1171-86.
[Am J Hum Genet. 2003]Am J Hum Genet. 2002 Mar; 70(3):737-50.
[Am J Hum Genet. 2002]Am J Hum Genet. 2001 Nov; 69(5):1080-94.
[Am J Hum Genet. 2001]Genome Res. 2002 Dec; 12(12):1805-14.
[Genome Res. 2002]Genetics. 2000 Jun; 155(2):945-59.
[Genetics. 2000]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Genetics. 2001 Oct; 159(2):699-713.
[Genetics. 2001]Genetics. 2000 Jun; 155(2):945-59.
[Genetics. 2000]Genetics. 2003 Aug; 164(4):1567-87.
[Genetics. 2003]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Am J Hum Genet. 2003 May; 72(5):1171-86.
[Am J Hum Genet. 2003]Am J Hum Genet. 2003 May; 72(5):1171-86.
[Am J Hum Genet. 2003]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Am J Hum Genet. 1997 Apr; 60(4):957-64.
[Am J Hum Genet. 1997]Am J Hum Genet. 1997 Apr; 60(4):957-64.
[Am J Hum Genet. 1997]Forensic Sci Int. 2001 Jun 1; 119(1):17-22.
[Forensic Sci Int. 2001]Am J Hum Genet. 1994 Oct; 55(4):788-808.
[Am J Hum Genet. 1994]Am J Hum Genet. 1998 Jul; 63(1):241-51.
[Am J Hum Genet. 1998]Theor Popul Biol. 2001 Nov; 60(3):227-37.
[Theor Popul Biol. 2001]Am J Hum Genet. 1999 Jul; 65(1):220-8.
[Am J Hum Genet. 1999]Theor Popul Biol. 2001 Nov; 60(3):227-37.
[Theor Popul Biol. 2001]Am J Hum Genet. 2003 Jun; 72(6):1492-1504.
[Am J Hum Genet. 2003]J Hered. 2002 Jul-Aug; 93(4):260-9.
[J Hered. 2002]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Genome Res. 2002 Dec; 12(12):1805-14.
[Genome Res. 2002]BMC Genomics. 2003 Feb 24; 4(1):6.
[BMC Genomics. 2003]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Hum Genet. 2003 Mar; 112(3):310-8.
[Hum Genet. 2003]Am J Hum Genet. 1997 Apr; 60(4):957-64.
[Am J Hum Genet. 1997]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]Science. 2002 Dec 20; 298(5602):2381-5.
[Science. 2002]