- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Informativeness of Genetic Markers for Inference of Ancestry^{*}

^{1}Program in Molecular and Computational Biology, University of Southern California, Los Angeles;

^{2}Institute of Biological Anthropology, University of Oxford, Oxford, United Kingdom; and

^{3}Department of Human Genetics, University of Chicago, Chicago

## Abstract

Inference of individual ancestry is useful in various applications, such as admixture mapping and structured-association mapping. Using information-theoretic principles, we introduce a general measure, the informativeness for assignment (*I*_{n}), applicable to any number of potential source populations, for determining the amount of information that multiallelic markers provide about individual ancestry. In a worldwide human microsatellite data set, we identify markers of highest informativeness for inference of regional ancestry and for inference of population ancestry within regions; these markers, which are listed in online-only tables in our article, can be useful both in testing for and in controlling the influence of ancestry on case-control genetic association studies. Markers that are informative in one collection of source populations are generally informative in others. Informativeness of random dinucleotides, the most informative class of microsatellites, is five to eight times that of random single-nucleotide polymorphisms (SNPs), but 2%–12% of SNPs have higher informativeness than the median for dinucleotides. Our results can aid in decisions about the type, quantity, and specific choice of markers for use in studies of ancestry.

## Introduction

Inference of individual ancestry from genetic markers is helpful in diverse situations, including admixture and association mapping, forensics, prediction of medical risks, wildlife management, and studies of dispersal, gene flow, and evolutionary history (Shriver et al. ^{1997}; Davies et al. ^{1999}; Primmer et al. ^{2000}; Manel et al. ^{2002}; Bamshad et al. ^{2003}; Campbell et al. ^{2003}; Ziv and Burchard ^{2003}). Statistical methods for ancestry inference use multilocus genotypes and population allele frequencies, either specified in advance or estimated during the inference process, to assign populations of origin to individuals (Smouse et al. ^{1982}; Paetkau et al. ^{1995}; Rannala and Mountain ^{1997}; Cornuet et al. ^{1999}; Pritchard et al. ^{2000}; Guinand et al. ^{2002}).

Because use of highly informative markers can reduce the amount of genotyping required for ancestry inference, it is desirable to measure the extent to which specific markers contribute to this inference. Several approaches have previously been used for measuring these locus contributions (table 1). However, despite their various features in specific scenarios, all of these measures are either difficult to compute, not designed specifically for estimating marker information content, or not applicable to sets with many potential source populations.

Here, using information-theoretic and decision-theoretic approaches, we introduce new criteria: the *informativeness for assignment,* the *optimal rate of correct** assignment,* and the *informativeness for ancestry coefficients*. The choice of statistic for use in identifying markers for ancestry inference depends on the inference algorithm that is being used (table 1). The new statistics, as convenient and statistically motivated general measures applicable to any number of alleles and populations, may be useful both in admixed and in multisource human groups, such as those that have formed in the Western Hemisphere by the intermixing of Africans, Native Americans, and Europeans. We first define the statistics, consider their relationships with δ and *F*_{st} (two criteria that are often used to measure marker information content), and study the number of markers needed for inference. We demonstrate that the criteria are highly correlated and proceed using only the informativeness for assignment, or, simply, the *informativeness*. Informativeness rankings of loci in human microsatellite data are found to be robust, and use of markers of highest informativeness is observed to reduce the number of markers needed for inference of population structure. We consider the relationships of informativeness values in different subsets of the human population, and the relative informativeness of microsatellites and SNPs. Tables TablesA,A, ,B,B, ,C,C, ,D,D, and andEE (online only) provide lists of informativeness ranks for various sets of source populations.

## Theory

Consider populations *i*=1,2,…,*K* and loci *l*=1,2,…,*L*, with *K*2 and *L*1. Locus *l* has alleles *j*=1,2,…,*N*^{(l)}. The relative frequency for allele *j* of locus *l* in population *i* is *p*^{(l)}_{ij}*;* this quantity represents a parametric rather than a sample frequency. The (parametric) average frequency of allele *j* at locus *l* is defined as

We use “*log*” to denote the natural logarithm, with 0*log*0=0.

### Informativeness for Assignment: the No-Admixture Model

In the no-admixture model, individuals are each assumed to originate from one of *K* populations. Suppose we are given a random individual, whose (random) population assignment is *Q,* with *Q*{1,2,…,*K*}. The probability that the individual belongs to population *i* is (*Q*=*i*); we assume that each population has the same initial probability of being the source of an unknown individual, so that (*Q*=*i*)=1/*K* for all *i*. The (random) genotype of one of the individual’s two alleles at locus *l* is *J*^{(l)}.

Our aim is to measure the amount of “information” gained about *Q* from knowledge of *J*^{(l)} and to compare this quantity across different values of *l*. This measurement can be performed in a natural way, through use of an information-theoretic framework. If the value of *J*^{(l)} is unknown, there is uncertainty, or entropy, regarding the value of the random variable *Q*. Once the value of *J*^{(l)} is known, the entropy of *Q* decreases. The reduction in uncertainty about *Q* due to knowledge of *J*^{(l)} is the *mu**tual information,* *I*_{n}(*Q*;*J*^{(l)})=*H*_{n}(*Q*)-*H*_{n}(*Q*|*J*^{(l)}), where *H*_{n}(*Q*) is the initial entropy of *Q,* and *H*_{n}(*Q*|*J*^{(l)}) is the conditional entropy of *Q* given knowledge of *J*^{(l)} (the subscript “n” refers to the *no-admixture* model). Using standard definitions (Cover and Thomas ^{1991}, chapter 2), and leaving off superscripts for convenience, we have

and

We refer to *I*_{n}(*Q*;*J*) as the *informativeness for assignment*. For a given set of populations, the minimal *I*_{n} of 0 occurs when all alleles have equal frequencies in all populations. The maximal value, *logK*, occurs when *N**K* and no allele is found in more than one population.

This measure of the amount of “information” about the ancestry *Q* contained in the genotype *J* also arises from a likelihood approach. The quantity can be viewed as the expected log-likelihood associated with drawing an allele randomly from the set of populations {1,2,…,*K*}. The term is the expected log-likelihood associated with drawing an allele from a hypothetical “average” population whose allele frequencies equal the mean across the *K* populations. Thus, equation (4) gives the expected logarithm of the likelihood ratio whose numerator is the likelihood that an allele is assigned to one of the populations and whose denominator is the likelihood that an allele is assigned to the “average” population. When the minimum description length principle (Barron et al. ^{1998}) is used, up to a constant, *I*_{n} also equals the expected reduction, upon observation of *J,* in the length of the optimal coding of the random variable *Q*. It gives the average (taken across populations) Kullback-Leibler information (Kullback ^{1959}, p. 6) for distinguishing population-specific allele frequency distributions from the distribution for the “average population.” For *K*=2, *I*_{n} is similar to a previously proposed statistic based on Kullback-Leibler information (table 1).

The expression for *I*_{n}(*Q*;*J*) also has a close correspondence with the *G* statistic obtained from a contingency table, each of whose *N* columns of *K* elements gives the relative frequencies of an allele in the *K* populations. This *G* statistic is given by (Sokal and Rohlf ^{1995}, p. 737)

Using equation (1) and , we can simplify to

Thus, for any number of populations and alleles, identifying loci of high informativeness for assignment is equivalent to identifying loci with large values of *G*_{n}.

Note that *I*_{n}(*Q*;*J*) is a sum over alleles. The contribution of allele *j* to informativeness for assignment is

For fixed *p*_{j}*,* the minimal allelic informativeness of 0 occurs when *p*_{ij}=*p*_{j} for all *i*. For fixed *p*_{j}1/*K*, the maximum, *p*_{j}*logK**,* occurs when *p*_{ij}=*Kp*_{j} for exactly one value of *i* and *p*_{ij}=0 for all other values.

Similarly, it is possible to write *I*_{n}(*Q*;*J*) as a sum over populations, with the contribution of population *i* to the informativeness for assignment equaling

Equations (5) and (6) enable calculation of the specific contribution of an allele or population to *I*_{n}. These computations are useful, since alleles at a locus might differ in their importance for assignment of individuals to populations. Populations might also differ in their degree of difference from the “average” population, so that assignment to some populations is easier than assignment to others. Henceforth, we use “informativeness” only in relation to the *I*_{n} statistic (and the *I*_{a} statistic, to be defined later), although we use “informative” and “information” more generally, to describe “ability to infer ancestry.” Note that, if the prior assignment of individuals is not uniformly distributed, general priors, (*Q*=*i*)=*q*_{i}, can be accommodated by replacing (*Q*=*i*) with *q*_{i} in the derivations of equations (2) and (3) and 1/*K* with *q*_{i} in equations (1), (4), (5), and (6).

*I*_{n} can also be extended for use in assignment to populations of multilocus diploid genotypes rather than of single alleles. For convenience, we treat diploid genotypes as being ordered so that (*j*_{1},*j*_{2}) differs from (*j*_{2},*j*_{1}) (an unordered genotype is assigned randomly to one of the possible ordered genotypes). If we assume both Hardy-Weinberg proportions and independence of loci within populations, equation (4) applies with *l*=1*Lp*_{ij(l)1}*p*_{ij(l)2} in place of *p*_{ij}*,* in place of *p*_{j}*,* and with the sum taken over all *l*=1*L*[*N*^{(l)}]^{2} possible multilocus genotypes, {(*j*^{(1)}_{1},*j*^{(1)}_{2}),(*j*^{(2)}_{1},*j*^{(2)}_{2}),…,(*j*^{(L)}_{1},*j*^{(L)}_{2})}. Although this sum is difficult to evaluate if the number of possible multilocus genotypes is large, it can, in principle, predict the informativeness of multilocus sets; note that informativeness is not additive over loci, since loci that are independent within populations may still contribute to ancestry inference in a correlated manner.

### Relationship of *I*_{n} to δ and *F*_{st}

For the simplest case in which informativeness is of interest—namely, for *K*=*N*=2—it is possible to relate *I*_{n} to δ and to *F*_{st}. In the mathematical development, we let δ equal the signed difference between the frequencies of allele 1 in two populations, *p*_{11}-*p*_{21}, and, without loss of generality, we assume that *p*_{11}*p*_{21}, so that δ[0,1]; when applied to data, it is implicit that δ refers to the absolute difference, |*p*_{11}-*p*_{21}|. Denoting σ=*p*_{11}+*p*_{21}, we must have σ[δ,2-δ]. Simplifying equation (4) in terms of δ and σ, we obtain (fig. 1*A*)

As it should be, *I*_{n} is invariant with respect to a transposition of alleles (δ→-δ and σ→2-σ), of populations (δ→-δ), or of both alleles and populations (σ→2-σ).

For *K*=2 and *N*=2, *F*_{st} for a locus (henceforth used interchangeably with *F*), can be written as (modified from p. 167 of Weir ^{1996})

Solving equation (8) for δ and using equation (7), we can express *I*_{n} in terms of *F* and σ (fig. 1*B*).

For a fixed value of δ or *F,* a biallelic marker is best able to infer ancestry if one of its alleles is absent in one of the populations: if the random genotype *J* equals this allele, there is no uncertainty about the origin of an individual. In other words, for a fixed δ (Campbell et al. ^{2003}) or *F,* the markers with the greatest ability to infer ancestry have a value of σ near one of the extremes. The statistic *I*_{n} captures this aspect of ancestry inference ability (fig. 1; table 2): for a fixed δ, informativeness declines from its maximum at σ=δ to its minimum at σ=1 and then climbs to a second maximum at σ=2-δ; for a fixed *F,* the minimal informativeness is at σ=1, and the maxima are at σ=2*F*/(1+*F*) and at σ=2/(1+*F*).

A comparison of figure figure11*A* and and11*B* demonstrates that informativeness varies less across values of σ for a fixed *F* than for a fixed δ; thus, *I*_{n} is more closely related to *F* than to δ. By considering the difference between the maximal and minimal informativeness over values of σ (table 2), it can be shown that the value of *F* predicts the value of *I*_{n} to within 0.0417, whereas δ predicts *I*_{n} only to within 0.0849. The mean difference between the upper and lower bounds on *I*_{n}*,* given the value of *F,* is 0.0282; the corresponding mean difference between the upper and lower bounds on *I*_{n}, given δ, is more than twice as large, equaling 0.0569 (fig. (fig.22*A* and and22*B*).

*I*

_{n}), δ, and

*F*

_{st}for 8,714 SNPs, based on allele frequency estimates in African Americans and European Americans.

*A,*

*I*

_{n}vs. δ.

*B,*

*I*

_{n}vs.

*F*

_{st}.

*C,*

*F*

_{st}vs. δ. Upper and lower bounds for the dependent

**...**

An additional consequence of equation (8) and the requirement that σ[δ,2-δ] is that δ can be used to predict *F* fairly accurately, and vice versa (fig. 2*C*). This is useful in cases in which only one of these two measures has been reported. Given δ and allowing σ to vary over [δ,2-δ], *F* ranges from a minimum of δ^{2}, when σ=1, to a maximum of δ/(2-δ), when σ=δ or σ=2-δ. Thus, either δ^{2} or δ/(2-δ) can be regarded as a substitute for *F;* for any values of δ and σ, it can be shown that δ/(2-δ)-0.0902*F*δ^{2}+0.0902. The maximal discrepancy between the two approximations, 0.0902, is attained when δ≈0.3820 (table 2). The mean of the two bounds, or (δ+2δ^{2}-δ^{3})/(4-2δ), is always within 0.0451 of *F*. The accuracy of such simple approximations as *F*≈δ^{2} and *F*≈δ/(2-δ) is perhaps somewhat surprising.

Predictions of δ from *F* are slightly less accurate than the reverse predictions. Given *F,* as σ ranges over [2*F*/(1+*F*),2/(1+*F*)], δ ranges from a minimum of 2*F*/(1+*F*) at σ=1 to a maximum of at σ=2*F*/(1+*F*) or σ=2/(1+*F*). The maximal width of this range of δ, given *F,* is 0.1349, and it is attained at *F*≈0.0874.

### Optimal Rate of Correct Assignment

If we use the no-admixture model, another way to measure marker information content is to pursue a decision-theoretic approach. We adopt an assignment rule in which observing that one of the two alleles of a random individual is *j* leads to assignment to population *i* with probability *d*_{ij}*,* where for each *j*. We also choose a cost function *c*_{ii′}*,* which gives the penalty for assignment to population *i*^{′} when the correct population is *i*. Under the assumption of a uniform prior on the population of origin, or (*Q*=*i*)=1/*K* for all *i,* the aim is to choose an assignment rule, or a set of values of *d*_{ij}, that minimizes the expected value of the loss (Weiss ^{1961}, p. 69), or

If *c*_{ii′} is taken to equal 0 when *i*=*i*^{′} and 1 otherwise, minimizing equation (9) is equivalent to maximizing the probability of correct assignment, or . The *optimal rate of correct assignment* (ORCA) is the probability of correct assignment when the optimal rule is used. To determine this rule, note that, for allele *j,* the maximum of the linear function over the set must occur at one of the vertices of the set. Consequently, this maximum occurs when each allele is always assigned to the population in which it is most frequent, and it equals . Adding across alleles, we obtain

Similarly to *I*_{n}*,* the minimal value of ORCA, 1/*K**,* occurs when all alleles have equal frequencies in all populations, and the maximal value, 1, occurs when *N**K* and no allele is found in more than one population. Also similarly to *I*_{n}*,* general prior assignment probabilities, (*Q*=*i*)=*q*_{i}, can be accommodated by replacing 1/*K* with *q*_{i} in equations (9) and (10). Note that, for *K*=*N*=2 with *p*_{11}*p*_{21}, we obtain (*p*_{11}+*p*_{22})/2 for ORCA, or

Similarly to *I*_{n}, ORCA can be extended for evaluation of sets of many loci. Because the maximal correct assignment probability when (Hardy-Weinberg) diploid genotypes rather than individual alleles are assigned to populations equals , when multilocus diploid genotypes (at loci that are independent within populations) are assigned, the corresponding probability is

Equation (12) can, in principle, predict the probabilities of correct assignment of sets of one or more loci in procedures (Buchanan et al. ^{1994}; Paetkau et al. ^{1995}; Banks et al. ^{2003}) that assign multilocus genotypes to their most likely source populations.

### Informativeness for Ancestry Coefficients: the Admixture Model

We have introduced two new measures that can facilitate assignment of individuals to populations. Often, however, a goal of ancestry inference is to estimate “ancestry coefficients” for an individual whose ancestry is from two or more populations (Rannala and Mountain ^{1997}; Pritchard et al. ^{2000}; Anderson and Thompson ^{2002}). Such an individual has a vector of *K* ancestry coefficients that sum to 1, where the coefficient for population *i* gives the fraction of the individual’s genome that derives from population *i*. Ancestry is now a random vector *Q* rather than a discrete random variable. The mutual information *I*_{a}(*Q*;*J*) that quantifies the amount of information about *Q* provided by knowledge of *J* is of interest. With *Q*=(*Q*_{1},*Q*_{2},…,*Q*_{K}), where *Q*_{i} is the (random) ancestry coefficient for the *i*th population and , we have

Similarly to the no-admixture case, any assumptions could be made for the initial probability distribution of *Q*. That is, any distribution defined on the set is suitable. For simplicity, we assume that this distribution is uniform: all collections of ancestry coefficients that sum to 1 are a priori equally likely.

Using the definition of mutual information for continuous random variables (Cover and Thomas ^{1991}, chapter 9), *I*_{a}(*Q*;*J*) equals (see the appendix)

where *S*^{(2)}_{K+1} is a Stirling number of the first kind (Abramowitz and Stegun ^{1965}, p. 833) (the first six values of |*S*^{(2)}_{K+1}|, starting at *K*=2, are 3, 11, 50, 274, 1,764, and 13,068). The quantity *I*_{a} is termed the *“informativeness for ancestry coefficients.”* *I*_{a} has a similar multilocus extension to that of *I*_{n}, and, for *K*=*N*=2, relationships of *I*_{a} to δ and *F* are qualitatively similar to the corresponding relationships of *I*_{n} to δ and *F* (not shown). For example, for fixed δ, the maxima of *I*_{a} over σ occur at σ=δ and σ=2-δ, and the minimum is at σ=1.

### Number of Markers

The statistic *I*_{a} suggests a way to prioritize markers for use in inference of ancestry coefficients, but it does not have a simple relationship with the number of markers needed for this inference. However, using a maximum-likelihood approach, it is possible to approximate this number of markers. Equation (13) gives the likelihood of ancestry coefficients (*q*_{1},*q*_{2},…,*q*_{K-1}) in a haploid one-locus model, with *q*_{K}=1-*q*_{1}-…-*q*_{K-1}. The expected Fisher information matrix, *U*, for the likelihood function has dimensions (*K*-1)×(*K*-1) and, for each *i* and *i*^{′}*,* the (*i*,*i*^{′})th element equals (Millar ^{1991}, eq. [A.3])

Multiplying equation (15) by 2, we can obtain the corresponding value for (Hardy-Weinberg) diploids. When standard maximum-likelihood theory is used (Elandt-Johnson ^{1971}), the variance-covariance matrix of the ancestry coefficient maximum-likelihood estimates is approximated by *U*^{-1}. For *K*3, a straightforward transformation of *U* can enable inclusion of *q*_{K} in the matrix (Millar ^{1991}); for *K*=2, the maximum-likelihood estimates and have equal variances.

Using equation (15), in the (diploid) case of *K*=2, we obtain

For biallelic markers (*N*=2), equation (16) reduces to

If we consider all possible values of *q*_{1} and assume that *p*_{11}*p*_{21}, the largest value of equation (17) occurs at *q*_{1}=(1-2*p*_{21})/(2δ), producing an upper bound for the approximate variance equal to

Because the information matrix for a set of loci that are independent within populations is the sum of the matrices for the individual loci, the number of independent markers, all with the same value δ, that are required to achieve , is

Using equation (19), 35 biallelic markers with δ=0.6 are necessary for achieving an SD of 0.1, in agreement with a previous suggestion of ~40 such markers (Hoggart et al. ^{2003}).

The number of independent biallelic markers required for accurate estimation of ancestry coefficients in the two-population admixture model (table 3) is considerably larger than the number required for assignment in corresponding no-admixture models (Risch et al. ^{2002}; Campbell et al. ^{2003}). However, our computations assume that estimation of ancestry coefficients occurs by maximum likelihood; other estimation procedures or use of dependencies between markers might reduce the number of markers needed. In addition, since it is based on an upper bound for the variance and not on the general expression, equation (19) might further overestimate the number of markers needed. Note that, because only the upper bound and not the general expression is directly related to δ, markers with high values of *I*_{n} and *I*_{a} rather than of δ might often produce smaller variances at values of *q*_{1} relevant to individuals under consideration.

### Estimation of Informativeness

*I*_{n}*,* ORCA, and *I*_{a} have been defined parametrically, as inherent properties of a marker together with a set of populations. In practice, however, estimates made from data rather than parametric allele frequencies must be used. For a given locus, let the number of copies of allele *j* observed at the locus in population *i* equal *n*_{ij}*,* and let the total number of observations in population *i* equal *n*_{i}. A simple estimator of informativeness statistics is the count estimate, in which *p*_{ij} is estimated by *n*_{ij}/*n*_{i}, and the estimated values are inserted in place of the parametric values.

This estimator can produce biased estimates; consider, for example, two samples taken from the same population. Since allele counts in the two samples are likely to differ by chance, markers will have positive estimated *I*_{n} for distinguishing the two samples and estimated ORCA >1/2 when parametric informativeness equals 0 and parametric ORCA is 1/2. This bias is not of major concern when the goal is to compare informativeness estimates for different loci through use of the same sample (Brenner ^{1998}), since a systematic bias affects all loci in a similar manner. In addition, in comparisons of locus informativeness across different samples, sample-specific biases should affect all loci similarly, and the relationship between informativeness estimates of a locus in two samples is preserved even if the estimates are biased.

*I*_{n}*,* ORCA, and *I*_{a} have been defined using frequencies of alleles rather than of diploid genotypes. If alleles within individuals are not independent, so that within-population genotype frequencies do not correspond to Hardy-Weinberg proportions, the definitions can be applied treating *J* as a random diploid genotype, and the count estimates of genotype frequencies can be used in estimation. We do not consider this issue further.

## Data

We consider various subsets (table 4) of a data set of 377 microsatellite markers—45 dinucleotides, 58 trinucleotides, and 274 tetranucleotides—genotyped in 1,056 individuals from 52 human populations (Cann et al. ^{2002}; Rosenberg et al. ^{2002}; Zhivotovsky et al. ^{2003}^{; }Human Diversity Panel Genotypes Web site; Human STRP Screening Sets Web site). Names of regions and regional affiliations of populations are the same as in the article by Rosenberg et al. (^{2002}). At least 50 of the 377 markers are among those reported in table 1 of the article by Collins-Schramm et al. (^{2002}), and at least 212 of them are included in table 1 of the article by Smith et al. (^{2001}).

To compare informativeness for microsatellites and SNPs, we consider SNPs that have been studied in African Americans, European Americans, and East Asians and that were found to have few enough errors for use in analysis of population divergence (Akey et al. ^{2002}^{; }^{Joshua Akey's} Homepage). From the Akey et al. (^{2002}) data, we exclude several types of SNPs: (1) SNPs genotyped by the Whitehead Institute, whose European American sample differed from that used by the other genotyping centers; (2) SNPs with unknown sample sizes or with sample size <40 (20 individuals) in at least one of the three groups; (3) SNPs with unknown, nonunique, or nonautosomal map positions; and (4) SNPs whose frequencies were obtained by DNA pooling or for which one or more of the reported allele frequencies could not be expressed as a rounded quotient of an integer and the reported sample size. The 8,714 SNPs we use were all genotyped by Celera, Motorola, or Orchid.

## Statistical Properties of Informativeness

In this section, we demonstrate that the proposed informativeness statistics are, indeed, useful measures. First, we show that the statistics *I*_{n}*,* ORCA, and *I*_{a} produce similar estimates, so that we can proceed using only one statistic, the *informativeness for assignment*. Second, we demonstrate that the *I*_{n} statistic is robust, in that rankings of locus informativeness do not vary greatly across resamples of the data. Third, we show that the *I*_{n} statistic does indeed measure ability to infer ancestry, in that population structure inference using markers of high informativeness requires fewer markers than inference using markers of low informativeness.

### Relationship between *I*_{n}, ORCA, and *I*_{a}

For each of the data sets in table table4,4, we computed *I*_{n}*,* ORCA, and *I*_{a} for each of the loci, using the allele count estimates with equations (4), (10), and (14). For each data set, Spearman rank correlation coefficients (Gibbons ^{1985}, p. 226) of locus *I*_{n} and ORCA, *I*_{n} and *I*_{a}, and ORCA and *I*_{a} values were computed. Loci for which two or more populations had an identical allele frequency estimate were not used in the latter two calculations, to avoid obtaining denominators of 0 in the computation of *I*_{a}. For the World-52, Central/South Asia, and East Asia data sets, in which many populations had the same sample sizes and therefore had numerous opportunities to produce equal allele frequency estimates in two or more populations, there were many such loci, and the correlation coefficients involving *I*_{a} were not computed.

Rankings of loci by *I*_{n}*,* ORCA, and *I*_{a} were all highly correlated, with the largest correlations observed between *I*_{n} and *I*_{a} (table 4). Thus, for convenience, in the remainder of this article we restrict attention to estimates of *I*_{n}*,* or, simply, the *informativeness,* and we assume that all three measures have similar properties.

### Locus Informativeness Rankings

For each of the 10 data sets (table 4), markers were ranked from highest to lowest estimated informativeness (table A [online only]). To assess the robustness of these rankings, or the extent to which they are affected by the particular choice of individuals included in the data, we performed bootstrap replicates.

Individuals were resampled with replacement within groups, holding group sample sizes fixed. For each replicate, informativeness was estimated for each locus, and loci were ranked by estimated informativeness. In some replicates, for at least one group and one locus, the resample included only individuals who did not have genotypes at the locus. This situation arose only for data sets in which some groups had small sample sizes (10). In these data sets, it was possible for a resample to consist solely of copies of a few individuals. Thus, if these few individuals had no data at a locus, the resample also had no data. For each data set, excluding these replicates, which were discarded, 1,000 resamples were performed. For data sets in which all groups had larger sample sizes (>10), it was not necessary to discard any replicates.

To assess the variability of *I*_{n} values across bootstrap replicates, for each locus, we computed the ratio of the SD of the bootstrap values of *I*_{n} to the value estimated from the data, and we averaged this quantity across loci. Three statistics were used to compare the locus informativeness rankings that were estimated from the data and those that were obtained in the bootstrap replicates: (1) the mean across replicates of *R*_{Vd,Vb}*,* where *V*_{d} denotes the vector of informativeness ranks based on the data, *V*_{b} denotes the vector of ranks based on the *b*th bootstrap replicate, and *R* denotes the Spearman rank correlation coefficient; (2) the Kendall coefficient of concordance of the 1,000 bootstrap replicates (Gibbons ^{1985}, p. 250); (3) the mean across loci of the mean across replicates of the absolute deviation between the rank of a locus in the data and its rank in the replicate. This third statistic was also computed using only the 50 loci of highest estimated informativeness.

Although informativeness fluctuated noticeably across replicates for individual loci, rankings in different replicates were highly concordant with each other and were highly correlated with the rankings based on the estimates from the data (table 5). Similar patterns of correlation across bootstrap replicates were observed for all of the data sets. The World-52, World-5, and World-7 data sets, which contained the most data, produced the most robust informativeness ranks and values; the least robust were found for Oceania, the smallest data set.

The fluctuation of ranks of individual loci across replicates indicates that exact ranks of loci (such as in tables tablesA,A, ,B,B, ,C,C, ,D,D, and andEE [online only]) should be regarded with caution. However, fluctuations were small enough that the markers of highest informativeness usually had low ranks in bootstrap replicates (rightmost column of table 5). Thus, confidence can be placed in general statements such as a locus being “among the most informative markers” in a data set.

### Performance of Markers of High Informativeness in Ancestry Inference

One way to test the utility of *I*_{n} as a measurement of the ability of a marker to infer ancestry is to check whether the population structure inferred using the markers of highest *I*_{n} more closely approximates the population structure inferred using all of the markers than does the population structure inferred using the markers of lowest *I*_{n}. Using the computer program *structure* (Pritchard et al. ^{2000}; available from the ^{Pritchard Lab} Web site), with five clusters and the full data of 1,056 individuals, we previously found that the genetically inferred population structure corresponded fairly closely to the five regions in the World-5 data set (Rosenberg et al. ^{2002}). Thus, if *I*_{n} indeed measures ability to infer ancestry, informativeness of a locus in the World-5 data set should correlate well with the contribution of the locus to population structure inference using five clusters.

We therefore ran *structure* with all 1,056 individuals in the data, using the markers of highest informativeness for the World-5 data set. For various choices of the number of markers, *M,* five *structure* runs were performed with the *M* markers of highest *I*_{n}*,* and five runs were performed with the *M* most heterozygous markers (table S4 of Rosenberg et al. [^{2002}]). Expected heterozygosity was used for comparison, because, among several statistics studied in a previous analysis (Rosenberg et al. ^{2001}), it produced the greatest reduction in the number of markers needed for inference. One run was performed with each of 20 random sets of *M* markers; for each value of *M,* random sets were chosen independently of the sets that were selected for the other values of *M*. Five runs were performed with the *M* markers of lowest *I*_{n}. All *structure* runs used five clusters, and, as in the study by Rosenberg et al. (^{2002}), they employed the admixture model for individual ancestry (Pritchard et al. ^{2000}), the *F* model for allele frequency correlations (Falush et al. ^{2003}), and a burn-in period of length 20,000 followed by 10,000 iterations.

The similarity coefficient *C* (Rosenberg et al. ^{2002}) was used to compare runs with subsets of the markers against 10 runs that employed all 377 markers and were performed by Rosenberg et al. (^{2002}). As in that study, the normalization required in the computation of *C* was based on the runs that used all of the markers. For each value of *M,* *M*<377, each of the 10 runs that used all 377 markers was compared with each of the five runs that used the *M* markers of highest informativeness, for a total of 50 comparisons. For *M*=377, the 90 pairwise comparisons of the 10 full-data runs were performed. For each *M,* the first quartile, median, and third quartile of the distribution of the 50 values were obtained (90 values for *M*=377). Comparisons to the full-data runs were made in an analogous manner, using the runs based on the least informative, most heterozygous, and random markers. For the random markers and *M*<377, the similarity coefficient distribution was based on 200 comparisons.

Figure 3 indicates that, in general, fewer loci chosen according to the highest informativeness were required than random loci for inferring a population structure similar to that obtained with all the loci. This pattern was observed especially for small and intermediate values of *M;* although similarity coefficients at these *M* often varied considerably across runs, runs based on the markers of highest *I*_{n} generally produced greater similarity coefficients than those based on random or highly heterozygous markers. For larger values of *M*, the difference in similarity coefficients across criteria was less pronounced, partly because the sets of markers chosen by different criteria had greater overlap than for small values of *M*. However, runs that used the markers of lowest *I*_{n} produced similarity coefficients that were considerably smaller than those obtained by the other sets of markers. Many more of the markers of lowest *I*_{n} than of those of highest *I*_{n} were required to obtain inferred population structures that were visually similar to that inferred using the full data (fig. 4). Thus, high informativeness is a useful indicator of the ability of a marker to infer ancestry; more dramatically, low informativeness suggests that a locus is not of great utility for inference of ancestry.

*distruct*(available from Noah Rosenberg's Homepage). Each individual is represented by a thin vertical line, which is partitioned

**...**

## Comparison of Rankings across Data Sets

For pairs of data sets, we computed correlation coefficients of locus informativeness (table 6). Most pairs of rankings had correlations of at least 0.2. Markers that had high informativeness for inference of regional ancestry tended to be informative for inference within several regions. One exception was that informativeness in the America data set was not correlated with informativeness in the World-5 and World-7 data sets.

The highest correlations for pairs of regions occurred for regions that were geographically proximate, such as Central/South Asia and East Asia. All correlations for pairs of regions, among those that included two of Africa, Europe, Middle East, Central/South Asia, and East Asia, were larger than correlations that involved Oceania or America. The smallest correlation for a pair of regions was between informativeness in Africa and informativeness in America. Larger absolute levels of informativeness in Africa, America, and Oceania (fig. 5) are consistent with the greater observed differentiation among populations in these regions (Rosenberg et al. ^{2002}).

Most loci were ranked poorly in at least one data set (table A [online only]). D11S2000, D16S3401, D16S422, D21S2055, and D3S2427 were the only markers to rank among the 75 most informative in all data sets; note that D21S2055 was one of three loci identified by Zhivotovsky et al. (^{2003}) as unusually variable. D13S285 and D7S1804 were highly informative in all seven regional data sets (rank 75) but were less informative in at least one of the three worldwide data sets (rank >75). Conversely, D14S1007, D1S235, D22S683, D2S1356, D8S560, D9S1779, D9S1871, NA-D18S-2, and NA-D5S-1 were highly informative in the worldwide data sets (rank 25) but were less informative in one or more of the regional data sets (rank >75).

## Microsatellites and SNPs

Dinucleotide loci, which show the most variation among the markers in these data (Zhivotovsky et al. ^{2003}), were generally more informative than tetranucleotide loci (table 7), consistent with the generally greater differentiation of dinucleotides across human populations (Ruiz Linares ^{1999}; Rosenberg et al. ^{2003}). Dinucleotides were usually also more informative than trinucleotides, but, in many cases, trinucleotides and tetranucleotides had similar levels of informativeness. However, for the worldwide data sets and for Africa, tetranucleotides were by far the least informative class of microsatellite. For example, although 73% of the loci were tetranucleotides, in the World-7 data set, the 25 loci of highest informativeness included only 7 tetranucleotides. Of the 100 loci of lowest informativeness in this data set, 97 were tetranucleotides.

To compare informativeness of microsatellites and SNPs, we determined the informativeness of microsatellites for assignment with three source populations: Africans, Europeans, and East Asians. For these groups, we also determined informativeness for each pairwise combination of source populations (tables (tablesBB and andCC [online only]). Similarly, we estimated informativeness of SNPs among African Americans, European Americans, and East Asians. Because the individuals and populations in the microsatellite and SNP data sets were not the same, our comparison of microsatellite and SNP informativeness can only be regarded as approximate. Inclusion of some extremely isolated populations in the microsatellite data but not in the SNP data might exaggerate the relative informativeness of microsatellites. However, this effect might be counteracted by a SNP ascertainment procedure that produced greater divergence across populations than is characteristic of randomly chosen SNP markers (J. Akey, personal communication); the microsatellite data likely show little or no such effect (Rosenberg et al. ^{2002}). The small-sample upward bias in informativeness might also impact relative informativeness estimates.

For each set of source populations, randomly chosen microsatellites had greater informativeness than random SNPs (fig. (fig.6).6). The ratios of median dinucleotide informativeness to median SNP informativeness were 7.8 (Africans vs. Europeans), 6.8 (Africans vs. East Asians), 5.1 (Europeans vs. East Asians), and 5.3 (Africans vs. Europeans vs. East Asians). The ratios of means were 4.3, 3.7, 2.8, and 3.8, respectively, and the 50th percentile of dinucleotide informativeness corresponded to the 96th, 95th, 88th, and 98th percentiles of SNP informativeness.

**...**

One threshold proposed for declaring a SNP to be highly informative is δ=0.5 (Shriver et al. ^{1997}), a value exceeded by 1.9%, 4.6%, and 2.7% of the SNPs (among those polymorphic in the relevant pair of populations) for African Americans and European Americans, African Americans and East Asians, and European Americans and East Asians, respectively. The value δ=0.5 corresponds to *F*_{st}[0.250,0.333] and *I*_{n}[0.131,0.216] (table 2); for corresponding comparisons, averaging across the three classes of loci, 26.0%, 42.0%, and 12.4% of microsatellites exceed the lower bound of *I*_{n}*,* and 5.9%, 10.2%, and 1.5% exceed the upper bound.

## Discussion

In this article, we have introduced new statistics, *I*_{n}*,* ORCA, and *I*_{a}*,* for measuring the information provided by loci about ancestry. *I*_{n}*,* which is highly correlated with ORCA and *I*_{a} (table 4), is robust, in that it gives similar results in bootstrap replicates (table 5). The statistic is effective for inference of ancestry, in that population structure is more easily inferred using markers that have high values of *I*_{n} than using those that have low values (figs. (figs.33 and and4).4). Although it is closely related to δ in the case of biallelic markers in two source populations (figs. (figs.11 and and2;2; table table2),2), unlike δ, *I*_{n} captures the dependence of information content on the position of allele frequencies in the unit interval.

Use of markers of highest informativeness is desirable for reduction of genotyping effort in such situations as forensics (Shriver et al. ^{1997}; Lowe et al. ^{2001}), admixture mapping (Dean et al. ^{1994}; McKeigue ^{1998}), and structured-association mapping (Pritchard and Donnelly ^{2001}; Hoggart et al. ^{2003}). In these scenarios, it is desirable to maximize information about individual ancestry at minimal cost. For the case of admixture mapping, the additional constraint that loci must be located in candidate regions of the genome applies; unlike other ancestry inference scenarios, admixture mapping makes use not of the ancestry of an individual as a whole but of particular parts of an individual genome. Thus, ideal marker sets for admixture mapping must have representation in regions of interest as well as high informativeness.

Highly informative markers are also useful in testing for population stratification in case-control genetic association studies (Pritchard and Rosenberg ^{1999}), although the test does not use individual ancestry estimates. The goal is to determine whether cases and controls differ in ancestry to such an extent that an excess number of random markers will, by chance, be associated with disease status. Because they have the greatest potential to differentiate among ancestry groups, the most informative markers offer the greatest power to reject the null hypothesis of no genomewide allele-frequency differences between cases and controls; thus, their use offers a cautious approach in dealing with population stratification. If allele-frequency differences are detected, these markers are ideal for structured-association methods that employ individual ancestry estimates to avoid identifying the associations that result from ancestry differences rather than from true association with disease status (Pritchard and Donnelly ^{2001}). The number of these markers needed for desired precision in estimated ancestry coefficients can be approximated using the maximum-likelihood model.

Consequently, the panels in tables tablesA,A, ,B,B, ,C,C, ,D,D, and andEE (online only) can provide a resource for tests of population stratification. For example, in European Americans, the test might use markers that are most informative for distinguishing among various types of European ancestry (table A [online only]); in Hispanic Americans, it might employ markers that are most informative for distinguishing European from Native American ancestry (table B [online only]) or for distinguishing European, Native American, and African ancestry (table C [online only]). Note that panels in tables tablesAA–E utilize groups and classifications that might not be identical to those needed in applications: for example, if ancestry inference in African Americans is of interest, the African and European groups in our data do not fully represent the groups from which African Americans have descended. However, we observed that informativeness in one region was often highly correlated with informativeness in another region (table 6; fig. 5). Thus, while the most informative markers in a data set need not be the most informative for use with a different collection of groups, this imperfect panel of markers is likely to be considerably more informative than a random panel. The observed pattern, in which informativeness correlations were highest for neighboring geographic regions, is likely to be a consequence of the correlation of allele frequencies that results from shared ancestry (Ramachandran et al., ^{in press}). Populations from neighboring regions typically share ancestors more recently, so that their allele frequencies are more strongly correlated.

Two exceptions to the general pattern of correlation across data sets were Oceania and America, in which informativeness was not very highly correlated with informativeness in other regions. The small correlations likely indicate that many of the markers that are extremely variable in other regions by chance must not have been highly variable in founder groups of Oceania and the Americas. That informativeness patterns across di-, tri-, and tetranucleotides were different in Oceania and America from those of the other data sets suggests that bottlenecks were strong enough to obscure the typical patterns of variation for these three classes of markers.

Thus, to identify a panel of markers that are generally useful for inference of regional ancestry and for population ancestry inference within regions (Hoggart et al. ^{2003}), it is most difficult to find markers that are informative both within continental Eastern Hemisphere regions and within Oceania and the Americas. We have identified a small number of generally informative markers; many more loci will need to be screened if markers that are informative in every region are to be found. Alternatively, a general panel might be assembled by collecting markers useful for inference between specific pairs of groups. Such a procedure may be advantageous, because, unlike sequential accumulation of generally informative markers, it avoids duplication of effort by accounting for the possibility that markers of high informativeness can provide information about ancestry in different ways. A systematic procedure to identify maximally informative sets or loci that are conditionally optimal, given the markers that have already been chosen, might use multilocus *I*_{n}*,* multilocus ORCA (eq. [12]), or a decision tree (Guinand et al. ^{2002}).

Although random microsatellites are considerably more informative than random SNPs for distinguishing among pairs of populations, and highly informative loci constitute a greater fraction of microsatellites than of SNPs, the right-hand tail of the distribution of SNP informativeness crosses that of microsatellites (fig. 6), suggesting that, if enough SNPs are screened, a set with informativeness comparable to that of the set of the most informative microsatellites can be found. This observation may be less applicable to the problem of distinguishing among *K* source populations for *K*>2. For a locus with *N* alleles, if *N**K**,* the informativeness of the locus can potentially be as large as *logK**,* whereas, if *N*=2, the maximal informativeness is no larger than *log*2, regardless of the value of *K*. Because microsatellites in the data of Rosenberg et al. (^{2002}) have an average of 12.4 alleles, for relatively large values of *K,* microsatellites have greater potential for higher information content than SNPs, most of which are biallelic. Thus, for large *K,* the relative performance of microsatellites compared with SNPs will likely be greater than is seen in figure figure66 for sets of two and three source populations.

Thus, for inferring ancestry among groups such as African Americans, European Americans, and East Asians, for which genomewide SNP allele frequencies have already been obtained (Akey et al. ^{2002}), use of the most informative known SNPs is likely to be most efficient. However, because informativeness for distinguishing among populations such as different Native American groups is not correlated with informativeness for distinguishing among major regional groups (table (table6),6), SNPs chosen by their informativeness in other scenarios will likely be considerably less useful in these populations than randomly chosen microsatellites. Until the most informative SNPs are identified for a set of populations of interest, use of microsatellites, especially dinucleotides, may lead to greater statistical efficiency in inference of ancestry. Of course, technical problems associated with dinucleotides (Ghebranious et al. ^{2003}) might outweigh the efficiency that derives from their use, and factors such as laboratory fixed costs and difficulties in multiplexing might make the application of less informative markers more economical. The decision about which markers to use for inference of ancestry in any particular context should incorporate a combination of economic, technical, and statistical concerns.

## Acknowledgments

N.A.R. was supported by an NSF Postdoctoral Fellowship in Biological Informatics. We thank J. Akey for assistance with the SNP data set, P. Calabrese and D. Conti for discussions, and E. Ziv and two reviewers for thoughtful comments on the manuscript.

## Appendix: Informativeness for Ancestry Coefficients

Define and suppose on Δ_{K-1}. The probability density function for *Q*=(*Q*_{1},*Q*_{2},…,*Q*_{K}), which we denote by *f*_{Q}(*q*), can be regarded as a function defined on Δ_{K-1}. An elementary calculation shows that Δ_{K-1}*dQ*_{1}…*dQ*_{K-1}=1/(*K*-1)!, and, therefore,

Similarly to the discrete no-admixture model, we can apply the continuous-variable analogue of mutual information (Cover and Thomas ^{1991}, chapter 9) to define informativeness in the admixture model, or *I*_{a}(*Q*;*J*) (the subscript “a” refers to the *admixture model*). The informativeness for ancestry coefficients is the difference of entropy *H*_{a}(*Q*) and conditional entropy *H*_{a}(*Q*|*J*). By use of equation (A1) and the definition, the entropy is

The conditional entropy of *Q* given *J* is given by

By use of equations (13) and (A1), the following integral can be evaluated:

Setting *a*_{ij}=*p*_{ij}(*K*-1)!/*p*_{j} and substituting equations (13), (A1), and (A4) into (A3), we have

If we assume that, for all *j,* if *i*≠*i*^{′}*,* then *p*_{ij}≠*p*_{i′j}*,* by applying the result of Rosenberg and Stong (^{2003}) with *K*-1 in place of *k* to the function *f*(*x*)=*x*^{K}*logx*/*K*!, it can be shown that the integral in equation (A5) evaluates to

where *S*^{(2)}_{K+1}=(-1)^{K+1}*K*!(1+2^{-1}+3^{-1}+…+*K*^{-1}) is a Stirling number of the first kind. Finally, inserting equation (A6) into (A5) and simplifying gives

The expression for the mutual information (eq. [14]) is obtained by use of *I*_{a}(*Q*;*J*)=*H*_{a}(*Q*)-*H*_{a}(*Q*|*J*) with equations (A2) and (A7). Note that the definition of *I*_{a} is sensible only if no two populations share the same frequency for any allele. It is appropriate to assume that parametric allele frequencies are unequal in different populations; however, when allele frequencies are estimated from samples of small and equal size, this assumption will often not be met.

## Footnotes

^{*}This article is dedicated to the memory of Ryk Ward.

## Electronic-Database Information

The URLs for data presented herein are as follows:

*distruct*software)

*structure*software)

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.0M)

- Comparison of measures of marker informativeness for ancestry and admixture mapping.[BMC Genomics. 2011]
*Ding L, Wiener H, Abebe T, Altaye M, Go RC, Kercsmar C, Grabowski G, Martin LJ, Khurana Hershey GK, Chakorborty R, et al.**BMC Genomics. 2011 Dec 20; 12:622. Epub 2011 Dec 20.* - Comparison of single-nucleotide polymorphisms and microsatellites in inference of population structure.[BMC Genet. 2005]
*Liu N, Chen L, Wang S, Oh C, Zhao H.**BMC Genet. 2005 Dec 30; 6 Suppl 1:S26. Epub 2005 Dec 30.* - Informativeness of genetic markers for pairwise relationship and relatedness inference.[Theor Popul Biol. 2006]
*Wang J.**Theor Popul Biol. 2006 Nov; 70(3):300-21. Epub 2006 Jan 4.* - Using ancestry-informative markers to define populations and detect population stratification.[J Psychopharmacol. 2006]
*Enoch MA, Shen PH, Xu K, Hodgkinson C, Goldman D.**J Psychopharmacol. 2006 Jul; 20(4 Suppl):19-26.* - Prospects for admixture mapping of complex traits.[Am J Hum Genet. 2005]
*McKeigue PM.**Am J Hum Genet. 2005 Jan; 76(1):1-7. Epub 2004 Nov 11.*

- GAGA: A New Algorithm for Genomic Inference of Geographic Ancestry Reveals Fine Level Population Substructure in Europeans[PLoS Computational Biology. ]
*Lao O, Liu F, Wollstein A, Kayser M.**PLoS Computational Biology. 10(2)e1003480* - Multilocus Detection of Wolf x Dog Hybridization in Italy, and Guidelines for Marker Selection[PLoS ONE. ]
*Randi E, Hulva P, Fabbri E, Galaverni M, Galov A, Kusak J, Bigi D, Bolfíková BČ, Smetanová M, Caniglia R.**PLoS ONE. 9(1)e86409* - A Panel of Ancestry Informative Markers for the Complex Five-Way Admixed South African Coloured Population[PLoS ONE. ]
*Daya M, van der Merwe L, Galal U, Möller M, Salie M, Chimusa ER, Galanter JM, van Helden PD, Henn BM, Gignoux CR, Hoal E.**PLoS ONE. 8(12)e82224* - Finding Markers That Make a Difference: DNA Pooling and SNP-Arrays Identify Population Informative Markers for Genetic Stock Identification[PLoS ONE. ]
*Ozerov M, Vasemägi A, Wennevik V, Diaz-Fernandez R, Kent M, Gilbey J, Prusov S, Niemelä E, Vähä JP.**PLoS ONE. 8(12)e82434* - Analysis and Optimization of Bulk DNA Sampling with Binary Scoring for Germplasm Characterization[PLoS ONE. ]
*Reyes-Valdés MH, Santacruz-Varela A, Martínez O, Simpson J, Hayano-Kanashiro C, Cortés-Romero C.**PLoS ONE. 8(11)e79936*

- Informativeness of Genetic Markers for Inference of AncestryInformativeness of Genetic Markers for Inference of AncestryAmerican Journal of Human Genetics. Dec 2003; 73(6)1402PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...