# Estimating Local Ancestry in Admixed Populations

^{1}Computer Science Deptartment, University of California, Berkeley, Berkeley, CA 94720, USA

^{2}Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA

^{3}International Computer Science Institute, Berkeley, CA 94704, USA

This document may be redistributed and reused, subject to certain conditions.

## Abstract

Large-scale genotyping of SNPs has shown a great promise in identifying markers that could be linked to diseases. One of the major obstacles involved in performing these studies is that the underlying population substructure could produce spurious associations. Population substructure can be caused by the presence of two distinct subpopulations or a single pool of admixed individuals. In this work, we focus on the latter, which is significantly harder to detect in practice. New advances in this research direction are expected to play a key role in identifying loci that are different among different populations and are still associated with a disease. We evaluated current methods for inference of population substructure in such cases and show that they might be quite inaccurate even in relatively simple scenarios. We therefore introduce a new method, LAMP (Local Ancestry in adMixed Populations), which infers the ancestry of each individual at every single-nucleotide polymorphism (SNP). LAMP computes the ancestry structure for overlapping windows of contiguous SNPs and combines the results with a majority vote. Our empirical results show that LAMP is significantly more accurate and more efficient than existing methods for inferrring locus-specific ancestries, enabling it to handle large-scale datasets. We further show that LAMP can be used to estimate the individual admixture of each individual. Our experimental evaluation indicates that this extension yields a considerably more accurate estimate of individual admixture than state-of-the-art methods such as STRUCTURE or EIGENSTRAT, which are frequently used for the correction of population stratification in association studies.

## Introduction

Recent advances in genotyping technologies have opened up unprecedented opportunities to improve our understanding of complex diseases through disease association studies. In these studies, a population of cases and controls are genotyped across the genome, and the allele frequencies are compared across these two groups. Currently, in a typical study, hundreds of thousands of single-nucleotide polymorphisms (SNPs) are genotyped for thousands of individuals.^{1} These numbers are expected to grow in the coming years because of the constant improvements in genotyping technologies.^{1}

A significant discrepancy between the allele frequencies in the cases and the controls gives evidence for an association between the SNP and the phenotype and therefore links the SNP to the disease. However, a growing concern is that many of the associations found are due to confounding effects. In particular, if the cases and the controls are not sampled from the same population, many spurious associations will be discovered because the two populations might have different allele frequencies at a SNP regardless of the disease status.^{2–10} This bias can be observed in diseases that are more prevalent in one population than in another. In such cases, the collection of the cases is a biased sample of the population.

Various methods have been proposed to deal with population substructure in association studies.^{2,11} One of the most intuitive approaches is to first find the population substructure within the cases and the controls by using a clustering algorithm such as STRUCTURE^{12} and then to correct it by using regression or other methods that take the subpopulation variable into account.^{13} The clustering algorithms need to be accurate enough, so that the signal obtained from the difference in population substructure will be weaker than the signal obtained from the difference in the disease status.

The problem of inferring the population substructure is especially challenging when recently admixed populations are involved. In these populations (e.g., African Americans and Latinos), two or more ancestral populations have been mixing for a relatively small number of generations, resulting in a new population in which the ancestry of every individual can be explained by different proportions of the original populations. Because of recombination events, even within the DNA of a single individual, different regions of the genome could originate from different ancestral populations. This adds to the complexity of the problem of finding the ancestral information of an individual because in nonadmixed populations, the whole genome can be used as evidence for the population membership of an individual, whereas in the admixed case, the genome of each individual is fragmented into shorter regions of different ancestry. It is therefore challenging to find the ancestral information of these individuals and, in particular, to find the locus-specific ancestries.

An accurate inference of locus-specific ancestry in admixed populations could lead to improved analysis of studies based on admixture mapping. In these studies, a set of cases from a recently admixed population is genotyped, and the genome is scanned for regions in which the proportions of ancestral populations are significantly different than the rest of the genome.^{14,15} Unfortunately, most of the current methods for inference of locus-specific ancestral information^{12,16–18} do not scale to large datasets. The only existing method that copes with large datasets is SABER,^{19} which is based on an extension of a Hidden Markov Model^{20} that deals with local haplotype blocks.

Here, we propose a new method, LAMP (Local Ancestry in adMixed Populations), for de novo estimation of the locus-specific ancestry in recently admixed populations (see Figure 1). Our method is based on the observation that previous methods that use a Hidden Markov Model, or extensions of it, are set to infer a very large set of parameters, including the exact position of the recombination events, making the search over the parameter space infeasible. Instead, our method operates on sliding windows of contiguous SNPs. We first calculate an optimal window length. Next, we use a clustering algorithm that operates on these windows and estimates each individual's ancestry. We then use a majority vote for each SNP, over all windows that overlap with the SNP, in order to decide the most likely ancestral populations at the SNP. This simple approach has two advantages over previous ones. First, we show analytically that the estimates of the algorithm are asymptotically correct across the entire genome. Second, it optimizes fewer parameters than previous methods, and hence the optimization is much faster and more robust than previous methods.

We tested LAMP extensively on various datasets of admixed populations generated from the HapMap resource. Our simulations show that LAMP is significantly more accurate than state-of-the-art methods such as SABER and STRUCTURE. In addition, LAMP is highly efficient, with a running time that is about 200 times faster than SABER and about 10^{4} times faster than STRUCTURE. The efficiency of LAMP allows us to estimate ancestries across the genome in several hours on a single computer.

An additional advantage of LAMP is that unlike previous methods, such as SABER, it does not require the ancestral genotypes to infer the locus-specific ancestries (though it can take advantage of these, if available). This might be crucial when the ancestral genotypes cannot be typed or are unknown. For instance, if one studies the population genetics of populations in remote geographic locations where historical admixing has not been recorded, a method such as LAMP could be used to reveal such recent admixing. Furthermore, even in cases where the history of admixing is known, it is not always possible to genotype all the ancestral populations because some of the subpopulations have become extinct and some have entirely mixed with other populations. On the other hand, as genotypes of major population groups become available, it would be beneficial to use LAMP-ANC (ANC: ancestral), which can take advantage of the pure genotypes.

Surprisingly, we find that in many cases where LAMP does not receive the genotypes of the ancestral populations as input, it performs considerably better than SABER. In particular, on a simulated dataset of African Americans, when measuring the percentage of individuals that are predicted with an accuracy of at least 90%, LAMP achieves high accuracies on 90% of the individuals, whereas SABER and STRUCTURE achieve less than 10%.

Finally, we used LAMP to estimate the individual admixture and showed empirically that this results in much more accurate estimates than methods such as STRUCTURE^{12} or EIGENSTRAT.^{2} This reduction in errors might be used to considerably reduce the rate of spurious association results in disease association studies.

## Material and Methods

The inference of locus-specific ancestry depends on the mathematical model representing the mixing process of the populations. We will first describe the model assumptions and then describe the inference algorithm under the model.

### Model Assumptions

We assume that there are *K* ancestral populations *A*_{1}, …, *A _{K}* that have been mixing for

*g*generations. If the populations have mixed at different times, then

*g*is taken to be an upper bound on the number of generations since the beginning of admixture. The fraction of population

*A*in the ancestral population that we call the admixture fraction is

_{i}*α*, where $\sum _{i}{\mathit{\alpha}}_{i}=1$. We assume for convenience that α

_{i}_{1}≥ α

_{2}… ≥

*α*. In each generation, we assume random mating within the combined pool of the

_{K}*k*populations. We denote the recombination rate at position

*j*by

*r*. Note that

_{j}*r*is the recombination rate at position

_{j}*j*at a specific meiosis (one generation), and not through history. We model the transmission of a chromosome from a parent to a child by walking along the chromosome from the 5′ end to the 3′ end, with crossovers between chromosomes occurring as a Poisson process with rate

*r*.

_{j}^{21}For simplicity of the presentation, we will assume a uniform recombination rate, i.e., that

*r*=

*r*for every position

_{j}*j*. The algorithm and analysis remain qualitatively the same when applied to nonuniform recombination rates.

We denote the genotype data of individual *i* at position *j* as *g _{ij}*, where

*g*{0, 1, 2} is the minor allele count at that position. At position

_{ij}*j*, the two alleles of individual

*i*have descended from one or two of the

*K*ancestral populations. We denote by ${a}_{\mathit{ij}}^{p}$ {0, 0.5, 1} the fraction of alleles descended from population

*p*at position

*j*in individual

*i*. The quantities ${a}_{\mathit{ij}}^{p}$ are unknown; the objective of this paper is to present a method LAMP that accurately estimates these quantities.

### The LAMP Framework

In this work, we consider a recently admixed population in which the number of generations *g* since the beginning of the mixing is small. Therefore, we expect the total number of recombinations in these *g* generations to be small, as well. The resulting chromosomes are mosaics of the *k* populations, where the ancestral breakpoints in which the chromosome ancestry changes from one population to the other are determined by the recombination events.

We assume that the quantities *g*, *α _{i}*, and

*r*are known for the admixed population. The basic idea in LAMP is to estimate the ancestries of the individuals in a sliding window that spans

*l*sites. We term

*l*the length of the window. The choice of the length

*l*will be discussed later. Intuitively, if

*l*is small enough, and the number of generations

*g*is not too large, a typical window of length

*l*will have almost no recombination events throughout history, and therefore almost no breakpoints. Therefore, within each window, it is reasonable to use an inference algorithm that assigns the sequence of genotypes in the window to one or two of the populations under the assumption that there are no breakpoints in any of the chromosomes. The latter is a simple clustering problem, although the accuracy of the inference in a given window improves when the number of SNPs

*l*in the window increases. We therefore search for a window length

*l*that is short enough so that most individuals have no breakpoints and large enough so that there is enough information to correctly cluster the individuals within the window. This procedure is repeated by sliding the window to cover all the SNPs on the genome. The windows that overlap a SNP are then combined into a single solution with a majority vote for the ancestry assignment. We note that unlike previous methods (e.g., SABER

^{19}or STRUCTURE

^{12}), we are not attempting to estimate the exact positions of the breakpoints; instead, we are trying to minimize the errors in the locus-specific ancestry prediction across the genome.

The LAMP algorithm works as follows. We first find the optimal window length on the basis of the parameters *g*, *α _{i}*, and

*r*. Then, we use a clustering algorithm that operates on a window and estimates for each individual

*i*, and for each ancestral populations

*A*,

_{j}*A*, the probability ${p}_{\mathit{jk}}^{i}$ for individual

_{k}*i*to have one chromosome descended from population

*A*at this window and another descended from population

_{j}*A*. We then use a majority vote for each SNP, over all windows that overlap with the SNP, in order to decide the most likely ancestral populations at the SNP. As we argue below, even though this scheme optimizes less parameters than previous methods, such as SABER, or a regular hidden Markov model (HMM), we show analytically and empirically that the estimates of the algorithm are asymptotically correct across the entire genome.

_{k}### Estimating the Ancestry in a Single Window

We assume that none of the individuals have a breakpoint within a window and estimate a single ancestral origin for each individual across the length of the window. This assumption is largely true if the length of the window is determined correctly (see Choosing the Window Length, as well as Estimate of Window Length in the Appendix). We further assume that the values α_{1}, …, *α _{K}* are known. These values are the admixture fractions of each of the populations across the whole genome, and they can be estimated with existing tools such as STRUCTURE.

^{12}In the Results section, we show that our method is robust to reasonable inaccuracies in the estimates of α

_{1}, …,

*α*.

_{K}#### Clustering Algorithm

We assume that subpopulation *A _{i}* has minor allele frequencies $\overrightarrow{{f}_{i}}={f}_{i1},\dots ,{f}_{\mathit{in}}$ for

*n*SNPs in a given window of length

*l*and that the different SNPs in the window are independent. The latter assumption can be achieved in practice by the greedy removal of SNPs having a high correlation coefficient (

*r*

^{2}> 0.1) from the window. We look for a classification function $\mathit{\theta}:I\to {\{1,\dots ,K\}}^{2}$, where

*I*is the set of individuals and the range corresponds to the possible pairs of subpopulations. In particular, we write θ(

*i*) = (θ

_{1}(

*i*), θ

_{2}(

*i*)) to denote the ancestries of the two chromosomes of individual

*i*in the current window. We use a clustering algorithm known as Iterated Conditional Modes (ICM)

^{22}to find an optimal classification of each individual in terms of the likelihood. For increased efficiency in the running time, we seed the algorithm with an initial classification as described in the Initializing the Clusters section.

The updates in the ICM algorithm differ from those in a traditional expectation maximization (EM) method only in the E step. In the latter, the E step consists of obtaining the expected classification θ, given the values $\overrightarrow{{f}_{i}}$. This would provide fractional class membership for each individual *i*. However, because we assume that the initial classification provides a reasonable solution, we find the maximum aposteriori estimate of θ as shown below. For brevity, we use ${G}_{i}$ to refer to the genotype (*g _{i}*

_{1}, …,

*g*) of the individual

_{in}*i*.

Because α_{1}, …, *α _{K}* are known, under the assumption of random mating, we can estimate the first term Pr $[\mathit{\theta}\left(i\right)={A}_{s}{A}_{t}|\overrightarrow{{f}_{1}},\dots ,\overrightarrow{{f}_{K}}]$ as Pr[θ(

*i*) =

*A*] = 2

_{s}A_{t}^{1−δ(s,}

^{t}^{)}

*α*where δ(

_{s}α_{t}*x*,

*y*) is 1 iff

*x*=

*y*and 0 otherwise.

The other term can be estimated as

In the M step, we obtain the maximum-likelihood estimate of $\overrightarrow{{f}_{1}},\dots ,\overrightarrow{{f}_{K}}$ by finding

If the phase of the individuals is known, then the maximum-likelihood estimate of $\overrightarrow{{f}_{1}},\dots ,\overrightarrow{{f}_{K}}$ could have been computed simply by the counting of the number of alleles in each of the subpopulations at every position. However, when the phase is not known, the problem becomes more complicated. Consider for instance a heterozygous site *j* in an individual *i*, with θ_{1}(*i*) ≠ θ_{2}(*i*). In this case, it is not clear whether the minor allele count should be added to ${f}_{{\mathit{\theta}}_{1}\left(i\right)j}$ or to ${f}_{{\mathit{\theta}}_{2}\left(i\right)j}$. To solve this problem, we introduce another classification function per site, $\overrightarrow{{\lambda}_{j}}:I\to {\{0,1\}}^{K}$. This function is defined on the set of SNPs for which the assignment of counts is ambiguous, i.e., heterozygous SNPs in individuals *i* with classification θ_{1}(*i*) ≠ θ_{2}(*i*). We denote this set of heterozygous SNPs ${H}_{i}$. The function $\overrightarrow{{\lambda}_{j}}$ is defined as

Here, ${\overrightarrow{e}}_{s}$ is the vector with 1 in coordinate *s* and 0 elsewhere.

For a heterozygous site *j* in individual *i* such that $j\in {H}_{i}$, we can now define

By using the assumption of independence of the SNPs and the $\overrightarrow{{\lambda}_{j}}$ just defined, we can rewrite Equation 2 as follows. The usefulness of this will be apparent later.

The MLE for ${\widehat{f}}_{ij},\dots ,{\widehat{f}}_{Kj}$ can be found with an EM algorithm where

Here, λ* _{j}*,

_{s}(

*i*) refers to coordinate

*s*of the vector λ

*(*

_{j}*i*). ${n}_{k,u}^{sj}$ refers to the number of individuals who have

*u*{0, 1, 2} minor alleles and

*k*{1, 2} copies of alleles from population

*A*at site

_{s}*j*. The counts of these individuals can be obtained on the basis of the classification θ(

*i*). Notice that the term corresponding to the heterozygous sites that have a single allele from population

*A*has its contribution modified by $\overline{{\lambda}_{j,s}}\left(i\right)$. We can now perform expectation-maximization iterations by using Equations 5 and 4. The convergence of these iterations provides us a maximum-likelihood estimate of $\overrightarrow{{f}_{1}},\dots ,\overrightarrow{{f}_{K}}$. These estimates can then be used in the next iteration to estimate θ with Equation 1.

_{s}#### Initializing the Clusters

We now describe how we obtain an initial setting of the parameters, i.e., the classification function θ or the allele frequencies $\overrightarrow{{f}_{1}},\dots ,\overrightarrow{{f}_{K}}$, which are used as starting points by the EM algorithm. We focus here on two specific scenarios. The first scenario is the case where there are two ancestral populations, i.e., *K* = 2, and unknown allele frequencies $\overrightarrow{{f}_{1}},\dots ,\overrightarrow{{f}_{K}}$. In this instance, we use an algorithm called MAXVAR to provide an initial solution to the EM algorithm. The main motivation behind MAXVAR is the quick production of a reasonable classification. The algorithm takes advantage of results computed from adjacent windows, and its running time grows linearly with the number of SNPs. We have also considered using spectral clustering, but in practice, we found that the final classification accuracy is nearly the same as MAXVAR though the running time is increased. The result from MAXVAR is a classification of the individuals, which is then used in Equation 2 of the EM.

The second scenario is the case where *K* ≥ 2 and estimates of the allele frequencies ${\widehat{f}}_{1},\dots ,{\widehat{f}}_{K}$ in the ancestral populations are known. In this case, these allele frequencies are used as a starting solution in Equation 1 of the EM algorithm.

*The MAXVAR Algorithm.* When we have two populations, we estimate a window length *l* such that most of the individuals have no breakpoints within a window. Thus, the ancestries of these individuals are *A*_{1}*A*_{1}, *A*_{1}*A*_{2}, or *A*_{2}*A*_{2}. We define α = α_{2} as the admixture fraction of the smaller of the two populations. We now describe a method for finding the ancestry of each individual in this window. We call this the MAXVAR algorithm.

We first define a similarity score S between a pair of individuals. For each SNP *j*, let ${\mathit{\mu}}_{j}=\frac{\sum _{i}{g}_{ij}}{n}$, where *n* is the number of individuals, and let ${\mathit{\sigma}}_{j}=\sqrt{\frac{\sum _{i}{({g}_{ij}-{\mu}_{j})}^{2}}{n}}$. For two individuals *i*_{1}, *i*_{2}, we define

For each *i* ≤ *n*, let $Var\left(i\right)=\underset{{i}^{\prime}:{i}^{\prime}\ne i}{\Sigma}S{(i,{i}^{\prime})}^{2}$ denote the similarity of all other individuals to individual *i*, and let *i*^{} = argmax* _{i}*{

*Var*(

*i*)}. The MAXVAR algorithm simply finds

*i*

^{}and clusters the individuals according to the values $S({i}^{\ast},i)$. In particular, we order the individuals according to these values, and the smallest (1 − α)

^{2}

*n*individuals are assigned the ancestry of

*A*

_{1}

*A*

_{1}, the largest α

^{2}

*n*individuals are assigned the ancestry of

*A*

_{2}

*A*

_{2}, and the rest are assigned

*A*

_{1}

*A*

_{2}. We provide a formal proof of correctness of the MAXVAR method in the Appendix (Correctness of MAXVAR).

*Known Ancestries.* The problem of estimating the ancestry is considerably simpler if we are provided estimates of the ancestral allele frequencies. In this case, as before, we first estimate the window length *l*. Within each window, we then estimate the ancestries by using the likelihood function given by Equation 1 with the given ancestries ${\widehat{f}}_{1},\dots ,{\widehat{f}}_{K}$ used as the starting solution. The ancestries predicted at each SNP are combined with a majority vote.

### Choosing the Window Length

In order for the local predictions to achieve reasonable accuracy, the length of the window *l* should be short enough so that most individuals do not have a breakpoint in the window and long enough so that the SNPs provide sufficient information for the observation of a difference between the populations. Note that we use the term breakpoint to refer to a recombination event that results in a change in ancestry of the adjacent SNPs. The power of our method stems from the fact that long windows provide much more information than any local behavior, provided that there are not too many individuals with breakpoints in the window. We are looking for the maximum window length *l* so that the errors in the classification that are due to breakpoints in the window are bounded. We present empirical results that validate the window-length estimates in the Estimate of Window Length section of the Appendix.

In each window, the errors in the classification depend on the length of the window, the number of individuals, and the distance between the populations. Evidently, it is hard to predict these errors because the distance between the populations is unknown, and the performance of the EM algorithm is unpredictable for a finite sample. To obtain a bound on the errors, we consider the most accurate classification of the individuals in a window. Such a classification is allowed to assign ancestries to the individuals in a window with knowledge of their true ancestral states ${a}_{ij}^{p}$ for p = 1, …, *K*. Thus an individual whose ancestry is *A _{s}A_{t}* over the length of the window is always classified correctly. The only errors made by such a classification are due to the locations of the breakpoints. In the presence of a breakpoint, an individual would be assigned an ancestry so that the number of errors is minimized. For instance, an individual with a breakpoint at position

*j*and ancestries ${A}_{{s}_{1}}$ and ${A}_{{s}_{2}}$ on either side of the breakpoint gets assigned the majority ancestry over the length of the window

*i.e.*, the individual gets classified as ${A}_{{s}_{1}}$ if $j>\lceil \frac{l}{2}\rceil $ and ${A}_{{s}_{2}}$ otherwise. It is easy to see that the larger the window size

*l*, the more likely it is for an individual to have a breakpoint, and hence, more errors are introduced in the optimal classification.

The number of recombination events throughout time along a specific window is assumed to be Poisson distributed with parameter (*g* − 1)*lr*. Therefore, as long as (*g* − 1)*lr* << 1, it can be verified that the probability to have a breakpoint in the window is upper bounded by $2(g-1)lr\sum _{i<j}{\alpha}_{i}{\alpha}_{j}$ under the assumption of random mating and that the admixture fractions of the population right before recombination are *α _{i}*. Therefore, the probability for a breakpoint on either chromosome is bounded by $\gamma =4(g-1)rl\sum _{i<j}{\alpha}_{i}{\alpha}_{j}$.

For a given window, the above analysis shows that the expected fraction of individuals with no breakpoints is 1 − γ. We can now use this to obtain a bound on the fraction of errors in a window. Let *X* be the fraction of errors in a window of an algorithm that makes the optimal classification. Let *I* be the number of breakpoints in the window. We compute

Note that *E*[*E*[*X*|*I* = 0]] = 0 because the optimal classification in this case makes no errors. When there is a single breakpoint *I* = 1, the breakpoint is distributed uniformly over the length of the window. We denote the position of the breakpoint *J* ~*Unif*(1, *l*). The fraction of errors in the presence of a single breakpoint at position *J* is

We now have

If *glr* << 1, we can ignore Pr [*I* > 1] so that

For a bound on the expected fraction of errors, we get γ < 4. Rewriting the window length *l* in terms of , we get

Although these arguments bound the errors in a single window, it is also possible to bound the errors due to overlapping windows at a SNP. In this case, the use of a majority vote can be shown to further improve the accuracy of the predictions. The details of this analysis can be found in the Appendix (Accuracy of the Window Length and the Majority Vote).

The analysis presented here is specific to the model of admixture described at the start of the Model Assumptions section. However, it is easy to see that the analysis can be extended to the case of nonuniform recombination rate, where the probability for a recombination in position *i* is *r _{i}*. In that case, the term (

*g*− 1)

*lr*should be replaced by $(g-1){\Sigma}_{i=0}^{l}{r}_{i}$.

The model considered so far does not take into account the distance between the ancestral populations while choosing the window length. When the ancestral genotypes are known, the window length can be chosen to trade off the accuracy in separating the ancestral genotypes with an increase in the errors due to breakpoints. A binary search over the window lengths can then pick the optimal window length, as discussed in the Appendix (Practical Issues in Implementing LAMP).

## Results

We empirically evaluated LAMP on various datasets and compared its performance with other tools that infer ancestry in admixed populations. When one is comparing this to previous methods, it is important to note that the inputs needed for the different methods are different. In particular, in SABER,^{19} the genotypes from the pure ancestral populations are assumed to be known, whereas in LAMP, we do not need this extra information. On the other hand, similar to SABER, LAMP assumes that the recombination rates across the genome and the admixture fraction (α_{1}, …, *α _{k}*) are known; the latter can be found with reasonable accuracy with existing methods such as STRUCTURE or EIGENSTRAT, wheras the former can be obtained from the previous estimates of recombination rates based on the HapMap data.

^{23}We also provide LAMP with an estimate of the number of generations

*g*of admixture, and this number can be approximated if the history of the admixed populations is known. We show below that our method is robust to deviations in the estimate of

*g*. For SABER, we set the parameter τ, which roughly corresponds to the number of generations since admixing, to

*g*. We found that allowing SABER to estimate the values of τ yielded much poorer estimates of ancestry.

### Simulated Datasets

We simulated admixed populations from the HapMap data in the following manner. We used the SNPs of chromosome 1 from the 500K Affymetrix GeneChip assay from each of the four HapMap populations: Yoruba people from Ibadan, Nigeria (YRI); Japanese from the Tokyo area (JPT); Han Chinese from Beijing (CHB); and Utah residents with ancestry from northern and western Europe (CEU). Overall, these span 38,864 SNPs for 60 unrelated individuals from CHB and YRI and 45 unrelated individuals from CHB and JPT.

For each pair of HapMap populations, we simulated admixed populations by random mating of individuals from the two populations across several generations. We started by joining a random set of *αn* individuals from the first population and (1 − α)*n* individuals from the second population. For the next generation, we repeatedly picked a random pair of individuals from the combined set of individuals and generated a child for this pair by transmitting one chromosome from each individual. We repeated this process for *g* generations. We set the recombination rate to be 10^{−8} per base pair per generation, consistent with previous studies.^{24} We note that this model is a worst-case scenario in the sense that in practice, the populations are expected to mix in a slower rate because individuals tend to mate with individuals from a similar ancestral background. We simulated admixture for various values of *g* and α; in the rest of this manuscript, the values of *g* and α are 7 and 0.2, unless stated otherwise. These parameters roughly match the nature of admixing in African American populations.^{16,17,25–27}

### LAMP's Performance and Accuracy

We evaluated the accuracy of the ancestry estimates inferred by LAMP. We consider the two versions of LAMP, i.e., the de novo inference of the local ancestries and the inference of the local ancestries based on genotype data of the original ancestral populations. We refer to the latter method as LAMP-ANC. For each individual *i* and SNP *j*, LAMP finds an estimate ${\widehat{a}}_{ij}^{p}\in \{0,0.5,1\}$ for the true ancestry ${a}_{ij}^{p}$ by a majority vote across the windows overlapping with position *j*. We measure the accuracy of LAMP as the fraction of triplets (*i*, *j*, *p*) for which ${a}_{ij}^{p}={\widehat{a}}_{ij}^{p}$.

We compared LAMP to two state-of-the-art methods: STRUCTURE^{12} and SABER.^{19} SABER requires the input genotypes, admixture fraction α, physical location of the SNPs, and the ancestral sequences that were used in the simulation (i.e., the original HapMap populations) and was also provided the number of generations *g*. For STRUCTURE, we only needed to provide the genotypes. We did not compare LAMP to methods such as AdmixMap^{18} and AncestryMap^{16} because the high density of markers made these methods infeasible.

Table 1 summarizes the prediction accuracies of LAMP, LAMP-ANC, SABER, and STRUCTURE. LAMP and LAMP-ANC were run on the set of 38,864 SNPs of chromosome 1. SABER and STRUCTURE were run on nonoverlapping windows of 4000 SNPs that included 36,000 of the original 38,864 SNPs. This was done because STRUCTURE got into numerical instabilities when a large number of SNPs were used and SABER crashed for an unknown reason when run on all of the SNPs over the set of 500 individuals. For STRUCTURE, the linkage model was used with 10,000 burn-in and 50,000 MCMC iterations. SABER was also seen to crash on some of the 4000 SNP blocks, and these were excluded from the analysis. The accuracy of the ancestry estimates were obtained on the SNPs for which all methods returned a result. From Table 1, it is clear that LAMP achieves considerable improvement over the YRI-CEU and the CEU-JPT datasets, when compared to SABER or STRUCTURE. For the JPT-CHB dataset, LAMP is worse than SABER, but LAMP-ANC achieves a higher accuracy than SABER.

The accuracy of each of the methods varies across the population. We therefore measured the average accuracy in predicting the ancestries for each of the individuals. Figure 2 shows the cumulative distribution function of the accuracies achieved by each of the methods across the set of 500 individuals. As can be seen from the figure, the improvement of LAMP compared to STRUCTURE and SABER is quite significant. For the YRI-CEU dataset, when measuring the percentage of individuals that are predicted with an accuracy of at least 90%, LAMP achieves 90%, whereas SABER and STRUCTURE achieve less than 10%. In general, the accuracy in the predictions that STRUCTURE makes has a higher variance than the predictions made by SABER and LAMP. On the CEU-JPT dataset, LAMP is more accurate than SABER. On the JPT-CHB dataset, SABER performs considerably better than LAMP; this is probably due to the fact that the ancestral populations, which are given to SABER but not to LAMP, are too similar to distinguish within a window; because LAMP-ANC uses the allele frequencies of the ancestral individuals as input while still inferring ancestries over entire windows, it is more accurate than SABER.

Table 1 also shows that LAMP achieves a gain in running time of at least two orders of magnitude. We found that on a single computer, LAMP and LAMP-ANC take less than 30 s to run on a 4000 SNP block and less than 7 min to run on the entire set of 38,864 SNPs.

These experiments suggest that LAMP is especially useful when the ancestral populations are sufficiently different from each other (e.g., CEU and YRI). In those cases, it is actually not essential to genotype the ancestral individuals because we observe that LAMP-ANC and LAMP achieve similar accuracy levels. When the populations are closer (e.g., CHB-JPT), even for a modest number of generations of mixing (in our case, seven generations), none of the methods performs well, even when the ancestral populations are given.

### Inferring Individual Admixture

Current studies often use the individual admixture of each individual across the genome to correct for population stratification.^{17,28–30} The individual admixture of an individual is defined by the proportion of ancestors of the individual from each of the ancestral populations. For instance, for an individual with a mother from CEU and a father from YRI, the individual admixture would be 50% YRI and 50% CEU.

Even though LAMP is designed to estimate the locus-specific ancestry, we can use it to find the individual admixture. We compare the estimates of the individual admixture obtained from LAMP with those from STRUCTURE. We used the YRI-CEU dataset with *g* = 7 and α = 0.20. We picked 4318 equally spaced SNPs from chromosome 1. This roughly matches the number of SNPs required to distinguish nonadmixed individuals from even very closely related subpopulations.^{31} We ran STRUCTURE on this set of SNPs with 10,000 burn-in iterations and 50,000 iterations with the NOLINKAGE model and the NOADMIX mode option set to 0. We ran LAMP on the entire chromosome and then calculated the locus-specific ancestry of each individual by averaging the ancestries predicted across the same set of 4318 SNPs given to STRUCTURE. As shown by Figure 3, LAMP consistently achieves considerably better estimates for the individual admixture. In particular, the average error rate for LAMP is 2.1%, whereas the average error rate for STRUCTURE is 5.4%. The difference in the performance between the methods is statistically significant (Wilcoxon signed rank test p value of 9.9 × 10^{−51}). This experiment suggests that because LAMP is more than 600 times faster than STRUCTURE (see Table 1), it would be better to use LAMP across the entire genome to infer the individual admixture than to use STRUCTURE across a smaller set of ancestry-informative markers (AIMs). We also inferred the individual admixture by using the LINKAGE model in STRUCTURE but found that this gave a significantly higher average error rate of 9.0%.

Another method for the inference of the individual admixture is EIGENSTRAT. We ran EIGENSTRAT on the SNPs used above and chose the largest eigenvector. We obtained the ancestries of the individuals by scaling the entries of the eigenvector to the interval [0, 1]. We found this procedure to result in individual admixtures with an average error rate of 13.4%. When we included ten ancestral individuals each from the HapMap YRI and CEU populations, the average error was reduced to 4.1% (Wilcoxon signed rank test p value of 1.3 × 10^{−83}). The use of all 38,864 SNPs decreased the average error to 11.1% and 3.6%, respectively.

### LAMP's Performance across Three Admixed Populations

When more than two populations are mixed, de novo inference of the locus-specific ancestry is a more challenging task. We therefore compared LAMP-ANC, which uses the genotypes from the ancestral populations, to SABER, on a dataset generated by the mixing of three populations (YRI, CEU, and JPT). We mixed these populations in the ratio 0.4:0.4:0.2 for seven generations. Figure 4 shows the ancestry estimates of LAMP-ANC for one of the individuals. LAMP-ANC accurately infers the ancestry over most of the chromosome, and it is clear that qualitatively the estimates are very close to the true ancestry. To give a more quantitative measure for the accuracy of LAMP-ANC, we calculated the cumulative distribution function of the accuracies for each individual of LAMP-ANC and of SABER (see Figure 5). Evidently, LAMP-ANC achieves a significantly better accuracy than SABER across the population (average accuracies of 92% and 74%, respectively).

### Empirical Robustness of LAMP

The performance of LAMP clearly depends on the nature of the data, on the number of generations *g*, and on α. We varied *g* for a simulated YRI-CEU admixed population with the fraction of CEU α = 0.20. As shown in Figure 6, even when *g* is as large as 20, LAMP reaches an accuracy of 88%, and LAMP-ANC reaches an accuracy of 93%. For more realistic values of *g*, (i.e., *g* < 10) the accuracy of LAMP is above 93%.

To measure the effect of α on the performance of LAMP, we measured the performance for simulated data with *g* = 7 for different values of α (see Figure 6). We observe that LAMP performs well for values of α of up to 0.40 with its accuracy remaining above 90%, and its performance drops sharply to a little above 50% accuracy for α = 0.5.

Finally, we measured the effect of the distance between the ancestral populations by comparing the accuracy of LAMP across the YRI-CEU, CEU-JPT, and the JPT-CHB datasets. As shown in Table 1 (see also Figure 6), LAMP is quite accurate on the CEU-JPT and the YRI-CEU datasets, but its performance is quite poor on the JPT-CHB dataset. In such a situation, the availability of allele frequencies is essential for accurate inference, because we observe that LAMP-ANC maintains an accuracy of around 70%.

### Robustness to Parameter Settings

Because LAMP requires as an input the values of α and *g*, verification that inaccurate estimates of these parameters do not affect the results significantly is important. We tested LAMP by benchmarking it over the simulated YRI-CEU dataset, with true values of *g* = 7 and α = 0.2. We ran LAMP on this dataset with different erroneous input values of *g* and α. In Figure 7, we observe that if the number of generations *g* is mistakenly given to LAMP as four or larger, then the accuracy of LAMP is kept reasonably high, and in particular it is at least 90%. On the other hand, it seems that if the input α is very different from the true α, LAMP can perform quite poorly. For instance, when the input α is set to 0.4 instead of 0.2, the accuracy level is about 85%. However, because α is a single parameter across all individuals, standard methods such as STRUCTURE^{12} give reasonable accuracy for α (e.g., the estimates for the YRI-CEU dataset are between 0.17 and 0.24 across ten runs), we can safely assume that the error in the prior estimate of α is within a factor of 0.5, in which case LAMP maintains a very good performance.

The model used in LAMP requires the SNPs to be independent. To ensure this, we discard SNPs with *r*^{2} above a threshold. We empirically chose a threshold of 0.10 for *r*^{2} so that we retain a majority of the SNPs. However, as shown in Figure 8, the accuracy of LAMP does not change much even when this threshold is raised so that the SNPs retained are no longer independent. The accuracy begins to decrease only at stringent thresholds below 0.005 because of the algorithm's tendency to discard informative SNPs. We also examined the impact of the sample size on the ancestry estimates. Although an increase in sample size might lead to SNPs being significantly linked even when the mutual *r*^{2} is small, for practical purposes, such SNPs are essentially uncorrelated. Thus, LAMP is also robust to the sample size, as shown in Figure 8.

Finally, we measured the effect of the method used to simulate the data on the different algorithms. To achieve this, we amplified the HapMap haplotypes for YRI and CEU populations by using the model of Li and Stephens.^{32} Briefly, the Li and Stephens model generates additional haplotypes based on the ones already observed. The newly generated haplotypes are composed from previous ones, assuming mutation and recombination. The recombination rate in this model depends on the number of observed haplotypes, such that the rate is higher when less haplotypes are observed. This reduces the recurrent sampling of haplotypes and, as was shown by Li and Stephens, mimics more accurately the generation of haplotypes. This resulted in a set of 10,000 ancestral individuals; this set then underwent admixture with *g* = 7 and α = 0.20, as described earlier. On this new dataset, the accuracies obtained by LAMP, LAMP-ANC, and SABER were 94.72%, 94.70%, and 89.09%, respectively. The accuracies are close to the accuracies obtained on the YRI-CEU dataset described in Table 1.

Because the ancestral allele frequencies used in LAMP-ANC were estimated from the same data that were used for the generation of the admixed datasets, there is a potential risk of over fitting. To make sure that this is not the case, we partitioned the fouding YRI and CEU populations into two equal-sized sets. We chose one of the two sets from each population to generate a YRI-CEU dataset with parameters *g* = 7 and α = 0.20. Ancestral allele frequencies were estimated from the other set. The accuracy of LAMP-ANC in this setting was 94.06%, which is very close to the previous estimates obtained. Running the same procedure on the amplified datasets gave an accuracy of 94.44%, and thus we conclude that the results were not due to over fitting.

## Discussion

We have presented a new method, LAMP, for de novo estimation of locus-specific ancestry in recently admixed populations. Unlike previous methods for locus-specific ancestry (e.g., SABER), LAMP does not use any information about the ancestral populations (i.e., it estimates the ancestries de novo). We show that LAMP is analytically justified and that it achieves significant improvements over existing methods both in terms of accuracy of prediction and speed. In particular, LAMP can easily be applied to whole-genome datasets, and the resulting locus-specific ancestries can be estimated within a few hours.

De novo estimation of the locus-specific ancestries is sometimes infeasible, especially when the ancestral populations are very close to each other (e.g., CHB and JPT). We therefore extended LAMP to a method called LAMP-ANC, which uses additional genotypes from the ancestral populations as priors. This approach has been shown to be useful before by methods such as SABER.

When compared to previous methods, LAMP is shown to achieve significantly better accuracy than other methods (SABER and STRUCTURE). The increase in accuracy might be crucial when one is trying to correct for population stratification in studies that involve recently admixed populations, as well as in studies that are based on admixed mapping. Furthermore, improved accuracy in the locus-specific ancestry estimation has potential applications in finding better signals for selection and other events across the genome.

Although LAMP relies on a knowledge of the parameters *g* and α, we have shown the robustness of the ancestry estimates to inaccuracies in these parameters. These parameters control the window size. As the window size is decreased, each window might contain fewer informative SNPs. On the other hand, errors in classifying individuals who have breakpoints within a window are reduced. This tradeoff is illustrated in Figure 7, where we see that the ancestry estimates are robust when *g* is overestimated. In practice, we would therefore recommend the use of an upper bound on *g* when *g* cannot be estimated accurately. Furthermore, *g* might actually be a more complex parameter—for example, if some portions of the admixed population have admixed for *g*_{1} generations and other portions have been admixed for only *g*_{2} generations, where *g*_{2} is smaller than *g*_{1}. In this case, *g* is set to be *g*_{1}, and more accurate results are expected than if the whole population has admixed for exactly g1 generation.

The fact that the LAMP algorithm performs better on the unbalanced case (α << 0.5) than on the balanced case seems counterintuitive at first. The reason for this phenomenon is the fact that a small α helps to break the symmetry. Even if all windows were perfectly clustered, the combination of the solutions of the different windows into one integrated solution is not a simple task when α = 0.5 because of symmetry. That is, after clustering the individuals in every given window, we are still left with the problem of deciding which cluster is population 1 and which one is population 2. If α < 0.5, then this decision is easier because the smaller cluster could be labeled as population 1 and the larger cluster as population 2.

Further, it is interesting to note that even though SABER models the linkage disequilibrium (LD) structure whereas LAMP does not, it appears that LAMP performs better than SABER. This could be attributed to several possible reasons. First, it might be that the LD structure only adds slightly to the information captured by the independent SNPs. Second, it might be that optimization of the model in SABER is a harder task than optimization of the model in LAMP because of the larger number of parameters, and thus SABER might potentially not converge to the global optimum of its parameter space.

A simple extension to LAMP can be used to infer the individual admixture. As we show here, the resulting estimates of the individual admixture are considerably better than the estimates achieved by STRUCTURE or EIGENSTRAT. A number of recent studies have produced panels of AIMs in admixed populations;^{33–36} AIMs are SNPs that have differing frequencies in the ancestral populations. It is possible that the AIMs might be used to improve the accuracy of individual admixture prediction done by STRUCTURE or other methods, including LAMP. However, the AIMs have disadvantages because there is a risk of over fitting, and the studied population might be somewhat different than the population for which the AIMs were found. As we show here, in an era where the genotyping technology is getting cheaper, it is useful to use the entire set of genotyped SNPs in the analysis of population stratification.

There are many possible improvements to this work, and in particular it would be important to improve the current methods in the case of very similar ancestral populations, or when more than two populations are involved. Furthermore, removing the dependency of the method on the input parameters (e.g., the number of generations *g*, or the admixture fraction α) might be quite useful for the generation of a rigorous statistical test for admixing. Additional improvements in the running time can be achieved by the parallelization of the LAMP algorithm. This is straightforward in our case because each window could be run independent of the others, and the results for windows overlapping a SNP could be aggregated in a final step.

## Appendix

### Correctness of MAXVAR

In this section, we analyze the correctness of the MAXVAR algorithm. We have two populations, *A*_{1} and *A*_{2}. We denote α = α_{2} as the admixture fraction of *A*_{2}—the smaller of the two populations. The MAXVAR algorithm classifies the individuals into three types of ancestries, i.e., *A*_{1}*A*_{1}, *A*_{1}*A*_{2}, and *A*_{2}*A*_{2}. The algorithm works by first picking a specific individual termed a pivot and then clustering individuals on the basis of their similarity to the pivot. We show that when the the populations are significantly different from each other, the pivot will have an ancestry *A*_{2}*A*_{2} with high probability. In this case, we show that one can define a similarity score S (as defined in the Material and Methods), such that the individuals who are also of ancestry *A*_{2}*A*_{2} have positive similarity score to the pivot, whereas those with ancestry *A*_{1}*A*_{1} have negative similarity scores in expectation. Thus, the individuals with the smallest (1 − α)^{2}*n* values of the similarity score are assigned an ancestry of *A*_{1}*A*_{1}, the largest α^{2}*n* values are assigned an ancestry of *A*_{2}*A*_{2}, and the rest are assigned *A*_{1}*A*_{2}.

Let ${p}_{{A}_{1}{A}_{1}},{p}_{{A}_{1}{A}_{2}},\phantom{\rule{0.25em}{0ex}}\text{and}\phantom{\rule{0.25em}{0ex}}{p}_{{A}_{2}{A}_{2}}$ be the frequencies of individuals of the three types in the population. We assume that ${p}_{{A}_{1}{A}_{1}}={(1-\alpha )}^{2}$, ${p}_{{A}_{2}{A}_{2}}={\alpha}^{2}$, and ${p}_{{A}_{1}{A}_{2}}=2\alpha (1-\alpha )$. Let *p _{k}* and

*q*be the minor allele frequencies of population

_{k}*A*

_{1}and

*A*

_{2}, respectively, in position

*k*. Furthermore, we assume that the values of

*μ*and

_{k}*σ*(as defined in the Material and Methods) are constants and that

_{k}*μ*= 2(1 − α)

_{k}*p*+ 2

_{k}*αq*, ${\sigma}_{k}^{2}=2{(1-\alpha )}^{2}{p}_{k}(1-{p}_{k})+2\alpha (1-\alpha )[{p}_{k}(1-{p}_{k})+{q}_{k}(1-{q}_{k})]+2{\alpha}^{2}{q}_{k}(1-{q}_{k}).$ Note that by simplifying the above, one gets that ${\sigma}_{k}^{2}=2(1-\alpha ){p}_{k}(1-{p}_{k})+2\alpha {q}_{k}(1-{q}_{k})$, and therefore it is of the same order of magnitude as

_{k}*μ*for common SNPs. If the number of individuals is large enough, the variance is quite low, and therefore this is not a restrictive assumption. We define the distance between the two populations as $W=\sum _{k}\frac{{\left({p}_{k}-{q}_{k}\right)}^{2}}{{\sigma}_{k}^{2}}$. Under these assumptions, it is easy to see that if

_{k}*aa*,

*ab*, and

*bb*are three given individuals with ancestry

*A*

_{1}

*A*

_{1},

*A*

_{1}

*A*

_{2}, and

*A*

_{2}

*A*

_{2}, respectively, in the window, then the expected similarity score S between pairs of individuals is

where *aa*′, *ab*′, and *bb*′ are individuals with ancestries *A*_{1}*A*_{1}, *A*_{1}*A*_{2}, and *A*_{2}*A*_{2}, but they are different individuals than *aa*, *ab*, and *bb*. From this, it is easy to verify that the sum of squares of expectations over all individuals that are different from *aa*, *ab*, and *bb* can be approximated as:

The only reason for the approximation is that the number of individuals with ancestry *A*_{2}*A*_{2} that are different from *bb* is (1 − α)^{2}*n* − 1, whereas we consider it as (1 − α)^{2}*n*. This approximation is not restrictive when the number of individuals is reasonably large.

Similarly, it is easy to verify that if *m* is the number of SNPs, then the following holds:

The bounds *O*(*m*) follow from the fact that ${\mu}_{k}=\Theta \left({\sigma}_{k}^{2}\right)$ for every common SNP with minor allele frequency bounded away from zero. Therefore, if we assume that $W>>\sqrt{m}$ (i.e., the populations are distant from each other), we get that for every individual *x*, $\sum _{i:i\ne x}E{\left[S(i,aa)\right]}^{2}>>\sum _{i:i\ne x}Var\left[S(i,aa)\right]$. On the other hand,

From Equation 10, we conclude that the expectation of the square of distances of the population from an individual from *A*_{2}*A*_{2} is larger than from and individual from *A*_{1}*A*_{2}, or *A*_{1}*A*_{1}. If the number of individuals is large enough, the distribution is concentrated around the mean, and thus we expect an *A*_{2}*A*_{2} individual to be chosen as the pivot. In that case, by Equation 9, the ordering of the individuals according to their similarity to the pivot should give the correct clustering with a fraction of errors exponentially small in W.

### Accuracy of the Window Length and the Majority Vote

For a given window, the analysis in the Choosing the Window Length section shows that the expected fraction of individuals with no breakpoints is 1 − γ. Here, we strengthen this analysis under the assumption that the errors in the predictions of the different windows are independent.

It is easy to see that the expected fraction of individuals with two or more breakpoints in a window is smaller than γ^{2}. For a given individual with a breakpoint in position *i*, we denote the ancestry by $({A}_{{s}_{1}},{A}_{{s}_{2}},i,{A}_{{s}_{3}})$, where ${A}_{{s}_{1}}$ is the ancestry of the nonrecombinant chromosome and ${A}_{{s}_{2}}$ and ${A}_{{s}_{3}}$ are the ancestries of the recombinant chromosome. We assume that the probability to classify such an individual as ${A}_{{s}_{1}}{A}_{{s}_{2}}$ is $\frac{i}{l}$ and the probability to classify it as ${A}_{{s}_{1}}{A}_{{s}_{3}}$ is $1-\frac{i}{l}$. There are *l* windows that overlap with any SNP. Consider a SNP that is a distance *d* away from a breakpoint. Let *X* be the number of times that the SNP is incorrectly classified as ${A}_{{s}_{1}}{A}_{{s}_{3}}$. Clearly,

With a Chernoff bound,^{37} the probability of incorrectly classifying this SNP after the majority vote is

In the case that there are no other breakpoints within distance *l* from the breakpoint considered, the expected number of errors around the breakpoint for the individual is bounded by

If there are breakpoints within distance *l* of each other, we take the worst-case assumption that all windows containing the two breakpoints make erroneous predictions over their entire length *l*. Because the expected fraction of breakpoints in an individual is $\frac{\gamma}{l}$ and the expected fraction of pairs of breakpoints that are of distance smaller than *l* is at most $\frac{{\gamma}^{2}}{l}$, we can bound the expected fraction of errors as $\frac{\gamma}{\sqrt{\pi l}}+{\gamma}^{2}=4\sum _{i<j}{\mathit{\alpha}}_{i}{\mathit{\alpha}}_{j}(g-1)r(\sqrt{\frac{l}{\pi}}+4\left(\sum _{i<j}{\mathit{\alpha}}_{i}{\mathit{\alpha}}_{j}\right)(g-1)r{l}^{2})$. On the basis of this analysis, a sufficient condition to achieve a desired error rate of is to have

### Estimate of Window Length

The window length derived in Equation 8 bounds the classification errors within each window to a desired error rate . Because all SNPs within a window are assigned the same ancestry, *any* algorithm that is used within this window will incur some errors in the presence of breakpoints. Hence, the window length was calculated under the assumption that the classification algorithm within the window was the most accurate possible, i.e., any errors in the classification were only a result of breakpoints within a window. Here, we empirically show that for the window lengths computed with Equation 8, the average classification error for a most accurate classification is bounded by the error rate , which is set to 0.10.

Within each window, the most accurate ancestry assignment is inferred with the assumption that the true ancestries are known. The most accurate assignment consists of assigning to an individual the ancestry found in a majority of the SNPs in that window. Thus, an individual who has no breakpoints is always correctly classified, whereas an individual with a breakpoint at position $i<\lceil \frac{l}{2}\rceil $ in a window of length *l* and ancestries ${A}_{{s}_{1}}$ and ${A}_{{s}_{2}}$ on either side of the breakpoint will have errors in positions {1,…,*i*}. The error rate for a window is the fraction of positions that are incorrectly classified in the window.

We computed the average of these errors in overlapping windows that span chromosome 1 of the YRI-CEU dataset for different values of *g* and α. We see in Figure 9 that the average error is below . However, the variance of the estimates (indicated by the minimum and the maximum fraction of errors) increases with larger *g* or with $\mathit{\alpha}\to 0.5$. The window-size estimates seem to provide a good bound on the average fraction of errors due to breakpoints.

### Practical Issues in Implementing LAMP

In this section, we describe some of the issues that we faced while implemeting LAMP. One of the issues that we needed to address was how to determine the degree of overlap between adjacent windows. An extreme degree of overlap would require adjacent windows to differ in a single SNP. In practice, we found that a smaller degree of overlap, where consecutive windows overlapped in a fraction *c* = 80% of their length, did not significantly change the accuracy while resulting in faster running times. The overlap between adjacent windows can be exploited to further improve the running time. Although using the MAXVAR algorithm to obtain an initial classification, each window requires a computation of the similarity score between all pairs of individuals. The similarity score is computed with an inner product of the normalized genotypes, as described in the Initializing the Clusters section. Instead of computing these similarity scores over entire windows of length *l*, we can compute these scores over chromosomes of length (1 − *c*)*l*. The similarity score in a new window can then be computed from that of the previous window by adjusting for the nonoverlapping regions.

As we mentioned at the end of the Choosing the Window Length section, the window length calculation should take into account the distance between the two populations. This is feasible when the ancestral genotypes are known. In this scenario, the accuracy of the classification for a given window length can be obtained by running LAMP-ANC on the ancestral genotypes. With an increase in the window length, this accuracy is exptected to increase. On the other hand, the errors due to breakpoints, as given in Equation 8, increase with window length. We can then search for the window length that maximizes the product of the fraction of individuals who do not have breakpoints and the fraction of these individuals who are accurately classified. For populations that are well separated, such as YRI-CEU and CEU-JPT, we find that the number of SNPs needed to accurately classify a nonadmixed individual is much smaller than the length of the window obtained from Equation 8, so that it suffices to simply set the window length to the latter estimate.

## Web Resources

The URLs for data presented herein are as follows:

- LAMP program, http://lamp.icsi.berkeley.edu/lamp/
- Hapmap project, http://www.hapmap.org
- Affymetrix GeneChip Human Mapping 500K Array Set, http://www.affymetrix.com/products/arrays/specific/500k.affx

## Acknowledgments

G.K, E.H, and S. Sankararaman were supported by National Science Foundation (NSF) grant IIS-0513599S. S. Sankararaman was also partially supported by grant R33 HG003070. G.K. was also partially supported by the Rothschild fellowship. S. Sridhar was supported by NSF grant IIS-0612099. E.H. was also partially supported by NSF grant IIS-0713254.

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (873K) |
- Citation

- Using ancestry-informative markers to define populations and detect population stratification.[J Psychopharmacol. 2006]
*Enoch MA, Shen PH, Xu K, Hodgkinson C, Goldman D.**J Psychopharmacol. 2006 Jul; 20(4 Suppl):19-26.* - Population structure, admixture, and aging-related phenotypes in African American adults: the Cardiovascular Health Study.[Am J Hum Genet. 2005]
*Reiner AP, Ziv E, Lind DL, Nievergelt CM, Schork NJ, Cummings SR, Phong A, Burchard EG, Harris TB, Psaty BM, et al.**Am J Hum Genet. 2005 Mar; 76(3):463-77. Epub 2005 Jan 19.* - Matching strategies for genetic association studies in structured populations.[Am J Hum Genet. 2004]
*Hinds DA, Stokowski RP, Patil N, Konvicka K, Kershenobich D, Cox DR, Ballinger DG.**Am J Hum Genet. 2004 Feb; 74(2):317-25. Epub 2004 Jan 21.* - Design and analysis of admixture mapping studies.[Am J Hum Genet. 2004]
*Hoggart CJ, Shriver MD, Kittles RA, Clayton DG, McKeigue PM.**Am J Hum Genet. 2004 May; 74(5):965-78. Epub 2004 Apr 14.* - Analyses of genetic ancestry enable key insights for molecular ecology.[Mol Ecol. 2013]
*Gompert Z, Buerkle CA.**Mol Ecol. 2013 Nov; 22(21):5278-94. Epub 2013 Sep 19.*

- Local and Global Ancestry Inference, and Applications to Genetic Association Analysis for Admixed Populations[Genetic epidemiology. 2014]
*Thornton TA, Bermejo JL.**Genetic epidemiology. 2014 Sep; 38(0 1)S5-S12* - Vitamin D receptor (VDR) Polymorphisms and Risk of Ovarian Cancer in Caucasian and African American women[Gynecologic oncology. 2013]
*Grant DJ, Hoyo C, Akushevich L, Iversen ES, Whitaker R, Marks J, Berchuck A, Schildkraut JM.**Gynecologic oncology. 2013 Apr; 129(1)173-178* - Admixture mapping identifies a locus at 15q21.2-22.3 associated with keloid formation in African Americans[Human genetics. 2014]
*Velez Edwards DR, Tsosie K, Williams SM, Edwards TL, Russell SB.**Human genetics. 2014 Dec; 133(12)1513-1523* - Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations[BMC Bioinformatics. ]
*Bansal V, Libiger O.**BMC Bioinformatics. 16(1)4* - Lupus risk variants in the PXK locus alter B-cell receptor internalization[Frontiers in Genetics. ]
*Vaughn SE, Foley C, Lu X, Patel ZH, Zoller EE, Magnusen AF, Williams AH, Ziegler JT, Comeau ME, Marion MC, Glenn SB, Adler A, Shen N, Nath S, Stevens AM, Freedman BI, Tsao BP, Jacob CO, Kamen DL, Brown EE, Gilkeson GS, Alarcón GS, Reveille JD, Anaya JM, James JA, Moser KL, Criswell LA, Vilá LM, Alarcón-Riquelme ME, Petri M, Scofield RH, Kimberly RP, Ramsey-Goldman R, Binjoo Y, Choi J, Bae SC, Boackle SA, Vyse TJ, Guthridge JM, Namjou B, Gaffney PM, Langefeld CD, Kaufman KM, Kelly JA, Harley IT, Harley JB, Kottyan LC.**Frontiers in Genetics. 5450*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles

- Estimating Local Ancestry in Admixed PopulationsEstimating Local Ancestry in Admixed PopulationsAmerican Journal of Human Genetics. 2008 Feb 8; 82(2)290

Your browsing activity is empty.

Activity recording is turned off.

See more...