• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS One. 2009; 4(12): e7928.
Published online Dec 1, 2009. doi:  10.1371/journal.pone.0007928
PMCID: PMC2779848

Laplacian Eigenfunctions Learn Population Structure

Dennis O'Rourke, Editor

Abstract

Principal components analysis has been used for decades to summarize genetic variation across geographic regions and to infer population migration history. More recently, with the advent of genome-wide association studies of complex traits, it has become a commonly-used tool for detection and correction of confounding due to population structure. However, principal components are generally sensitive to outliers. Recently there has also been concern about its interpretation. Motivated from geometric learning, we describe a method based on spectral graph theory. Regarding each study subject as a node with suitably defined weights for its edges to close neighbors, one can form a weighted graph. We suggest using the spectrum of the associated graph Laplacian operator, namely, Laplacian eigenfunctions, to infer population structure. In simulations and real data on a ring species of birds, Laplacian eigenfunctions reveal more meaningful and less noisy structure of the underlying population, compared with principal components. The proposed approach is simple and computationally fast. It is expected to become a promising and basic method for population genetics and disease association studies.

Introduction

Principal Components Analysis (PCA) is a classical statistical tool to achieve dimension reduction through consideration of linear combinations of the original variables. The top few principal components (PCs) are the linear combinations that explain the greatest amount of variation in the data. The use of PCA in population genetics has a long history, including early work of Cavalli-Sforza and colleagues [1], [2], who considered high dimensional genetic variants from population samples at many different continental locations and used the top PCs to summarize the genetic variation across space. While legitimate concerns have been raised about the interpretation of such PC maps [3], PCA can still provide useful information and is a commonly-used tool in various contexts of genetic data analysis [4]. For example, there is known to be a close connection between the spectral decomposition of the migration matrix and that of the genetic covariance matrix [5]. More recently, in genome-wide disease association studies, PCA has been employed to detect and correct population stratification [6][8], in which systematic ancestry differences between cases and controls can lead to false positive association between phenotype and genotype. Such spurious associations [9][11] can occur when the disease frequency varies across subpopulations, resulting in affected individuals being more likely than unaffected individuals to be sampled from certain subpopulations [12]. Though this topic has been extensively studied, PCA has advantages [6] over other methods such as genomic control [13] and structured association [14].

Motivated from geometric learning [15], we describe LAPSTRUCT, a Laplacian eigenfunction approach based on graph theory which we briefly introduced in Genetic Analysis Workshop (GAW) 16 [16]. One regards each subject as a vertex of a weighted graph [17], where the weight associated to the edge for each pair of subjects is chosen as a function of their genetic relatedness, with higher weight given when individuals are genetically closer (see Methods). Thus, in this context, one thinks of the distance between each pair of subjects as being based on their degree of genetic relatedness, not on their geographical proximity. The resulting adjacency graph approximates the underlying manifold of the dependence structure of the sample. The eigenfunctions of the Laplace-Beltrami operator [18] on the manifold are generalized geometric harmonic functions, which contain useful intrinsic geometric structure information on the population. The eigenvectors of the associated graph Laplacian matrix (see Methods) are first-order linear approximations of the Laplacian eigenfunctions, and they relate to the intrinsic dependence structure of the data. The Laplacian eigenmap formed by embedding each subject to a lower dimensional Euclidean space via the top few eigenfunctions has a locality preserving property, that is, the distance between a pair of subjects in the Laplacian eigenmap reflects the degree of their being correlated. The more they are correlated, the closer together they are mapped. As a result, the Laplacian eigenmap leads to cluster-like structures for subjects who either come from the same discrete subpopulation or share more common ancestry in an admixed population.

The Laplacian eigenfunction method is part of a large class of spectral methods that includes PCA as a special case. However, the approach we use improves on PCA in that each vertex is connected by edges to only its close neighbors, rather than to all other individuals (where, here, closeness refers to genetic relatedness rather than physical proximity). A justification for this results from the connection between spectral clustering and approximate solutions to graph cut problems (see previous work [19], [20] for details). The result is that the Laplacian eigenfunction method tends to emphasize substructure that affects many data points rather than just a few extreme points, so the proposed nonlinear algorithm is robust to outliers, in contrast to PCA. Therefore we suggest using Laplacian eigenvectors instead of PCs to study population structure. A similar approach based on spectral graph theory is also treated by Lee et al. [20] with a nice illustration on the POPRES data [21], but with different choices of weight and data renormalization (see Methods and Discussion).

The proposed method, LAPSTRUCT, has arisen from the idea of studying the geometry of the intrinsic dependence structure of sample populations, which can be creatively regarded as a weighted graph, together with a metric measuring the degree of relatedness for each pair of individuals. The paradigm of the approach is that local infinitesimal structure integrates out global macroscopic structure. Another interpretation to this is to define a random walk on the weighted graph constructed above, with a suitably normalized transition probability between two nodes reflecting their connectivity. Then one can use the top spectrum of the Markov transition matrix to map the data to a lower dimenional Euclidean space. This idea has clear antecedents in earlier work in population genetics (e.g. [5]).

The results on both the Greenish warbler (a ring species) data set [3], [22] and a simulated data set with a spatially correlated population give better approximations to the true population structure than does PCA. Because Laplacian eigenfunctions are generalized harmonic functions, the patterns observed from the PC map on spatially correlated genetic data [3] are also present in the Laplacian eigenmap. Therefore, any hypotheses of historic migration suggested by LAPSTRUCT would require additional evidence before a conclusion is made.

Results

Simulation Study A

In our simulations, we compare the results of LAPSTRUCT with those of the PC-based method EIGENSTRAT [6]. Figure 1 illustrates the population structure dectected by EIGENSTRAT and by LAPSTRUCT in the discrete population consisting of two subpopulations (see Methods). In this example, the population structure is perfectly captured by the vector, An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e001.jpg, of length An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e002.jpg, having entry An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e003.jpg for each individual in population 1 and entry An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e004.jpg for each indvidual in population 2, where An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e005.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e006.jpg are the total numbers of individuals from subpopulations 1 and 2, respectively, and An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e007.jpg (see TextS1 online for details). Both the PC and the Laplacian eigenvector appear to be approximating An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e008.jpg, but the Laplacian approach is clearly giving a much more accurate approximation. While both approaches are effective at clustering the data, the more accurate approximation of the ancestry vector, An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e009.jpg, by the Laplacian approach suggests that ancestry should be more accurately accounted for in downstream analyses such as association mapping. In principle, this should increase power, though in our simulation the effect was slight (see Table 1). Figure 2 shows the population structure identified by EIGENSTRAT and by LAPSTRUCT in the admixed population. The PC map shows the expected uniform distribution of ancestry proportion. However, the Laplacian eigenmap shows a tendency to shrink the points toward two clear clusters, indicating the two ancestral populations. For disease association studies conducted in both simulations by simply replacing the PCs by Laplacian eigenfunctions in the regression setting introduced in reference [6], LAPSTRUCT peforms as well as EIGENSTRAT (see Table 1).

Figure 1
Structure of a simulated discrete population.
Figure 2
Structure of a simulated admixed population.
Table 1
Simulated Association Testing.

Simulation Study B

The sensitivity of PC to outliers is illustrated by the analysis of the spatially correlated population that consists of subpopulations arranged on a circle and an additional isolated subpopulation. When 10 individuals from the isolated subpopulation are included in the sample, the top PC focuses on isolating those outliers, and the PC map based on the top 2 components does not capture the full structure of the data, missing the circle configuration of the population structure (see Figure 3). With the outliers removed from the sample, the PC map based on the top two PCs does give the ring shape of the population structure. In contrast, the Laplacian eigenmap based on two components identifies the full population structure even in the presence of outliers, demonstrating that it is much more robust to outliers than is PC. The additional smoothness in the Laplacian eigenmap compared to the PC map might be due to the fact local correlation is weighted more highly, which gives a local smoothing effect.

Figure 3
Structure of a simulated ring population.

Phylloscopus trochiloides

Figure 4 below illustrates the population structure detected by the PCA and Laplacian methods, respectively, where one can more clearly observe the ring-shape structure in the Laplacian eigenmap, compared to the vague structure shown in the PC map.

Figure 4
Ring Structure of a real dataset.

Discussion

We have developed LAPSTRUCT, a Laplacian eigenfunction approach for detection and correction of population structure in genetic studies. LAPSTRUCT can be viewed as a robust alternative to PC-based methods such as EIGENSTRAT. Like PC, LAPSTRUCT naturally leads to population clusters according to the degree of genetic correlation among individuals. However, LAPSTRUCT is designed to be less sensitive to outliers than PC, emphasizing structure that affects many data points rather than just a few extreme points. LAPSTRUCT can reveal less noisy and richer structure at different scales by varing the parameters. It is expected to become a promising tool for population genetics.

In the simulation studies, the top Laplacian eigenfunctions identify the overall structure, while the PC approach has a tendency to highlight outliers, when they are present. For example, in the spatial simulation with outliers, PC requires three components to find the ring structure, while the Laplacian eigenfunction approach finds the ring structure with only two components. This suggests that the Laplacian eigenfunction approach could be more useful than the PC approach in contexts such as association mapping in which it is desirable to capture the population structure with as few components as possible, in order to preserve power. Additionally, only those eigenfunctions for which cases and controls have significantly different distributions need to be accounted for in the setting of association mapping, and including unnecessary eigenfunctions will lead to power loss. Further investigation in this direction is encouraged.

The Laplacian eigenmap approach we describe is part of a more general setting of spectrum-based dimension reduction techniques that includes the PC approach. The appropriate choice of the neighborhood parameter, An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e017.jpg, is what causes the Laplacian eigenmap to be less sensitive to outliers than PC. When An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e018.jpg is sufficiently large, the Laplacian eigenmap approach and the PC approach can produce very similar results. As An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e019.jpg is decreased, the Laplacian eigenmap can capture the local dependence structure at different scales. In practice, An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e020.jpg should be chosen reasonably large to make the graph connected and maintain valid type one error for association studies. For example, An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e021.jpg could be the An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e022.jpg-th quantile for some suitable An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e023.jpg. An alternative on the scale of neighborhood is to select each subject's An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e024.jpg closest neighbors in terms of correlation for some reasonably large integer An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e025.jpg. To avoid the issue of tuning parameter selection Lee et al. [20] simply take An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e026.jpg if An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e027.jpg, otherwise An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e028.jpg Generally there is room for different choices of weights which may give close performance, and the optimal weight is worth further investigation. The threshholding technique seems appropriate and it has been widely accepted. It reduces the noise from less correlated samples. We incorporate this idea in the renormalization of the genotype data, where each individual's SNP is normalized using the local SNP frequency estimated from only those closely correlated individuals. We note this is appropriate when the data are abundant, and one would certainly use all data instead if the sample size were relatively small.

Materials and Methods

Phylloscopus Trochiloides (Greenish Warblers) Data

Greenish warblers are most abundant in western and eastern Siberia, where they form a ring species complex. The complex consists of two main populations connected by gene flow via a narrow band of populations to the south that are arranged in a ring around the Tibetan plateau. There is no mating between the two main populations where they overlap geographically, so greenish warblers can be regarded as inhabiting a one-dimensional habitat. Irwin et al. [22] collected 105 individuals from 26 geographic sites and each individual was typed for presence or absence at 62 amplified fragment length polymorphism (AFLP) markers.

Laplacian Eigenfunctions

Regard each individual An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e029.jpg as a vertex An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e030.jpg in a weighted graph An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e031.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e032.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e033.jpg. Let the weight between individuals An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e034.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e035.jpg be a Gaussian kernel An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e036.jpg if An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e037.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e038.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e039.jpg otherwise. Here An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e040.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e041.jpg are some selected positive real numbers. The An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e042.jpg measures the size of each subject's neighborhood. The constant An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e043.jpg stands for the global diffusion scale on the graph and we set An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e044.jpg in all the computations. (For information on the effects of An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e045.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e046.jpg on detection of population structure, see Figure S1 online.) The An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e047.jpg measures the distance between vertex An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e048.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e049.jpg. We set the distance An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e050.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e051.jpg is the estimator of genetic correlation [6] between individuals An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e052.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e053.jpg. Specifically, let An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e054.jpg denote the genotype An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e055.jpg of individual An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e056.jpg at SNP An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e057.jpg. We normalize the vector of genotypes for SNP An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e058.jpg by subtracting off its average, An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e059.jpg, and then dividing each entry by An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e060.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e061.jpg is an estimate of the allele frequency at SNP An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e062.jpg given by An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e063.jpg. (All missing entries are excluded from the computation.) Let An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e064.jpg be the resulting normalized genotype for SNP An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e065.jpg in individual An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e066.jpg. Then we set An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e067.jpg.

To avoid the effects of population structure in the allele frequency estimation, the same idea above leads to an alternative local SNP frequency estimation and genotype updating approach. Instead of estimating a single allele frequency per marker, we compute a local SNP frequency An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e068.jpg for each individual An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e069.jpg at SNP An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e070.jpg simply by including only those individuals whose correlation with individual An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e071.jpg is larger than An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e072.jpg. That is, An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e073.jpg. Next we denote the updated genotype matrix G from the original genotype matrix g by An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e074.jpg.

Let An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e075.jpg be a diagonal matrix of size An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e076.jpg with entries An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e077.jpg, a natural measure on the vertices. The Laplacian matrix on graph An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e078.jpg is defined to be An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e079.jpg. Note that An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e080.jpg is a symmetric and positive semidefinite matrix, and we restrict to the normalized version An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e081.jpg which is not symmetric anymore. The eigenfunctions of the normalized equation An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e082.jpg are denoted by An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e083.jpg for each An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e084.jpg, ranked according to the reverse order of their corresponding eigenvalues, i.e., An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e085.jpg. It is easy to see that An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e086.jpg is always an eigenvalue with constant eigenvector consisting of all 1's. These eigenfunctions generalize the low frequency Fourier harmonics on a manifold approximated by the graph An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e087.jpg. To achieve dimension reduction, the Laplacian eigenmap with first An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e088.jpg (usually small, 2 or 3) eigenvectors is defined by An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e089.jpg for individual An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e090.jpg. Note that the situation here is different from PCA, where one takes the PCs corresponding to the largest eigenvalues which account for the largest amount of variation in the data. The justification is given below. We remark that a symmetrically normalized version of An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e091.jpg is given by An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e092.jpg. The Laplacian eigenmap using the corresponding spectrum gives comparable performance. For the relationship between these two versions, see [19].

The Laplacian eigenmap approach we describe is part of a more general setting of spectrum-based dimension reduction techniques that includes the PC approach. The appropriate choice of the neighborhood parameter, An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e093.jpg, is what causes the Laplacian eigenmap approach to be less sensitive to outliers than PC. When An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e094.jpg is sufficiently large, the Laplacian eigenmap approach and the PC approach can produce very similar results. This is shown in Figure 5 for the simulated discrete population model. As An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e095.jpg is decreased, the Laplacian eigenmap can capture the local dependence structure at different scales. See Figure S2 online for an illustration.

Figure 5
QQ-plot of PCA and Laplacian.

To apply the Laplacian eigenmap method to disease association studies, one can follow a multiple regression approach as in [6]. For example, one could regress genotypes and phenotypes on the top An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e097.jpg Laplacian eigenvectors for each individual, and then compute the adjusted An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e098.jpg statistic of the residuals. In the simulations, we set An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e099.jpg equal to An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e100.jpg, in order to make a comparison with EIGENSTRAT.

Justification of Weight Kernel and Laplacian Eigenmap

The selected Gaussian weight is optimal in a certain sense, and it has a deep connection to the heat kernel on a manifold that gives the general solution to the heat equation. In the discrete case, the Laplacian of a function can be expressed as combinations of heat kernels which locally approximate the Gaussian kernel. For the mathematical details, see references [15], [17]. The locality preserving property of the Laplacian eigenmap follows from the fact that the cost function of a weighted graph equals the Laplacian of the map function, that is, An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e101.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e102.jpg are the collection of nodes and An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e103.jpg So the minimization problem reduces to finding An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e104.jpg that minimizes An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e105.jpg subject to the constraint An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e106.jpg, and this is equivalent to the generalized eigenvalue problem stated above. This also explains why the Laplacian eigenmap ranks the eigenvalues in increasing order.

Simulation Study A. Discrete and Admixed Populations

To simulate a discrete population consisting of two subpopulations, we follow a model of population structure used in reference [10] (see also [6]). Each subpopulation is generated by the Balding-Nichols model, but with each subpopulation having its own generalized An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e107.jpg value (0.01 and 0.05, respectively, for subpopulations 1 and 2), instead of the same value for both subpopulations (see [10] for details). The population allele frequency of each random SNP is sampled uniformly from An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e108.jpg. The allele frequency within each subpopulation is drawn from a beta distribution, An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e109.jpg. For each individual, 10,000 SNPs were generated. The sample consists of 500 cases and 500 controls, where 60% of cases and 40% of controls were from subpopulation 1 and the rest were sampled from subpopulation 2. For the admixed population with two ancestral populations, the ancestral populations' generalized An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e110.jpg values were set equal to 0.01 and 0.09 respectively. For the admixed population, 1,000 individuals were sampled, half cases and half controls. The sample's ancestral proportions are assumed uniformly distributed from 0 to 1. For the causal allele, a risk model [6] with relative risk An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e111.jpg was used for both the discrete population and the admixed population. The allele frequencies for highly differentiated SNPs are respectively set to 0.2 and 0.8 in the two subpopulations.

Simulation Study B. Spatially Correlated Population

Following reference [3], an equilibrium population is simulated using the software MS for population genetics developed by Hudson [23]. The population consists of 100 subpopulations equally spaced on a circle, with members of an additional isolated subpopulation as outliers. Each subpopulation is assumed to consist of an equal number of diploids. During each generation backward in time, a fraction An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e112.jpg of each subpopulation along the circle is made up of migrants from each adjacent subpopulation, and there are no gamete swaps between non-adjacent subpopulations. 1,000 SNP loci were independently simulated with one segregating site per locus, and An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e113.jpg individuals were sampled from each subpopulation.

URL. Software for running LAPSTRUCT on a Linux platform is available at http://galton.uchicago.edu/~junzhang/LAPSTRUCT.html.

Supporting Information

Text S1

Supporting Text

(0.09 MB PDF)

Figure S1

Here we consider the simulated discrete population consisting of two subpopulations, analyzed with ε = 1.0 in all cases. When the scale parameter t is sufficiently small, the Laplacian matrix L degenerates to the identical matrix I and no structure can be detected. When t = 0.1, the second Laplacian eigenfunction degenerates approximately to zero for one of the subpopulations. For larger t values, there are little difference in the detected structures.

(0.08 MB PDF)

Figure S2

Here we consider the simulated discrete popualtion consisting of two subpopulations, and t = 1.0 in all cases. When ε = 0.96, the graph has two connected components representing two subpopulations and the top two Laplacian eigenfunctions degenerate to 0 and -1/An external file that holds a picture, illustration, etc.
Object name is pone.0007928.e114.jpg 500 = -0.0447. When ε≥1.0, the graph is connected. As ε increases, the local correlation structures revealed by the Laplacian eigenmap evolve to global structures which approximate to PCs.

(0.11 MB PDF)

Acknowledgments

The authors are grateful to Matthew Stephens for numerous discussions and advice to improve the presentation and John Novembre for help on spatial simulations. Thanks also goes to Trevor Price and Darren Irwin for generously providing the greenish warblers data set.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: Support from NIH grant R01 HG001645 (to M.S.M.) is gratefully acknowledged. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Cavalli-Sforza L, Edwards AWF Analysis of human evolution. Genetics Today. 3
2. Menozzi P, Piazza A, Cavalli-Sforza L. Synthetic maps of human gene frequencies in europeans. Science. 1978;201:786–792. [PubMed]
3. Novembre J, Stephens M. Interpreting principal component analyses of spatial population genetic variation. Nature Genetics. 2008;40:646–649. [PMC free article] [PubMed]
4. Reich D, Price A, Patterson N. Prinicpal component analysis of genetic data. Nature Genetics. 2008;40:491–2. [PubMed]
5. Felsenstein J. Contrasts for a within-species comparative method. In: Slatkin M, Veuille M, editors. Modern Developments in Theoretical Population Genetics: the legacy of Gustave Malecot. New York: Oxford University Press; 2002. pp. 118–129.
6. Price AL, Patterson N, Plenge RM, Weinblatt ME, Shadick NA, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38:904–909. [PubMed]
7. Chen H, Zhu X, Zhao H, Zhang S. Qualitative semi-parametric test for genetic associations in case-control designs under structured populations. Ann Hum Genet. 2003;67:250–264. [PubMed]
8. Zhu X, Zhang S, Zhao H, Cooper R. Association mapping, using a mixture model for complex traits. Genet Epidemiol. 2002;23:181–196. [PubMed]
9. Lander E, Schork N. Genetic dissection of complex traits. Science. 1994;265:2037–2048. [PubMed]
10. Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nat Genet. 2004;36:512–517. [PubMed]
11. Freedman Mea. Assessing the impact of population stratification on genetic association studies. Nat Genet. 2004;36:388–393. [PubMed]
12. Pritchard J, Rosenberg N. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet. 1999;65:220–228. [PMC free article] [PubMed]
13. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. [PubMed]
14. Pritchard J, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. [PMC free article] [PubMed]
15. Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computing. 2003;13:1373–1397.
16. Zhang J, Weng C, Niyogi P. Graphical analysis of population structure on rheumatoid arthritis data. BMC Proceedings, in press 2009
17. Chung FRK. Spectral Graph Theory. American Mathematical Society 1997
18. Rosenberg S. The Laplacian on a Riemannian Manifold. 1997. Cambridge University Press.
19. von Luxburg U. A tutorial on spectral clustering. Stat Comput. 2007;17:395–416.
20. Lee A, Luca D, Klei L, Devlin B, Roeder K. Discovering genetic ancestry using spectral graph theory. Genetic Epidemiology. 2009;33(5) [PubMed]
21. Nelson MR, Bryc K, King KS, Indap A, Boyko AR, et al. The population reference sample, popres: a resource for population, disease, and pharmacological genetics research. Am J Hum Genet. 2008;83:347–358. [PMC free article] [PubMed]
22. Irwin DE, Bensch S, Irwin JH, Price TD. Speciation by distance in a ring species. Science. 2005;307:414–6. [PubMed]
23. Hudson RR. Generating samples under a wright-fisher neutral model. Bioinformatics. 2002;18:337–8. [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...