Logo of plosgenPLoS GeneticsSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)View this Article
PLoS Genet. Oct 2009; 5(10): e1000686.
Published online Oct 16, 2009. doi:  10.1371/journal.pgen.1000686
PMCID: PMC2757795

A Genealogical Interpretation of Principal Components Analysis

Molly Przeworski, Editor

Abstract

Principal components analysis, PCA, is a statistical method commonly used in population genetics to identify structure in the distribution of genetic variation across geographical location and ethnic background. However, while the method is often used to inform about historical demographic processes, little is known about the relationship between fundamental demographic parameters and the projection of samples onto the primary axes. Here I show that for SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes. The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture. I also demonstrate a link between PCA and Wright's fst and show that SNP ascertainment has a largely simple and predictable effect on the projection of samples. Using examples from human genetics, I discuss the application of these results to empirical data and the implications for inference.

Author Summary

Genetic variation in natural populations typically demonstrates structure arising from diverse processes including geographical isolation, founder events, migration, and admixture. One technique commonly used to uncover such structure is principal components analysis, which identifies the primary axes of variation in data and projects the samples onto these axes in a graphically appealing and intuitive manner. However, as the method is non-parametric, it can be hard to relate PCA to underlying process. Here, I show that the underlying genealogical history of the samples can be related directly to the PC projection. The result is useful because it is straightforward to predict the effects of different demographic processes on the sample genealogy. However, the result also reveals the limitations of PCA, in that multiple processes can give the same projections, it is strongly influenced by uneven sampling, and it discards important information in the spatial structure of genetic variation along chromosomes.

Introduction

The distribution of genetic variation across geographical location and ethnic background provides a rich source of information about the historical demographic events and processes experienced by a species. However, while colonization, isolation, migration and admixture all lead to a structuring of genetic variation, in which groups of individuals show greater or lesser relatedness to other groups, making inferences about the nature and timing of such processes is notoriously difficult. There are three key problems. First, there are many different processes that one might want to consider as explanations for patterns of structure in empirical data and efficient inference, even under simple models can be difficult. Second, different processes can lead to similar patterns of structure. For example, equilibrium models of restricted migration can give similar patterns of differentiation to non-equilibrium models of population splitting events (at least in terms of some data summaries such as Wright's An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e001.jpg). Third, any species is likely to have experienced many different demographic events and processes in its history and their superposition leads to complex patterns of genetic variability. Consequently, while there is a long history of estimating parameters of demographic models from patterns of genetic variation, such models are often highly simplistic and restricted to a subset of possible explanations.

An alternative approach to directly fitting models is to use dimension-reduction and data summary techniques to identify key components of the structure within the data in a model-free manner. Perhaps the most widely used technique, and the most important from a historical perspective, is principal components analysis (PCA). Technical descriptions of PCA can be found elsewhere, however, its key feature is that it can be used to project samples onto a series of orthogonal axes, each of which is made up of a linear combination of allelic or genotypic values across SNPs or other types of variant. These axes are chosen such that the projection of samples along the first axis (or first principal component) explains the greatest possible variance in the data among all possible axes. Likewise, projection of samples onto the second axis maximizes the variance for all possibles axes perpendicular to the first and so on for the subsequent components. Typically, the positions of samples along the first two or three axes are presented, although methods for obtaining the statistical significance of any given axis have been developed [1]. Beyond being non-parametric, PCA has many attractive properties including computational speed, the ability to identify structure caused by diverse processes and its ability to group or separate samples in a striking visual manner; for example, see [2]. PCA has also become widespread in the analysis of disease-association studies where the inclusion of the locations of samples on a limited number of axes as covariates can be used in an attempt to control for population stratification [3].

Although PCA is explicitly a non-parametric data summary, it is nevertheless attractive to attempt to use the projections to make inferences about underlying events and processes. For example, dispersion of sample projections along a line is thought to be diagnostic of the samples being admixed between the two populations at the ends of the line, though these need not always be present [1], while correlations between principal components and geographical axes have been interpreted as evidence for waves of migration [4],[5]. However, while simulation studies have shown that such patterns do occur when the inferred process has acted [1],[6], they can also be caused by other processes or even statistical artefacts. For example, clines in principal components result not just from waves of expansion, but also recurrent bottlenecks, admixture and equilibrium models of spatial structure [6][11].

In this paper I develop a framework for understanding how PCA relates to underlying processes and events. I show that the expected location of samples on the principal components can, for single nucleotide polymorphism (SNP) data, be predicted directly from the pairwise coalescence times between samples. Because it is often relatively easy to obtain analytical or numerical solutions to expected coalescence times under explicit population genetics models, it is also possible to obtain expressions for the PCA projections of samples under diverse scenarios, including island models, models with isolation and founder events and historical admixture. The result also highlights some key limitations of PCA. For example, it follows that PCA cannot be used to distinguish between models that lead to the same mean coalescence times (for example models with migration or isolation). Furthermore, PCA projections are strongly influenced by uneven sampling. Using examples from human genetics I discuss the implications of these results for making inferences from PCA of genetic variation data.

Results

PCA describes structure in the matrix of pairwise coalescence times

In this section I provide a brief summary of how PCA is carried out and describe the key result concerning the relationship between PCA and average coalescence time. In what follows I assume that An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e002.jpg haploid individuals have been sequenced with complete accuracy (diploid samples and the influence of SNP ascertainment will be discussed later). The only polymorphisms present are biallelic SNPs that are the result of a single historical mutation. Let An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e003.jpg be the allelic state for individual i at locus s (here I assume that the ancestral allele is defined as 0 and the derived allele as 1, however the following also applies for any coding, for example where the minor allele is coded as 1). After removing monomorphic sites the data, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e004.jpg, consist of an An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e005.jpg binary matrix (An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e006.jpg is the number of SNPs). In PCA, the first step is to zero-centre the data, so as to create a new matrix, X, where

equation image
(1)

At this stage, the data rows are often normalized so as to have equal variance, however, it is assumed that this is not the case (in practice normalization has little effect for SNP data, though will tend to up-weight the influence of rare variants). Each individual sample can be thought of as representing a point in L-dimensional space, where each dimension (or axis) represents a single SNP. The goal of PCA is to find a new set of orthogonal axes (the principal components), each of which is made up from a linear combination of the original axes, such that the projection of the original data onto these new axes leads to an efficient summary of the structure of the data. More formally, PCA defines a stretch and rotation transformation, expressed through the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e008.jpg, such that application of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e009.jpg to the original data (An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e010.jpg) leads to transformed data with the following properties.

  1. The transformed data matrix, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e011.jpg, has the same dimensions (An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e012.jpg) as the original data and the mean of each row is zero.
  2. The value associated with a given individual in An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e013.jpg, the ith row of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e014.jpg, represents the individual's position or projection on the ith principal component.
  3. The correlation between any two rows of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e015.jpg is zero.
  4. The sum of the variances of the rows equals the variance in the original data.
  5. The variances of the rows are monotonically decreasing.
  6. The variance of the first row is the largest of any possible projection of the original data on a linear combination of the SNPs.

The principal components can be obtained directly by finding the eigenvectors of the covariance matrix

equation image
(2)

such that the ith principal component (the ith row of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e017.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e018.jpg) is the ith eigenvector of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e019.jpg. However, because An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e020.jpg (of dimension An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e021.jpg) can be very large for genome-wide SNP data sets, it can be more convenient to use singular value decomposition (SVD) to find the principal components and individual projections. SVD, which exists for any An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e022.jpg real matrix (where An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e023.jpg) rewrites the original data in terms of three other matrices

equation image
(3)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e025.jpg is an orthogonal matrix (i.e. the dot-product between any two columns is zero) of dimension An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e026.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e027.jpg is a diagonal matrix of dimension An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e028.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e029.jpg is another orthogonal matrix of dimension An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e030.jpg. This is achieved by setting An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e031.jpg, the ith column of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e032.jpg, to be the ith eigenvector of the matrix

equation image
(4)

An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e034.jpg, the ith diagonal entry of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e035.jpg to be the square root of the corresponding eigenvalue and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e036.jpg, the ith column of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e037.jpg, to be the vector

equation image
(5)

PCA and SVD are, through construction, intimately related. Specifically, the projection of samples along the ith principal component is given by An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e039.jpg (note this is the ith row of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e040.jpg and the ith column of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e041.jpg) and the ith principal component is An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e042.jpg. For typical population genetics data sets, eigenvalue analysis of the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e043.jpg (of dimension An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e044.jpg) is computationally simpler than analysis of the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e045.jpg (typically hundreds or thousands of samples have been genotyped at hundreds of thousands or millions of SNPs). The above construction results in the projection of samples on the PCs being influenced by the number of SNPs (e.g. repeating the analysis on a data set in which every SNP is included twice will lead to projections that are a factor An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e046.jpg larger than previously). To correct for this, consider a slightly different definition of the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e047.jpg:

equation image
(6)

which is equivalent to dividing the data matrix by the square-root of the number of SNPs. It is worth noting that An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e049.jpg may either be a random variable as in the case of sequencing, or a fixed variable, as in the case of genotyping. Here, it will be treated as a fixed variable, though in practice this is of little importance.

An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e050.jpg is a stochastic matrix. However, it is possible to learn about the key structural features of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e051.jpg by considering its expectation. From above, it follows that

equation image
(7a)
equation image
(7b)
equation image
(7c)

Assuming that sites are identical in distribution (though not necessarily independent) the subscript An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e055.jpg can be dropped to give

equation image
(8)

where the terms such as An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e057.jpg indicate the expectation (for sample An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e058.jpg) is averaged over all individuals An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e059.jpg in the sample (note this includes self); i.e. An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e060.jpg. Because An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e061.jpg is either 0 or 1, the four terms in Equation 8 can be thought of as:

  1. The probability that samples An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e062.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e063.jpg both carry a derived mutation at a randomly chosen locus conditional on the locus being polymorphic in the sample.
  2. The probability that sample An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e064.jpg and another randomly chosen sample An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e065.jpg (which may include either An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e066.jpg or An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e067.jpg) both carry the derived mutation at a randomly chosen locus.
  3. The probability that sample An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e068.jpg and another randomly chosen sample An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e069.jpg (which may include either An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e070.jpg or An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e071.jpg) both carry the derived mutation at a randomly chosen locus.
  4. The probability that two samples, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e072.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e073.jpg, chosen at random with replacement both carry the derived mutation at a randomly chosen locus.

In the case of a low mutation rate, where polymorphic sites are the result of a single historical mutation, expressions can be obtained for the above quantities in terms of features of the genealogical tree [12][14]. Figure 1 shows how the probability of two samples both carrying a mutation depends on their time to a common ancestor relative to the time to the common ancestor of the whole sample. Let An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e074.jpg be the expected coalescence time for samples An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e075.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e076.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e077.jpg be the expected time to the most recent common ancestor of the sample, and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e078.jpg be the expected total branch length in the tree. The probability that two samples share a derived mutation (conditional on the site being segregating) is given by

equation image
(9a)
equation image
(9b)
equation image
(9c)

By writing similar expressions for the other terms in Equation 8 it follows that

equation image
(10)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e083.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e084.jpg. Note that these expressions include coalescence with self where the coalescence time is always zero; i.e. An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e085.jpg. In short, the expectation of the matrix whose eigenvectors give the projections of samples on the principal components can be written in terms of the mean coalescence times for pairs of samples. It is worth noting that An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e086.jpg (and the related quantities) can be interpreted either as the expected coalescence time under some model or else the average realized coalescent time across the genome. The difference between these quantities can be important in some settings, such as admixture models (see below).

Figure 1
Genealogical statistics.

For diploid individuals the genotypic value for an individual at a given SNP is typically given by the sum of the allelic values; i.e. An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e090.jpg, where the superscripts indicate the two alleles. By following the same argument as above it can be shown that for genotype data

equation image
(11)

where the superscripts again indicate the relevant allele in each individual. In the following I will assume that data consist of haplotypes, however Equation 11 makes it clear that essentially identical results will hold for genotype data.

An example using two geographically separated populations

The implication of Equation 10 is that if the structure of pairwise coalescence times in a given data set can be understood, then the projection of the samples on the principal components can be predicted directly. Two illustrate this idea consider the simple model of a population split shown in Figure 2A. Under this model the expected coalescence time for pairs of samples within either population is 1 (in units of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e092.jpg generations) and the expected coalescence time for pairs of samples from different populations is An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e093.jpg, where An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e094.jpg is the age of the population split (also in units of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e095.jpg generations). Suppose of the total sample size An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e096.jpg, a fraction An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e097.jpg are from population A. Define An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e098.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e099.jpg, it follows that for large An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e100.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e101.jpg has a simple block structure;

equation image
(12)

where the first An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e103.jpg rows and columns represent the samples from population A (here, for example, three samples from A and two from B are shown). What will the leading eigenvalue and associated eigenvector be for a matrix with this kind of block structure? Although it is simple to obtain eigenvectors numerically, it is also worth having some intuition about what they represent. Through the construction of SVD it follows that the leading eigenvector, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e104.jpg and eigenvector, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e105.jpg, are those that, through Equation 3, provide the best approximation to the original data in terms of least-squares error. Equivalently, the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e106.jpg is the best least-squares approximation to An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e107.jpg. Intuitively, the original data is well approximated by the average allele frequency in each population and the the block structure of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e108.jpg can be recovered by clustering samples from the two populations either side of the origin in An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e109.jpg. More formally, it can be shown that

equation image
(13a)
equation image
(13b)

Assuming that An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e112.jpg , the projection of the samples on the first principal component is given by the vector

equation image
(14)

Note that the sign of the projections is arbitrary. This result implies that the Euclidean distance between samples from the two populations on the first principal component will be An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e114.jpg and their position relative to the origin is determined by the relative sample size, with the larger sample lying closer to the origin. Figure 2B shows the expected projection of samples.

Figure 2
Principal component analysis of two populations.

These results refer explicitly to the expected value of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e128.jpg. However, it is also important to know whether stochasticity resulting from the finite size of the genome has a significant effect on the results. Theoretical work on the nature and size of the first principal component in random matrices [15],[16] has identified a critical signal to noise ratio below which the true structure of the signal cannot be recovered. In the context of a two-population model this equates to An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e129.jpg being greater than An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e130.jpg [1]. For example, with a sample size of 100 and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e131.jpg, the threshold is 100 SNPs. Simulations were carried out for different numbers of independent SNPs (Figure 2C). As expected, for 10 or 100 SNPs PCA fails to separate samples from the two populations, while for 1,000 SNPs or more samples from the two populations are distinct on the first PC and centre around the theoretical expectation.

PCA cannot distinguish between alternative models that have the same effect on mean coalescence time

A direct consequence of Equation 10 is that PCA predominantly reflects structure in the expected (or mean realized coalescent) time. Consequently, any two demographic models that give the same structure of expected coalescence times will also give the same projections. To illustrate this result, consider a fully general model with two homogeneous populations where the expected coalescence time for two samples from population A is An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e132.jpg, the expected coalescence time for two samples from population B is An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e133.jpg and the expected coalescence time for one sample from each population is An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e134.jpg. Define An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e135.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e136.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e137.jpg. It can be shown that

equation image
(15)

Again, only three samples from population A and two from population B are shown. For large An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e139.jpg, the leading eigenvalue and corresponding eigenvector of the above matrix are respectively

equation image
(16a)
equation image
(16b)

Consequently, the projection of the samples on the first principal component is given by the vector

equation image
(17)

Comparison of Equations 14 and 17 shows that the Euclidean distance between samples from the two populations on the first PC is a function of the difference between cross-population and within-population coalescence times and that the positioning of the populations relative to the origin simply reflects their relative sample size (as for the simpler two-population model). Consequently, any two models that give the same value of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e143.jpg will give the same expected projections of samples on the first PC.

One connection that is worth exploring further is the link between the results shown here and those of Slatkin [12] concerning An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e144.jpg. Slatkin showed that

equation image
(18)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e146.jpg is the average coalescence time for pairs of samples from the same population and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e147.jpg is the average coalescence time across all pairs of samples. In the notation used above it can be shown that

equation image
(19)

Now consider the PCA projection. The variance along the first axis is An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e149.jpg. The total variance in the sample is An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e150.jpg. Consequently, the fraction of the total variance explained by the first PC is equal to An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e151.jpg. Given that An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e152.jpg is defined as the fraction of the total variance that is explained by between-population differences this result is not surprising. Nevertheless, the result demonstrates a simple relationship between the Euclidean distance of populations in PCA space and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e153.jpg, at least in the case of two populations.

Uneven sampling has a strong influence on PCA projections

As has been shown previously [11], PCA projections can be strongly influenced by uneven sampling from a series of populations. The results described here provide an explanation. First, from Equation 10 it can be seen that the the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e154.jpg is influenced by the relative sample size from each population through the components An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e155.jpg. For instance, even if all populations are equally divergent from each other, those for which there are fewer samples will have larger values of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e156.jpg because relatively more pairwise comparisons are between populations. Second, even if the entries of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e157.jpg were not influenced by the relative sample size, its eigenvectors will be, simply because relative sample size will influence the structure of the genetic variance in the sample (see Figure 2). The influence of uneven sample size can be to bias the projection of samples on the first few PCs in unexpected ways, for example, where there is spatial structure to genetic variation. Consider a lattice arrangement of populations with equal migration between neighbouring populations. For this arrangement it is possible to obtain analytical expressions for the expected coalescence time for pairs of samples from the different populations (results not shown) and hence the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e158.jpg (up to an unknown scaling factor) and subsequently the projection of samples on the first few PCs under different assumptions about sample size and migration rate. If sample sizes from the different populations are equal, the spatial arrangement of the populations on the first two PCs mimics the structure of the migration matrix (Figure 3A). However, sample sizes differ between populations the effect is to distort the projection space (Figure 3B and 3C). This distortion of PC-space relative to the structure of the migration matrix is problematic for interpreting the location of samples on PCs. Sub-sampling from populations to achieve more equal representation, as in [2], is the only way to avoid this problem.

Figure 3
The effect of uneven sampling on PCA projection.

The projection of admixed individuals onto existing axes directly identifies admixture proportions

The principal components identified through PCA can be used to project not just those samples from which the PCs were obtained, but also additional samples. The appeal of such analyses is that it enables the analysis of structural features identified in one data set to be transferred to another. For example, where data from two source populations and a set of possibly admixed samples are available, projection of the admixed samples onto the axes defined by the source populations can identify the extent of mixed ancestry. The advantage of this approach rather than simply performing PCA on all samples together is that other structural features within the admixed samples (e.g. admixture from a third population or relatedness) will have little influence on the projection. In the light of the above results showing how the PCA projection of samples can be interpreted in terms of coalescence times, it is interesting to ask how the the projection of additional samples onto the same axes also relates to coalescence times.

Consider the case of the general two-population model where the positions of the samples on the first PC are An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e162.jpg for samples from population A and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e163.jpg for samples from population B. The first PC can be obtained as in Equation 5. For a given SNP, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e164.jpg, the expected loading for the first PC, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e165.jpg, is therefore

equation image
(20a)
equation image
(20b)
equation image
(20c)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e169.jpg is the number of samples carrying the derived allele in population A. By writing An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e170.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e171.jpg, such that An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e172.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e173.jpg are the frequencies of the derived alleles in populations A and B respectively, it follows that

equation image
(21)

The expected location of an additional sample, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e175.jpg, on the first PC is therefore

equation image
(22a)
equation image
(22b)
equation image
(22c)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e179.jpg (note this does not include the additional sample An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e180.jpg). Again, the subscript An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e181.jpg has been dropped by assuming that sites are identical in distribution. By noting that An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e182.jpg, where the expectation is over those samples from population A, it follows that similar arguments to those above can be made to relate the quantities in Equation 22 to coalescent times. Define An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e183.jpg as the average coalescent time between the additional sample and all samples from population A and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e184.jpg to be the equivalent for population B, it can be shown that

equation image
(23)

An important implication of Equation 23 is that if the additional sample is the result of an admixture event between the two populations with a fraction An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e186.jpg of its genome coming from population A then it follows that the location of the sample on the first PC is

equation image
(24)

In words, the admixture proportion of the individual can be directly inferred from their relative position along the first PC from the two source populations.

There are three important points to note when applying this result. First, only if the admixture event was very recent are the source populations likely to be available. Rather samples may be available for descendants of these source populations. Consequently, the average divergence between the population A part of an individuals genome and other samples from population A might typically be greater than for two samples taken directly from population A. However, this effect is likely to be very similar for the two source populations and, given Equation 23, these effects largely cancel out.

The second point to note is that if samples are admixed between more than two populations, the result generalises so that an individual whose genome is derived from several source populations will have a projected position (along each significant PC) defined by the weighted sum of the positions of its source populations. Informally, the result arises because of the linearity in Equation 22. Those parts of the genome with ancestry from a given population will have a PC projection that matches samples taken directly from the source population. If there is mixed ancestry, the effect is simply to average the PC projections.

Finally, it is important to note that projection of non-admixed individuals can also lead to their location being intermediate between the two original populations. For example, samples from a third population that either diverged from population A since the split with population B or that come from a population that diverged before the A/B split will (in both cases) be projected between the locations of samples from populations A and B. It may, however, be possible to distinguish between such cases by carrying out PCA on all data combined.

PCA carried out on admixed individuals can identify relative admixture proportion in the absence of source populations

As has already been shown through simulation [1], PCA carried out on samples that are the result of admixture events can identify admixed samples as lying along the axes between the two or more source populations, even if one or more of the source populations are absent. The results above shed some light onto when such analyses are expected to work and when they will fail.

Consider a sample of individuals who are the result of an historical admixture event between two populations A and B. In order to define the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e188.jpg for this sample it is necessary to know which part of their genome is derived from each of the source populations. Let An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e189.jpg be a series of indicator functions for each of the An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e190.jpg SNPs in individual An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e191.jpg that takes value 1 if that part of the individual's genome was derived from population A and 0 if it was derived from population B. The value An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e192.jpg can be obtained by comparing the value of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e193.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e194.jpg at each position and adding up the relevant contribution from each of An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e195.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e196.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e197.jpg. Note that here the achieved ancestry proportions are being used rather than their expectation under some model (which might be the same for all samples).

Given these considerations there are two situations under which none of the structure between the two source populations is expected to be reflected in the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e198.jpg. First, all individuals could have the same vector An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e199.jpg, which could occur if the admixture event were ancient and involved relatively few individuals such that the source population at every point in the genome were fixed (note this does not mean that there is no variation, simply that all individuals at this location have an ancestry from the same population). Second, individuals have different ancestry vectors, but the average value is the same for all individuals and the admixture chunks have been sufficiently broken up through historical recombination such that everyone is equally related to everyone else. Again, this scenario could occur if the admixture result were ancient. Note that all individuals having the same average ancestry proportions is, by itself, not sufficient to create this problem. To examine the rate at which admixture signal is lost, an admixed population was simulated forward in time and the projections of samples on the first PC were followed, along with the correlation between PC projection and individual ancestry. As shown in Figure 4, in which the population is chosen to have parameters comparable to humans, the initially strong correlation between ancestry proportion and location on the first PC is rapidly lost such that after only 15 generations there is essentially no signal remaining, even though locally within the genome admixture chunks are still very clear (i.e. there is still admixture LD) after 50 generations.

Figure 4
Identification of admixture proportions without source populations.

Discussion

The primary result of this paper is that the locations of samples on the principal components identified from genome-wide data on genetic variation can be predicted from an understanding of the average coalescent time for pairs of samples. This gives a direct route to understanding the influence various demographic scenarios can have on the relationships between samples identified from PCA and how PCA can be used to make inference about processes of interest such as admixture. However, the results also demonstrate the way in which sampling schemes can influence PC projections and how similar projections can arise from very different demographic scenarios. Consequently, using these results to motivate inference from PCA about underlying demographic process may prove difficult.

There are, however, situations in which PCA can be used to infer demographic parameters directly. For example, in cases of simple two- or three-way admixture, where populations close to the source populations can be identified and sampled from, estimation of admixture proportions can be achieved from projecting samples onto the PCs identified from the source populations. To illustrate this, Figure 5 shows the inferred ancestry proportions for a set of haplotypes (estimated from trio data) in 20 African Americans collected as part of the HapMap3 project. In this analysis, haplotypes (also inferred from trios) from the European ancestry population in Utah (CEU) and the Yoruba in Nigeria (YRI) are used to represent the source populations (note, as discussed above, the requirement is not that these are the source populations, simply that they are closely related to the source populations). By analysing each chromosome separately it can be shown that while each individual's average ancestry proportion across the genome is fairly constant (typically 70–90% African), there is considerable variation at the level of individual chromosomes, with some chromosomes appearing essentially European (for some individuals) and others essentially African (no chromosome shows an overall tendency to come from one population). Such information could be informative about processes such as the level of assortative mating and the rate of ongoing admixture.

Figure 5
Admixture proportions inferred from PCA projections.

One important issue in the application of these ideas to the analysis of empirical data is the extent to which SNP ascertainment will influence outcome. SNP discovery in a small panel will typically lead to the under-representation of rare SNPs in the genotyped data and, depending on the geographical distribution of the samples used for discovery, can also lead to biases in the representation of variation from different areas. The quantities in Equation 8 are therefore conditional not just on segregation in the genotyped sample, but also on segregation within the SNP discovery panel. Consider the joint genealogy of the genotyped and discovery samples shown in Figure 6A. The probability that a pair of samples, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e200.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e201.jpg share a derived mutation (in the genotyped samples) that also lies on the subtree of the discovery samples, An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e202.jpg is

equation image
(25)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e204.jpg is the first time at which the common ancestor of the samples An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e205.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e206.jpg is also a common ancestor of at least one of the discovery panel samples (An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e207.jpg), An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e208.jpg is the time to the more recent of the discovery or sample MRCAs and An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e209.jpg is the total time of the intersection between the discovery and genotyped samples' genealogies (Figure 6A). It follows that the equivalent expression for Equation 10 with SNP ascertainment will typically be larger than without SNP ascertainment because An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e210.jpg whereas the differences in the numerators will largely cancel each other out. Consequently, it is expected that, except for very strongly biased SNP discovery (e.g. a sample of two from one of a series of very divergent populations), that PCA projections from genotype data will be similar to PCA projections from resequencing data, but will typically be larger in magnitude (if the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e211.jpg is normalized by the number of SNPs) by a factor An external file that holds a picture, illustration, etc.
Object name is pgen.1000686.e212.jpg; a result confirmed by simulation (Figure 6B and 6C). For the example shown, this result holds even under the most extreme ascertainment scheme of two discovery samples from a single population. In short, SNP ascertainment will tend to have a simple and predictable effect on PC projections that has little influence on the relative placing of samples.

Figure 6
The effect of SNP ascertainment on PCA projection.

Finally, it is worth pointing out that because PCA effectively summarizes structure in the matrix of average pairwise coalescent times, but in a manner that is influenced by sample composition, more direct inferences can potentially be made from the matrix of pairwise differences (which are trivially related to pairwise coalescent times). This is not to say that eigenvalue analysis of the pairwise distance matrix will correct for the effects of biased sampling demonstrated in Figure 3. However, while readily-available alternatives to PCA, such as multidimensional scaling, seem to have properties similar to PCA, it is possible to envisage non-parametric methods for analysing the matrix of pairwise differences that identify structure without being influenced by sample size.

Methods

Coalescent simulations were carried out using scripts written by the author in the R language (www.r-project.org) and available on request. Principal component analysis of simulated data was carried out using the R function eigen. Phased haplotypes from the International HapMap Project (HapMap3 release 2) were used in the analysis of the CEU, YRI and ASW population (see ftp://ftp.hapmap.org/hapmap/phasing/2009-02_phaseIII/HapMap3_r2/).

Acknowledgments

Many thanks to Niall Cardin, Peter Donnelly, Stephen Leslie, Simon Myers, John Novembre, Nick Patterson, and Molly Przeworski for discussion and comments on the manuscript.

Footnotes

The author has declared that no competing interests exist.

The work was supported by the Leverhulme Trust (www.leverhulme.ac.uk) and the HFSP (www.hfsp.org; grant no. RGP0054/2006). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. [PMC free article] [PubMed]
2. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, et al. Genes mirror geography within Europe. Nature. 2008;456:98–101. [PMC free article] [PubMed]
3. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. [PubMed]
4. Cavalli-Sforza LL, Menozzi P, Piazza A. The History and Geography of Human Genes. New Jersey: Princeton; 1994.
5. Reich D, Price AL, Patterson N. Principal component analysis of genetic data. Nat Genet. 2008;40:491–492. [PubMed]
6. Klopfstein S, Currat M, Excoffier L. The fate of mutations surfing on the wave of a range expansion. Mol Biol Evol. 2006;23:482–490. [PubMed]
7. Barbujani G, Sokal RR, Oden NL. Indo-European origins: a computer-simulation test of five hypotheses. Am J Phys Anthropol. 1995;96:109–132. [PubMed]
8. Fix AG. Gene frequency clines produced by kin-structured founder effects. Hum Biol. 1997;69:663–673. [PubMed]
9. Chikhi L, Nichols RA, Barbujani G, Beaumont MA. Y genetic data support the Neolithic demic diffusion model. Proc Natl Acad Sci USA. 2002;99:11008–11013. [PMC free article] [PubMed]
10. Currat M, Excoffier L. The effect of the Neolithic expansion on European molecular diversity. Proc Biol Sci. 2005;272:679–688. [PMC free article] [PubMed]
11. Novembre J, Stephens M. Interpreting principal component analyses of spatial population genetic variation. Nat Genet. 2008;40:646–649. [PMC free article] [PubMed]
12. Slatkin M. Inbreeding coefficients and coalescence times. Genet Res. 1991;58:167–175. [PubMed]
13. Wilkinson-Herbots HM. Genealogy and subpopulation differentiation under various models of population structure. J Math Biol. 1998;37:535–585.
14. McVean GA. A genealogical interpretation of linkage disequilibrium. Genetics. 2002;162:987–991. [PMC free article] [PubMed]
15. Baik J, Ben Arous G, Péché S. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann Probability. 2005;33:1643–1697.
16. Debashis P. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica. 2007;17:1617–1642.
17. Schaffner S, Foo C, Gabriel S, Reich D, Daly MJ, et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15:1576–1583. [PMC free article] [PubMed]
18. The International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;27:1299–1320. [PMC free article] [PubMed]

Articles from PLoS Genetics are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...