Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Ann Hum Genet. Author manuscript; available in PMC Nov 1, 2010.
Published in final edited form as:
PMCID: PMC2764806

Tests of Association for Quantitative Traits in Nuclear Families Using Principal Components to Correct for Population Stratification


Traditional transmission disequilibrium test (TDT) based methods for genetic association analyses are robust to population stratification at the cost of a substantial loss of power. We here describe a novel method for family-based association studies that corrects for population stratification with the use of an extension of principal component analysis (PCA). Specifically, we adopt PCA on unrelated parents in each family. We then infer principal components for children from those for their parents through a TDT-like strategy. Two test statistics within variance-components model are proposed for association tests. Simulation results show that the proposed tests have correct type I error rates regardless of population stratification, and have greatly improved power over two popular TDT-based methods: QTDT and FBAT. The application to the Genetic Analysis Workshop 16 (GAW16) data sets attests to the feasibility of the proposed method.

Keywords: Family Based Association Tests (FBATs), Transmission Disequilibrium Test (TDT), Principal Component Analysis (PCA), Variance-Components


Genetic association studies are powerful tools for detecting and evaluating the effects of genetic variants underlying complex traits and diseases(Risch & Merikangas, 1996). With the availability of high-density maps of SNPs from HapMap project(Tanaka, 2005) and of decreasing genotyping cost, genome-wide association studies (GWAS) have proven to be successful in finding genes with subtle to modest effects to complex diseases(WellcomeTrust, 2007).

Though most of current experiment designs for association studies focus on recruiting unrelated subjects, there are still existing studies on family samples (Benjamin et al., 2007, Cheung et al., 2005, Morley et al., 2004, Laird & Lange, 2006). A well known example of this latter study design is the Framingham Heart Study (FHS)(Dawber et al., 1951), which recruited over 14,000 family members across three generations and, for each of these family members, genotyped at an approximate number of 550K SNPs across the genome. In cases where only a subset of family members are genotyped, a more cost-effective strategy will be to impute missing genotypes for other members within family. The work of Chen and Abecasis(Chen & Abecasis, 2007) provides guidance on various genotyping strategies for obtaining optimal imputation efficiency.

When analyzing familial data for association, the transmission disequilibrium test (TDT)(Spielman et al., 1993) and its various extensions(Abecasis et al., 2000a, Allison, 1997, Fulker et al., 1999, Lake et al., 2000, Martin et al., 1997, Rabinowitz, 1997, Spielman & Ewens, 1998, Rabinowitz & Laird, 2000, Lange & Laird, 2002b, Lange & Laird, 2002a, Lange et al., 2002, Lange et al., 2003, Laird et al., 2000) form the basis of analytical methodology. TDT examines the transmission of heterozygous alleles from parents to offspring and is robust to population stratification of any kind. However, this ability comes at the cost of a substantial loss of statistical power. The problem of power loss is severe in genomic scans, where it is vital to achieve a genome-wide statistically significant level to judge a positive signal. When genotype data across the genome are available, a more cost-effective approach for dealing with population stratification is to use some well-tested methods for evaluating the effects of population structures and for guarding against population stratification through multiple marker genotypes.

A variety of approaches(Devlin & Roeder, 1999, Pritchard et al., 2000b, Pritchard & Rosenberg, 1999, Chen et al., 2003, Zhang et al., 2003, Zhu et al., 2002, Price et al., 2006, Zhu et al., 2008) have been proposed for clustering population substructures and/or for controlling false-positive rates with the use of genotype data on multiple markers. Three methods: genomic control (GC)(Devlin & Roeder, 1999), structured association (SA)(Pritchard et al., 2000a, Pritchard et al., 2000b), and principal component analysis (PCA) based adjustment(Price et al., 2006), among these approaches are commonly used. GC computes an overall inflation factor on a set of unlinked markers and adjusts association statistics at each marker using this factor. This approach ignores the heterogeneity of ancestral allele distributions among markers and thus leads to a loss of power at some markers. Using the Markov Chain Monte Carlo (MCMC) approach, SA infers the number of discrete sub-populations and the probabilities of an individual belonging to each sub-population. SA is computationally intensive(Pritchard et al., 2000a), especially when a large number of markers are involved, and may produce unreliable association results due to the uncertainty of the inference of the number of sub-populations. In the third approach, PCA is used to summarize individual genetic background information from multiple marker genotypes. The obtained principal components are then used to adjust individual genotypes and/or phenotypes to correct for population stratification. This approach has been proven to be both computationally fast and statistically powerful, and is especially suitable for association studies when a large number of markers are involved.

However, by operating on only unrelated subjects, PCA cannot be applied directly to familial data sets. To solve this issue, Zhu et al.(Zhu et al., 2008) propose to summarize genetic backgrounds for only unrelated individuals through a principal coordinate analysis (PCoA)(Gower, 1966, Bauchet et al., 2007) and to evaluate genetic backgrounds for the remaining related individuals through their genotypes as well as the calculated principal components. PCoA retrieves information equivalent to that by PCA(Gower, 1966). Unlike PCA calculating principal components on people (taking subjects as variables), PCoA calculates principal components on markers (taking markers as variables), and will produce a variance-covariance matrix with dimensions equaling to the number of markers. However, to compute PCoA may be computer-memory demanding when large numbers of markers, e.g., GWAS data, are involved. Recent work on fast matrix approximation may help speed up these calculations and save memory capacities (Drineas P, 2006, Paschou et al., 2008).

In this study, we describe a method that tests association for quantitative traits in nuclear families. We correct for population stratification by an extended PCA. Briefly, we perform a PCA on parental marker genotypes to summarize their genetic backgrounds(Price et al., 2006), and use the obtained principal components to adjust parental genotypes and phenotypes. Through a TDT-like strategy, we infer children's principal components from those of their parents. The strategy is also expanded to take missing parents into account. Children's principal components are subsequently used to adjust their genotypes and/or phenotypes. Association between residual genotypes and residual phenotypes is examined within variance-components framework. We evaluate the statistical properties of the proposed test via simulation studies, and demonstrate its availability by an application to genetic analysis workshop 16 (GAW16) data sets.


Definitions and models

Assume that there are k nuclear families, and there are ni (i = 1,…, k) related individuals in family i with the first two individuals being parents. Thus, the total number of individuals is N=Σi=1kni, and the total number of parents is Nf = 2k. Assume that there are M SNP markers and that genotype data are available for all individuals at all markers. The genotype score gijm for the jth individual in the ith family at the mth SNP is defined as 0, 1, and 2 for genotypes “11”, “12”, and “22”, respectively.

Let yij and xij be the phenotypic value and the vector of covariates, respectively, for the jth (j = 1, …, ni) individual in family i. yij is modeled in the variance-components framework(Abecasis et al., 2000a, Amos, 1994) as (We omit the genotype subscript m for illustration simplicity)


where μ denotes the grand mean, βx the covariate effects, βg the additive major gene effect, αij the additive polygenic effect, and εij the residual effect. In addition, αij and εij are assumed to have zero means, leading to


For family i, Ωi, the ni x ni variance-covariance matrix, has elements


where σg2, σa2 and σe2 are variance components accounting for major gene effect, polygenic effect, and residual effect, respectively. πijl denotes the identity-by-descent (IBD) coefficient between individual j and individual k at the tested locus, and 2[var phi]ijl denotes the expected kinship coefficient between the same two individuals.

Correcting for population stratification

When population stratification exists, the above model may no longer provide a valid test. One way to correct for population stratification is to adjust individual genotypes and/or phenotypes through linear regressions on principal components obtained from multiple marker genotypes, as proposed by Price et al.(Price et al., 2006) and described briefly below. Since a PCA on all family members without taking correlation among family members into account may result in biased estimates, we perform PCA on only unrelated individuals, i.e., the parents in each family, as suggested by Zhu et al.(Zhu et al., 2008). The process of principal component analysis is outlined in Appendix A. The obtained eigenvectors, or principal components, are arranged in a descending order based on the corresponding eigenvalues and are represented by e1,…, eNf. Notice that these eigenvectors represent new orthogonal axes of decreasing variability of genetic background of individuals, and were defined by Price et al.(Price et al., 2006) as the axes of variation. The jth coordinate of the lth eigenvector is interpreted as the value of ancestry of individual j along the lth axis of variation. Through linear regressions on the first L principal components, genotypes and phenotypes are adjusted. We here take L = 10 as suggested by Price et al.(Price et al., 2006). More specifically, we have




where eijl denotes the coordinate value along the lth principal component for the jth individual in family i (i = 1, …, k, j = 1, 2, l = 1, …, L), and δij and τij are residuals. Through classic linear regression theories, the least-squares estimators of β0, β1, …, βL, α0, α1, …, αL, denoted as β^0,β^1,,β^L,α^0,α^1,,α^L, respectively, can be easily obtained. The residuals for each parent are then calculated by




Since children's principal components are unknown, the adjustments on them cannot be estimated through the above linear regression equations. We here propose a TDT-like approach to infer children's principal components and to estimate their adjustments. For a nuclear family trio, denote the genotype scores for the father, the mother, the child C, and a pseudo-child C' as gf, gm, gc, and gc', respectively, where C' possesses the two parental haplotypes not transmitted to C. Thus, gf + gm = gc + gc'. In addition, denote the principal component vectors for the father, the mother and C as ef, em, and ec, respectively. Through linear regression, the adjustment of genotype for C can then be expressed as


Here E(gc) denotes the part of genotype score of C explained by principal components, that is, the effect of genetic backgrounds. Notice that C and C' are a complement pair from the same two parents, thus the effects of genetic backgrounds on C and C' are assumed to be same, implying E(gc) = E(gc'), thus,


Comparing the above two expressions, we find


The principal component vector for each child for phenotypes can be similarly obtained, which is again the mean of parental principal components (Appendix B). Subsequently, the adjustment of genotype scores and phenotype for each child can be evaluated from Equations (3) and (4), where the estimates of regression coefficients from parental principal components are used.

An obvious illustration from Equation (5) is that principal components for all children in a family are equal. This is intuitively interpretable since all children in a family are supposed to have the same genetic backgrounds. Thus, in the cases where parental information is missing, we randomly select one sib to be included in analysis of principal components and of subsequent linear regressions. Principal components for other sibs are set to be equal to that of the selected sib. When only one parent is available, we include the parent and one sib into calculations. Principal components for the remaining sibs are set to be equal to that of the selected sib as well.

Tests of association

Tests of association are performed with the residuals of genotypes and of phenotypes by likelihood ratio test within variance-components model. The likelihood for the residual data is given by


where yi denote the residual phenotypes for the ith family. The statistic is defined as


where L0 is the maximal of L with constraints βg = 0, and L1 the maximal of L without constraints. TPCLRT asymptotically follows a chi-square distribution with 1 degree of freedom (df) under the null hypothesis of no association.

To compute TPCLRT, however, is time-consuming, since the likelihood function has to be maximized twice for each SNP. To speed up calculations for large-scale scans, we extend the score test proposed by Chen et al.(Chen & Abecasis, 2007). Given the complete set of parameters for the likelihood equation, θ=[μ,βx,βg,σa2,σg2,σe2] the score statistic TPCSCR is defined as


where gi is a vector of residual genotype scores for all individuals in the ith family, g is the sample mean of gi, and 1 (1 × ni) is a vector with all elements being 1. In addition, E(yi)(base) and Ωi(base) are, respectively, a vector of fitted values and an estimate of the variance-covariance matrix for each family, fitted to the variance-components model with respect to parameters μ, βx, σa2 and σe2 (i.e., without parameters βg and σa2 relative to the complete parameter set θ). Usage of the statistic TPCSCR can largely speed up calculations since the likelihood function requires only to be maximized once for all markers. TPCSCR again follows a chi-square distribution with 1 df under the null hypothesis of no association regardless of population stratification.

Data simulation

Simulation studies are conducted to explore the statistical properties of the proposed tests. Nuclear families are generated under three population scenarios: homogeneous population, two discrete populations, and one admixed population with two ancestral populations. The number of SNP markers varies from 200 to 2,000. In the first scenario, the allele frequency of each marker is drawn from the Uniform distribution U(0.1, 0.9). Parental marker genotypes are generated based on the corresponding allele frequencies with an assumption of linkage equilibrium between adjacent markers. Two randomly selected parents mate to produce offspring, with the number of offspring in each family following a Poisson distribution with mean 2 (families with no children were omitted). Offspring marker genotypes are generated according to parental genotypes with a uniform recombination rate 0.01 between adjacent markers.

Two additional SNP markers with minor allele frequency (MAF) 0.3 are simulated as the causal site and the test site, respectively. The additive effect of the causal site explaines 2% of the phenotypic variation, and polygenic effects explain additional 50% of the phenotypic variation. For simplicity, covariate effects are not considered, though the incorporation of them is not difficult. Phenotypes for each family are drawn from multivariate normal distributions with individual mean from Equation (1) and variance-covariance matrix from Equation (2).

In the second scenario, equal numbers of nuclear families are sampled from two discrete distributions (A and B). Allele frequencies are generated using the Balding-Nichols model(Balding & Nichols, 1995). Specifically, for each SNP, an ancestry allele frequency p is drawn from the Uniform distribution U(0.1, 0.9). The allele frequencies for populations A and B are then drawn from a Beta distribution with parameters p(1-FST)/FST and (1- p) (1- FST) / FST, where FST is a measure of genetic distance between two populations. We set FST to 0.05 to simulate moderate population stratification. The generation of nuclear families in each population is the same as that in scenario 1. MAFs for the causal site are set to pA = 0.2 in population A and pB = 0.4 in population B. Phenotypic means (μA and μB; μA > μB) are selected such that 20% of the phenotypic variation in the combined population is explained by stratification.

In the last scenario where an admixed population is involved, we adopt a continuous gene flow (CGF) model proposed by Zhu et al.(Zhu et al., 2004). Specifically, two panels of allele frequencies are produced for two ancestral populations A and B as that in scenario 2. At the first generation, 10,000 unrelated individuals are produced from population A. To produce the next generation, a proportion (λ) of randomly selected individuals from population A mate to individuals randomly produced from population B, with the remaining 1–λ mating among themselves. The number of children for each mating is drawn from a Poisson distribution with mean 2. The generated offspring comprise the new population A. We set λ to 0.05 and repeat the process 10 times, resulting in the current admixed population of approximately 60%/40% of ancestry from population A/B. Nuclear families are randomly drawn from the current admixed population. On simulating phenotypes, each individual has a specified population mean μ = αμA + (1-α)μB, where α is the individual's ancestry proportion of population A and (1 - α) is that of population B. The settings for causal and test sites and population means are the same as those in scenario 2.

We study both type I error rates and power for our proposed tests in all three scenarios. Type I error rates are estimated by setting the LD measure r2 between the causal and the test sites to 0.0, while powers are estimated by setting r2 to 1.0. All rates are estimated on 2,000 replicates.

We compare the performance of our proposed tests with two popular family-based methods: quantitative transmission disequilibrium tests (QTDT) proposed by Abecasis et al.(Abecasis et al., 2000a), and the method proposed by Rabinowitz and Laird(Rabinowitz & Laird, 2000) and implemented in the software FBAT(Laird et al., 2000). We also include the method of Zhu et al. for comparison, which is implemented in the software FAMCC (Zhu et al., 2008). We term these five tests as PCLRT, PCSCR, QTDT, FBAT, and FAMCC respectively, throughout our studies.

HapMap datasets

Our simulations assume that markers are in linkage equilibrium with each other. To study the performance of the proposed method when applied to real data sets where markers may exhibit strong local LD patterns, we analyze a data set simulated on haplotype data from the HapMap project(Tanaka, 2005). We use only haplotype data on chromosome 1 of European population (CEU) and of African population (YRI), both of which consist of 30 child-parents trios. One hundred and twenty parental haplotypes for each population are downloaded from the HapMap project website at the release 2007–08_rel22. There are 167,161 SNPs that exist in both populations, which will be used in subsequent analyses. For each of the 60 marriages, we simulate children with the number following a Poisson distribution with mean 2. Phenotypic simulation is similar to that of the scenario 2 described previously. A random SNP (rs1687827) is selected as the causal site explaining 5% of the total phenotypic variation for both populations.

Genetic Analysis Workshop 16 (GAW16) datasets

As an application, we analyze the Genetic Analysis Workshop 16 (GAW16) Problem 3 simulated data sets with the proposed test. The GAW16 data sets consist of 6,476 subjects from Framingham Heart Study (FHS), where each subject has real genotypes at approximately 550,000 (549,872) SNPs and simulated phenotypes. Subjects are distributed among 3 generations and singletons. After dividing large families into smaller nuclear families and applying some quality controls to the data (for example, as the proposed test cannot analyze half-sibs, we delete one of half-sibs from the data), we finally identify 5,456 family members from a total of 1,815 nuclear families.

A total of six correlated traits, termed HDL, LDL, TG, CHOL, CAC, and MI, respectively, are simulated on the observed genetic variation in order to mimic the lipid pathway underlying the development of cardiovascular disease(Kannel et al., 1961). Phenotype data are simulated at three pseudo-visits with 10 years apart to mimic the context of longitudinal study, and at each visit, 200 simulated data sets are replicated. We focus on the trait HDL which is influenced by five major genes each contributing 0.3 to 1.0% to the phenotypic variation. The data set from the first replicate of the first visit is analyzed as suggested by the workshop. The phenotype is adjusted by age and sex.


Performances of children's principal components

Children's top two principal components (right) when analyzing 1,000 SNPs are compared with those of parents in Figures 1A and 1B for simulation scenario 2 (Stratified) and scenario 3 (Admixture), respectively. It is obvious that children's principal components have similar patterns to those of parents in both scenarios. In scenario 2 where two discrete populations are combined, the first principal component alone can correctly group both children and parents into two separate sub-populations. In scenario 3 where an admixed population from two ancestral populations is involved, the first principal component is highly correlated to individual ancestral proportions (Figure 1C). Note that the variation of children's principal components is slightly smaller than that of parents. This is not surprising since children's principal components are calculated as the mean of those of parents, and thus will have a smaller variance.

Figure 1
The performances of Children's Principal Components

Figure 1D displays the performance of PCA when applied to HapMap project data, where there are strong levels of LD among nearby markers. The results show a similar pattern with simulation scenario 2: the first principal component clearly distinguishes the two populations in both the parental and offspring generations. This principal component also explains most of genetic variation (>99%), suggesting that LD has limited influence on PCA and that the proposed method is applicable to real GWAS data. We also observe that the second principal component has a limited variation, which may be because each of the two populations is not homogeneous.

Type I error rates

Type I error rates of various tests are investigated for a variety of settings, including different population models, different family structures, and different numbers of genotyped SNPs. We notice that the performances of QTDT and FBAT are not affected by the number of SNPs. Thus we report only one value for each of both the tests regardless of the numbers of SNPs considered. Table 1 lists the error rates when information for both parents is available. For all the methods, the error rates are close to the nominal levels in all scenarios. For PCLRT and PCSCR, a number of 1,000 SNPs seems to be sufficient to provide a safe-guard against population stratification regardless of whether samples are stratified or admixed. When either or both parents in a family are not used, as shown in Tables 2 and Table 3, the error rates are still close to the nominal significant levels. These results indicate that our proposed tests are robust to the population stratification.

Table 1
Type I Error Rates of Vaious Tests When Both Parents Are Used
Table 2
Type I Error Rates of Various Tests When One Parent Is Unused
Table 3
Type I Error Rates of Various Tests When Both Parents Are Unused

Power estimates

Powers of various tests are presented in Table 4 for various population models and family structures. For all conditions, PCLRT, our proposed likelihood ratio test and PCSCR, our proposed score test, have similar powers. When other conditions are fixed, both tests have higher powers in homogeneous populations than in stratified populations, and have lowest powers in admixed populations. Among the methods, both PCLRT and PCSCR have the highest power, followed by FAMCC, QTDT, and FBAT. This shows that our proposed method is more powerful than that of Zhu et al. (Zhu et al., 2008), and that traditional TDT-based FBATs are overly conservative. The advantage of using more practical approaches than TDT for controlling population stratification is obvious.

Table 4
Power Estimates of Various Methods

We also study the effects of two additional factors that may influence power estimations: the level of LD and the number of children per family. Figure 2 displays the trend of increasing power estimations with increasing values of r2, under stratified population setting. Among the methods, the powers of PCLRT and PCSCR are again similar to each other and are higher than that of others. FAMCC has a higher power than QTDT, and FBAT has the lowest power. Figure 3A shows the effects of the number of children per family on power estimations when fixing the total number of individuals at 800. Unlike QTDT and FBAT the power of which increases as the number of children per family increases, the powers of PCLRT, PCSCR and FAMCC instead decrease slightly with increasing number of children. Figure 3B presents power estimations with different numbers of children per family when fixing the number of families at 200. All tests have an improved power with increasing number of children. However, the relative gain of power for PCLRT, PCSCR and FAMCC is less than that for either QTDT or FBAT. Again, the proposed two tests (PCLRT and PCSCR) have higher powers over the other three methods, further indicating the advantage of the proposed method.

Figure 2
Power Estimation under Various Levels of LD
Figure 3
Power Estimation for Different Numbers of Children per Family


We first apply the proposed score test to the simulated data sets on the HapMap project. The computation takes a total time of less than 2 hours with the use of a desktop computer with a single 2.8GHz Intel Xeon CPU. Figure 4A displays raw p-values of this scan. The causal SNP (rs1687827) has the lowest p-value (2.70e-9) that remains significant after Bonferroni adjustment for multiple testing by the 160K SNPs. FAMCC also detects the causal SNP as the most significant with a p-value 4.63e-5. Instead, QTDT and FBAT have p-values 0.012 and 0.010 which are not significant after accounting for multiple tests at this SNP, respectively.

Figure 4
Applications of the proposed method to the HapMap project simulated datasets and to the GAW16 simulated datasets

We then apply the proposed score test to the GAW16 datasets described previously. The computation for the scan with 550K SNPs takes a total time of less than 20 hours with the use of same computer as above. Figure 4B presents the quantile-quantile (QQ) plot (left) and log-QQ plot (right), and Figure 4C displays raw p-values over 22 autosomes. Obviously, the overall p-values distribute uniformly between 0 and 1. The most significant SNP identified by the proposed method, rs10820738, is exactly the one that explains the most phenotypic variation (1.0%) in the simulation. The corresponding p-value 2.21e-17 is significant at the genome-wide level. QTDT also detects the SNP as significant at the genome-wide level with a p-value 4.49e-10. FBAT instead has a p-value 6.57e-4 that is not significant at the genome-wide level. FAMCC, which allows no missing parental information, could not analyze this data set. We also present the p-values for the other four major gene SNPs in Table 5. Significantly, the proposed test has the lowest p-values at all the SNPs among the methods, though only one additional p-value from the proposed test achieves genome-wide significant level. From the logQQ plot, we observe that deviation from the theoretical values occurs when p-values are lower than approximately 1.0e-3. Notice that in the GAW16 simulation, a number of 1,000 SNPs are set to be causal. Taking LD between causal SNPs and their nearly markers into account, we believe that a remarkable proportion of observed deviations may represent truly positive signals.

Table 5
P Values at the Major Genes for the Various Tests When Analyzing GAW16 Simulated HDL Trait


In this study, we describe a new method for testing association for quantitative traits in nuclear families. Using large-scale marker genotypes, we adjust both genotype scores and phenotypes of parents in each family through principal components analysis. For children in each family, we adjust their genotype scores and phenotypes according to their parental adjustments through a TDT-like strategy. Association tests are performed with residual genotypes and residual phenotypes with the use of variance components model. Simulation results show that the proposed tests have greatly improved powers over two popular family-based methods: QTDT and FBAT, while correcting for population stratification.

The improved power of our proposed tests over the TDT-based tests primarily results from their different approaches for dealing with population stratification. In TDT based methods, the genotype score for a particular child at each marker is partitioned into a between-family component and a within-family component, where the former is sensitive to population stratification and the latter is only sensitive to linkage disequilibrium. The test then solely relies on within-family components to provide a safe-guard against population stratification. This partitioning approach considers each family as a sub-population. Specifically, within-family scores for all parents are always set to zero and do not contribute to test statistics. For our proposed tests, the genotype score for each individual is partitioned into a linear regression constituent, explained by principal components, and a residual part. The partitioning approach in our proposed tests implicitly increases power by using the information from both parents and children to contribute to the test statistics and by explicitly quantifying the effects of population substructures for all individuals.

In our analysis of GAW16 simulated data sets, we split large pedigrees into small nuclear families to facilitate the analysis. Although the proposed method is currently developed only for analyzing data with nuclear families, its extension to data with extended pedigrees is theoretically possible. The extension with regarding to correcting for population stratification is straightforward. For example, all founders can be collected to form an unrelated sample, on which the PCA will be performed and both the genotype and the phenotype for each founder are adjusted accordingly. Then, for offspring of non-founders, their principal components are calculated through the TDT strategy as we proposed. The process will carry out recursively until all non-founders are adjusted. Some specialized algorithms, e.g., the one described in (Abecasis et al., 2000b), can be adopted in a relative simple manner. However, the performance of the proposed method when applying to extended pedigrees is unknown, which is not trivial and deserves further endeavor.

Applying PCA to the study of population structures has been subject of research for several years (Bauchet et al., 2007, Price et al., 2006, Zhang et al., 2003). Since PCA requires independent samples for calculation, it cannot be directly applied to familial data due to genotypic correlations between family members. Here we perform PCA on only unrelated parents, and infer children's principal components according to their parents' principal components. This approach is rational when both parents in each family are available, since parental genotype data already contain all the sample information reflecting underlying genetic backgrounds for the family. Under this circumstance, the children's genotypes are merely the linear combination of the genotypes of their parents. When one or both parents are not available, however, we propose to randomly select one child for PCA calculation, and infer the other children's principal components accordingly. This may result in estimation bias to some extent, since the selected child's genotype cannot efficiently represent other children' genotypes. However, our simulation results do not show obvious deviations of type-I errors from target levels. This is partially because when the number of genotyped markers is large, each child in a family can, on average, adequately represent the underlying ancestral populations and other children' genetic backgrounds.

As PCA in the proposed method is calculated on people, singular covariance matrix will emerge when the number of individuals is greater than that of markers, e.g., 400 individuals versus 200 markers. Notice that the eigenvalue decomposition procedure can process even with singular matrix, therefore principal components can still be calculated with singular covariance matrix. The simulation results show that the proposed method is still valid even only 200 markers are involved, demonstrating that its performance is not influenced by the issue of singular matrix. We explain that since only top L eigenvectors are used for analyses, those eigenvectors with smaller eigenvalues (some exactly equal to 0) resulting from a singular matrix will not be used, and thus may have little influence on the performance of the proposed method.

We also study the effects of family structures on the efficiencies of the various methods through simulation. When the size of the genotyped sample is fixed, the proposed method favors smaller nuclear families that can provide more information on allelic distributions. In contrast, both QTDT and FBAT perform better with larger families, where the number of offspring that are informative increases. When fixing the number of families, power of all the methods increases as the number of genotyped offspring increases. However, investigators in real applications are often restricted by limited genotyping costs. They will be faced by the question of selecting trade-off between relative gain in power by more genotyped offspring and the genotyping cost. In such circumstance, an alternative strategy is to impute missing genotypes for a subset of family members instead of genotyping them. Various algorithms can impute missing genotypes in nuclear families with very high accuracy rates(Chen & Abecasis, 2007, Elston & Stewart, 1971, Lander & Green, 1987), and various schemes have been discussed when genotyping resources in nuclear families are limited(Chen & Abecasis, 2007). For example, genotyping one parent and one child will be most powerful if only two members are to be genotyped, and genotyping just three pre-selected individuals can recover > 90% of the information from a fully genotyped nuclear family. Combined with these genotype imputation methods and genotyping schemes, our proposed method will be helpful to investigators who plan to genotype a subset of family members, since incorporating individuals with available phenotypes and imputed genotypes will substantially improve the performance of the proposed tests in terms of power.

In summary, we propose a method for tests of association for quantitative traits in nuclear families that corrects for population stratification by an extended principal component analysis. Our proposed method provides investigators an opportunity to perform a more powerful GWAS than TDT-based methods. The java program implementing our method is freely available upon the request from the authors.


We acknowledge Genetic Analysis Workshop 16 (GAW16) and Framingham Heart Study (FHS) for permission of use of the data. The study was partially supported by Xi'an Jiaotong University. The investigators of this work were also benefited from grants from National Science Foundation of China, NIH (R01 AR050496, R21 AG027110, R01 AG026564, P50 AR055081 and R21 AA015973), Huo Ying Dong Education Foundation, Hunan Province, and the Ministry of Education of China. Computing support was partially provided by the High Performance Computing Cluster Center at Xi'an Jiaotong University.

Appendix A. The calculation of principal components from genotype data

Most of the materials presented here are from the paper of Price et al. (Price et al., 2006). Reader is invited to read that paper for more details.

For a sample of N unrelated individuals and M markers, let gij be the genotype for SNP i and individual j, where i=1 to M and j=1 to N, and let GMxN be the resulting genotype matrix. Each row i in G is normalized by subtracting each entry by the row mean μi=(Σjgij)N and then dividing by the standard deviation σi=pi(1pi), where pi=(1+Σjgij)(2+2N). The resulting matrix is denoted as XMxN. We compute an NxN covariance matrix A from X. We then perform the eigenvalue decomposition procedure on A, to obtain,


where DNxN is the diagonal eigenvalue matrix, VNxN is the orthogonal eigenvector matrix and V' is the transpose of V. Each column of V is an eigenvector corresponding to the eigenvalue in D. The top L (e.g., 10) eigenvectors corresponding to the L largest eigenvalues are used for adjusting population stratification.

Appendix B. Inference of children's principal components through phenotypic adjustments

For a nuclear family trio, denote the phenotype means for the father, the mother and the child as μf, μm and μc, respectively. According to polygenic model, the child's phenotype mean should be the average of that of its parents since the child possesses half copies of each parent's genes, that is


Adjustment on phenotypes is regarded as adjusting all members' phenotype means to an unified value, termed μ′, so that Δf = μf − μ′, Δm = μm − μ′, and Δc = μc − μ′, respectively. Together with (A1), it is easy to see that


Denote the principal component vectors for the father, the mother and the child as ef, em, and ec, respectively. From liear regression equation we have




Substituting (A2) into the above expressions, it can be deduced that



  • Abecasis GR, Cardon LR, Cookson WO. A general test of association for quantitative traits in nuclear families. Am J Hum Genet. 2000a;66:279–92. [PMC free article] [PubMed]
  • Abecasis GR, Cookson WO, Cardon LR. Pedigree tests of transmission disequilibrium. Eur J Hum Genet. 2000b;8:545–51. [PubMed]
  • Allison DB. Transmission-disequilibrium tests for quantitative traits. Am J Hum Genet. 1997;60:676–90. [PMC free article] [PubMed]
  • Amos CI. Robust variance-components approach for assessing genetic linkage in pedigrees. Am J Hum Genet. 1994;54:535–43. [PMC free article] [PubMed]
  • Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995;96:3–12. [PubMed]
  • Bauchet M, Mcevoy B, Pearson LN, Quillen EE, Sarkisian T, Hovhannesyan K, Deka R, Bradley DG, Shriver MD. Measuring European population stratification with microarray genotype data. Am J Hum Genet. 2007;80:948–56. [PMC free article] [PubMed]
  • Benjamin EJ, Dupuis J, Larson MG, Lunetta KL, Booth SL, Govindaraju DR, Kathiresan S, Keaney JF, Jr., Keyes MJ, Lin JP, Meigs JB, Robins SJ, Rong J, Schnabel R, Vita JA, Wang TJ, Wilson PW, Wolf PA, Vasan RS. Genome-wide association with select biomarker traits in the Framingham Heart Study. BMC Med Genet. 2007;8(Suppl 1):S11. [PMC free article] [PubMed]
  • Chen HS, Zhu X, Zhao H, Zhang S. Qualitative semi-parametric test for genetic associations in case-control designs under structured populations. Ann Hum Genet. 2003;67:250–64. [PubMed]
  • Chen WM, Abecasis GR. Family-based association tests for genomewide association scans. Am J Hum Genet. 2007;81:913–26. [PMC free article] [PubMed]
  • Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT. Mapping determinants of human gene expression by regional and genome-wide association. Nature. 2005;437:1365–9. [PMC free article] [PubMed]
  • Dawber TR, Meadors GF, Moore FE., Jr. Epidemiological approaches to heart disease: the Framingham Study. Am J Public Health Nations Health. 1951;41:279–81. [PMC free article] [PubMed]
  • Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. [PubMed]
  • Drineas P KR, Mahoney M. Fast Monte Carlo algorithms for matrices III: Computing a compressed approximate matrix decomposition. SIAM Journal of Computing. 2006;36:184–206.
  • Elston RC, Stewart J. A general model for the genetic analysis of pedigree data. Hum Hered. 1971;21:523–42. [PubMed]
  • Fulker DW, Cherny SS, Sham PC, Hewitt JK. Combined linkage and association sib-pair analysis for quantitative traits. Am J Hum Genet. 1999;64:259–67. [PMC free article] [PubMed]
  • Gower JC. Some Distance Properties of Latent Root and Vector Methods Used in Multivariate Analysis. Biometrika. 1966;53:325–338.
  • Kannel WB, Dawber TR, Kagan A, Revotskie N, Stokes J., 3rd Factors of risk in the development of coronary heart disease--six year follow-up experience. The Framingham Study. Ann Intern Med. 1961;55:33–50. [PubMed]
  • Laird NM, Horvath S, Xu X. Implementing a unified approach to family-based tests of association. Genet Epidemiol. 2000;19(Suppl 1):S36–42. [PubMed]
  • Laird NM, Lange C. Family-based designs in the age of large-scale gene-association studies. Nat Rev Genet. 2006;7:385–94. [PubMed]
  • Lake SL, Blacker D, Laird NM. Family-based tests of association in the presence of linkage. Am J Hum Genet. 2000;67:1515–25. [PMC free article] [PubMed]
  • Lander ES, Green P. Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci U S A. 1987;84:2363–7. [PMC free article] [PubMed]
  • Lange C, Demeo D, Silverman EK, Weiss ST, Laird NM. Using the noninformative families in family-based association tests: a powerful new testing strategy. Am J Hum Genet. 2003;73:801–11. [PMC free article] [PubMed]
  • Lange C, Demeo DL, Laird NM. Power and design considerations for a general class of family-based association tests: quantitative traits. Am J Hum Genet. 2002;71:1330–41. [PMC free article] [PubMed]
  • Lange C, Laird NM. On a general class of conditional tests for family-based association studies in genetics: the asymptotic distribution, the conditional power, and optimality considerations. Genet Epidemiol. 2002a;23:165–80. [PubMed]
  • Lange C, Laird NM. Power calculations for a general class of family-based association tests: dichotomous traits. Am J Hum Genet. 2002b;71:575–84. [PMC free article] [PubMed]
  • Martin ER, Kaplan NL, Weir BS. Tests for linkage and association in nuclear families. Am J Hum Genet. 1997;61:439–48. [PMC free article] [PubMed]
  • Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG. Genetic analysis of genome-wide variation in human gene expression. Nature. 2004;430:743–7. [PMC free article] [PubMed]
  • Paschou P, Drineas P, Lewis J, Nievergelt CM, Nickerson DA, Smith JD, Ridker PM, Chasman DI, Krauss RM, Ziv E. Tracing sub-structure in the European American population with PCA-informative markers. PLoS Genet. 2008;4:e1000114. [PMC free article] [PubMed]
  • Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–9. [PubMed]
  • Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet. 1999;65:220–8. [PMC free article] [PubMed]
  • Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000a;155:945–59. [PMC free article] [PubMed]
  • Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am J Hum Genet. 2000b;67:170–81. [PMC free article] [PubMed]
  • Rabinowitz D. A transmission disequilibrium test for quantitative trait loci. Hum Hered. 1997;47:342–50. [PubMed]
  • Rabinowitz D, Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered. 2000;50:211–23. [PubMed]
  • Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–7. [PubMed]
  • Spielman RS, Ewens WJ. A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test. Am J Hum Genet. 1998;62:450–8. [PMC free article] [PubMed]
  • Spielman RS, Mcginnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet. 1993;52:506–16. [PMC free article] [PubMed]
  • Tanaka T. International HapMap project. Nippon Rinsho. 2005;63(Suppl 12):29–34. [PubMed]
  • Wellcometrust Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–78. [PMC free article] [PubMed]
  • Zhang S, Zhu X, Zhao H. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet Epidemiol. 2003;24:44–56. [PubMed]
  • Zhu X, Cooper RS, Elston RC. Linkage analysis of a complex disease through use of admixed populations. Am J Hum Genet. 2004;74:1136–53. [PMC free article] [PubMed]
  • Zhu X, Li S, Cooper RS, Elston RC. A unified association analysis approach for family and unrelated samples correcting for stratification. Am J Hum Genet. 2008;82:352–65. [PMC free article] [PubMed]
  • Zhu X, Zhang S, Zhao H, Cooper RS. Association mapping, using a mixture model for complex traits. Genet Epidemiol. 2002;23:181–96. [PubMed]
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles
  • SNP
    PMC to SNP links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...