# GCTA: A Tool for Genome-wide Complex Trait Analysis

^{1}Queensland Statistical Genetics Laboratory, Queensland Institute of Medical Research, 300 Herston Road, Brisbane, Queensland 4006, Australia

^{2}Department of Food and Agricultural Systems, University of Melbourne, Parkville, Victoria 3010, Australia

^{3}Biosciences Research Division, Department of Primary Industries, Bundoora, Victoria 3086, Australia

## Abstract

For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called *g*enome-wide *c*omplex *t*rait *a*nalysis (GCTA), which was developed based on a method we recently developed to address the “missing heritability” problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.

## Main Text

Despite the great success of genome-wide association studies (GWAS), which have identified hundreds of SNPs conferring the genetic variation of human complex diseases and traits,^{1} the genetic architecture of human complex traits still remains largely unexplained. For most traits, the associated SNPs from GWAS only explain a small fraction of the heritability.^{2,3} There has not been any consensus on the explanation of the “missing heritability.” Possible explanations include a large number of common variants with small effects, rare variants with large effects, and DNA structural variation.^{2,4} We recently proposed a method of estimating the total amount of phenotypic variance captured by all SNPs on the current generation of commercial genotyping arrays and estimated that ∼45% of the phenotypic variance for human height can be explained by all common SNPs.^{5} Thus, most of the heritability for height is hiding rather than missing because of many SNPs with small effects.^{5,6} In contrast to single-SNP association analysis, the basic concept behind our method is to fit the effects of all the SNPs as random effects by a mixed linear model (MLM),

where **y** is an *n* × 1 vector of phenotypes with *n* being the sample size, **β** is a vector of fixed effects such as sex, age, and/or one or more eigenvectors from principal component analysis (PCA), **u** is a vector of SNP effects with $\mathbf{u}\sim N\left(0,\mathbf{I}{\sigma}_{\text{u}}^{2}\right)$, **I** is an *n* × *n* identity matrix, and **ɛ** is a vector of residual effects with $\mathbf{\varepsilon}\sim N\left(0,\mathbf{I}{\sigma}_{\varepsilon}^{2}\right)$. **W** is a standardized genotype matrix with the *ij*^{th} element ${w}_{ij}=\left({x}_{ij}-2{p}_{i}\right)/\sqrt{2{p}_{i}\left(1-{p}_{i}\right)}$, where *x _{ij}* is the number of copies of the reference allele for the

*i*

^{th}SNP of the

*j*

^{th}individual and

*p*is the frequency of the reference allele. If we define $\mathbf{A}=\mathbf{W}{\mathbf{W}}^{\prime}/N$ and define ${\sigma}_{\text{g}}^{2}$ as the variance explained by all the SNPs, i.e., ${\sigma}_{\text{g}}^{2}=N{\sigma}_{\text{u}}^{2}$, with

_{i}*N*being the number of SNPs, then Equation 1 will be equivalent to:

^{7–9}

where **g** is an *n* × 1 vector of the total genetic effects of the individuals with $\mathbf{g}\sim N\left(0,\mathbf{A}{\sigma}_{\text{g}}^{2}\right)$, and **A** is interpreted as the genetic relationship matrix (GRM) between individuals. We can therefore estimate ${\sigma}_{\text{g}}^{2}$ by the restricted maximum likelihood (REML) approach,^{10} relying on the GRM estimated from all the SNPs. Here we report a versatile tool called *g*enome-wide *c*omplex *t*rait *a*nalysis (GCTA), which implements the method of estimating variance explained by all SNPs, and extend the method to partition the genetic variance onto each of the chromosomes and also to estimate the variance explained by the X chromosome and test for dosage compensation in females. We developed GCTA in five function domains: data management, estimation of the GRM from a set of SNPs, estimation of the variance explained by all the SNPs on a single chromosome or the whole genome, estimation of linkage disequilibrium (LD) structure, and simulation.

### Estimation of the Genetic Relationship from Genome-wide SNPs

One of the core functions of GCTA is to estimate the genetic relationships between individuals from the SNPs. From the definition above, the genetic relationship between individuals *j* and *k* can be estimated by the following equation:

We provide a function to iteratively exclude one individual of a pair whose relationship is greater than a specified cutoff value, e.g., 0.025, while retaining the maximum number of individuals in the data. For data collected from family or twin studies, we recommend that users estimate the genetic relationships with all of the autosomal SNPs and then use this option to exclude close relatives. The reason for exclusion is that the objective of the analysis is to estimate genetic variation captured by all the SNPs, just as GWAS does for single SNPs. Including close relatives, such as parent-offspring pairs and siblings, would result in the estimate of genetic variance being driven by the phenotypic correlations for these pairs (just as in pedigree analysis), and this estimate could be a biased estimate of total genetic variance, for example because of common environmental effects. Even if the estimate is not biased, its interpretation is different from the estimate from “unrelated” individuals: a pedigree-based estimator captures the contribution from all causal variants (across the entire allele frequency spectrum), whereas our method captures the contribution from causal variants that are in LD with the genotyped SNPs.

As a by-product, we provide a function in GCTA to calculate the eigenvectors of the GRM, which is asymptotically equivalent to those from the PCA implemented in EIGENSTRAT^{11} because the GRM (*A _{jk}*) defined in GCTA is approximately half of the covariance matrix (

*Ψ*) used in EIGENSTRAT. The only purpose of developing this function is to calculate eigenvectors and then include them in the model as covariates to capture variance due to population structure. More sophisticated analyses of the population structure can be found in programs such as EIGENSTRAT

_{jk}^{11}and STRUCTURE.

^{12}

### Estimation of the Variance Explained by Genome-wide SNPs by REML

The GRM estimated from the SNPs can be fitted subsequently in an MLM to estimate the variance explained by these SNPs via the REML method.^{10} Previously, we included only one genetic factor in the model. Here we extend the model in a general form as

where ${\mathbf{g}}_{i}$ is a vector of random genetic effects, which could be the total genetic effects for the whole genome or for a single chromosome. In this model, the phenotypic variance (${\sigma}_{\text{P}}^{2}$) is partitioned into the variance explained by each of the genetic factors and the residual variance,

where ${\sigma}_{i}^{2}$ is the variance of the *i*^{th} genetic factor with its corresponding GRM, **A*** _{i}*.

In GCTA, we provide flexible options to specify different genetic models. For example:

(1) To estimate the variance explained by all autosomal SNPs, we can specify the model as **y** = **Xβ** + **g** + **ɛ** with $\mathbf{V}={\mathbf{A}}_{\text{g}}{\sigma}_{\text{g}}^{2}+\mathbf{I}{\sigma}_{\varepsilon}^{2}$, where **g** is an *n* × 1 vector of the aggregate effects of all the autosomal SNPs for all of the individuals and **A**_{g} is the GRM estimated from these SNPs. This model is the same as Equation 2.

(2) To estimate the variance of genotype-environment interaction effects (${\sigma}_{\text{ge}}^{2}$), we can specify the model as **y** = **Xβ** + **g** + **ge** + **ɛ** with $\mathbf{V}={\mathbf{A}}_{\text{g}}{\sigma}_{\text{g}}^{2}+{\mathbf{A}}_{\text{ge}}{\sigma}_{\text{ge}}^{2}+\mathbf{I}{\sigma}_{\varepsilon}^{2}$, where **ge** is a vector of genotype-environment interaction effects for all of the individuals with **A**_{ge} = **A**_{g} for the pairs of individuals in the same environment and with **A**_{ge} = **0** for the pairs of individuals in different environments.

(3) To partition genetic variance onto each of the 22 autosomes, we can specify the model as $\mathbf{y}=\mathbf{X}\mathbf{\beta}+{\sum}_{i=1}^{22}{\mathbf{g}}_{i}+\mathbf{\varepsilon}$ with $\mathbf{V}={\sum}_{i=1}^{22}{\mathbf{A}}_{i}{\sigma}_{i}^{2}+\mathbf{I}{\sigma}_{\varepsilon}^{2}$, where ${\mathbf{g}}_{i}$ is a vector of genetic effects attributed to the *i*^{th} chromosome and **A*** _{i}* is the GRM estimated from the SNPs on the

*i*

^{th}chromosome.

GCTA implements the REML method via the average information (AI) algorithm.^{13} In the REML iteration process, the estimates of variance components from the *t*^{th} iteration are updated by ${\mathbf{\theta}}^{\left(t+1\right)}={\mathbf{\theta}}^{\left(t\right)}+{\left(\mathbf{A}{\mathbf{I}}^{\left(t\right)}\right)}^{-1}\partial L/\partial \mathbf{\theta}{|}_{{\mathbf{\theta}}^{\left(t\right)}}$, where θ is a vector of variance components (${\sigma}_{1}^{2}$, …, ${\sigma}_{r}^{2}$ and ${\sigma}_{\varepsilon}^{2}$); *L* is the log likelihood function of the MLM (ignoring the constant), $L=-1/2\left(\mathrm{log}\left|\mathbf{V}\right|+\mathrm{log}\left|{\mathbf{X}}^{\prime}{\mathbf{V}}^{-1}\mathbf{X}\right|+{\mathbf{y}}^{\prime}\mathbf{P}\mathbf{y}\right)$ with $\mathbf{P}={\mathbf{V}}^{-1}-{\mathbf{V}}^{-1}\mathbf{X}{\left({\mathbf{X}}^{\prime}{\mathbf{V}}^{-1}\mathbf{X}\right)}^{-1}{\mathbf{X}}^{\prime}{\mathbf{V}}^{-1}$; **AI** is the average of the observed and expected information matrices, $\mathbf{A}\mathbf{I}=1/2\left[\begin{array}{cccc}{\mathbf{y}}^{\prime}\mathbf{P}{\mathbf{A}}_{1}\mathbf{P}{\mathbf{A}}_{1}\mathbf{P}\mathbf{y}& \cdots & {\mathbf{y}}^{\prime}\mathbf{P}{\mathbf{A}}_{1}\mathbf{P}{\mathbf{A}}_{r}\mathbf{P}\mathbf{y}& {\mathbf{y}}^{\prime}\mathbf{P}{\mathbf{A}}_{1}\mathbf{P}\mathbf{P}\mathbf{y}\\ \vdots & \vdots & \vdots & \vdots \\ {\mathbf{y}}^{\prime}\mathbf{P}{\mathbf{A}}_{r}\mathbf{P}{\mathbf{A}}_{1}\mathbf{P}\mathbf{y}& \cdots & {\mathbf{y}}^{\prime}\mathbf{P}{\mathbf{A}}_{r}\mathbf{P}{\mathbf{A}}_{r}\mathbf{P}\mathbf{y}& {\mathbf{y}}^{\prime}\mathbf{P}{\mathbf{A}}_{r}\mathbf{P}\mathbf{P}\mathbf{y}\\ {\mathbf{y}}^{\prime}\mathbf{P}\mathbf{P}{\mathbf{A}}_{1}\mathbf{P}\mathbf{y}& \cdots & {\mathbf{y}}^{\prime}\mathbf{P}\mathbf{P}{\mathbf{A}}_{r}\mathbf{P}\mathbf{y}& {\mathbf{y}}^{\prime}\mathbf{P}\mathbf{P}\mathbf{P}\mathbf{y}\end{array}\right]$; and $\partial L/\partial \mathbf{\theta}$ is a vector of first derivatives of the log likelihood function with respect to each variance component, $\partial L/\partial \mathbf{\theta}=-1/2\left[\begin{array}{c}tr\left(\mathbf{P}{\mathbf{A}}_{1}\right)-{\mathbf{y}}^{\prime}\mathbf{P}{\mathbf{A}}_{1}\mathbf{P}\mathbf{y}\\ \vdots \\ tr\left(\mathbf{P}{\mathbf{A}}_{r}\right)-{\mathbf{y}}^{\prime}\mathbf{P}{\mathbf{A}}_{r}\mathbf{P}\mathbf{y}\\ tr\left(\mathbf{P}\right)-{\mathbf{y}}^{\prime}\mathbf{P}\mathbf{P}\mathbf{y}\end{array}\right]$.^{13} At the beginning of the iteration process, all of the components are initialized by an arbitrary value, i.e., ${\sigma}_{i}^{2\left(0\right)}={\sigma}_{\text{P}}^{2}/\left(r+1\right)$, which is subsequently updated by the expectation maximization (EM) algorithm, ${\sigma}_{i}^{2\left(1\right)}=\left[{\sigma}_{i}^{4\left(0\right)}{\mathbf{y}}^{\prime}\mathbf{P}{\mathbf{A}}_{i}\mathbf{P}\mathbf{y}+\text{tr}\left({\sigma}_{i}^{2\left(0\right)}\mathbf{I}-{\sigma}_{i}^{4\left(0\right)}\mathbf{P}{\mathbf{A}}_{i}\right)\right]/n$. The EM algorithm is used as an initial step to determine the direction of the iteration updates because it is robust to poor starting values. After one EM iteration, GCTA switches to the AI algorithm for the remaining iterations until the iteration converges with the criteria of *L*^{(t} ^{+ 1)} – *L*^{(t)} < 10^{−4}, where *L*^{(t)} is the log likelihood of the *t*^{th} iteration. In the iteration process, any component that escapes from the parameter space (i.e., its estimate is negative) will be set to 10^{−6} × ${\sigma}_{\text{P}}^{2}$. If a component keeps escaping from the parameter space, it will be constrained at 10^{−6} × ${\sigma}_{\text{P}}^{2}$.

From the REML analysis, GCTA has an option to provide the best linear unbiased prediction (BLUP) of the total genetic effect for all individuals. BLUP is widely used by plant and animal breeders to quantify the breeding value of individuals in artificial selection programs^{14} and also by evolutionary geneticists.^{15} Consider Equations 1 and 2, i.e., **y** = **Xβ** + **Wu** + **ɛ** and **y** = **Xβ** + **g** + **ɛ**. Because these two models are mathematically equivalent,^{7–9} the BLUP of **g** can be transformed to the BLUP of **u** by $\widehat{\mathbf{u}}={\mathbf{W}}^{\prime}{\mathbf{A}}^{-1}\widehat{\mathbf{g}}/N$. Here the estimate of *u _{i}* corresponds to the coefficient

*w*, which is then rescaled for the original

_{ij}*x*by ${\widehat{u}}_{i}^{\ast}={\widehat{u}}_{i}/\sqrt{2{p}_{i}\left(1-{p}_{i}\right)}$. We could obtain the BLUP of SNP effects in a discovery set by GCTA and predict genetic values of the individuals in a validation set (${\widehat{\mathbf{g}}}_{\text{new}}={\mathbf{W}}_{\text{new}}\widehat{\mathbf{u}}$). For example, GCTA could be used to predict SNP effects in a discovery set, and the SNP effects could be used in PLINK to predict whole-genome profiles via the scoring approach in a validation set. If the predictions are unbiased, then the regression slope of the observed phenotypes on the predicted genetic values is 1.

_{ij}^{14}In that case, the genetic value calculated based on the BLUP of SNP effects is an unbiased predictor of the true genetic value in the validation set (

**g**

_{new}), in the sense that $\text{E}\left({\mathbf{g}}_{\text{new}}|{\widehat{\mathbf{g}}}_{\text{new}}\right)={\widehat{\mathbf{g}}}_{\text{new}}$.

^{16,17}Prediction analyses of human complex traits have demonstrated that many SNPs that do not pass the genome-wide significance level have substantial contribution to the prediction.

^{18,19}This option is therefore useful for the whole-genome prediction analysis with all of the SNPs, irrespective of their association p values.

### Estimation of the Variance Explained by the SNPs on the X Chromosome

The method of estimating the genetic relationship from the X chromosome is different to that for the autosomal SNPs, because males have only one X chromosome. We modified Equation 3 for the X chromosome as:

where ${x}_{ij}^{\text{M}}$ and ${x}_{ij}^{\text{F}}$ are the number of copies of the reference allele for an X chromosome SNP for a male and a female, respectively.

Assuming the male-female genetic correlation to be 1, the X-linked phenotypic covariance between a pair of individuals is:^{20}

where ${\sigma}_{\text{X}\left(\text{M}\right)}^{2}$ and ${\sigma}_{\text{X}\left(\text{F}\right)}^{2}$ are the genetic variance attributed to the X chromosome for males and females, respectively.

The relative values of ${\sigma}_{\text{X}\left(\text{M}\right)}^{2}$ and ${\sigma}_{\text{X}\left(\text{F}\right)}^{2}$ depend on the assumption made regarding dosage compensation for X chromosome genes. There are two alleles per locus in females, but only one in males. If we assume that each allele has a similar effect on the trait (i.e., no dosage compensation), the genetic variance on the X chromosome for females is twice that for males: i.e., ${\sigma}_{\text{X}}^{2}={\sigma}_{\text{X}\left(\text{F}\right)}^{2}=2{\sigma}_{\text{X}\left(\text{M}\right)}^{2}$. Thus,

This can be implemented by redefining GRM for the X chromosome as ${\mathbf{A}}_{\text{X}}^{\text{ND}}=1/2{\mathbf{A}}_{\text{X}}$ for male-male pairs, ${\mathbf{A}}_{\text{X}}^{\text{ND}}={\mathbf{A}}_{\text{X}}$ for female-female pairs, and ${\mathbf{A}}_{\text{X}}^{\text{ND}}=1/\sqrt{2}{\mathbf{A}}_{\text{X}}$ for male-female pairs. If we assume that each allele in females has only half the effect of an allele in males (i.e., full dosage compensation), the X-linked genetic variance for females is half that for males: i.e., ${\sigma}_{\text{X}}^{2}={\sigma}_{\text{X}\left(\text{F}\right)}^{2}=1/2{\sigma}_{\text{X}\left(\text{M}\right)}^{2}$. Thus,

Therefore, the raw **A**_{X} matrix should be parameterized as ${\mathbf{A}}_{\text{X}}^{\text{FD}}=2{\mathbf{A}}_{\text{X}}$ for male-male pairs, ${\mathbf{A}}_{\text{X}}^{\text{FD}}={\mathbf{A}}_{\text{X}}$ for female-female pairs, and ${\mathbf{A}}_{\text{X}}^{\text{ND}}=\sqrt{2}{\mathbf{A}}_{\text{X}}$ for male-female pairs. The third possibility is to assume equal genetic variance on the X chromosome for males and females, i.e., ${\sigma}_{\text{X}}^{2}={\sigma}_{\text{X}\left(\text{F}\right)}^{2}={\sigma}_{\text{X}\left(\text{M}\right)}^{2}$, in which case the **A**_{X} matrix is not redefined at all.

We can estimate ${\sigma}_{\text{X}}^{2}$ by fitting the model $\mathbf{y}=\mathbf{X}\mathbf{\beta}+{\mathbf{g}}_{\text{X}}+\mathbf{g}+\mathbf{\varepsilon}$, where ${\mathbf{g}}_{\text{X}}$ is a vector of genetic effects attributable to the X chromosome, with $\mathrm{var}\left({\mathbf{g}}_{\text{X}}\right)={\mathbf{A}}_{\text{X}}^{\text{ND}}{\sigma}_{\text{X}}^{2}$ assuming no dosage compensation, $\mathrm{var}\left({\mathbf{g}}_{\text{X}}\right)={\mathbf{A}}_{\text{X}}^{\text{FD}}{\sigma}_{\text{X}}^{2}$ assuming full dosage compensation, and $\mathrm{var}\left({\mathbf{g}}_{\text{X}}\right)={\mathbf{A}}_{\mathrm{X}}{\sigma}_{\text{X}}^{2}$ assuming equal X-linked genetic variance for males and females. Test of dosage compensation can be achieved by comparing the likelihoods of model fitting under the three assumptions.

### Estimation of the Variance Explained by Genome-wide SNPs for a Case-Control Study

The methodology described above is also applicable for case-control data, for which the estimate of variance explained by the SNPs corresponds to variation on the observed 0–1 scale. Under the assumption of a threshold-liability model for a disease, i.e., disease liability on the underlying scale follows standard normal distribution,^{21} the estimate of variance explained by the SNPs on the observed 0–1 scale can be transformed to that on the unobserved continuous liability scale by a linear transformation.^{22} The relationship between additive genetic variance on the observed 0–1 and unobserved liability scales was proposed more than a half century ago,^{23,24} and we recently extended this transformation to account for ascertainment bias in a case-control study, i.e., a much higher proportion of cases in the sample than in the general population (unpublished data). We provide options in GCTA to analyze a binary trait and to transform the estimate on the 0–1 scale to that on the liability scale with an adjustment for ascertainment bias. There is an important caveat in applying the methods described herein to case-control data. Any batch, plate, or other technical artifact that causes allele frequencies between case and control on average to be more different than that under the null hypothesis stating that the samples come from the same population will contribute to the estimation of spurious genetic variation, because cases will appear to be more related to other cases than to controls. Therefore, stringent quality control is essential when applying GCTA to case-control data. Quantitative traits are less likely to suffer from technical genotyping artifacts because they will generally not lead to spurious association between continuous phenotypes and genotypes.

### Estimation of the Inbreeding Coefficient from Genome-wide SNPs

Apart from estimating the genetic relatedness between individuals, GCTA also has a function to estimate the inbreeding coefficient (*F*) from SNP data, i.e., the relationship between haplotypes within an individual. Two estimates have been used: one based on the variance of additive genetic values (diagonal of the SNP-derived GRM) and the other based on SNP homozygosity (implemented in PLINK).^{25} Let (1 – *p _{i}*)

^{2}+

*p*(1 –

_{i}*p*)

_{i}*F*, 2

*p*(1 –

_{i}*p*)(1 –

_{i}*F*), and

*p*

_{i}^{2}+

*p*(1 –

_{i}*p*)

_{i}*F*be the frequencies of the three genotypes of a SNP

*i*and let

*h*= 2

_{i}*p*(1 –

_{i}*p*). The estimate based on the variance of additive genotype values is

_{i}where *x _{i}* is the number of copies of the reference allele for the

*i*

^{th}SNP. This is a special case of Equation 3 for a single SNP when

*j = k*. The estimate based upon excess homozygosity is

where O(# hom) and *E*(# hom) are the observed and expected number of homozygous genotypes in the sample, respectively. Both estimators are unbiased estimates of *F* in the sense that $E\left({\widehat{F}}_{i}^{\text{I}}|F\right)=E\left({\widehat{F}}_{i}^{\text{II}}|F\right)=F$, but their sampling variances are dependent on allele frequency, i.e., $\mathrm{var}\left({\widehat{F}}_{i}^{\text{I}}\right)=\mathrm{var}\left({\widehat{F}}_{i}^{\text{II}}\right)=$ (1 – *h _{i}*) /

*h*if

_{i}*F*= 0. In addition, the covariance between the two estimators is (3

*h*– 1) /

_{i}*h*+ (1 – 2

_{i}*h*)

_{i}*F*/

*h*–

_{i}*F*

^{2}, so that the sampling covariance between the estimators is (3

*h*– 1) /

_{i}*h*and the sampling correlation is (3

_{i}*h*– 1) / (1 –

_{i}*h*) when

_{i}*F*= 0. We proposed an estimator based upon the correlation between uniting gametes:

^{5}

${\widehat{F}}_{i}^{\text{III}}$ is also an unbiased estimator of *F* in the sense that $E\left({\widehat{F}}_{i}^{\text{III}}|F\right)=F$. If *F* = 0, $\mathrm{var}\left({\widehat{F}}_{i}^{\text{III}}\right)=1$ regardless of allele frequency, which is smaller than the sampling variance of ${\widehat{F}}_{i}^{\text{I}}$ and ${\widehat{F}}_{i}^{\text{II}}$, i.e., 1 ≤ (1 – *h _{i}*) /

*h*. When 0 <

_{i}*F*< 1/3, ${\widehat{F}}_{i}^{\text{III}}$ also has a smaller variance than ${\widehat{F}}_{i}^{\text{I}}$ and ${\widehat{F}}_{i}^{\text{II}}$. In GCTA, we use 1 + ${\widehat{F}}_{i}^{\text{III}}$ rather than 1 + ${\widehat{F}}_{i}^{\text{I}}$ to calculate the diagonal of the GRM. For multiple SNPs, we average the estimates over all of the SNPs, i.e., $\widehat{F}=1/N{\sum}_{i=1}^{N}{\widehat{F}}_{i}$.

### Estimating LD Structure

In a standard GWAS, particularly with a large sample size, the mean (λ_{mean}) or median (λ_{median}) of the test statistics for single-SNP associations often deviates from its expected value under the null hypothesis of no association between any SNP and the phenotype, which is usually interpreted as the effect due to population stratification and/or cryptic relatedness.^{11,26,27} An alternative explanation is that polygenic variation causes the observed inflated test statistic.^{18} To predict the genomic inflation factors, λ_{mean} and λ_{median}, from polygenic parameters such as the total amount of variance that is explained by all SNPs, we need to quantify the LD structure between SNPs and putative causal variants (unpublished data). GCTA provides a function to search for all the SNPs in LD with the “causal variants” (mimicked by a set of SNPs chosen by the user). Given a causal variant, we use simple regression to test for SNPs in LD with the causal variant within *d* Mb distance in either direction. PLINK has an option (“show targets”) to select SNPs in LD with a set of target SNPs with LD *r*^{2} larger than a user-specified cutoff value. This function is very useful to distinguish independent association signals but less suited to predict λ_{mean} and λ_{median}, because the test statistics of the SNPs in modest LD with causal variants (SNPs at Mb distance with low *r ^{2}*) will also be inflated to a certain extent, and these test statistics will contribute to the genomic inflation factors.

### GWAS Simulation

We provided a function to simulate GWAS data based on the observed genotype data. For a quantitative trait, the phenotypes are simulated by the simple additive genetic model **y** = **Wu** + **ɛ**, where the notation is the same as above. Given a set of SNPs assigned as causal variants, the effects of the causal variants are generated from a standard normal distribution, and the residual effects are generated from a normal distribution with mean of 0 and variance of ${\sigma}_{g}^{2}\left(1/{h}^{2}-1\right)$, where ${\sigma}_{g}^{2}$ is the empirical variance of **Wu** and *h*^{2} is the user specified heritability. For a case-control study, assuming a threshold-liability model, disease liabilities are simulated in the same way as that for the phenotypes of a quantitative trait. Any individual with disease liability exceeding a certain threshold *T* is assigned to be a case and a control otherwise, where *T* is the threshold of normal distribution truncating the proportion of *K* (disease prevalence). The only purpose of this function is to do a simple simulation based on the observed genotype data. More complicated simulation can be performed with programs such as ms,^{28} GENOME,^{29} FREGENE,^{30} and HAPGEN.^{31}

### Data Management

We chose the PLINK^{25} compact binary file format (^{∗}.bed, ^{∗}.bim, and ^{∗}.fam) as the input data format for GCTA because of its popularity in the genetics community and its efficiency of data storage. For the imputed dosage data, we use the output files of the imputation program MACH^{32} (^{∗}.mldose.gz and ^{∗}.mlinfo.gz) as the inputs for GCTA. For the convenience of analysis, we provide options to extract a subset of individuals and/or SNPs and to filter SNPs based on certain criteria, such as chromosome position, minor allele frequency (MAF), and imputation *R*^{2} (for the imputed data). However, we do not provide functions for a thorough quality control (QC) of the data, such as Hardy-Weinberg equilibrium test and missingness, because these functions have been well developed in many other genetic analysis packages, e.g., PLINK, GenABEL,^{33} and SNPTEST.^{34} We assume that the data have been cleaned by a standard QC process before entering into GCTA.

### Estimating Total Heritability

The method implemented in GCTA is to estimate the variance explained by chromosome- or genome-wide SNPs rather than the trait heritability. Estimating the heritability (i.e., variance explained by all the causal variants), however, relies on the genetic relationship at causal variants that is predicted with error by the genetic relationship derived from the SNPs as a result of imperfect tagging. We have previously established that the prediction error is *c* + 1 / *N*, with *c* depending on the distribution of the MAF of causal variants. We therefore developed a method based on simple regression to correct for the prediction error by

where $\beta =1-(c+1/N)/\mathrm{var}\left({A}_{jk}\right)$. The estimate of variance explained by all of the SNPs after such adjustment is an unbiased estimate of heritability only if the assumption about the MAF distribution of causal variants is correct.

### Efficiency of GCTA Computing Algorithm

GCTA implements the REML method based on the variance-covariance matrix **V** and the projection matrix **P**. In some of the mixed model analysis packages, such as ASREML,^{35} to avoid the inversion of the *n* × *n* **V** matrix, people usually use Gaussian elimination of the mixed model equations (MME) to obtain the **AI** matrix based on sparse matrix techniques. The SNP-derived GRM matrix, however, is typically dense, so the sparse matrix technique will bring an extra cost of memory and CPU time. Moreover, the dimension of MME depends on the number of random effects in the model, whereas the **V** matrix does not. For example, when fitting the 22 chromosomes simultaneously in the model, the dimension of MME is 22*n* × 22*n* (ignoring the fixed effects), whereas the dimension of **V** matrix is still *n* × *n*. We compared the computational efficiency of GCTA and ASREML. When the sample size is small, e.g., n < 3000, both GCTA and ASREML take a few minutes to run. When the sample size is large, e.g., n > 10,000, especially when fitting multiple GRMs, it takes days for ASREML to finish the analysis, whereas GCTA needs only a few hours.

### System Requirements

We have released executable versions of GCTA for the three major operating systems: MS Windows, Linux/Unix, and Mac OS. We have also released the source codes so that users can compile them for some specific platforms. GCTA requires a large amount of memory when calculating the GRM or performing an REML analysis with multiple genetic components. For example, it requires ∼4.8 GB memory to calculate the GRM for a data set with 3925 individuals genotyped by 294,831 SNPs, and it takes ∼4 CPU hours (AMD Opteron 2.8 GHz) to finish the computation. We therefore recommend using the 64-bit version of GCTA for large memory support.

### Nonadditive Genetic Variance

The analysis approach we have adapted is a logical extension of estimation methods based on pedigrees. It allows estimation of additive genetic variation that is captured by SNP arrays and is therefore informative with respect to the genetic architecture of complex traits. The estimate of variance captured by all of the SNPs obtained in GCTA is directly comparable to the heritability estimated from pedigree analysis in family and twin studies, as well as the variance explained by GWAS hits, so that missing and hiding heritability can be quantified.^{5} Other sources of genetic variations such as dominance, gene-gene interaction, and gene-environment interaction are also important for complex trait variation but are less relevant to the “missing heritability” problem if the total heritability refers to the narrow-sense heritability, i.e., the proportion of phenotypic variance due to additive genetic variance. The current version of GCTA only provides functions to estimate and partition the variances of additive and additive-environment interaction effects. It is technically feasible to extend the analysis to include dominance and/or gene-gene interaction effects in the future. However, the power to detect the high-order genetic variation will be limited, i.e., the sampling variance of estimated variance components will be very large. Future developments will also include options to do multivariate analyses, to read genotype or imputed probability data in different formats, and to implement other applications of whole-genome or chromosome segment approaches.

In summary, we have developed a versatile tool to estimate genetic relationships from genome-wide SNPs that can subsequently be used to estimate variance explained by SNPs via a mixed model approach. We provide flexible options to specify different genetic models to partition genetic variance onto each of the chromosomes. We developed methods to estimate genetic relationships from the SNPs on the X chromosome and to test the hypotheses of dosage compensation. GCTA is not limited to the analysis of data on human complex traits, but in this report we only use examples and specifications (e.g., the number of autosomes) for humans.

## Acknowledgments

We thank Bruce Weir for discussions on the sampling variance of estimators of inbreeding coefficients. We thank Allan McRae and David Duffy for discussions and Anna Vinkhuyzen for software testing. We acknowledge funding from the Australian National Health and Medical Research Council (grants 389892 and 613672) and the Australian Research Council (grants DP0770096 and DP1093900).

## Web Resources

The URLs for data presented herein are as follows:

- Genome-wide Complex Trait Analysis (GCTA), http://gump.qimr.edu.au/gcta
- MACH 1.0: A Markov Chain-based haplotyper, http://www.sph.umich.edu/csg/yli/mach

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (118K) |
- Citation

- Genome-wide complex trait analysis (GCTA): methods, data analyses, and interpretations.[Methods Mol Biol. 2013]
*Yang J, Lee SH, Goddard ME, Visscher PM.**Methods Mol Biol. 2013; 1019:215-36.* - SNP-based pathway enrichment analysis for genome-wide association studies.[BMC Bioinformatics. 2011]
*Weng L, Macciardi F, Subramanian A, Guffanti G, Potkin SG, Yu Z, Xie X.**BMC Bioinformatics. 2011 Apr 15; 12:99. Epub 2011 Apr 15.* - Genome-wide association studies for agronomical traits in a world wide spring barley collection.[BMC Plant Biol. 2012]
*Pasam RK, Sharma R, Malosetti M, van Eeuwijk FA, Haseneyer G, Kilian B, Graner A.**BMC Plant Biol. 2012 Jan 27; 12:16. Epub 2012 Jan 27.* - Estimation and partition of heritability in human populations using whole-genome analysis methods.[Annu Rev Genet. 2013]
*Vinkhuyzen AA, Wray NR, Yang J, Goddard ME, Visscher PM.**Annu Rev Genet. 2013; 47:75-95. Epub 2013 Aug 22.* - Software engineering the mixed model for genome-wide association studies on large samples.[Brief Bioinform. 2009]
*Zhang Z, Buckler ES, Casstevens TM, Bradbury PJ.**Brief Bioinform. 2009 Nov; 10(6):664-75.*

- Natural Selection in a Bangladeshi Population from the Cholera-Endemic Ganges River Delta[Science translational medicine. 2013]
*Karlsson EK, Harris JB, Tabrizi S, Rahman A, Shlyakhter I, Patterson N, O'Dushlaine C, Schaffner SF, Gupta S, Chowdhury F, Sheikh A, Shin OS, Ellis C, Becker CE, Stuart LM, Calderwood SB, Ryan ET, Qadri F, Sabeti PC, LaRocque RC.**Science translational medicine. 2013 Jul 3; 5(192)192ra86* - Novel genetic matching methods for handling population stratification in genome-wide association studies[BMC Bioinformatics. ]
*Lacour A, Schüller V, Drichel D, Herold C, Jessen F, Leber M, Maier W, Noethen MM, Ramirez A, Vaitsiakhovich T, Becker T.**BMC Bioinformatics. 16(1)84* - First Genome-Wide Association Study in an Australian Aboriginal Population Provides Insights into Genetic Risk Factors for Body Mass Index and Type 2 Diabetes[PLoS ONE. ]
*Anderson D, Cordell HJ, Fakiola M, Francis RW, Syn G, Scaman ES, Davis E, Miles SJ, McLeay T, Jamieson SE, Blackwell JM.**PLoS ONE. 10(3)e0119333* - Novel genetic variants in differentiated thyroid cancer and assessment of the cumulative risk[Scientific Reports. ]
*Figlioli G, Chen B, Elisei R, Romei C, Campo C, Cipollini M, Cristaudo A, Bambi F, Paolicchi E, Hoffmann P, Herms S, Kalemba M, Kula D, Pastor S, Marcos R, Velázquez A, Jarząb B, Landi S, Hemminki K, Gemignani F, Försti A.**Scientific Reports. 58922* - Educational Attainment Influences Levels of Homozygosity through Migration and Assortative Mating[PLoS ONE. ]
*Abdellaoui A, Hottenga JJ, Willemsen G, Bartels M, van Beijsterveldt T, Ehli EA, Davies GE, Brooks A, Sullivan PF, Penninx BW, de Geus EJ, Boomsma DI.**PLoS ONE. 10(3)e0118935*

- GCTA: A Tool for Genome-wide Complex Trait AnalysisGCTA: A Tool for Genome-wide Complex Trait AnalysisAmerican Journal of Human Genetics. 2011 Jan 7; 88(1)76

Your browsing activity is empty.

Activity recording is turned off.

See more...