- Journal List
- HHS Author Manuscripts
- PMC3386377

# Genome-wide Efficient Mixed Model Analysis for Association Studies

^{1}Department of Human Genetics; University of Chicago, Chicago, IL 60637

^{2}Department of Statistics; University of Chicago, Chicago, IL 60637

## Abstract

Linear mixed models have attracted considerable recent attention as a powerful and effective tool for accounting for population stratification and relatedness in genetic association tests. However, existing methods for exact computation of standard test statistics are computationally impractical for even moderate-sized genome-wide association studies. To deal with this several approximate methods have been proposed. Here, we present an efficient exact method that makes these approximations unnecessary in many settings. This method is roughly *n* times faster than the widely-used exact method EMMA, where *n* is the sample size, making exact genome-wide association analysis computationally practical for large numbers of individuals.

## INTRODUCTION

There is an increasing interest in using linear mixed models (LMMs, also known as mixed linear models, or MLMs) to test for association in genome-wide association studies (GWAS), because of their demonstrated effectiveness in accounting for relatedness among samples and in controlling for population stratification and other confounding factors^{1–7}. However, these models present substantial computational challenges. For example, at the time this work was submitted for publication, the most efficient algorithm for computing (effectively) exact association test statistics (either the Wald test or the likelihood ratio test), implemented in the Efficient Mixed Model Association (EMMA) software^{3}, had a per-SNP computational time that increases with the cube of the number of individuals (*n*). As a result, a medium size GWAS with a few thousand individuals and half a million SNPs would take years of CPU time to analyze^{1,7}. (While this paper was in review, Lippert et al (2011)^{8} also published an efficient algorithm for this model, implemented in software FaST-LMM; the relationship between this algorithm and ours is discussed later.)

Several approximation methods have been proposed to make genome-wide analysis using linear mixed models possible. Probably the simplest and fastest of these approximations, GRAMMAR (Genome-wide Rapid Association using Mixed Model And Regression), implemented in the software GenABEL^{9}, first estimates the residuals from the LMM under the null model, and then treats these residuals as phenotypes for further genome-wide analysis by a standard linear model^{10}. This substantially reduces per-SNP computation time, making it linear in the number of individuals. More recently two more-sophisticated approximate approaches have been suggested. Zhang et al^{7} use P3D (Population Parameters Previously Determined) which avoids repeatedly estimating variance components when performing each test by simply using the pre-estimated variance components from the null model; their method is implemented in the software TASSEL. Kang et al^{1} also avoid repeatedly estimating variance components by a slightly different strategy, which keeps the heritability estimated from the null model fixed when testing individual SNPs. Their approach is implemented in the software EMMAX (EMMA eXpedited). (This approximation, and related ideas, was also considered by previous authors, including^{10,11}.) Both these last two approximations have per-SNP computation time that increases quadratically with the number of individuals, which makes them practical, on a single desktop computer, for GWAS involving thousands of individuals.

Although in some settings the approximate methods described above provide results almost identical to those of the exact method^{1,7}, this is not guaranteed in general, and in practice it is hard to know how accurate the approximations will be without running an exact calculation. One possible consequence of inaccuracy in the approximation could be a reduction in power compared with exact methods. For these reasons, the ability to perform exact calculations remains of interest. Here, we present a new, more efficient, method for exact calculations that provides numerically identical results to EMMA (i.e. exact Wald or likelihood ratio test statistics) but is roughly *n* times faster (computation time per SNP, when using the usual genome-wide relatedness matrix, is quadratic in the number of individuals, with run time similar to EMMAX). This makes exact calculations feasible for large GWAS, obviating the need for approximate methods in most common settings.

## RESULTS

The method and its computational complexity is described and derived in detail in the Online Methods section. Briefly, the method requires complete or imputed genotype data^{12,13} for all SNPs, and involves only one eigen-decomposition of the relatedness matrix at the beginning (computational complexity *O(n ^{3})*). For each SNP tested, it effectively replaces the expensive additional eigen-decomposition step in EMMA with one matrix and vector multiplication (computational complexity

*O(n*). After this, like EMMA, each iteration of the following optimization step requires cheap operations (complexity

^{2})*O(n)*) to evaluate both first and second derivatives of the target functions. We refer to our method as Genome-wide Efficient Mixed Model Association (GEMMA) because it builds on EMMA and facilitates its genome-wide application.

We illustrate our method and compare the analysis results with the exact method EMMA and the approximation methods EMMAX and GRAMMAR, using two examples, a mouse GWAS for high-density lipoprotein cholesterol (HDL-C) levels from the Hybrid Mouse Diversity Panel (HMDP)^{14} and a human GWAS for Crohn's disease from the Wellcome Trust Case Control Consortium (WTCCC)^{15}. The size of this second study makes it computationally impractical to analyze with EMMA^{3}. Table 1 summarizes the computational complexity for the four methods along with CPU time for the two data sets on a single desktop CPU. Table 1 also includes results for the recently-published FaST-LMM^{8}, which can produce identical *p* values to EMMA and GEMMA in the same time complexity as GEMMA; see below for further discussion. As expected GEMMA is comparable in speed with EMMAX, completing the larger (WTCCC) example in under 4 hours.

**...**

To verify the correctness of our algorithm and implementation we first validate it by comparing *p* values calculated by GEMMA with those from EMMA on a subset of SNPs from both data sets. For all SNPs examined the *p* values from the two methods match exactly (Wald test results shown in Figure 1a and 1b; Likelihood ratio test not shown).

_{10 }

*p*values obtained from GEMMA with those from EMMA (a, b), and EMMAX and GRAMMAR (c, d). In (a) and (b) the

*p*values are shown for the top 10,000 markers and top 100 markers respectively. In (c) and (d) the

*p*values are shown for all

**...**

Since GEMMA provides exact computations in essentially the same time as EMMAX, the accuracy of the approximations in EMMAX and other methods may seem moot. However, in some settings, and specifically for mixed models with more than one random effect (variance component), the computational trick used by GEMMA does not apply, and approximations along the lines of EMMAX may remain necessary. For this reason the accuracy of different approximation methods remains of some potential interest, and so we present a comparison between the (Wald test) *p* values from GEMMA, EMMAX and GRAMMAR, genome-wide, on both the HMDP and WTCCC data sets above.

The HMDP GWAS represents a situation where approximation methods such as EMMAX or GRAMMAR may yield inaccurate test statistics. In particular, because individuals in the data set are closely related, and the strongly associated SNPs contribute to a significant proportion of phenotypic variation in HDL-C^{13}, using estimates of variance components or fitted residuals from the null model for testing may be expected to yield conservative *p* values, leading to a potential loss of power. Our empirical comparison (Figure 1c) confirms this: in this case, approximation by EMMAX leads to systematic and appreciable underestimation of the most significant *p* values (almost two orders of magnitude), while approximation by GRAMMAR leads to dramatic underestimation of all *p* values. Indeed, in contrast to the exact *p* values, no *p* values generated by EMMAX are significant at the conventional 0.05 level after Bonferroni correction, and no *p* values generated by GRAMMAR are significant even before Bonferroni correction. The fact that the exact *p* values for the most significant results are substantially more significant than the approximate *p* values from EMMAX suggests that, in this type of setting, the exact *p* values may produce a more powerful test; simulation results confirm this (Supplementary Fig. 1).

In contrast, the WTCCC example represents a very different situation where the approximations may be expected to yield accurate test statistics. This is because there is relatively little population stratification in these data (the individuals are all from the UK, and the relatedness matrix is approximately diagonal), and the effect sizes of the most strongly associated SNPs for Crohn's disease are small compared with the effect sizes in the HMDP data above^{14}. Both conditions favor the approximation assumptions in EMMAX and GRAMMAR. Empirical comparisons (Figure 1d) show that, for this particular data set, the *p* values from EMMAX differ negligibly from the exact values. However, the *p* values from GRAMMAR still depart notably from the exact values.

Taken together, the above results confirm that approximation by EMMAX is appreciably more accurate than GRAMMAR, even in cases, such as the WTCCC data, where the sample structure is subtle. The comparisons also demonstrate that the accuracy of the EMMAX approximation can vary from case to case. Consequently, the potential gain in power from doing exact vs approximate tests will also vary among datasets. For the HMDP data, the potential gain in power from the exact calculations appears considerable, and this is confirmed by simulations (Supplementary Fig. 1). For the WTCCC Crohn's disease data the power gain is negligible, and as noted in ref^{1} only a small gain in power is generally expected at SNPs with small effect size. Of course, one nice feature of being able to do the exact tests is that it obviates the need to consider which approximations work best under what circumstances, or to consider ways in which the approximations could be improved. We also note that the computational tricks employed here also apply to other settings, including the combined “variable selection plus random effects” model that has been widely studied for phenotype and breeding value prediction^{16}, but which, without the trick used here, is computationally challenging to fit.

## DISCUSSION

In summary, we have presented an efficient method for computing exact values of standard test statistics in linear mixed models. This method is comparable in speed with approximation methods such as EMMAX while yielding exact test statistics. Using two examples we illustrate our method, and show that the approximation methods can yield inaccurate *p* values when the sample structure is strong and/or when the marker effect size is large. We also find that the approximation by EMMAX is more accurate than the approximation by GRAMMAR genome-wide (a comparison made possible only by the availability of an efficient exact method).

While this work was in review, Lippert et al^{8} also published an efficient method for computing likelihoods for LMMs that, like our method, requires only one singular value decomposition of the relatedness matrix. They use this method, in combination with Brent's optimization algorithm, to produce an algorithm for computing exact test statistics with effectively the same computational complexity as GEMMA: *O(mn ^{2}*+

*cn*+

^{2}*pn*+

^{2}*ptc*, as in Table 1. (Lippert et al

^{2}n)^{8}also suggest a further innovation, using a low-rank relatedness matrix in place of the usual relatedness matrix computed from all SNPs genome-wide, that produces an algorithm that is linear in

*n*, and so feasible for very large GWAS samples containing more than 100,000 individuals; however changing the relatedness matrix in this way changes the resulting

*p*values appreciably, and in this sense this linear complexity algorithm is not directly comparable with either GEMMA or EMMA; see below for further discussion.) The main additional contribution of our work here beyond that in Lippert et al is that we provide, and make use of, efficient methods for evaluation of not only the likelihood, but also both its first and second derivatives. This allows us to make use of the Newton--Raphson optimization method, which has better theoretical convergence properties than Brent's algorithm (quadratic, vs super-linear), potentially reducing per-SNP computation time by reducing the number of iterations required for convergence,

*t*. The practical effect of this is expected to depend on the sample size

*n*. Examining the theoretical computational complexity, if

*p*is large (and we assume the simplest case with no additional covariates, so

*c=1*) then the per-SNP complexity of the algorithms is

*O(nw*+

^{2}*tn)*. Thus if

*n*is large then the

*n*term will dominate and the number of iterations will have only a small effect of computation time; if

^{2}*n*is moderate then the number of iterations may play a more important role. Consistent with this, we found GEMMA to be 12 times faster than the Lippert et al algorithm, implemented in FaST-LMM, for the smaller HMDP dataset (33 minutes vs 6.8 hours), but only 2 times faster for the WTCCC data (3.3 hours vs 6.2 hours). It is possible that implementational issues, which are important but conceptually less fundamental, also contribute to this difference in speed. Besides this difference in speed, which might be considered a minor issue, by providing efficient methods to compute derivatives our work here lays the foundations for similar efficient analyses for LMMs with multivariate phenotypes

^{17}, where multidimensional optimization is required and evaluating the target functions alone is unlikely to suffice.

Here we have focused on computations using the usual relatedness matrix, computed from all SNPs genome-wide, whose rank, *r*, is typically equal to the number of individuals *n*. However, as noted by Lippert et al^{8}, if a lower-rank relatedness matrix is used then this reduces computing time (computational complexity of the singular value decomposition can scale with *nr ^{2}*) and in some cases memory requirements (e.g. Lippert et al.

^{8}suggest using a relatedness matrix based on only a few thousand SNPs; this has the nice property that required singular value decompositions can be done without computing the

*n*by

*n*relatedness matrix itself). Using the usual full-rank relatedness matrix, our current implementation of GEMMA can handle approximately 23,000 individuals on a machine with 64 Gb memory (in double precision); using a lower-rank relatedness matrix, much larger problems could be tackled. However, we note that changing the relatedness matrix can produce much larger changes in

*p*values than, for example, the differences between EMMAX and exact calculations (e.g. Supplementary Fig. 2), and for both the HMDP and WTCCC data using a lower-rank relatedness matrix seems to compromise the ability of the LMM to control for sample structure (Supplementary Table 1). Thus choice of relatedness matrix could affect statistical efficiency (both power, and correct control of type I error due to stratification or relatedness) as well as computational efficiency. Interestingly, statistical and computational considerations may not necessarily conflict: for example,

^{7}suggest that use of compressed MLM, which yields a lower-rank relatedness matrix by clustering individuals, can both reduce computation and increase power compared with the full-rank matrix. The general question of which low-rank relatedness matrices produce the best combination of computational and statistical performance seems to be an interesting avenue for further study.

### URLs

Our method is implemented in software GEMMA, freely available at http://stephenslab.uchicago.edu/software.html.

## ACKNOWLEDGMENT

This research is supported in part by NIH grant HL092206 (PI Y Gilad) and NIH grant HG02585 to MS. We thank A. J. Lusis for making the mouse genotype and phenotype data available. This study also makes use of data generated by the Wellcome Trust Case-Control Consortium ^{14}. A full list of the investigators who contributed to the generation of the data is available from http://www.wtccc.org.uk . Funding for the WTCCC project was provided by the Wellcome Trust under award 085475.

## Footnotes

**AUTHOR CONTRIBUTIONS** X.Z. and M.S. designed the study, developed methods and wrote the manuscript. X.Z. implemented software and analyzed data.

**COMPETING FINANCIAL INTERESTS** The authors declare no competing financial interests.

## REFERENCES

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (160K) |
- Citation

- Variance component model to account for sample structure in genome-wide association studies.[Nat Genet. 2010]
*Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E.**Nat Genet. 2010 Apr; 42(4):348-54. Epub 2010 Mar 7.* - An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations.[Nat Genet. 2012]
*Segura V, Vilhjálmsson BJ, Platt A, Korte A, Seren Ü, Long Q, Nordborg M.**Nat Genet. 2012 Jun 17; 44(7):825-30. Epub 2012 Jun 17.* - Efficient multivariate linear mixed model algorithms for genome-wide association studies.[Nat Methods. 2014]
*Zhou X, Stephens M.**Nat Methods. 2014 Apr; 11(4):407-9. Epub 2014 Feb 16.* - Software engineering the mixed model for genome-wide association studies on large samples.[Brief Bioinform. 2009]
*Zhang Z, Buckler ES, Casstevens TM, Bradbury PJ.**Brief Bioinform. 2009 Nov; 10(6):664-75.* - New approaches to population stratification in genome-wide association studies.[Nat Rev Genet. 2010]
*Price AL, Zaitlen NA, Reich D, Patterson N.**Nat Rev Genet. 2010 Jul; 11(7):459-63.*

- Novel genetic matching methods for handling population stratification in genome-wide association studies[BMC Bioinformatics. ]
*Lacour A, Schüller V, Drichel D, Herold C, Jessen F, Leber M, Maier W, Noethen MM, Ramirez A, Vaitsiakhovich T, Becker T.**BMC Bioinformatics. 16(1)84* - Genome-Wide Association Mapping for Yield and Other Agronomic Traits in an Elite Breeding Population of Tropical Rice (Oryza sativa)[PLoS ONE. ]
*Begum H, Spindel JE, Lalusin A, Borromeo T, Gregorio G, Hernandez J, Virk P, Collard B, McCouch SR.**PLoS ONE. 10(3)e0119873* - Genome-wide mapping in a house mouse hybrid zone reveals hybrid sterility loci and Dobzhansky-Muller interactions[eLife. ]
*Turner LM, Harr B.**eLife. 3e02504* - Long-range epigenetic regulation is conferred by genetic variation located at thousands of independent loci[Nature Communications. ]
*Lemire M, Zaidi SH, Ban M, Ge B, Aïssi D, Germain M, Kassam I, Wang M, Zanke BW, Gagnon F, Morange PE, Trégouët DA, Wells PS, Sawcer S, Gallinger S, Pastinen T, Hudson TJ.**Nature Communications. 66326* - Risk of false positive genetic associations in complex traits with underlying population structure: A case study[Veterinary journal (London, England : 1997)...]
*Finno CJ, Aleman M, Higgins RJ, Madigan JE, Bannasch DL.**Veterinary journal (London, England : 1997). 2014 Dec; 202(3)543-549*

- Genome-wide Efficient Mixed Model Analysis for Association StudiesGenome-wide Efficient Mixed Model Analysis for Association StudiesNIHPA Author Manuscripts. ; 44(7)821

Your browsing activity is empty.

Activity recording is turned off.

See more...