Format

Send to

Choose Destination
Nat Genet. 2018 Sep;50(9):1335-1341. doi: 10.1038/s41588-018-0184-y. Epub 2018 Aug 13.

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.

Author information

1
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
2
Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA.
3
Department of Internal Medicine, Division of Cardiology, University of Michigan Medical School, Ann Arbor, MI, USA.
4
K. G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, Norwegian University of Science and Technology, Trondheim, Norway.
5
Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI, USA.
6
Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA.
7
Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA.
8
HUNT Research Centre, Department of Public Health and General Practice, Norwegian University of Science and Technology, Levanger, Norway.
9
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA. cristen@umich.edu.
10
Department of Internal Medicine, Division of Cardiology, University of Michigan Medical School, Ann Arbor, MI, USA. cristen@umich.edu.
11
Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI, USA. cristen@umich.edu.
12
Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA. leeshawn@umich.edu.
13
Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI, USA. leeshawn@umich.edu.

Abstract

In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.

Supplemental Content

Full text links

Icon for Nature Publishing Group Icon for PubMed Central
Loading ...
Support Center