• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinfoLink to Publisher's site
Bioinformatics. Feb 15, 2010; 26(4): 580–581.
Published online Dec 29, 2009. doi:  10.1093/bioinformatics/btp710
PMCID: PMC2852219

GWAF: an R package for genome-wide association analyses with family data

Abstract

Summary: GWAF, Genome-Wide Association analyses with Family, is an R package designed for GWAF. It implements association tests between a batch of genotyped or imputed single nucleotide polymorphisms (SNPs) and a binary or continuous trait with user specified genetic model, and generates informative results from the analyses. In addition, GWAF provides functions to visualize results. We evaluated GWAF using a simulated continuous trait and a binary trait dichotomized from the simulated continuous trait with real genotype data from the Framingham Heart Study's SNP Health Association Resource project.

Availability: http://cran.r-project.org/web/packages/GWAF/

Contact: ude.ub@gnayq

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The completion of the Human Genome Project and the advances of the International Hapmap consortium have made genome-wide association (GWA) scans possible. Many software and packages are designed only for GWA studies with unrelated individuals. For family data, the correlations among individuals in a pedigree may cause false positives due to unexplained familial correlation. For this purpose, GWA analyses with family data (GWAF) package utilizes functions in existing R packages to properly model the residual correlations within families in the test of genotype–phenotype association. GWAF is a wrapper that enables users to analyze a batch of single nucleotide polymorphisms (SNPs) under user specified genetic model (additive, recessive, dominant or general) and covariates using existing functions, and that automatically summarizes the results in an informative and convenient format. The genotypes can be observed or imputed.

GWAF was developed from the functions that have been empirically tested through simulations and used in many GWAs publications from the SNP Health Association Resource (SHARe) project of Framingham Heart Study (FHS), http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000007.v6.p3. With the availability of SHARe data to general scientific community, this package would be a useful resource for investigators to analyze SHARe data or other GWA studies consisted of related individuals.

2 IMPLEMENTATION

2.1 Input files

GWAF requires three input files: (i) pedigree file; (ii) genotype file; and (iii) phenotype file. The pedigree file shall contain pedigree information of family ID, individual ID, father ID, mother ID and sex. Each genotype file shall contain individual ID and the genotype data of an arbitrary number of SNPs. The phenotype file shall contain individual ID, phenotype of interest and covariate data. Please see package documentation (http://cran.r-project.org/web/packages/GWAF/) or Supplementary Material for more details.

2.2 Continuous traits

To account for the within pedigree familial correlation, GWAF uses a linear mixed effects model (LME) implemented in lmekin function in kinship package (http://cran.r-project.org/web/packages/kinship/) to test association between a continuous trait and each SNP in the genotype file under user specified genetic model. The only difference between lmekin() functions from kinship and GWAF is that under general model, GWAF uses the Wald chi-square test that provides a global test of genotype effects, while kinship only has t-test for each comparison of two genotypes. The within pedigree correlation matrix is modeled using kinship coefficient matrix (Abecasis et al., 2001) in GWAF, which can be easily generated with the kinship package. For genotyped SNPs, the user specified genetic model can be additive, dominant, recessive or general model. Except that general model uses a two degrees of freedom Wald chi-square test, the others use a one degree of freedom Wald chi-square test. While for imputed SNPs, the user specified genetic model option is not available and GWAF will analyze the imputed genotype data without recoding. Finally, GWAF also provides an estimate of the proportion of phenotype variance explained by the tested SNP, which is computed by

equation image

where Var(y) is the total phenotypic variance, σG.null2 and σe.null2 are the polygenic variance and error variance when modeling without the tested SNP, σG.full2 and σe.full2 are the polygenic variance and error variance when modeling with the tested SNP.

2.3 Dichotomous traits

For a dichotomous trait, GWAF uses logistic regression via generalized estimating equations (GEE, Liang and Zeger, 1986) implemented in gee() function in gee package (http://cran.r-project.org/web/packages/gee/) to test association between the phenotype of interest and each SNP in a genotype file with user specified genetic model. In GEE analysis, GWAF uses independence working correlation matrix with each family being a cluster in the robust variance estimate for the genotype effects. Similar to continuous traits, GWAF uses Wald chi-square test for the main effect. Again, except that general model uses a two degrees of freedom Wald chi-square test, the others use a one degree of freedom Wald chi-square test. In addition, Fisher's exact test is carried out to test whether differential missingness exists between affected and unaffected sample which can be used to judge potential genotyping quality discrepancy between cases and controls.

The gee() function in the gee package can encounter convergence or hanging (unlimited looping) issues mainly due to SNPs with 0/low counts in the disease-genotype contingency table that frequently happens to low minor allele frequency (MAF) SNPs. In such cases, GWAF may employ logistic regression instead of GEE and provide a remark. Users should pay caution to the remarks for the top SNPs or are suggested to filter out SNPs with low MAFs.

2.4 Output file

Please see the manual of our functions at http://cran.r-project.org/web/packages/GWAF/ for details. Sample outputs were also provided in Supplementary Tables S1–S6.

2.5 Genome-wide P-values plot and quantile–quantile P-values plot

After GWA analyses are completed, to briefly visualize and examine the results, GWAF provides GWplot() function for genome-wide plot of −log10(P-values) versus genomic position and qq() function for quantile–quantile (QQ) plot of observed −log10(P-values) versus expected −log10(P-values) in bitmap format. GWplot() requires the P-value, the chromosome number and the physical position of each SNP. The P-values are negatively logarithm transformed with base 10 in both plots. In GWplot(), the users are allowed to specify two P-value cut-offs, one indicates genome-wide significance with the default of 5E−8 and the other indicates suggestive genome-wide significance with the default of 4E−7. SNPs with genome-wide significance are presented in red, while SNPs between the two cut-offs are plotted in blue. The qq() function makes the QQ plot of P-values against a uniform (0,1) distribution. The genomic control parameter λ (Devlin and Roeder, 1999) that indicates systematic inflation in GWA results for one degree of freedom chi-square statistics corresponding to the P-values is also presented in the QQ plot.

3 EXAMPLE

We applied GWAF to a simulated continuous trait and a binary trait with real 550K genotype data from Framingham Heart Study's SHARe project. Phenotypes were simulated on 8481 individuals genotyped with call rate >97% from 1494 real pedigrees. The continuous traits were randomly generated following multivariate normal distribution, with a quantitative trait locus (QTL), a polygenic and a residual variance component using the program SOLAR (Almasy and Blangero 1998). SNP rs1570092 on chromosome 1 of good genotype quality was selected to be the single QTL explaining 1% of total phenotypic variance (=1) and the polygenic heritability was set to be 0.3. Additive genetic model was used in simulation and association analyses.

To create binary traits, we dichotomized the simulated continuous traits by assuming 10% population prevalence and an additive genetic model with genotype relative risk of 1.3. The additive genetic model was also used in the association analyses.

The results of LME and GEE are presented in the genome-wide P-values plots and QQ plots (Supplementary Fig. S1). Both genome-wide P-values plots show that the genome-wide significant SNPs were all close to the QTL, rs1570092. The genomic control factors (λ) are 1 and 1.02 for LME and GEE, respectively, showing no global inflation of false positives. Note, λ is computed as the empirical median divided by its expectation under the χ12 distribution. Thus λ does not reflect how much deviation the tail has from the 45 line, as in our example, a notable deviation in the tail is observed with λ close to 1. For GEE results, 21 SNPs with MAF <0.01 were excluded. Both LME and GEE identified rs1570092 as the most significant SNP [LME P-value=1.64E−22 (explained 1.28% phenotype variation); GEE P-value=6.52E−10, Odds Ratio = 1.427]. For a batch of 1000 genotyped SNPs, GEE takes ~4.5 min and LME takes ~1.8 h to complete the analyses using a Linux cluster with 2× Dual-Core AMD Opteron(tm) Processor 2218 HE and total 12 GB RAM, running Rocks 4.3 Linux Cluster Distribution from San Diego Supercomputer Center. LME (uses 405 MB of RAM) takes longer time because of its model complexity and the estimation of the proportion of phenotypic variance explained by the tested SNP that requires two analyses (under null and full models) for each SNP. For analyzing imputed genotypes, GWAF takes less time because of no recoding.

4 FUTURE WORK

We are in the process of expanding GWAF to perform gene-environmental interaction and better handling rare variants.

Supplementary Material

[Supplementary Data]

ACKNOWLEDGEMENTS

The authors thank Dr Josée Dupuis, Dr Kathryn L. Lunetta, Dr L. Adrienne Cupples, Dr Martin G. Larson, Dr Anita L. DeStefano and Dr Jemma B. Wilk for their helpful comments on the package. The authors also thank Dr Jinghua Zhao for his help with the kinship package, and Alisa N. Manning, Denver J. Lybarger and Andi Broka for their assistance. This research was conducted in part using data and resources from the Framingham Heart Study of the National Heart Lung and Blood Institute of the National Institutes of Health and Boston University School of Medicine.

Funding: National Heart, Lung and Blood Institute's Framingham Heart Study (contract no. N01-HC-25195) and its contract with Affymetrix, Inc for genotyping services (contract no. N02-HL-6-4278). A portion of this research utilized the Linux Cluster for Genetic Analysis (LinGA-II) funded by the Robert Dawson Evans Endowment of the Department of Medicine at Boston University School of Medicine and Boston Medical Center.

Conflict of Interest: none declared.

REFERENCES

  • Abecasis GR, et al. Association analysis in a variance components framework. Genet. Epidemiol. 2001;21(Suppl. 1):S341–S346. [PubMed]
  • Almasy L, Blangero J. Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 1998;62:1198–1211. [PMC free article] [PubMed]
  • Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. [PubMed]
  • Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

  • Clinical and Genetic Correlates of Growth Differentiation Factor-15 in the Community[Clinical chemistry. 2012]
    Ho JE, Mahajan A, Chen MH, Larson MG, McCabe EL, Ghorbani A, Cheng S, Johnson AD, Lindgren CM, Kempf T, Lind L, Ingelsson E, Vasan RS, Januzzi J, Wollert KC, Morris AP, Wang TJ. Clinical chemistry. 2012 Nov; 58(11)1582-1591
  • A multi-level model for analyzing whole genome sequencing family data with longitudinal traits[BMC Proceedings. ]
    Chen T, Santawisook P, Wu Z. BMC Proceedings. 8(Suppl 1)S86
  • Meta-Analysis of Genome-Wide Association Studies in African Americans Provides Insights into the Genetic Architecture of Type 2 Diabetes[PLoS Genetics. ]
    Ng MC, Shriner D, Chen BH, Li J, Chen WM, Guo X, Liu J, Bielinski SJ, Yanek LR, Nalls MA, Comeau ME, Rasmussen-Torvik LJ, Jensen RA, Evans DS, Sun YV, An P, Patel SR, Lu Y, Long J, Armstrong LL, Wagenknecht L, Yang L, Snively BM, Palmer ND, Mudgal P, Langefeld CD, Keene KL, Freedman BI, Mychaleckyj JC, Nayak U, Raffel LJ, Goodarzi MO, Chen YD, Taylor HA Jr, Correa A, Sims M, Couper D, Pankow JS, Boerwinkle E, Adeyemo A, Doumatey A, Chen G, Mathias RA, Vaidya D, Singleton AB, Zonderman AB, Igo RP Jr, Sedor JR, the FIND Consortium, Kabagambe EK, Siscovick DS, McKnight B, Rice K, Liu Y, Hsueh WC, Zhao W, Bielak LF, Kraja A, Province MA, Bottinger EP, Gottesman O, Cai Q, Zheng W, Blot WJ, Lowe WL, Pacheco JA, Crawford DC, the eMERGE Consortium, the DIAGRAM Consortium, Grundberg E, the MuTHER Consortium, Rich SS, Hayes MG, Shu XO, Loos RJ, Borecki IB, Peyser PA, Cummings SR, Psaty BM, Fornage M, Iyengar SK, Evans MK, Becker DM, Kao WH, Wilson JG, Rotter JI, Sale MM, Liu S, Rotimi CN, Bowden DW, for the MEta-analysis of type 2 DIabetes in African Americans (MEDIA) Consortium. PLoS Genetics. 10(8)e1004517
  • APOM and High-Density Lipoprotein are associated with Lung Function and Percent Emphysema[The European respiratory journal. 2014]
    Burkart KM, Manichaikul A, Wilk JB, Ahmed FS, Burke GL, Enright P, Hansel NN, Haynes D, Heckbert SR, Hoffman EA, Kaufman JD, Kurai J, Loehr L, London SJ, Meng Y, O’Connor GT, Oelsner E, Petrini M, Pottinger TD, Powell CA, Redline S, Rotter JI, Smith LJ, Artigas MS, Tobin MD, Tsai MY, Watson K, White W, Young TR, Rich SS, Barr RG. The European respiratory journal. 2014 Apr; 43(4)1003-1017
  • ?-Aminoisobutyric Acid Induces Browning of White Fat and Hepatic ?-oxidation and is Inversely Correlated with Cardiometabolic Risk Factors[Cell metabolism. 2014]
    Roberts LD, Boström P, O’Sullivan JF, Schinzel RT, Lewis GD, Dejam A, Lee YK, Palma MJ, Calhoun S, Georgiadi A, Chen MH, Ramachandran VS, Larson MG, Bouchard C, Rankinen T, Souza AL, Clish CB, Wang TJ, Estall JL, Soukas AA, Cowan CA, Spiegelman BM, Gerszten RE. Cell metabolism. 2014 Jan 7; 19(1)96-108
See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...