• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Genet Epidemiol. Author manuscript; available in PMC Jul 26, 2011.
Published in final edited form as:
PMCID: PMC3143715
NIHMSID: NIHMS306122

A Comparison of Approaches to Account for Uncertainty in Analysis of Imputed Genotypes

Abstract

The availability of extensively genotyped reference samples, such as “The HapMap” and 1,000 Genomes Project reference panels, together with advances in statistical methodology, have allowed for the imputation of genotypes at single nucleotide polymorphism (SNP) markers that are untyped in a cohort or case-control study. These imputation procedures facilitate the interpretation and meta-analyses of genome-wide association studies. A natural question when implementing these procedures concerns how best to take into account uncertainty in imputed genotypes. Here we compare the performance of the following three strategies: least-squares regression on the “best-guess” imputed genotype; regression on the expected genotype score or “dosage”; and mixture regression models that more fully incorporate posterior probabilities of genotypes at untyped SNPs. Using simulation, we considered a range of sample sizes, minor allele frequencies, and imputation accuracies to compare the performance of the different methods under various genetic models. The mixture models performed the best in the setting of a large genetic effect and low imputation accuracies. However, for most realistic settings, we find that regressing the phenotype on the estimated allelic or genotypic dosage provides an attractive compromise between accuracy and computational tractability.

Keywords: GWAS, genotype imputation, mixture models

INTRODUCTION

The shared ancestry of chromosomes in a population results in haplotype stretches shared by different individuals. Making use of these shared haplotype stretches, and thereby accounting for the correlation of alleles at nearby markers (linkage disequilibrium, LD), statistical algorithms can make inferences about unobserved alleles. To estimate a missing allele at a specific single nucleotide polymorphism (SNP) on a haplotype, these algorithms compare flanking markers with those from other haplotypes in the sample to find appropriate “template” or reference haplotypes to inform an estimate of the missing allele.

Recently there has been considerable interest in the imputation of missing genotype data for the analysis of genome-wide association (GWA) studies. The availability of panels of extensively genotyped reference samples, such as those from The International HapMap Project [HapMap; International HapMap Consortium, 2007] and now the 1,000 Genomes Project, has allowed for the indirect measurement of SNP genotypes that were not directly typed in a genetic association study but typed in the reference samples. This strategy has aided the discovery of multiple loci associated with disease [e.g. Barrett et al., 2008; Scott et al., 2007; The Wellcome Trust Case Control Consortium, 2007] or quantitative traits [Lettre et al., 2008; Loos et al., 2008; Willer et al., 2008]. For example, in Willer et al. [2008], the LDLR (cholesterol receptor) signal was detected only after imputation was performed, since the associated variant (rs6511720) was poorly tagged in samples genotypes with the Afymetrix 500K array set (maximum R2≈0.21).

This imputation-based mapping protocol is a 2-step process. First, unmeasured genotypes are imputed in the GWA data. Then, imputed genotypes are tested for association with phenotypes. Multiple methods exist for imputing genotypes from population genetic data [Browning and Browning, 2007; Greenspan and Geiger, 2004; Li et al., 2006; Marchini et al., 2007; Scheet and Stephens, 2006; Stephens and Scheet, 2005]; for a review see Browning [2008]. Here we focus on the second step, testing the imputed genotypes for association with a trait of interest.

Specifically, we aim to evaluate the relative performance of several strategies for analyzing the distribution of imputed genotypes in downstream analyses. One summary of these probabilities comes from imputing a “best-guess” genotype for each individual, which corresponds to the marginal mode of the posterior distribution of the unmeasured genotype. This approach ignores the uncertainty in the imputed genotype. When imputation is accurate, the correspondence between the true and imputed genotype is strong and an analysis of the imputed genotypes might result in little loss in power compared with the true genotypes. However, if imputation accuracy is low there may be a weak correlation between the true genotypes and the guesses, which will mask any real association between genotype and phenotype.

We also consider two approaches that attempt to account for this uncertainty. The first of these uses the mean of the distribution of imputed genotypes, which corresponds to an expected allelic or genotypic count, or “dosage”, for each individual. This approach may do well, relative to using the “best guess” genotype, when there is some uncertainty about the true genotype, since it retains more of the available information, differentiating genotypes that were imputed with high confidence from those with greater uncertainty.

A final approach uses mixture regression models to take full advantage of the individual genotype posterior probabilities. This approach should be superior when there is uncertainty in the imputed genotypes, and information about the relationship between genotype and phenotype is not well summarized by a single average genotype. For example, this may occur when the posterior probabilities are high for the two homozygote genotypes, yet an average or allelic dosage would indicate that unmeasured genotype was a heterozygote.

We find that for most realistic settings of GWA studies, such as modest genetic effects, large sample sizes and high average imputation accuracies, the strategy of regressing the phenotype on the genetic dosages provides adequate performance. In fact, for these settings, small gains from using the full mixture models are rendered negligible by the increased model complexity and associated “cost” of estimating additional parameters. However, when the effect size is large, and imputation accuracy sufficiently poor, we demonstrate an increase in power when utilizing all available information in the posterior distribution in the form of mixture models.

METHODS

OVERVIEW

To simulate data from realistic cohort-based association studies, we first generated dense genotype data from a coalescent model. Then, conditional on these genotypes, we simulated quantitative trait data for all individuals in each cohort. In order to mimic the marker density from a GWA study, we masked a fraction of the SNPs and then imputed these genotypes, conditional on a set of simulated reference haplotypes and the remaining observed SNPs. Finally, we performed analyses to test for association between imputed genotype and phenotype.

SIMULATIONS

Genotype data

For each of 100 one-megabase (1-Mb) regions, we simulated 10,000 chromosomes from a coalescent model that mimics LD in real data, accounts for variation in local recombination rates, and models population history consistent with the HapMap CEU and YRI analysis panels [Schaffner et al., 2005].

For each 1-Mb region, we then took a random subset of 120 simulated chromosomes to generate a region-specific “pseudo HapMap”. We randomly paired (assuming Hardy-Weinberg equilibrium) a random subset of 2,000 chromosomes of the remaining 9,880 chromosomes to create 1,000 diploid pseudo individuals.

For the simulated HapMap data, polymorphic sites were ascertained and thinned to match the corresponding (CEU) Phase II HapMap International HapMap Consortium [2007] marker density, allele frequency spectrum and LD patterns, resulting in ≈1,000 SNPs for each region for the panel of 120 HapMap chromosomes. Based on the thinned HapMap panel, we selected a set of 100 tagSNPs for each region that included the 90 tagSNPs with the largest number of proxies and 10 additional SNPs picked at random among those remaining [Carlson et al., 2004]. The tagSNP selection approach taken above resulted in tagSNP sets that captured ≈78% of the common variants (MAF >5%) in the simulated CEU HapMap, similar to the observed performance of the Illumina HumanHap300 Beadchip SNP genotyping platform. The genotypes at these 100 tagSNPs constituted the observed data for each simulated sample.

QUANTITATIVE TRAIT

We generated phenotype values on each of the n individuals for a large and small sample (n = 1,000, 50), conditional on their simulated genotypes. We simulated trait values separately for four genetic models, with varying degrees of dominance, and also for a null model where genotypes and phenotypes were independent.

At each SNP, the genotype label (0, 1, 2) is represented by the count of an arbitrarily chosen allele. Table I contains a summary of notation for the frequencies and genetic effect sizes (“phenotypic deviations”) of each genotype. Since allele frequency affects the power to detect phenotype association, we adjust the phenotypic deviations separately for each SNP, so that we may tabulate results over all SNPs. To accomplish this, we maintain constant genetic variance attributable to the marker VG, which we calculate with the following formula from Equation [8.8] (p. 129) of Falconer [1989]:

VG=2pq[a+d(qp)]2+[2pqd]2,
(1)

where p and q = 1 − p are allele frequencies, and a and d are additive and dominance effects (see Table I). We report genetic variance as a percent of total phenotypic variation (heritability), fixing this at 2.8% for n = 1,000, and 59.8% for n = 50. These values were calculated so as to achieve approximately 90% power at type-I error of 5 × 10−5 when analyzing the simulated genotypes under an additive genetic model with equal allele frequencies of one-half.

TABLE I
Genotype and phenotype values

We performed the above trait simulations for 83,327 SNPs in turn for the following genetic models: additive (d = 0); partially dominant (d = (1/2)a); dominant (d = a); and over-dominant (d = (6/5)a), corresponding to a value for the heterozygote that is 10% greater than the difference between the two homozygotes.

To simulate trait data yi for individual i (1,…,n) at a single SNP, we used the following model:

yi=μ+(a)I{gi=0}+(d)I{gi=1}+(a)I{gi=2}+εi,
(2)

where gi is the true genotype for individual i, a and d are chosen according to (1), the indicator variable I{A} is one if A is true and zero otherwise, and εi ~N(0, 1).

REAL GENOTYPE DATA

We also obtained data from a GWA study of Type II diabetes in individuals of European descent [FUSION; Scott et al., 2007]. In 538 control samples, additional genotyping was conducted in a region of chromosome 14. This resulted in 521 markers for which we had both imputed genotypes from the HapMap, as well as genotypes typed on a custom microarray from Illumina. Imputation in these data set was carried out with MaCH 1.0 [Li et al., 2010], using the data from an Illumina 317K microarray platform as “tag SNPs”. Conditional on the typed genotypes at these 521 markers, we simulated quantitative trait data as above, keeping the genetic variance fixed to yield informative and interpretable summaries of power across multiple markers with varying allele frequencies. In addition to simulating phenotypes on the full set (538 individuals), we also simulated a larger effect on a subset of 50 individuals, selected at random. We repeated these simulations 100 times to obtain simulated phenotypes for 52,100 SNPs (100 × 521), for both “large” and small sample sizes.

GENOTYPE IMPUTATION

To obtain posterior probabilities and imputed genotypes, (Fig. 1) we used the software package fastPHASE [Scheet and Stephens, 2006]. For each simulated region, we fit the LD model to the reference chromosomes only, and then applied this fitted model to the pseudo individuals in the simulated cohort. (For convenience we set the number of haplotype clusters K to be 20.) We assess imputation accuracy with the square of the Pearson correlation coefficient between the true and best-guess genotypes (R2), which is more informative about power at different allele frequencies than a simple genotype imputation error rate measure. For our simulations, the median R2 for these data was 0.90 and the mean was 0.75.

Fig. 1
Example of posterior probability summaries. Here we present a didactic illustration of the three summaries of the full posterior probabilities for imputed genotypes. From the set of Reference Haplotypes, the missing genotype (denoted with two ? symbols) ...

REGRESSION ANALYSIS

We used regression analysis to test the effectiveness of multiple summaries of the imputed genotypes. Let pki denote the conditional (“posterior”) probabilities for the imputed genotypes of individual i (1,….,n), where k (0,1,2) indexes the genotype by its label. We evaluated the performance of the following three summaries of the genotype probabilities conditional on the observed data:

  1. Best guess—maximum a posteriori (“MAP”) genotype;
  2. Dosage—estimated (expected) allelic or genotypic counts; and
  3. Posterior probabilities—probabilities of the three possible genotypes obtained from imputation.

For comparison, we also analyzed the true (simulated or typed) genotypes.

First we give the models used for ordinary least squares (OLS) regression. Then, we explain the use of mixture models for regression. For each method, we consider both additive (1-parameter of 1-degree-of-freedom “1 df”) and non-additive (2-parameter, 2 df) regression models for analysis. In what follows, let yi denote the quantitative trait value for individual i at a SNP.

ORDINARY REGRESSION ON GENOTYPES AND ALLELIC DOSAGE

Additive

Let xi represent a particular feature of the imputation procedure or the true genotype (gi) at a SNP under consideration, i.e.

xi={argmaxk{0,1,2}{pki}bestguessgenotypep1i+2p2iallelicdosagegitruegenotype.

The additive model is written as

yi=μ+βxi+εi,
(3)

where ε ~ An external file that holds a picture, illustration, etc.
Object name is nihms306122ig1.jpg(0, σ2), independently for all i. We use OLS regression to test the null hypothesis H0: β = 0 vs. H1: β ≠ 0. To evaluate significance, we compute an F-statistic.

Non-additive

Under a non-additive model, we expand xi to be composed of two components ( xi(1),xi(2)) as follows:

(xi(1),xi(2))={(I{xi=1},I{xi=2})bestguessgenotype(p1i,p2i)genotypicdosage(I{gi=1},I{gi=2})truegenotype.

We write the dominance model as

yi=μ+β1xi(1)+β2xi(2)+εi,
(4)

where εi ~ An external file that holds a picture, illustration, etc.
Object name is nihms306122ig1.jpg (0, σ2), as above. Again we evaluate the null hypothesis that there is no effect for any genotype, i.e. H0: =β1 = 0, β2 = 0 vs. H1: β1 ≠ 0 or β2 ≠ 0. We apply OLS regression and compute an F-statistic.

MIXTURE OF REGRESSION MODELS

To investigate the approach of multiple-imputation, we fit a mixture of regression models to the phenotype data and posterior genotype probabilities. The composite regression model may be written as

yi=k=02pkifk(μ,β,εi),
(5)

where the regression function fk(·) is a function of the assumed genetic model, i.e. additive or non-additive (see below).

For each assumed model below, we construct likelihood ratio statistics to test for statistical significance. To estimate the parameters (μ, β), we maximize the log-likelihood function using the Nelder-Mead Simplex Method [Nelder and Mead, 1965], implemented in the R package optim.

Additive

Under an assumption of additivity of the allelic effects, the regression function fk(·) is

fk(μ,β,εi)={μ+εi,k=0μ+β+εi,k=1μ+2β+εi,k=2,
(6)

where εi ~ An external file that holds a picture, illustration, etc.
Object name is nihms306122ig1.jpg (0, σ2). To test the hypothesis H0: β = 0 vs. H1: β ≠ 0, we construct a likelihood ratio test.

Non-additive

Relaxing the assumption of additivity (allowing for dominance) of the allelic effects, we expand β to be (β1, β2), and the regression function fk(·) is

fk(μ,β1,β2,εi)={μ+εi,k=0μ+β1+εi,k=1μ+β2+εi,k=2,
(7)

where εi ~ An external file that holds a picture, illustration, etc.
Object name is nihms306122ig1.jpg (0, σ2). To test the hypothesis H0: β1 =0 β2 =0 vs. H1: β1 ≠ 0, or β2 ≠ 0, we construct a likelihood ratio test.

RESULTS

Here we present results on simulated phenotypes from both simulated and real genotype data, as well as imputed genotypes from standard software packages. For various settings (sample sizes, effect sizes, real and simulated genotypes, genetic models), we tabulate power results overall, as well as plot them by imputation accuracy and allele frequency (calculated from the full cohort of 1,000 individuals).

LARGE SAMPLE SIZE WITH SMALL EFFECTS

We computed power empirically, based on the analysis of ≈1 million null data sets (where there was no association between phenotype and genotype) from which we obtained empirical significance thresholds. Results from analyses of our various imputation strategies and regression models, for the large sample of 1,000 individuals in the simulated studies, are reported in Table II.

TABLE II
Power results for large sample size and small effects

In general, there was a consistent gain in performance achieved from using the dosage summaries or mixture models in comparison to using the best guess genotypes. This improvement was larger for the two-parameter regression models, regardless of the underlying genetic model, with absolute gains in power of ≈14%. For additive or one-parameter models, the average gain was more modest (3–4%). All differences between the dosage and mixture model strategies were small (<2%).

We also examined the effect of imputation accuracy and allele frequencies on the power to detect association in Figure 2. We summarized accuracy at each SNP with the square of the Pearson correlation coefficient between the imputed and true genotypes (coded as 0, 1, or 2), which we refer to as R2.

Fig. 2
Power vs. accuracy and allele frequency for large sample size and small effects. For each summary and the true genotypes, both an additive (solid line) and dominant (dotted line) model were analyzed. (A) and (C) are based on data simulated with an additive ...

When the accuracy is high (R2>0.9), using the best-guess genotype from the imputation procedure results in little loss of power. The gain from using a dosage or mixture model is greatest at intermediate accuracies, since posterior probabilities are informative about the underlying genetic variation, even if they do not allow accurate “best-guess imputation” of genotypes. For all three strategies, at low imputation accuracies, the lines of the additive regression models converge, so do the lines of the dominant regression models.

An important factor in overall power summaries, such as those in Tables II and III (below), is the allele frequency distribution of SNPs present in the reference panel, at which genotypes are being imputed in the study samples, since the tables are constructed with averages over all SNPs. In Figures 2C and and3C,3C, where phenotypes were simulated from an additive genetic model, powers for all regression models increase substantially when minor allele frequencies are relatively low. This may reflect the relative difficulty of accurate imputation at SNPs with a lower MAF. (Under the correct additive model, power for the true genotypes is unaffected, since we attempted to make power independent of allele frequency for the purposes of aggregating results across SNPs for general comparisons among analysis strategies; see Methods.) For data simulated under a dominant genetic model, methods that assume the correct dominant model for analysis are superior at a greater range of allele frequencies.

Fig. 3
Power vs. accuracy and allele frequency for small sample size and large effects. Power was computed at a fixed type-I error rate (α) of 5 × 10−5. The sample size was 50. For each summary and the true genotypes, both an additive ...
TABLE III
Power results for large effects and small sample size

SMALL SAMPLE SIZE WITH LARGE EFFECTS

For SNPs with modest genetic effects, as above, there is little gain from the increased computational demands of applying mixture models for the analyses. To examine a scenario where the mixture models might offer an advantage, we repeated the above simulations with larger genetic effects (and thus smaller sample sizes so that power was below 100%). This situation might be found in expression quantitative trait loci (eQTL) mapping studies, for example. These results are in Table III.

Here, the advantage of applying mixture models is apparent, with average power gains of 10–12%. The contrast is greater at lower imputation accuracies (top row of Fig. 3) and is maintained even when we applied the incorrect additive regression model to data simulated with a strong dominant effect (Fig. 3B) (i.e. the green solid line is well above the blue dotted line at modest accuracies).

It is worth noting that although we attempted to simulate phenotypes so that results may be tabulated across allele frequencies, by keeping heritability constant per Equation (1), this does not guarantee that power will be independent of allele frequency. Heritability represents, in some sense, the amount of information about the phenotype and genotype relationship. How that relates to power is not completely predictable and will depend on additional factors, such as analysis methods, genetic model (e.g. dominance here), and particularly imputation accuracy. However, even for the true genotypes, it is difficult to calibrate power at low allele frequencies in samples of finite size. For example, in Figures 2D and and3D,3D, power is reduced at low frequencies of the dominant allele. This is due to the requirement of having a homozygote—a rare genotype at low allele frequency and thus less likely to be observed near its expected proportion—for a shift in phenotypic mean; e.g., this phenomenon is more pronounced in Figure 3D where the sample size is 50. In fact, it is for this reason that we included results for true genotypes (where the “correction” for allele frequency, etc., was not perfect).

COMPUTATION

The mixture-model-based procedures were considerably more computationally demanding, based on our implementations in the R software package. Per-marker run times for the mixture-models averaged approximately 4 sec for 1-df (about 300 times longer than for the best-guess and dosage methods) and 20 sec for 2-df regression models. However, calculations for methods applied in this study can be conducted in parallel. We estimate that an application of mixture models to poorly imputed SNPs in a GWA study could be completed in a couple of days using tens of CPUs.

REAL DATA WITH LARGE AND SMALL EFFECTS

We confirmed the general applicability of our results to real genotype data, by applying our methods to 538 control samples from a GWA case-control study of Type II diabetes (FUSION). We studied the following two scenarios: (1) all 538 samples and a modest effect (single-marker heritability of 4.3%); and (2) small sample size of 50 individuals and a large effect (single-marker heritability of 59.8%). To examine the phenomenon of seeing greatly increased power for the mixture models at sites with poor imputation accuracy, we report results for small sample size by low imputation accuracy (R2 <0.56) and “high” accuracy (R2 ≥0.56). (Due to the constraints of the real data, there does not exist a full spectrum of allele frequencies for plots by allele frequency. The cutoff of 0.56 was chosen based on a visual examination of Fig. 3.) In all scenarios, the power from using mixture models equals or exceeds those for the dosage and best-guess summaries, although only the scenario of low imputation accuracy and large effects show a pronounced difference. Results are displayed in Table IV.

TABLE IV
Power results for various effect and sample sizes: application to real data

DISCUSSION

Several software packages have been developed to impute and test SNPs that were not typed directly, such as BIMBAM [Servin and Stephens, 2007], IMPUTE [Marchini et al., 2007], MaCH [Li et al., 2009, 2010], and Beagle [Browning and Browning, 2009]. Two of these methods (BIMBAM and IMPUTE) assess association between genotype and phenotype with a Bayes Factor. We do not consider the Bayesian approach here, but this is discussed by Guan and Stephens [2008].

Multiple factors will impact power of imputation-based strategies for the analysis of GWA studies, including differences in the patterns of LD and allele frequencies between the study and reference popOulations. However, for the single-marker analyses examined in our study, the impact of these factors can be measured via their effect on imputation accuracy, since the missing (unmeasured) genotypes are the quantities of interest for analysis. Different imputation algorithms will lead to slightly differential accuracies. However, our aim here was not to compare these accuracies but to condition on the sorts of accuracies that might be expected from typical marker densities and patterns of LD.

The SNP density targeted in our simulations was motivated by analyses of existing GWA studies. Increased densities should result in more information about LD to increase imputation accuracies. For this reason, we plotted results by imputation accuracies in addition to the tables which integrate over the distributions of LD patterns and allele frequencies. For the same reason, our results should be applicable to imputation from low-coverage sequencing data. Although the distributions of allele frequencies of interrogated SNPs will shift to lower values, and imputation accuracies may vary in a manner different from those encountered in this study, our results plotted by these features (frequency and accuracy) should apply to other raw data sources. (Our tabulated summaries may in fact change under these different conditions, since the results are integrated over particular distributions of allele frequencies and accuracies, dependent on the simulations and imputation methods employed herein.)

We applied methods to a sample size of 1,000 individuals. While this size is somewhat smaller than for some GWA studies, and much smaller than associated meta-analyses, it is sufficiently large to illustrate comparisons of methods for effect sizes that correspond to intermediate power to detect association. Larger sizes, with similarly sized effects, will simply result in increased power regardless of methods. Smaller sample sizes will require stronger genetic effects for there to exist sufficient power to detect association. Examples of such scenarios may come from studies of pharmacogenetics or mapping eQTLs.

Here, we have made no attempt to model the correlation of genotypes among SNPs during analysis. To detect interactions among genotypes at nearby SNPs, it may be beneficial to model this dependence during imputation and analysis. The imputation procedures mentioned above may obtain correlated genotypes by sampling entire chromosomes of untyped SNPs, instead of the data at each SNP, marginally.

It may be possible to do better in such a setting by using genuine “multiple imputation” methods. However, in our setting, by applying a mixture of regression models, we hope to capture a range of possible phenotype-genotype relationships, and the gain from multiple imputation over the mixture model should not be large. Therefore, we felt that the mixture model provided a close approximation to an optimal analysis procedure.

In our most relevant comparisons with modest effects and large sample sizes, use of the dosage summaries was as powerful as using the mixture model methods, at a fraction of the computational cost. The exception to this result is apparent only at SNPs with very large genetic effects. In such situations of large effects, most methods will be effective at detecting an association. This difference is most pronounced at poorly imputed SNPs. In practice, many researchers routinely exclude results from poorly imputed SNPs, such as those below an R2 threshold of, say, 30%. Application of this quality-control filter to our results would tend to mitigate (tabulated) differences in power between the mixture and standard regression methods in the setting of large effect sizes. In fact, it may be fruitful, in some cases, to devote additional computational resources to some of these SNPs, such as application of mixture models. However, for the majority of settings and effect sizes detected and verified in GWA studies, use of dosage quantities appears to be effective and efficient to account for the uncertainty in the imputed genotypes.

Acknowledgments

Contract grant sponsors: National Human Genome Research Institute; National Heart Lung and Blood Institute; Pew Scholarship for the Biomedical Sciences; Rackham One-Term Dissertation Fellowship; NIH; Contract grant numbers: HL084729-02; 3-R01-CA082659-11S1.

We thank M. Boehnke and the FUSION investigators for sharing with us their data. This research was supported by grants from the National Human Genome Research Institute and the National Heart Lung and Blood Institute (for J.Z. and G.R.A.), the award of a Pew Scholarship for the Biomedical Sciences (to G.R.A.), a Rackham One-Term Dissertation Fellowship (for Y.L.), and NIH grant HL084729-02 (for G.R.A. and P.S.). Y.L. is partially supported by NIH grant 3-R01-CA082659-11S1.

APPENDIX A: EFFECT SIZES FOR SIMULATIONS

Phenotypes were simulated as described in Methods, using Equations (1) and (2). Here, we focus on the values used for the more realistic scenario of a larger samples size (1,000), i.e. a constant genetic variance with heritability of 2.8%, chosen for adequate power to facilitate comparisons among methods. In Figure A1 we show the actual effect sizes—the values for a and d used in Equation (2)—as they vary with allele frequency. Note the frequency of the recessive allele is plotted on the horizontal axis, and thus for the purely additive model (no dominance) the effect size is symmetric about an allele frequency of 0.5.

Fig. A1

An external file that holds a picture, illustration, etc.
Object name is nihms306122f4.jpg

Summary of effect sizes for phenotype simulations. Values for the effect size (a) are plotted against allele frequencies of the recessive allele (allele “A” in Table I). Values of d are given as in Tables II and III, i.e. 0 (Additive), (1/2)a (Partially dominant), a (Dominant), and (6/5)a (Overdominant).

References

  • Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, Brant SR, Silverberg MS, Taylor KD, Barmada MM, Bitton A, Dassopoulos T, Datta LW, Green T, Griffiths AM, Kistner EO, Murtha MT, Regueiro MD, Rotter JI, Schumm LP, Steinhart AH, Targan SR, Xavier RJ, Libioulle C, Sandor C, Lathrop M, Belaiche J, Dewit O, Gut I, Heath S, Laukens D, Mni M, Rutgeerts P, Van Gossum A, Zelenika D, Franchimont D, Hugot JP, de Vos M, Vermeire S, Louis E, Cardon LR, Anderson CA, Drummond H, Nimmo E, Ahmad T, Prescott NJ, Onnie CM, Fisher SA, Marchini J, Ghori J, Bumpstead S, Gwilliam R, Tremelling M, Deloukas P, Mansfield J, Jewell D, Satsangi J, Mathew CG, Parkes M, Georges M, Daly MJ. Belgian-French IBD Consortium, Wellcome Trust Case Control Consortium. Genome-wide association defines more than 30 distinct susceptibility loci for crohn’s disease. Nat Genet. 2008;40:955–962. [PMC free article] [PubMed]
  • Browning SR. Missing data imputation and haplotype phase inference for genome-wide association studies. Hum Genet. 2008;124:439–450. [PMC free article] [PubMed]
  • Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007;81:1084–1097. [PMC free article] [PubMed]
  • Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009;84:210–223. [PMC free article] [PubMed]
  • Carlson C, Eberle M, Rieder M, Yi Q, Kruglyak L, Nickerson D. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet. 2004;74:106–120. [PMC free article] [PubMed]
  • Falconer DS. Introduction to Quantitative Genetics. 3. New York: Longman Scientific & Technical; 1989.
  • Greenspan G, Geiger D. Model-based inference of haplotype block variation. J Comput Biol. 2004;11:493–504. [PubMed]
  • Guan Y, Stephens M. Practical issues in imputation-based association mapping. PLoS Genet. 2008;4:e1000279. [PMC free article] [PubMed]
  • International HapMap Consortium. A second generation human haplotypemap of over 3.1 million SNPs. Nature. 2007;449:851–861. [PMC free article] [PubMed]
  • Lettre G, Jackson AU, Gieger C, Schumacher FR, Berndt SI, Sanna S, Eyheramendy S, Voight BF, Butler JL, Guiducci C, Illig T, Hackett R, Heid IM, Jacobs KB, Lyssenko V, Uda M, Boehnke M, Chanock SJ, Groop LC, Hu FB, Isomaa B, Kraft P, Peltonen L, Salomaa V, Schlessinger D, Hunter DJ, Hayes RB, Abecasis GR, Wichmann HE, Mohlke KL, Hirschhorn JN. Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet. 2008;40:584–591. [PMC free article] [PubMed]
  • Li Y, Ding J, Abecasis GR. Mach 1.0: rapid haplotype reconstruction and missing genotype inference. Am J Hum Genet. 2006;79:S2290.
  • Li Y, Willer C, Sanna S, Abecasis GR. Genotype imputation. Annu Rev Genom Hum Genet. 2009;10:387–406. [PMC free article] [PubMed]
  • Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–834. [PMC free article] [PubMed]
  • Loos RJF, Lindgren CM, Li S, Wheeler E, Zhao JH, Prokopenko I, Inouye M, Freathy RM, Attwood AP, Beckmann JS, Berndt SI, Bergmann S, Bennett AJ, Bingham SA, Bochud M, Brown M, Cauchi S, Connell JM, Cooper C, Smith GD, Day I, Dina C, De S, Dermitzakis ET, Doney ASF, Elliott KS, Elliott P, Evans DM, Farooqi IS, Froguel P, Ghori J, Groves CJ, Gwilliam R, Hadley D, Hall AS, Hattersley AT, Hebebrand J, Heid IM, KORA, Herrera B, Hinney A, Hunt SE, Jarvelin MR, Johnson T, Jolley JDM, Karpe F, Keniry A, Khaw KT, Luben RN, Mangino M, Marchini J, McArdle WL, McGinnis R, Meyre D, Munroe PB, Morris AD, Ness AR, Neville MJ, Nica AC, Ong KK, O’Rahilly S, Owen KR, Palmer CNA, Papadakis K, Potter S, Pouta A, Qi L, Randall JC, Rayner NW, Ring SM, Sandhu, Scherag A, Sims MA, Song K, Soranzo N, Speliotes EK, Syddall HE, Teichmann SA, Timpson NJ, Tobias JH, Uda M, Vogel CIG, Wallace C, Waterworth DM, Weedon MN, Willer CJ, FUSION, Wraight VL, Yuan X, Zeggini E, Hirschhorn JN, Strachan DP, Ouwehand WH, Caulfield MJ, Samani NJ, Frayling TM, Vollenweider P, Waeber G, Mooser V, Deloukas P, McCarthy MI, Wareham NJ, Barroso I. Nurses’ Health Study; Diabetes Genetics Initiative; The SardiNIA Study; The Wellcome Trust Case Control Consortium. Common variants near mc4r are associated with fat mass, weight and risk of obesity. Nat Genet. 2008;40:768–775. [PMC free article] [PubMed]
  • Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet. 2007;39:906–913. [PubMed]
  • Nelder JA, Mead R. A simplex algorithm for function minimization. Comput J. 1965;7:308–313.
  • Schaffner S, Foo C, Gabriel S, Reich D, Daly M, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15:1576–1583. [PMC free article] [PubMed]
  • Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78:629–644. [PMC free article] [PubMed]
  • Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL, Erdos MR, Stringham HM, Chines PS, Jackson AU Prokunina-Olsson L, Ding CJ, Swift AJ, Narisu N, Hu T, Pruim R, Xiao R, Li XY, Conneely KN, Riebow NL, Sprau AG, Tong M, White PP, Hetrick KN, Barnhart MW, Bark CW, Goldstein JL, Watkins L, Xiang F, Saramies J, Buchanana TA, Watanabe RM, Valle TT, Kinnunen L, Abecasis GR, Pugh EW, Doheny KF, Bergman RN, Tuomilehto J, Collins FS, Boehnke M. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science. 2007;316:1341–1345. [PMC free article] [PubMed]
  • Servin B, Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 2007;3:e114. [PMC free article] [PubMed]
  • Stephens M, Scheet P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am J Hum Genet. 2005;76:449–462. [PMC free article] [PubMed]
  • The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [PMC free article] [PubMed]
  • Willer CJ, Sanna S, Jackson AU, Scuteri A, Bonnycastle LL, Clarke R, Heath SC, Timpson NJ, Najjar SS, Stringham HM, Strait J, Duren WL, Maschio A, Busonero F, Mulas A, Albai G, Swift AJ, Morken MA, Narisu N, Bennett D, Parish S, Shen H, Galan P, Meneton P, Hercberg S, Zelenika D, Chen WM, Li Y, Scott LJ, Scheet PA, Sundvall J, Watanabe RM, Nagaraja R, Ebrahim S, Lawlor DA, Ben-Shlomo Y, Davey-Smith G, Shuldiner AR, Collins R, Bergman RN, Uda M, Tuomilehto J, Cao A, Collins FS, Lakatta E, Lathrop GM, Boehnke M, Schlessinger D, Mohlke KL, Abecasis GR. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat Genet. 2008;40:161–169. [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...