• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinfoLink to Publisher's site
Bioinformatics. Feb 15, 2011; 27(4): 516–523.
Published online Dec 14, 2010. doi:  10.1093/bioinformatics/btq688
PMCID: PMC3105480

The Bayesian lasso for genome-wide association studies

Abstract

Motivation: Despite their success in identifying genes that affect complex disease or traits, current genome-wide association studies (GWASs) based on a single SNP analysis are too simple to elucidate a comprehensive picture of the genetic architecture of phenotypes. A simultaneous analysis of a large number of SNPs, although statistically challenging, especially with a small number of samples, is crucial for genetic modeling.

Method: We propose a two-stage procedure for multi-SNP modeling and analysis in GWASs, by first producing a ‘preconditioned’ response variable using a supervised principle component analysis and then formulating Bayesian lasso to select a subset of significant SNPs. The Bayesian lasso is implemented with a hierarchical model, in which scale mixtures of normal are used as prior distributions for the genetic effects and exponential priors are considered for their variances, and then solved by using the Markov chain Monte Carlo (MCMC) algorithm. Our approach obviates the choice of the lasso parameter by imposing a diffuse hyperprior on it and estimating it along with other parameters and is particularly powerful for selecting the most relevant SNPs for GWASs, where the number of predictors exceeds the number of observations.

Results: The new approach was examined through a simulation study. By using the approach to analyze a real dataset from the Framingham Heart Study, we detected several significant genes that are associated with body mass index (BMI). Our findings support the previous results about BMI-related SNPs and, meanwhile, gain new insights into the genetic control of this trait.

Availability: The computer code for the approach developed is available at Penn State Center for Statistical Genetics web site, http://statgen.psu.edu.

Contact: ude.usp.cmh.seh@uwr

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Recent genotyping technologies allow the fast and accurate collection of genotype data throughout the entire genome for many subjects. By genome-wide association studies (GWASs), the genetic variants associated with a complex disease or trait, their chromosomal distribution and individual effects, can be identified. GWASs are based on either case–control cohorts to test the associations between SNPs and diseases or population cohorts to estimate genetic effects of SNPs on traits. In both cases, there are hundreds of thousands of SNPs genotyped on samples involving thousands of subjects. This typical problem, having the number of predictors far exceeding the number of observations, makes it impossible to analyze the data using traditional multivariate regression. In current GWASs, simple univariate linear regression that analyzes one SNP at a time is usually used and, by adjusting for multiple comparisons, the significance levels of the detected genes are then calculated (McCarthy et al., 2008).

These single SNP-based GWASs have been instrumental for reproducibly detecting significants genes for various complex diseases or traits (Donnelly, 2008). However, such strategies have three major disadvantages, limiting the future applications of GWAS. First, because most complex traits are polygenic, a single SNP analysis can only detect a very small portion of genetic variation and, also, may not be powerful for identifying weaker associations (Hoggart et al., 2008). Second, different genes may interact with each other to form a complex network of genetic interactions, which cannot be characterized from a single SNP analysis. Third, many GWASs analyze genetic associations separately for different environments, such as males and females, and then make an across-environment comparison in genetic effects. This analysis is neither powerful nor precise for the identification of gene–environment interactions. Because of these limitations, many authors have developed various approaches for simultaneously analyzing multiple SNPs for GWASs (Logsdon et al., 2010; Wu et al., 2009; Yang et al., 2010), although most approaches focus on case–control cohorts.

There is a daunting need on the development of a variable selection model to identify SNPs with significant effects on quantitative traits in population cohorts and estimate all selected predictors simultaneously. Traditionally, a subset of predictors in a regression model is obtained by forward selection, backward elimination and stepwise selection, but these approaches are computationally expensive and unstable even when the number of predictors is not large. Recently, alternative approaches have been developed, including ridge regression, bridge regression (Frank and Friedman, 1993), least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), elastic net (Zou and Hastie, 2005) and the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001). For the number of variables much larger than that of subjects, as commonly seen in GWASs, Fan and Lv (2008) proposed a two-stage procedure for variable selection by first suppressing the high dimensionality of response into its low-dimensional representation and then finding a subset of predictors that can predict the suppressed response. A similar two-stage approach was also developed by Paul et al. (2008).

In this article, we for the first time integrate Paul et al.'s preconditioning procedure into LASSO to develop a two-stage strategy for identifying important SNPs in GWASs. In step one, we find a linear combination of predictors that are strongly correlated with the response by a supervised principle component analysis and get a consistent ‘preconditioned’ estimate of response variable. In step two, we implement the Bayesian lasso (Park and Casella, 2008) for variable selection based on the ‘preconditioned’ response that mitigates the observational noise. The Markov chain Monte Carlo (MCMC) algorithm is used to estimate all the parameters. The Bayesian hierarchical model is implemented to control an issue of over-fitting that arises when too many associations are included. Our model shows a great flexibility to fit many SNPs and many covariates at the same time. The statistical properties of the model were tested through simulation studies. We used a real GWAS dataset from the Framingham Heart Study (FHS) to validate the usefulness and utilization of the new model.

2 BAYESIAN GWAS MODEL

2.1 Preconditioning

When the number of predictors far exceeds the number of observations, preconditioning via a supervised principal component analysis is recommended to reduce the effect of observational noise on model selection (Paul et al., 2008). In a GWAS of n subjects, we express a response variable y (assumed to be normally distributed) as a function of p SNPs genotyped throughout the entire genome using a linear model

equation image
(1)

where W = (w1,…, wn)T is a (n × p) design matrix, b = (b1,…, bp)T is the vector of regression coefficients and ε~Nn(0, σ2In) is the residual error.

The design matrix is reduced to one that consists of only those predictors whose estimated regression coefficients exceed a threshold θ in the absolute value. Thus, the reduced design matrix Wreduced consists of the j'-th column of W, where An external file that holds a picture, illustration, etc.
Object name is btq688i1.jpg. The principal components of Wreduced, called supervised principal components, are computed. The first m supervised principal components can serve as independent variables in a linear regression model, from which a consistent predictor An external file that holds a picture, illustration, etc.
Object name is btq688i2.jpg of the true response is obtained. In practice, we select θ by 5-fold cross-validation. Since only the first few components are useful for prediction, in the following examples we consider the first three principal components. Next, a standard variable selection procedure will be conducted for the preconditioned response variable An external file that holds a picture, illustration, etc.
Object name is btq688i3.jpg.

2.2 Lasso penalized regression

Given phenotypical measurements and genotype information, we could obtain the preconditioned response An external file that holds a picture, illustration, etc.
Object name is btq688i4.jpg based on the generic form of linear regression (1). However in GWASs, a number of covariates, which are either discrete or continuous, may be measured for each subject. In order to estimate genetic effects precisely by adjusting for these covariates, a GWAS model that takes into account the effects of important covariates would be more appropriate. Therefore, we describe the preconditioned value An external file that holds a picture, illustration, etc.
Object name is btq688i5.jpg of a quantitative trait for subject i as

equation image
(2)

where μ is the overall mean, Xi is the d1-dimensional vector of discrete covariates for subject i, α = (α1,…, αd1)T is the vector of regression coefficients for discrete covariates, Zi is the d2-dimensional vector of continuous covariates for subject i, β = (β1,…, βd2)T is the vector of regression coefficients for continuous covariates, a = (a1,…, ap)T and d = (d1,…, dp)T are the p-dimensional vectors of the additive and dominant effects of SNPs, respectively, ξi and ζi are the indicator vectors of the additive and dominant effects of SNPs for subject i, and εi is the residual error assumed to follow a N(0, σ2) distribution. The j-th elements of ξi and ζi are defined as

equation image

equation image

Despite p[dbl greater-than sign]n in the GWAS, most of the regression coefficients in (2) are expected to have no or only weak effects on the phenotype. To identify a few SNPs that may have notable effects and enhance prediction performance, we put L1 lasso penalties on the sizes of additive effects and the dominant effects and encourage sparse solutions using

equation image
(3)

where t and t* are a certain value chosen to penalize the additive and dominant effects, respectively. Thus, parameters in Equation (2) are estimated by the penalized least squares

equation image
(4)

where An external file that holds a picture, illustration, etc.
Object name is btq688i6.jpg, μ = (μ,…, μ)T, X =(X1,…, Xn)T, Z = (Z1,…, Zn)T, ξ = (ξ1,…, ξn)T, ζ = (ζ1,…, ζn)T and λ and λ* are tuning parameters or lasso parameters that control the degrees of shrinkage in the estimate of the genetic effects.

2.3 Bayesian hierarchical model and prior distributions

Noting the form of the L1-penalty term in (4), Tibshirani (1996) suggested that lasso estimates can be interpreted as posterior mode estimates when the regression parameters have independent and identical Laplace (i.e. double-exponential) priors. Therefore, when lasso penalties are imposed on the additive and dominant effects of SNPs, the conditional prior for aj is a Laplace distribution with the scale parameter σ/λ:

equation image
(5)

Similarly, the conditional Laplace prior for dj is

equation image
(6)

Since the Laplace distribution can be represented as a scale mixture of a normal distribution with an exponential distribution (Andrews and Mallows, 1974), we have the following hierarchical representation of the penalized regression model:

equation image

After integrating out τ21,…, τ2p and τ*21,…, τ*2p, the conditional priors on a and d have the desired forms (5) and (6), respectively. We assign conjugate normal priors to α and β because they are low dimensional and not the parameters of interest. Finally, since the data are usually sufficient to estimate μ and σ, we can use an independent, flat prior π(μ)=1 for μ and a non-informative scale-invariant prior π(σ2)=1/σ2 for σ2.

The tuning parameters of the ordinary lasso can be prespecified by cross-validation, generalized cross-validation or the idea based on Stein's unbiased risk estimate. However, in the Bayesian lasso, λ and λ* can be estimated along with other parameters by assigning appropriate hyperpriors to them. This procedure avoids the choice of lasso parameters and allows us to determine the amount of shrinkage from the data. In particular, we consider the conjugate gamma priors on λ2/2 and λ*2/2,

equation image

equation image

where a, b, a* and b* are small values so that the priors are essentially non-informative. With this specification, lasso parameters can be treated similar to the other parameters and estimated by the Gibbs sampler.

3 POSTERIOR COMPUTATION AND INTERPRETATION

3.1 MCMC algorithm

We estimate the parameters by sampling from their conditional posterior distributions through the MCMC algorithm. The joint posterior distribution can be expressed as:

equation image

Two-level hierarchical modeling allows us to easily derive the conditional posterior distributions of parameters and hyperparameters, from which the Gibbs sampler draws posterior samples. Conditional on the parameters (a, d, τ21,…, τ2p, τ*21,…, τ*2p), the model is the standard linear regression and, thus, the conditional posterior distributions of α, β, σ2) are

equation image

equation image

equation image

Conditional on the parameters (τ21,…, τ2p, τ*21,…, τ*2p, α, β), the model becomes the weighted linear regression, and thus the conditional posterior distributions of (a, d) are

equation image

equation image

Moreover, the full conditional for τ21,…, τ2p, τ*21,…, τ*2p are conditionally independent, with

equation image

and

equation image

Finally, with the conjugate priors Gamma(a, b) and Gamma(a*, b*), the conditional posterior distributions of the hyperparameters are gammas

equation image

and

equation image

An efficient Gibbs sampler based on these full conditionals proceeds to draw posterior samples from each full conditional posterior distribution, given the current values of all other parameters and the observed data. This process continues until all chains converge. We use the potential scale reduction factor An external file that holds a picture, illustration, etc.
Object name is btq688i7.jpg to access the convergence (Gelman and Rubin, 1992). Once An external file that holds a picture, illustration, etc.
Object name is btq688i8.jpg for all scalar estimands of interest, we continue to draw 15 000 iterations to obtain samples from the joint posterior distribution.

3.2 Posterior interpretation

The proposed MCMC algorithm for our Bayesian lasso model can provide posterior median estimates of the additive effects and dominant effects of individual SNPs, while adjusting for the effects of all other SNPs and covariates. Furthermore, using the posterior samples of a, d, and the observed genotypes, we can calculate the proportion of the phenotypic variance explained by a particular SNP, i.e. heritability, by

equation image

where An external file that holds a picture, illustration, etc.
Object name is btq688i9.jpg is the estimated allele frequency for A, and An external file that holds a picture, illustration, etc.
Object name is btq688i10.jpg is the estimated allele frequency for a, âj is the median estimate of the additive effect for SNP j and An external file that holds a picture, illustration, etc.
Object name is btq688i11.jpg is the median estimate of the dominant effect for SNP j. Since heritability estimates are unitless, they could guide variable selection and identify SNPs that have relatively large effects on the phenotype.

4 RESULTS

4.1 Worked example

We used the newly developed model to analyze a real GWAS dataset from the FHS, a cardiovascular study based in Framingham, Massachusetts, supported by the National Heart, Lung and Blood Institute, in collaboration with Boston University (Dawber et al., 1951). Recently, 550 000 SNPs have been genotyped for the entire Framingham cohort (Jaquish, 2007), from which 418 males and 559 females were chosen for our data analysis. These subjects were measured for body mass index (BMI) at different ages from 29 to 61 years. As is standard practice, SNPs with minor allele frequency <10% were excluded from data analysis. The numbers and percentages of non-rare allele SNPs vary among different chromosomes and ranges from 4417 to 28 771 and from 64% to 72%, respectively.

In principle, our approach can handle an extremely large number of SNPs at the same time. To save our computing time, however, we use those SNPs that cannot be neglected according to a simple single SNP analysis. We chose the phenotypic data of BMI in a middle measure age of each subject for a single SNP analysis, separately for males and females. Supplementary Figure S1 gives −log10 P−values for each SNP in the two sexes from which 1837 SNPs with a −log10 P-value of >3.5 in at least one sex was selected for Bayesian lassso analysis. Before this analysis, we imputed missing genotypes for a small proportion of SNPs (5.16%) according to the distribution of genotypes in the population. A preconditional analysis with m = 3 and θ = 0.426 was used to mitigate observational noise, leading to the preconditioned phenotypes. Like original measures, the preconditioned BMI also displays a normal distribution (Fig. 1), which meets the normality assumption required by the new approach.

Fig. 1.
The histograms of original and preconditioned BMI.

By treating the sex as a discrete covariate and age as a continuous covariate, we imposed lasso penalties on the additive effects a1,…, ap and dominant effects d1,…, dp to identify those SNPs with notable effects on BMI. We employ the proposed MCMC algorithms to estimate all parameters and implement variable selection, where Σα = 1, Σβ β 1 and all parameters in the conjugate gamma hyperpriors are 0.1. In unreported tests, we find that the posteriors are not sensitive to these prior specifications, as long as a and b are small values so that the priors are relatively flat (Park and Casella, 2008; Yi and Xu, 2008). Figure 2 plots the estimated additive and dominant effects of each SNP after adjusting for the effects of other SNPs and covariates. The heritability explained by each SNP is shown in Figure 3. The Bayesian hierarchical model automatically shrinks small coefficients to zero, and hence the posterior estimates of a, d and hj2 can guide variable selection. We claim that a genetic effect is significant if its 95% posterior credible interval does not contain zero. Alternatively, Hoti and Sillanpaa (2006) suggested to preset a threshold value, c, such that one SNP is included into the final model if the heritability explained by this SNP is greater than c. We usually report the SNPs with high heritabilities and, thus, this threshold can be chosen on more subjective grounds.

Fig. 2.
Estimated additive (A) and dominant effects (B) of 1837 SNPs from the Framingham heart study.
Fig. 3.
Estimated additive (A) and dominant effects (B) based on 50 simulations.

Table 1 tabulates the names and positions of SNPs with the heritability (hj2) greater than 0.5, as well as the estimated additive effects and heritabilities. We do not report the estimated dominant effects since they are relatively low in this example. The Bayesian lasso tends to shrink small effects of genes into zero. Assuming that a=d=0.4 for a SNP with allele frequencies of 0.5 in a population, the additive and dominant variances explained by this SNP is An external file that holds a picture, illustration, etc.
Object name is btq688i12.jpg and An external file that holds a picture, illustration, etc.
Object name is btq688i13.jpg, respectively. Thus, there is a possibility that the dominant effects are shrunk to a greater extent than the additive effects if they are of similar size. This may partly explain why the dominant effects estimated for significant SNPs are much smaller than the additive effects.The amount of shrinkage in the estimates of additive and dominant effects are quantified by two hyperparameters λ and λ* determined from the data. The posterior medians for λ and λ* are 54.474 and 54.523, respectively, with the 95% posterior intervals being [53.325, 55.626] and [53.359, 55.678], respectively. These suggest that the tuning parameters for the additive and dominant effects can be estimated precisely.

Table 1.
Information about significant SNPs

Since five significant SNPs are selected from chromosome 1, and four from chromosome 10, we will further examine the correlations among the significant SNPs from the same chromosome, as suggested by a referee. The correlation matrix of five significant SNPs from chromosome 1 is given by

equation image

where star denotes significant correlations at the significance level 1%. Clearly, these SNPs can be classified into two groups, and within each group, SNPs are highly correlated. The correlation matrix of four significant SNPs from chromosome 10

equation image

also suggested that these SNPs are closely linked to each other.

4.2 Computer simulation

The new approach is investigated through simulation studies. We generate data according to the model (2) with μ = 0, σ2 = 10 and n = 500. For ease of simulation, ξij is derived from uij, where each uij has a standard normal distribution marginally, and ρ = cov(uij, uik) = 0.1. Then, to mimic a SNP with equal allele frequencies, we set

equation image

where −c is the first quartile of a standard normal distribution. Finally, ζij is derived from ξij. We assume that there are 1000 SNPs from which 20 are significant for a phenotypic trait. The positions and additive and dominant effects of individuals are given in Table 2. It is assumed that the trait is measured at a subject-specific age, following the data structure of the FHS.

Table 2.
Genetic effects of 20 assumed SNPs for data simulation

Figure 3 gives the estimated additive and dominant genetic effects of different SNPs over 50 simulations, and Figure 4 plots the heritability explained by each SNP. It is clear that lasso penalties shrink small genetic effects to zeros, resulting in sparse solutions of the regression coefficients. In general, the 20 assumed SNPs can be well identified and their additive and dominant effects well estimated. Also, two hyperparameters λ and λ* whose influence the degree of shrinkage can be well estimated. In Supplementary Figure 2, the histograms of these two hyperparameters are shown.

Fig. 4.
Estimated heritability explained by each SNP based on 50 simulations.

Then, we carry out another simulation study to compare the performance of preconditioned Bayesian lasso, Bayesian lasso without preconditioning and the traditional single SNP analysis. Without loss of generality, only the additive model is considered. Specifically, we generate data on n = 200 and p = 500 or 1000 according to the model (2), with μ = 0, σ2 = 10, ρ = 0.1, aj = 1 for 1≤j≤20 and aj = 0 for j > 20.

We apply three methods to the 100 simulated datasets: single SNP analysis (SSA), standard Bayesian lasso (B-lasso) and the Bayesian lasso applied to the preconditioned response from supervised principal components (PB-lasso). In single SNP analysis, we reject the null hypothesis that the genetic effect of an individual SNP equals to zero at the significance level of 5% with the FDR adjustment. For the Bayesian lasso and preconditioned Bayesian lasso, we reject the null hypothesis based on 95% Bayesian credible intervals. To ameliorate the bias of the parameter estimates introduced by lasso penalties, we always refit the linear regression model without the penalty term using only those SNPs selected by the model selection procedure.

For each estimated genetic effect obtained from each method, we calculate the average bias and empirical standard error over 100 simulations. Since the first 20 genetic effects are non-zeros with the same true value, in Table 3 we report the average values over the first 20 SNPs and over the rest of the SNPs separately. The standard error of each average is in parentheses. In the column labeled ‘Aver. Nonzeros’, we present the average number of non-zero coefficients correctly identified to be non-zero, or the average number of zero coefficients incorrectly estimated to be non-zero in 100 replications. In the column ‘Proportion of Correct-fit’, we present the proportion of replications that the exact true model was identified.

Table 3.
Simulation results for three methods based on 100 simulations

As can be seen from Table 3, the single SNP analysis tend to overestimate the genetic effect, since when we test a SNP for the association with the phenotype, we assume the genetic variation is solely due to this particular SNP, and ignore the effects from all other SNPs. Therefore, in terms of parameter estimates, model selection methods that simultaneously estimate the genetic effects associated with all SNPs outperform the traditional single SNP analysis. In terms of variable selection, although preconditioned Bayesian lasso has a slightly higher false positive rate due to the preconditioning step, it greatly improves the probability of correctly identifying regression coefficients with non-zero effects. Moreover, as the number of SNPs gets larger, single SNP analysis detected fewer important SNPs, since this method subjects to severe multiple comparison adjustment. However, preconditioned Bayesian lasso is still able to identify non-zero coefficients and zero coefficients correctly in almost every simulation. Supplementary Table 1 displays the simulation results when ρ = 0.5, which are consistent with our findings.

Since our method is from the Bayesian perspective and is based on the Gibbs sampler, the computational time is relatively high. For example, for each replicate in this simulation study, averagely it takes 439 s when p = 1000 and 109 seconds when p = 500 on a 2.0 GHz PC.

5 DISCUSSION

When the number of predictors p is much larger than the number of observations n, highly regularized approaches, such as penalized regression models, are needed to identify non-zero coefficients, enhance model predictability and avoid over-fitting (Hastie et al., 2009). The L1 penalized regression or lasso is such one of the most popular techniques. In this article, we presented a Bayesian hierarchical model with lasso penalties to simultaneously fit and estimate all possible genetic effects associated with all SNPs in a GWAS, adjusting for both discrete and continuous covariates. Lasso penalties are imposed on the additive and dominant effects, and implemented by assigning double-exponential priors to their regression coefficients. It shrinks small effects toward zero and produces sparse solutions. In this framework, SNPs with significant genetic effects can be identified more accurately.

We fit the model in a fully Bayesian approach, employing the MCMC algorithm to generate posterior samples from the joint posterior distribution, which can be used to make various posterior inferences. Although computationally intensive, it is easy to implement and provides not only point estimates but also interval estimates of all parameters. The Bayesian lasso treats tuning parameters as unknown hyperparameters and generates their posterior samples when estimating other parameters. This technique avoids the choice of tuning parameters, and automatically accounts for the uncertainty in its selection that affects the estimation of the final model. In contrast, standard lasso algorithms usually select tuning parameters by K-fold cross-validation, which involves partitioning the whole dataset and refitting the model many times. This process may result in unstable tuning parameter estimates.

In order to improve the performance of lasso when p is greater than n, preconditioning is considered before variable selection. Preconditioning encourages the principal components of a reduced design matrix to be highly correlated with the response, and thus in most cases only the first or first few components tend to be useful for prediction. It denoises the response variable so that variable selection becomes more efficient. Our simulation demonstrated that when p greatly exceeds n, preconditioned Bayesian lasso could successfully identify almost all the SNPs with true genetic effects. By analyzing real data, our approach is shown to produce biologically relevant results. For example, the approach detected a significant SNP ss66171460 at position 22580931 of chromosome 20 associated with BMI. It is interesting to note that this SNP is within 500 kb of the FOXA2 (Forkhead Box A2) gene, an important genetic variant that regulates obesity (Wolfrum et al., 2003).

One simulation example of Paul et al. (2008) implies that, in the context of GWASs, SNPs that are marginally independent of the phenotype could be screened out by preconditioning, but can be identified by standard variable selection techniques such as lasso or Bayesian lasso. In theory, if SNPs are correlated with the phenotype through marginal correlations, we believe the preconditioning step is worthwhile to identify more important SNPs. However, in reality, since different SNPs may display interactions, this approach may not work perfectly. In any case, this two-step variable selection procedure should always be advantageous over a single SNP analysis, because we are always testing the marginal correlation between the predictor and response when one SNP is analyzed at a time.

Motivated by Tibshirani (1996), Park and Casella (2008) developed the Bayesian lasso and demonstrated the diabetes data (Efron et al., 2004) with p = 10 and n = 442. We applied the Bayesian lasso to the high-dimensional regression problem, and improved it by preconditioning. We have concentrated on the preconditioned Bayesian lasso method for continuous trait in GWASs. The proposed preconditioning procedure and MCMC algorithm can be readily extended to survival data analysis and lasso penalized logistic regression in case–control disease gene mapping. Also, we may look for gene–gene interaction effects after identifying main effects, as suggested by Wu et al. (2009). The model with a capacity to identify epistatic interactions will enables geneticists to decipher a detailed picture of the genetic architecture of a complex trait.

Funding: NSF/NIH Mathematical Biology grant (No. 0540745); NIDA; NIH grants (R21 DA024260 and R21 DA024266). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA or the NIH.

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data:

REFERENCES

  • Andrews D.F., et al. Scale mixture of normal distributions. J. R. Stat. Soc. Ser. B. 1974;36:99–102.
  • Dawber T., et al. Epidemiological approaches to heart disease: the Framingham study. Ame. J. Public Health. 1951;41:279–286. [PMC free article] [PubMed]
  • Donnelly P. Progress and challenges in genome-wide association studies in humans. Nature. 2008;465:728–731. [PubMed]
  • Efron B., et al. Least angle regression (with discussion) Annu. Stat. 2004;32:407–499.
  • Fan J., Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001;96:1348–1360.
  • Fan J., Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) J. R. Stat. Soc. Ser. B. 2008;70:849–911. [PMC free article] [PubMed]
  • Frank I.E., Friedman J.H. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–148.
  • Gelman A., Rubin D.B. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992;7:457–511.
  • Hastie T., et al. The Elements of Statistical Learning. 2. New York: Springer; 2009. High-dimensional problems: p>N.
  • Hoggart C., et al. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 2008;4:e1000130. [PMC free article] [PubMed]
  • Hoti F., Sillanpaa M.J. Bayesian mapping of genotype × expression interactions in quantitative and qualitative traits. Heredity. 2006;97:4–18. [PubMed]
  • Jaquish C. The Framingham heart study, on its way to becoming the gold standard for cardiovascular genetic epidemiology? BMC Med. Genet. 2007;8:63. [PMC free article] [PubMed]
  • Logsdon B.A., et al. A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC Bioinformatics. 2010;27:11–58. [PMC free article] [PubMed]
  • McCarthy M., et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 2008;9:356–369. [PubMed]
  • Park T., Casella G. The Bayesian lasso. J. Am. Stat. Assoc. 2008;103:681–686.
  • Paul D., et al. Preconditioning for feature selection and regression in high-dimensional problems. Annu. Stat. 2008;36:1595–1618.
  • Tibshirani R. Regression shrinkage and selction via the lasso. J. R. Stat. Soc. Ser. B. 1996;58:267–288.
  • Wolfrum C., et al. Role of Foxa-2 in adipocyte metabolism and differentiation. J. Clin. Invest. 2003;112:345–356. [PMC free article] [PubMed]
  • Wu T.T., et al. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25:714–721. [PMC free article] [PubMed]
  • Yang J., et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Rev. Genet. 2010;42:565–569. [PMC free article] [PubMed]
  • Yi N., Xu S. Bayesian lasso for quantitative trait loci mapping. Genetics. 2008;179:1045–1055. [PMC free article] [PubMed]
  • Zou H., Hastie T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. 2005;67:301–320.

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • EST
    EST
    Published EST sequences
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...