- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- Bioinformatics
- PMC3025714

# A variable selection method for genome-wide association studies

## Abstract

**Motivation:** Genome-wide association studies (GWAS) involving half a million or more single nucleotide polymorphisms (SNPs) allow genetic dissection of complex diseases in a holistic manner. The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Existing methods for joint analysis of GWAS data tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs).

**Results:** We introduce GWASelect, a statistically powerful and computationally efficient variable selection method designed to tackle the unique challenges of GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. The advantages of the GWASelect are illustrated with the Wellcome Trust Case-Control Consortium (WTCCC) data.

**Availability:** The software implementing GWASelect is available at http://www.bios.unc.edu/~lin.

Access to WTCCC data: http://www.wtccc.org.uk/

**Contact:** ude.cnu.soib@nil

**Supplementary information:** Supplementary data are available at *Bioinformatics* Online.

## 1 INTRODUCTION

Genome-wide association studies (GWAS) have become increasingly popular for studying complex human diseases. Within the last several years, the number of single nucleotide polymorphisms (SNPs) per DNA array has grown from 10 000 to 1 million (Altshuler *et al.*, 2008). Despite the very large number of SNPs that are genotyped in a study, GWAS data are commonly analyzed one SNP at a time. Indeed, the Armitage trend test (ATT) is used almost exclusively.

There are at least two strong reasons for considering all the SNPs or at least a large subset of them simultaneously. First, the marginal effects of SNPs (i.e. the effect of each SNP on disease when it is considered alone) may be quite different from their joint effects: (i) a SNP that is not related to disease but is correlated with a causal SNP will be marginally associated with disease; (ii) some SNPs may have weak marginal effects but strong joint effects. Conditional on causal SNPs that are already in the model, false positive signals tend to be weakened while marginally uncorrelated causal SNPs have a better chance of being selected. Second, the predictive power of a single SNP tends to be very low. The accuracy of prediction can be improved substantially by utilizing a large number of relevant SNPs.

It is extremely challenging to decide which set of SNPs should be included in the joint analysis because the number of SNPs in a GWAS is much larger than the sample size. This is commonly referred to as the ‘small *n*, large *p*’ problem. A major difficulty in this problem is that the number and extent of spurious associations between predictors and response increase rapidly with increasing *p*. Weak effects of causal variants and strong linkage disequilibrium (LD) among SNPs present additional challenges.

There is a large body of literature on variable selection methods, including bridge regression (Frank and Friedman, 1993), least absolute shrinkage and selection operator (LASSO; Tibshirani, 1996), smoothly clipped absolute deviation (SCAD; Fan and Li, 2001), elastic net (Zou and Hastie, 2005) and adaptive lasso (Zou, 2006). However, these methods were designed for a moderate number of predictors (i.e. tens or hundreds). For ultra-high *p*, these methods may be computationally infeasible and statistically inaccurate.

Recently, Fan and Lv (2008) developed the so-called sure independence screening (SIS) strategy for high-dimensional statistical modeling. The idea is to first reduce the dimension from a very large scale to a moderate scale that is below sample size by univariate correlation learning, and then select important predictors by a moderate-scale variable selection method, such as the LASSO or SCAD. In a similar spirit, Wu *et al.* (2009) reduced the dimension of SNPs in a GWAS to several hundreds using a simple score criterion and applied the LASSO to the reduced set of SNPs. A drawback of this approach is that important features that are marginally uncorrelated with response are bound to be missed because the univariate screening step is based entirely on marginal correlations. Fan and Lv (2008) suggested the iterative sure independence screening (ISIS) procedure, which iterates the SIS procedure conditional on the previously selected features so as to capture important features that are marginally uncorrelated with response. Fan and Lv's work is confined to linear regression of a continuous response, and the number of features they considered is merely thousands.

In this article, we extend Fan and Lv's ISIS idea to logistic regression of case–control GWAS data. This extension is challenging for several reasons. First, the ISIS performs linear regression analysis of residuals, but residuals cannot be used as response variables in logistic regression. Second, prediction errors tend to be much higher for binary outcomes than continuous outcomes. Third, the number of SNPs can be extremely large, typically more than half a million. Fourth, the effects of causal SNPs on complex diseases tend to be small to modest, so the signal-to-noise ratio in GWAS data is low. Fifth, the LD among SNPs is extensive and can be extremely high in certain regions.

A separate challenge is that the false discovery rate (FDR) associated with the ISIS, and indeed with any existing variable selection method, tends to be high. Recently, Meinshausen and B¨hlmann (2010) proposed the stability selection strategy to reduce the FDR. The idea is to repeatedly subsample the original data and perform variable selection on each subsample. The features selected frequently among the subsamples tend to be truly associated with outcome and thus should be included in the final model. In this article, we integrate stability selection into our ISIS procedure to develop a new approach, GWASelect, for genome-wide variable selection.

We describe our approach in the next section. In Section 3, we demonstrate through simulation studies that GWASelect has robust performance under a variety of LD structures and can substantially increase the power and reduce the FDR compared with existing methods. In addition, the regression models generated by GWASelect significantly improve prediction accuracy. In Section 4, we apply GWASelect to the GWAS data from the Wellcome Trust Case-Control Consortium (WTCCC, 2007) and show that it yields several novel discoveries and improves prediction accuracy.

## 2 METHODS

Our ISIS method consists of one marginal SIS and two rounds of conditional SIS. We first describe the marginal SIS procedure. The data contain *n* subjects and *p* SNPs. The genotypes of each SNP are standardized by its sample SD. The SIS theory suggests to reduce the original set of features to a small subset whose dimension is in the order of *n*/log*n*. Since binary outcomes generally contain less information than continuous outcomes, we shrink the dimension of SNPs from *p* to *n*/(4log*n*). The SIS theory also suggests to use a large proportion of the subset for the marginal SIS; therefore, we choose to use *t* SNPs, where *t* is the integer part of 0.9*n*/(4log*n*). That is, we perform the ATT (under the additive model) on each SNP and select the *t* most significant SNPs to form a set _{1}. Then we apply the LASSO to _{1} as follows.

For *i* = 1,…, *n*, let *Y*_{i} denote the disease status (1 = case, 0 = control), and *X*_{i} denote the (*t* + 1)-vector consisting of 1 and the genotypes of the *t* SNPs in _{1}. The genotype of each SNP is represented by the number of minor alleles. It is natural to assume the logistic regression model

where β = (β_{0}, β_{1},…, β_{t})^{T} denotes the vector of unknown regression coefficients. The penalized log-likelihood function takes the form

where λ is the tuning parameter.

We adopt the cyclic coordinate decent (CCD) algorithm (Friedman *et al.*, 2010; Genkin *et al.*, 2007), which is tantamount to maximizing in a component-wise manner. Cross-validation can be used to determine the tuning parameter (and consequently the model size), but for now, we set the model size to a user-specified number, say *d*. (We will show later how to determine the model size adaptively.) That is, we run the LASSO on a dense grid of λ until it generates a model containing *d* predictors. If the exact number of *d* cannot be achieved, we choose the model whose size is right below *d*. This model is labeled _{1}.

To reduce potential collinearity, we prune _{1} using pairwise correlations. Our analysis revealed that 99.9% of the pairwise correlations among the Illumina300K SNPs have absolute values less than 0.8 (corresponding to *r*^{2} of 0.64). Thus, we set the pruning threshold for *r*^{2} to 0.64 so as to minimize the loss of information due to pruning. The pruned model is labeled _{1}^{*}. This marks the end of the marginal SIS.

Assuming that _{1}^{*} contains *t*_{1} SNPs, we label the set of the remaining (*p* − *t*_{1}) SNPs as . We use the conditional SIS described below to capture important SNPs in that are marginally uncorrelated with disease. The first step is to screen all the SNPs in to identify a small set of candidate SNPs that are correlated with *Y* conditional on _{1}^{*}. This step is computationally challenging because the cardinality of is close to *p*, which can be 1 million. We develop the following conditional score test to accomplish this task in a very efficient manner.

For the *i*-th subject, let *W*_{i} be the (*t*_{1} + 1)-vector consisting of 1 and the genotypes of the *t*_{1} SNPs in _{1}^{*}. Let *Z*_{j} be the *j*-th SNP in , and *Z*_{ji} be the value of *Z*_{j} on the *i*-th subject, where *j* = 1,…, *p* − *t*_{1}. We assume the logistic regression model:

where γ and η are unknown regression coefficients. We are interested in testing the null hypothesis *H*_{0} : γ = 0. It is computationally intensive to fit the above model for each of the (*p* − *t*_{1}) SNPs. To bypass this difficulty, we perform the conditional score test. Specifically, we calculate *S* = *U*/*V*^{1/2}, where

and is the maximum likelihood estimator of η under *H*_{0}. Note that and *I*_{ηη} do not involve any data in and thus need to be calculated only once at the outset of the conditional SIS. Given and *I*_{ηη}^{−1}, we calculate the test statistic *S* for each of the (*p* − *t*_{1}) SNPs in . In vein with the SIS theory, we choose the most significant *q* SNPs, where *q* is the integer part of 0.05*n*/(4log*n*), and call this set of SNPs _{2}. (We use 0.05 since 0.9 + 0.05 + 0.05 = 1, where 0.9 pertains to the marginal SIS, and (0.05 + 0.05) to the two rounds of conditional SIS.)

The first step of the conditional SIS is aimed at identifying important SNPs that are marginally uncorrelated (but conditionally correlated) with disease while weakening the priority of those unimportant SNPs that are highly associated with disease through their correlations with the SNPs in _{1}^{*}. In the second step, we combine _{2} with _{1}^{*} and run the LASSO to select a model _{2} with *d* SNPs. During this process, new SNPs may be selected, and previously selected SNPs have a chance to be removed from the model. We prune _{2} to form a new model _{2}^{*}. This completes the conditional SIS.

To increase the opportunities of capturing important SNPs, we repeat the conditional SIS once and call the final model _{3}^{*}. We refer to _{3}^{*} as the ISIS model.

To reduce the FDR, we combine the extended ISIS procedure with the stability selection strategy (Meinshausen and Bühlmann, 2010) to create the GWASelect method, as illustrated in Figure 1. Specifically, we randomly obtain half of the cases and half of the controls from the GWAS data to form a subsample and then run the ISIS on this subsample. The resulting model is named _{1}. Repeating this subsampling concatenated with the ISIS 50 times, we obtain _{1},…, _{50}. Let = _{j=1}^{50} _{j}, and denote = {*v*_{1},…, *v*_{L}}. We then calculate the selection probabilities for the *L* SNPs in

where *I*(·) is the indicator function. We choose the *d* SNPs with the highest selection probabilities from to form the GWASelect model.

It is sometimes desirable to determine the model size adaptively from the data. To this end, we develop dynamic-GWASelect (d-GWASelect), which contains two modifications to the GWASelect. The first modification is that cross-validation is used to determine the tuning parameter for the LASSO embedded in the ISIS. Specifically, we divide the data randomly into five equal parts, with the *k*-th (*k* = 1,…, 5) part being the testing data and the remaining four parts being the training data. For a given tuning parameter λ, we apply the LASSO to the training data and select the SNPs that have non-zero regression coefficients. We calculate the liability score (i.e. the linear predictor) for each testing subject. Let _{1} denote the set of subjects with the highest δ × 100% liability scores, and _{2} the set with the lowest δ × 100%, where δ is a user-specified number between 0 and 0.5. We then calculate the δ-error rate, defined as , where is the number of subjects in the testing data. We choose the value of λ that minimizes the δ-error rate averaged over the five testing datasets for δ = 0.1.

The second modification is that, instead of fixing the model size at *d*, we specify a selection threshold ξ and select all SNPs with selection probabilities ≥ξ. As shown in the next section, the influence of ξ on the final model is typically small.

## 3 SIMULATION STUDIES

Each simulated dataset contained 2000 cases and 2000 controls. For each subject, we simulated 20 chromosomes, each containing 3000 SNPs. The disease status was generated from the logistic regression model containing 10 causal variants, *G*_{1},…, *G*_{10}, with the vector of log odds ratios β*.

We considered three simulation schemes for the causal SNPs. In the first scheme, we simulated 10 independent causal SNPs that are located on 10 different chromosomes, with minor allele frequencies (MAFs) of 0.3. We set β* = (−0.35, −0.35, 0.35, 0.35, 0.35, 0.35, 0.35, −0.35, −0.35, −0.35)^{T}.

In the second scheme, we let {*G*_{1},…, *G*_{10}} reside on one chromosome and have a special correlation structure such that the correlation between any two causal variants is nearly 0.6. We set β* =(0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3)^{T}.

In the third scheme, multiple causal SNPs were generated to be marginally uncorrelated with *Y*. We let the first causal SNP be independent of the other nine causal SNPs. The latter were simulated to form three clusters, {*G*_{2}, *G*_{3}, *G*_{4}}, {*G*_{5}, *G*_{6}, *G*_{7}} and {*G*_{8}, *G*_{9}, *G*_{10}}, each cluster residing on one chromosome. The three clusters are independent of each other, but within each cluster, SNPs have a compound symmetry correlation structure with correlation 0.5. We set β* =(0.5, −0.5, −0.5, 0.5, 0.5, −0.5, 0.5, −0.5, 0.5, −0.5)^{T}. Under this scheme, corr(*Y*, *G*_{4}), corr(*Y*, *G*_{6}) and corr(*Y*, *G*_{9}) are equal to 0.

For all three schemes, the positions of the causal variants on the chromosomes were randomly chosen. Thus, our simulation results would not be affected by any local LD patterns.

It is not trivial to simulate non-causal SNPs as they are desired to mimic the actual LD structure of human population. There exist several genome simulators based on the coalescent approach (Hudson, 2002), for which the users have to arbitrarily specify a number of parameters. As an alternative, the GWAsimulator of Li and Li (2008) employs a moving-window mechanism and can simulate genotypes based on the Illumina HumanHap300 chip data. We adopted the latter approach. Because the average distance between two SNPs for the Illumina HumanHap300 chip data is roughly 10 kb, the total length we simulated is approximately 600 Mb, which accounts for 1/5 of the whole genome. The LD was well preserved and no trimming was done for the simulated data.

Hoggart *et al.* (2008) explored variable selection from a Bayesian point of view by imposing the Laplace prior or the normal exponential gamma prior on each SNP. The former prior yields the LASSO procedure, while the latter generates a more sparse model and is called hyper-LASSO (HLASSO). We included the HLASSO in our simulation studies. Thus, we analyzed the simulated data by five methods: (i) the ATT method, for which the threshold for declaring significance was set to 0.05/60000 (i.e. Bonferroni correction); (ii) the method by Wu *et al.* (2009); (iii) the (extended) ISIS method; (iv) the HLASSO; and (v) the GWASelect method. We set the model sizes to 15 for all methods (except the ATT) because most biology labs are likely to restrict their resources to a small number of top SNPs.

There are different criteria to evaluate a variable selection method. We chose to use the true discovery rate (TDR) and false discovery rate (FDR; Benjamini and Hochberg, 1995) because the main goal of GWAS is to identify causal variants. For genetic studies, how to define the true discovery and false discovery is a delicate issue. This is because once a SNP is declared to be significant, all SNPs that are close to and in LD with that SNP will be followed up. We defined the true positive and false positive as follows. If a captured SNP was no more than 50 SNPs away from a true causal SNP and had *r*^{2} > 0.05 with that same causal SNP, then we classified it as a true positive. [Our experiments revealed that replacing 50 with 20 yielded similar results; Hoggart *et al.* (2008) provided a rationale for choosing 0.05 for *r*^{2}.] If more than one SNP satisfied these conditions, we counted them only as one true positive cluster. The remaining captured SNPs were classified as false positives. If two false positive SNPs were no more than 10 SNPs apart (i.e. within 100 kb in distance), we counted them as only one false positive cluster. The calculations of the TDR and FDR were based on clusters, rather than on individual SNPs. For each simulation scheme, the number of replications was set to 200. The results are shown in Table 1.

Scheme 1 was designed to compare the five methods under a scenario where all causal variants are independent and their effects are moderate. Under this scheme, all five methods yield high TDRs (> 95%), but the FDRs are highly variable. Despite a large model size, the ATT method has the lowest FDR. This seemingly paradoxical phenomenon is explained by the fact that most of the SNPs in the ATT model are highly clustered due to strong LD. The GWASelect model has an elevated FDR, but far lower than the ISIS and the HLASSO, and slightly lower than the Wu *et al.* model. This demonstrates that, by repeated subsampling and variable selection, GWASelect is able to remove many noise features from the model. The ATT method appears to be a good option when causal variants are independent with moderate effects, but if one wishes to achieve higher power without too many false discoveries, the GWASelect method would be a reasonable choice.

In scheme 2, all 10 causal variants are correlated with each other, which makes variable selection more challenging. It can be shown that under this scheme, the marginal effects of the causal SNPs are much higher than their joint effects. For variable selection, this has the undesired effect of selecting unimportant SNPs that are in proximity of the causal SNPs. Reflecting this fact, the ATT, the ISIS, and the HLASSO all have FDRs above 30%. The GWASelect is able to keep the FDR at a low level and preserve most of the power because of stability selection. The Wu *et al.* method has high power and a relatively low FDR, suggesting that this method is particularly capable of distinguishing causal SNPs from unimportant SNPs that are in LD with them.

Scheme 3 represents a more complex correlation structure in which the three causal SNPs (i.e. the fourth, sixth and ninth SNPs) are marginally uncorrelated with *Y*. As expected, the methods that are strongly driven by marginal correlations, such as the ATT and the Wu *et al.* method, almost completely missed *G*_{4}, which drives down their power to 70%. Both the ISIS and the HLASSO methods achieved higher power, but at the price of high FDRs (around 30%). The GWASelect model offers a more balanced solution in terms of the TDR and FDR.

In summary, only the HLASSO and the GWASelect were able to keep their power above 90% under all three schemes, and the latter appears to have a much lower FDR. The other three methods either lack power under some schemes or entail high FDRs in others.

Next, we investigated the prediction accuracy of the five methods. For each scheme, we further simulated 2000 testing subjects under the prospective sampling. To avoid numerical instabilities, we pruned the obtained models and used the pruned models for prediction. We calculated the true liability score and the estimated liability score for each subject and used the correlation between the two scores as a measure of prediction accuracy. We also calculated the absolute difference between the model-predicted and true disease probabilities, termed as p-diff, to measure the prediction error. The results are shown in Table 2.

The Wu *et al.* method excels under scheme 2, consistent with its high TDR and low FDR under this scheme. However, both the Wu *et al.* and the ATT are less accurate than the other methods under scheme 3 because they missed those marginally uncorrelated SNPs. The HLASSO performs well under schemes 2 and 3, suggesting that high prediction power can be achieved even if some noise features are included in the model. Only the HLASSO and the GWASelect have prediction accuracy above 0.9 under all three schemes.

To assess data-adaptive choice of model size, we repeated the above simulation studies but now incorporated a 5-fold cross-validation into all the methods (except the ATT) by using the 10% error rate as the evaluation criterion (see Section 2). For d-GWASelect, we set the selection threshold ξ to 0.3. All effect sizes were set to be moderate. For both schemes 1 and 3, β* = (0.4, −0.4, −0.4, 0.4, 0.5, −0.5, 0.5, −0.6, 0.6, −0.6)^{T}. For scheme 2, β* = (0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2)^{T}. The results are shown in Tables 3 and and44.

The d-GWASelect remains a robust variable selection method under all three schemes and indeed appears to have a better performance than its counterpart with a fixed model size. The Wu *et al.*, the ISIS and the HLASSO now entail extremely high FDRs and poor prediction accuracy. The reason is that cross-validation often favors a large model size for logistic regression, especially when the signal–noise ratio is low. The d-GWASelect method, however, has a well-controlled model size because stability selection sifts away many noise features. Overall, the d-GWASelect enjoys low FDR, high TDR and excellent prediction performance. Replacing the selection threshold with 0.4 yielded highly similar results (data not shown). We also explored the use of deviance (instead of the 10% error rate) as the evaluation criterion for cross-validation, and the d-GWASelect remains more favorable than the other methods (Supplementary Tables S1 and S2).

## 4 ANALYSIS OF WTCCC DATA

The WTCCC study examined approximately 2000 subjects for each of seven common diseases and a shared set of approximately 3000 controls. Each subject was genotyped on the Affymetrix GeneChip 500K Mapping Array Set. We provide in this section a detailed analysis of the data on Type II diabetes [T2D (MIM 125853, http://www.ncbi.nlm.nih.gov/omim)] and Type I diabetes [T1D (MIM 222100)]; the analysis for the other five diseases is presented in Supplementary Tables S3–S7.

We excluded a small number of subjects according to the sample exclusion lists provided by the WTCCC. In addition, we excluded a SNP if (i) it is on the SNP exclusion list provided by the WTCCC; (ii) it has a poor cluster plot as defined by the WTCCC; (iii) its MAF <0.01 in both cases and controls; or (iv) it has extreme departure from Hardy–Weinberg equilibrium (*P* < 10^{−4}). Approximately 390 000 SNPs were used in the analysis, and there were 2938 controls, 1924 T2D cases and 1963 T1D cases.

Figure 2 indicates the SNPs selected by the ATT, Wu *et al.*, HLASSO and GWASelect for T2D; the details are shown in Table 5 and Supplementary Table S9. Under the ATT method, 15 SNPs reach the genome-wide significance of *P* < 10^{−7}. The most significant one is rs4506565 (*P*-value = 7.5 × 10^{−13}), which is located in gene *TCF7L2*. The other 14 SNPs are clustered within either *TCF7L2* or *FTO*. These results are consistent with the WTCCC's findings. The HLASSO model is essentially identical to the ATT model, albeit with a smaller model size.

For the Wu *et al.* and GWASelect methods, we set the model sizes to 20. Both methods successfully detected *TCF7L2* and *FTO*. They also identified a locus that spans *TSPAN8/LGR5*, which was one of the most significant loci reported in a recent meta-analysis of 10 128 subjects (Zeggini *et al.*, 2008). This finding demonstrates empirically that regression-based variable selection methods can be more powerful than the ATT method.

It is interesting to compare the GWASelect and Wu *et al.* models. Five SNPs, rs11688935, rs6846031, rs6872465, rs2389591 and rs10435018, show up only in the GWASelect model. Among these SNPs, rs6846031 was selected partly due to its conditional correlation with T2D, underscoring the importance of conditional screening in variable selection. This finding also indicates that genetic factors underlying T2D are not simply in parallel with each other, but rather form a complex structure that needs to be carefully dissected.

Several SNPs in our GWASelect model have not been reported in the literature on T2D. Some of them are plausibly related to T2D. For example, GULP1 is an adaptor protein that binds and directs the trafficking of LRP1 (Su *et al.*, 2002), a protein that has been shown to play a critical role in adipocyte energy homeostasis and insulin sensitivity (Hofmann *et al.*, 2007). Thus, genetic variants in *GULP1* may potentially influence the amount of LRP1 in adipocyte cells and thereby modulate a person's risk to T2D. As another example, the CREB5 was recently found to be downregulated along with other members of the insulin signaling cascade when stimulated by a ligand of PPARγ, which is known to be associated with T2D (Herrmann *et al.*, 2009). This suggests that CREB5 is closely related to PPARγ and the insulin pathway. The other SNPs do not have known connections with T2D, but further investigation of those loci may reveal novel mechanisms or pathways related to T2D.

For prediction of T2D, the δ-error rates (with δ = 0.1) of all four models are over 40%, suggesting that T2D is greatly influenced by other types of genetic variations and environmental factors. Since it is not very meaningful to compare prediction errors at such a high level, we turned our attention to the T1D data because it is well-known that T1D is genetically more homogeneous than T2D.

For the T1D data, we used cross-validation to choose the tuning parameter for the d-GWASelect method and set the selection threshold ξ to 0.20. For the Wu *et al.* method, we set the model size to 15. The results are shown in Figure 3, Table 6 and Supplementary Table S8. The d-GWASelect model contains 14 SNPs, among which *ADAM29*, *SYNGAP1*, *CUX2* and *ALDH2* do not appear in any of the other three models. The gene *SYNGAP1* was observed to have strong conditional correlation with T1D, demonstrating again that selection solely based on marginal correlation is insufficient. Searching the T1DBase (http://www.t1dbase.org) revealed that all four genes have expressions in pancreas, although none has been previously considered as strong candidates for T1D. Interestingly, the *CUX2* has been shown to directly regulate the expression of *NeuroD* (Lulianella *et al.*, 2008), a gene that can cause T1D if mutated.

Finally, we compared the prediction accuracy of the four methods. We randomly divided the data into three parts, two as the training data and one as the testing data. Since the training dataset contains only 2/3 of the original data, we reduced ξ from 0.20 to 0.10 to ensure that a similar number of loci are included in the d-GWASelect model. Since the true liability scores and disease probabilities are unknown in real data, we measured the prediction errors by the δ-error rates for δ = 0.1, 0.15 and 0.25 (see Section 2 for detail). Considering that pruning was done before each model was used for prediction, we report the actual (i.e. effective) number of SNPs used by each model for prediction. Under default settings, the effective model sizes of the Wu *et al.*, the HLASSO and the d-GWASelect are 14, 4 and 21, respectively. Since the former two models are much smaller, we also evaluated their prediction accuracy with 21 effective SNPs. (We were not able to evaluate the ATT with 21 effective SNPs due to numerical instabilities.) The results are reported in Table 7. Clearly, the d-GWASelect performs the best or nearly the best for all three δ-error rates. We have also calculated the area under the ROC curve for the four methods, and GWASelect achieves the highest value (Supplementary Table S11).

## 5 DISCUSSION

We have developed a new tool, GWASelect, for variable selection at the genome-wide level. This regression-based method has the ability to capture both marginally correlated and marginally uncorrelated causal SNPs and has low FDR. The advantages over the existing methods have been demonstrated through simulated and real data. Our method has two versions. The first version requires the specification of the model size *d*, for which we suggest to choose a number that is consistent with the current biological knowledge of the studied disease. The second version (d-GWASelect) does not require the specification of the model size, and this is the version we recommend for general use.

The correlation structures for causal variants used in our simulation studies have biological relevance. Scheme 2 mimics a scenario in which the causal variants form a gene cluster that contributes synergistically to the disease outcome, while scheme 3 reflects a scenario in which several biological pathways (or networks) affect the disease development.

We did not include least angle regression (LARS) in our studies because it has been shown to have highly similar performance to LASSO (Hastie *et al.*, 2009). Indeed, LASSO can be implemented by LARS with a small modification. Wu and Lange (2008) demonstrated that CCD is ‘considerably faster and more robust than LARS’ and is ‘more successful than LARS in model selection’.

The HLASSO adopts a concave penalty function, but the CCD algorithm may not converge for non-convex penalty functions (Friedman *et al.*, 2010; Wu *et al.*, 2009). A valid algorithm to implement concave penalty functions is local linear approximation (Zou and Li, 2008), which amounts to multiple rounds of CCD and would make the HLASSO computation prohibitively expensive. For the WTCCC T1D data, running the CCD version of the HLASSO with 10 iterations on an Intel Quadcore Nehalem processor (2.4 GHz, 16 GB memory) requires 67.5 to 175 h, depending on the value of the tuning parameter. In contrast, we have been running the GWASelect in a parallel computing environment, and the same analysis can be completed within several hours on 16 processors.

In an independent effort, Fan *et al.* (2009) developed an ISIS method for generalized linear models in the context of microarray data analysis. In their method, the conditional screening procedure requires fitting a separate regression model for each feature, which would create heavy computational burden for GWAS data. In addition, their method tends to have high FDR. They observed that cross-validation tends to yield large models for logistic regression, resonating our findings.

We can extend our methods to select interactions. Instead of considering all possible interaction terms, we may incorporate known biological network information (Franke *et al.*, 2006) into our selection procedure. Another approach is to first extend the existing genetic network identification tools, such as the liquid association (Li, 2002) and bounded mode stochastic search (Dobra, 2007), to infer SNP interactions and then incorporate such information into our GWASelect procedure. Recently, Han *et al.* (2010) proposed a Markov blanket-based method to evaluate epistatic interactions for GWAS data. It will be interesting to compare to that method when we extend our work to interaction effects.

How to obtain *P*-values for high-dimensional variable selection is an active research area. The stochastic error introduced by the selection process makes it very difficult to assign *P*-values to the selected features. Meinshausen *et al.* (2009) offered one possible solution by ‘aggregating’ *P*-values from stability selection, but our experiments indicated that this procedure is too conservative for SNP data, likely due to the ultra-high dimension and strong LD.

The prediction of genetic risk using GWAS data has drawn considerable attention in recent years. Wray *et al.* (2007) pioneered this area of research. Their approach selected genetic predictors by a univariate screening method. As shown in this article, our GWASelect method tends to provide more accurate prediction than univariate screening when the SNPs are in strong LD. Wei *et al.* (2009) explored genetic risk prediction through a Support Vector Machine algorithm. It is difficult to compare our results directly with theirs because (i) their analysis involved two other datasets besides the WTCCC T1D data; (ii) our testing samples are far smaller than theirs; and (iii) interaction effects are not considered in our current work.

*Funding*: National Cancer Institute grant (1-P01-CA142538-01).

*Conflict of Interest*: none declared.

## REFERENCES

- Altshuler D, et al. Genetic mapping in human disease. Science. 2008;322:881–888. [PMC free article] [PubMed]
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 1995;57:289–300.
- Dobra A. Variable selection and dependency networks for genomewide data. Biostatistics. 2007;8:1–18. [PMC free article] [PubMed]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Ass. 2001;96:1348–1360.
- Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. 2008;70:849–911. [PMC free article] [PubMed]
- Fan J, et al. Ultrahigh dimensional feature selection: beyond the linear model. J. Mach. Learn. Res. 2009;10:2013–2038. [PMC free article] [PubMed]
- Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–135.
- Franke L, et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet. 2006;78:1011–1025. [PMC free article] [PubMed]
- Friedman J, et al. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010;33:1–22. [PMC free article] [PubMed]
- Genkin A, et al. Large-scale Bayesian logistic regression for text categorization. Technometrics. 2007;49:291–304.
- Han B, et al. A Markov blanket-based method for detecting causal SNPs in GWAS. BMC Bioinform. 2010;11(Suppl. 3):S5. [PMC free article] [PubMed]
- Hastie T, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2. New York: Springer; 2009. pp. 75–76.
- Herrmann J, et al. Isomer-specific effects of CLA on gene expression in human adipose tissue depending on PPARγ P12A polymorphism: a double blind, randomized, controlled cross-over study. Lipids Health Dis. 2009;8:35. [PMC free article] [PubMed]
- Hoggart CJ, et al. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 2008;4:e1000130. [PMC free article] [PubMed]
- Hofmann SM, et al. Adipocyte LDL receptor-related protein1 expression modulates postprandial lipid transport and glucose homeostasis in mice. J. Clin. Invest. 2007;117:3271–3282. [PMC free article] [PubMed]
- Hudson RR. Generating samples under a Wright-Fisher neutral model. Bioinformatics. 2002;18:337–378. [PubMed]
- Li C, Li M. GWAsimulator: a rapid whole-genome simulation program. Bioinformatics. 2008;24:140–142. [PubMed]
- Li K.-C. Genome-wide coexpression dynamics: theory and application. Proc. Natl Acad. Sci. USA. 2002;99:16875–16880. [PMC free article] [PubMed]
- Lulianella A, et al. Cux2 (Cutl2) integrates neural progenitor development with cell-cycle progression during spinal cord neurogenesis. Development. 2008;135:729–741. [PMC free article] [PubMed]
- Meinshausen N, et al. P-values for high-dimensional regression. J. Am. Stat. Assoc. 2009;104:1671–1681.
- Meinshausen N, Bühlmann P. Stability selection. J. R. Stat. Soc. Ser. B. 2010;72:417–448.
- Su HP, et al. Interaction of CED-6/GULP, an adapter protein involved in engulfment of apoptotic cells with CED-1 and CD91/low density lipoprotein receptor-related protein (LRP) J. Biol. Chem. 2002;281:12081–12092. [PubMed]
- Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. 1996;58:267–288.
- Wei R, et al. From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLos Genet. 2009;5:e1000678. [PMC free article] [PubMed]
- Wellcome Trust Case Control Consortium. Genome-wide association study of 14 000 cases of seven common diseases and 3000 shared controls. Nature. 2007;447:661–678. [PMC free article] [PubMed]
- Wray NR, et al. Prediction of individual risk to disease from genome-wide association studies. Genome Res. 2007;17:1520–1528. [PMC free article] [PubMed]
- Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2008;2:224–244.
- Wu TT, et al. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25:714–721. [PMC free article] [PubMed]
- Zeggini E, et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nature Genet. 2008;40:638–645. [PMC free article] [PubMed]
- Zou H, Hastie R. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. 2005;67:301–320.
- Zou H. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 2006;101:1418–1429.
- Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 2008;36:1509–1533. [PMC free article] [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (329K)

- Finding type 2 diabetes causal single nucleotide polymorphism combinations and functional modules from genome-wide association data.[BMC Med Inform Decis Mak. 2013]
*Kang C, Yu H, Yi GS.**BMC Med Inform Decis Mak. 2013; 13 Suppl 1:S3. Epub 2013 Apr 5.* - SNP-based pathway enrichment analysis for genome-wide association studies.[BMC Bioinformatics. 2011]
*Weng L, Macciardi F, Subramanian A, Guffanti G, Potkin SG, Yu Z, Xie X.**BMC Bioinformatics. 2011 Apr 15; 12:99. Epub 2011 Apr 15.* - Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis.[BMC Bioinformatics. 2012]
*Ueki M, Tamiya G.**BMC Bioinformatics. 2012 May 3; 13:72. Epub 2012 May 3.* - HAPGEN2: simulation of multiple disease SNPs.[Bioinformatics. 2011]
*Su Z, Marchini J, Donnelly P.**Bioinformatics. 2011 Aug 15; 27(16):2304-5. Epub 2011 Jun 8.* - Increasing power of genome-wide association studies by collecting additional single-nucleotide polymorphisms.[Genetics. 2011]
*Kostem E, Lozano JA, Eskin E.**Genetics. 2011 Jun; 188(2):449-60. Epub 2011 Apr 5.*

- A Discrete Wavelet Based Feature Extraction and Hybrid Classification Technique for Microarray Data Analysis[The Scientific World Journal. 2014]
*Bennet J, Arul Ganaprakasam C, Arputharaj K.**The Scientific World Journal. 2014; 2014195470* - Analyzing Genome-Wide Association Studies with an FDR Controlling Modification of the Bayesian Information Criterion[PLoS ONE. ]
*Dolejsi E, Bodenstorfer B, Frommlet F.**PLoS ONE. 9(7)e103322* - Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering[BMC Bioinformatics. ]
*Guo X, Meng Y, Yu N, Pan Y.**BMC Bioinformatics. 15102* - Integrated Enrichment Analysis of Variants and Pathways in Genome-Wide Association Studies Indicates Central Role for IL-2 Signaling Genes in Type 1 Diabetes, and Cytokine Signaling Genes in Crohn's Disease[PLoS Genetics. 2013]
*Carbonetto P, Stephens M.**PLoS Genetics. 2013 Oct; 9(10)e1003770* - Predicting Disease Risk Using Bootstrap Ranking and Classification Algorithms[PLoS Computational Biology. 2013]
*Manor O, Segal E.**PLoS Computational Biology. 2013 Aug; 9(8)e1003200*

- A variable selection method for genome-wide association studiesA variable selection method for genome-wide association studiesBioinformatics. Jan 1, 2011; 27(1)1PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...