- Journal List
- Bioinformatics
- PMC2732298

# Genome-wide association analysis by lasso penalized logistic regression

^{1}Yi Fang Chen,

^{2}Trevor Hastie,

^{2,}

^{3}Eric Sobel,

^{4}and Kenneth Lange

^{4,}

^{5,}

^{*}

^{1}Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD 20742,

^{2}Department of Statistics,

^{3}Department of Biostatistics, Stanford University, Stanford, CA 94305,

^{4}Department of Human Genetics and

^{5}Department of Biomathematics, University of California, Los Angeles, CA 90095

## Abstract

**Motivation:** In ordinary regression, imposition of a lasso penalty makes continuous model selection straightforward. Lasso penalized regression is particularly advantageous when the number of predictors far exceeds the number of observations.

**Method:** The present article evaluates the performance of lasso penalized logistic regression in case–control disease gene mapping with a large number of SNPs (single nucleotide polymorphisms) predictors. The strength of the lasso penalty can be tuned to select a predetermined number of the most relevant SNPs and other predictors. For a given value of the tuning constant, the penalized likelihood is quickly maximized by cyclic coordinate ascent. Once the most potent marginal predictors are identified, their two-way and higher order interactions can also be examined by lasso penalized logistic regression.

**Results:** This strategy is tested on both simulated and real data. Our findings on coeliac disease replicate the previous SNP results and shed light on possible interactions among the SNPs.

**Availability:** The software discussed is available in Mendel 9.0 at the UCLA Human Genetics web site.

**Contact:** ude.alcu@egnalk

**Supplementary information:** Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

The recent successes in association mapping of disease genes have been propelled by logistic regression using cases and controls. In most ways this represents a step down from the computational complexities of linkage analysis performed on large pedigrees. The most novel feature of these genome-wide association studies is their sheer scale. Hundreds of thousands of SNPs (single nucleotide polymorphisms) are now being typed on samples involving thousands of individuals. This avalanche of data creates new problems in data storage, manipulation and analysis. Size does matter. For instance, with hundreds of thousands of predictors, the standard methods of multivariate regression break down. These methods involve matrix inversion or the solution of linear equations for a very large number of predictors *p*. Since these operations scale as *p*^{3}, it is hardly surprising that geneticists have opted for univariate linear regression SNP by SNP. This simplification goes against the grain of most statisticians, who are trained to consider predictors in concert. In this article, we explore an intermediate strategy that permits fast computation while preserving the spirit of multivariate regression.

The lasso penalty is an effective device for continuous model selection, especially in problems where the number of predictors *p* far exceeds the number of observations *n* (Chen *et al.*, 1998; Claerbout and Muir, 1973; Santosa and Symes, 1986; Taylor *et al.*, 1979; Tibshirani, 1996). Several authors have explored lasso penalized ordinary regression (Daubechies *et al.*, 2004; Friedman *et al.*, 2007 Fu, 1998; Wu and Lange, 2008) in both the ℓ_{1} and ℓ_{2} settings. Let *y*_{i} be the response for case *i*, *x*_{ij} the *j*-th predictor for case *i*, β_{j} the regression coefficient corresponding to the *j*-th predictor and μ the intercept. For notational convenience also let θ=(μ, β_{1},…, β_{p})^{t} and *x*_{i}=(*x*_{i1},…, *x*_{ip})^{t}. In ordinary linear regression, the objective function is *f*(θ)=∑_{i=1}^{n}(*y*_{i}−μ−*x*_{i}^{t}β)^{2}. In ℓ_{1} regression one replaces squares by absolute values. Lasso penalized regression is implemented by minimizing the modified objective function

Note that the intercept μ is ignored in the lasso penalty λ ∑_{j=1}^{p}|β_{j}|. The tuning constant λ controls the strength of the penalty, which shrinks each β_{j} toward the origin and enforces sparse solutions. A ridge penalty λ∑_{j=1}^{p}β_{j}^{2} also shrinks parameter estimates, but it is not as effective in actually forcing many estimates to vanish. This defect of the ridge penalty reflects the fact the |*b*| is much larger than *b*^{2} for small *b*.

Many diseases are believed to stem from the interaction of risk factors. This further complication can also be handled by lasso penalization if we proceed in two stages. In the first stage, we select the important marginal predictors; in the second stage, we look for interactions among the supported predictors. In both stages, we adjust the penalty constant to give a fixed number of supported predictors. In most genetic studies, researchers have a general idea of how many true predictors to expect. Our software encourages experimentation and asks the user to decide on the right balance between model completeness and quick computation.

This article, like most papers, has its antecedents. In particular, Shi *et al.* (2006, 2007, 2008); Uh *et al.* (2007) and Park and Hastie (2008) make substantial progress in adapting the lasso to logistic regression and to the discovery of interactions. Malo *et al.* (2008) apply ridge regression to distinguish causative from non-causative SNPs in a small region. Schwender and Ickstadt (2008) and Kooperberg and Ruczinski (2005) identify interactions using logic regression. These and other relevant papers are reviewed by Liang and Kelemen (2008). We focus on a coordinate descent algorithm because it appears to be the fastest available. Competing algorithms for lasso penalized logistic regression include non-negative quadratic programming (Sha *et al.*, 2007), quadratic approximations (Lee *et al.*, 2006) and interior point methods (Koh *et al.*, 2007). Friedman *et al.* (2008) compared coordinate descent with several competing algorithms and concluded that it performs the best.

The specific contributions made in this article include (i) the consistent use of the lasso penalty for both marginal and interaction predictors, (ii) selection of the tuning constant to give a fixed number of predictors, (iii) application of cyclic coordinate ascent in maximizing the lasso penalized loglikelihood, (iv) rigorous pre-selection of a working set of predictors and (v) application of false discovery rates for global significance. Our overall strategy combines fast computing with good recovery of the dominant predictors.

In the remainder of the article, Section 2 fleshes out our statistical approach to data. In particular, it covers the lasso penalized logistic model, selection of the tuning constant, cyclic coordinate ascent and assessment of significance for both marginal and interaction predictors. The procedures are summarized as follows:

- prescreening by a score criterion (Section 2.6);
- selection of the tuning parameters λ for a fixed number of predictors by bracketing and golden section search (Section 2.2);
- parameter estimation via cyclic coordinate descent (Section 2.5);
- significance assessment based on leave-one-out (LOO) indices (Section 2.3) and false discovery rate (FDR) (Section 2.7);
- lasso identification and quantification of interactions among previously selected features (Section 2.4).

Section 3 evaluates the method on simulated data. Section 4 applies the method to real data on coeliac disease. Finally, Section 5 summarizes the advantages and limitations of lasso penalized logistic regression in association testing, puts our specific findings into the larger context of current research and mentions the availability of relevant software.

## 2 METHODS

### 2.1 Lasso penalized logistic regression

In case–control studies, the dichotomous response variable *y*_{i} is typically coded as 1 for cases and 0 for controls. By analogy to ordinary linear regression, in linear logistic regression we write the probability *p*_{i}=Pr(*y*_{i}=1) of case *i* given the predictor vector *x*_{i} as

The parameter vector θ=(μ, β_{1},…, β_{p})^{t} is usually estimated by maximizing the loglikelihood

To encourage sparse solutions, we subtract a lasso penalty from the loglikelihood as just suggested. For the purposes of this article, we consider only additive models where the range of the predictors *x*_{ij} is restricted to the three values −1, 0 and 1, corresponding to the three SNPs genotypes 1/1, 1/2 and 2/2, respectively. A dominant model can be achieved by collapsing the genotypes 1/1 and 1/2, and a recessive model can be achieved by collapsing genotypes 1/2 and 2/2. In both models the assigned quantitative values are −1 and 1. In our experience, the set of markers entering the model is relatively insensitive to the genetic model assumptions. We recommend standardizing all non-SNP quantitative predictors to have mean 0 and variance 1.

### 2.2 Selection of the tuning constant λ

For a given value of the tuning constant λ, maximizing the penalized loglikelihood singles out a certain number of predictors with non-zero regression coefficients. Let *r*(λ) denote the number of predictors selected. If we reduce λ and relax the penalty, then more predictors can enter the model. Although minor exceptions occasionally occur, *r*(λ) is basically a decreasing function of λ with jumps of size 1. Hence, once a predictor enters the model, it usually remains in the model as λ decreases. Although a predictor's order of entry tends to be correlated with its marginal significance, violations of this rule of thumb occur with correlated predictors. For every integer *s*≤*p*, we assume that there is an interval *I*_{s} on which *r*(λ)=*s*. One can quickly find a point in *I*_{s} by a combination of bracketing and bisection. In bracketing, we start with a guess λ. If *r*(λ)=*s*, we are done. If *r*(λ)<*s* and *a*∈(0, 1), then there is a positive integer *j* such that *r*(*a*^{j} λ)≥*s*. If *r*(λ)>*s* and *b*>1, then there is a positive integer *k* such that *r*(*b*^{k} λ)≤*s*. In practice, we set *a*=1/2 and *b*=2 and take the smallest integer *j* or *k* yielding the second bracketing point. Once we have a bracketing interval [λ_{l}, λ_{u}], we employ bisection. This involves testing the midpoint λ_{m}=1/2(λ_{l}+λ_{u}). There are three possibilities: if *r*(λ_{m})=*s*, we are done; if *r*(λ_{m})<*s*, we replace λ_{u} by λ_{m}; and if *r*(λ_{m})>*s*, we replace λ_{l} by λ_{m}. In either of the latter two cases, we bisect again and continue. As soon as we hit a point in *I*_{s}, we halt the process.

The primary danger in bracketing is visiting a λ with *r*(λ) very large. To limit the damage from a poor choice of λ, we abort optimization of the objective function whenever the search process encounters too many non-zero predictors. Since predictors can enter and leave the model repeatedly prior to convergence, this check is delayed for several iterations, say 10. As a further safeguard, we set the maximum number of non-zero predictors allowed well above the desired number of predictors *s*. In practice we use *s*+10.

In simpler settings, cross-validation is used to find the best value of λ. Recall that in *k*-fold cross-validation, one divides the data into *k* equal batches (subsamples) and estimates parameters *k* times, leaving one batch out per time. The testing error for each omitted batch is computed using the estimates derived from the remaining batches, and the cross-validation curve *c*(λ) is computed by averaging testing error across the *k* batches. The curve *c*(λ) can be quite ragged, and many values of λ must be tried to find its minimum. To avoid this time-consuming process, we let the desired number of predictors drive statistical analysis. In actual gene mapping studies, geneticists would be thrilled to map even 5 or 10 genes. In our coeliac disease example, it is necessary to consider a larger number of predictors to uncover the full biological truth.

### 2.3 Assessing significance

When SNPs are tested one by one, it is easy to assign a *P*-value to a SNP by conducting a likelihood ratio test. If we ignore non-genetic predictors such as age, sex and diet, then the only relevant parameters are the intercept μ and the slope β of the SNP. The null hypothesis β=0 can be tested by maximizing the loglikelihood under the null and alternative hypotheses and forming twice the difference in maximum loglikelihoods. This statistic is asymptotically distributed as a χ^{2}-distribution with 1 degree of freedom. Collectively, the *P*-values must be corrected for multiple testing, either by a Bonferroni correction or some version of a FDR correction. The latter choice is more appropriate when we anticipate a fairly large number of true positives. We will say more about FDR corrections later. A more compelling concern is that proceeding SNP by SNP omits the impact of other SNPs. Most statisticians prefer to assess significance in the context of multiple linear regression rather than simple linear regression. They resist this natural impulse in association studies because of the computational barriers and the mismatch between numbers of observations and predictors.

In our multivariate setting, we compare the standard SNP by SNP *P*-values with alternative *P*-values generated by considering the *s* selected predictors as a whole. Once we have selected the *s* model predictors, we discard the non-selected predictors and re-estimate parameters for the selected predictors with λ=0. Since *s* is small, say 10 to 20 in our numerical studies, re-estimation is now a fully determined problem. We then undertake *s* further rounds of estimation, omitting each of the selected predictors in turn. These actions put us into position to conduct likelihood ratio tests by leaving one predictor out at a time. It is tempting to assign *P*-values by comparing the resulting likelihood ratio statistics to the percentile points of a χ^{2}-distribution with 1 degree of freedom. This is invalid because it neglects the complex selection procedure for defining the reduced model in the first place. Nonetheless, these LOO *P*-values are helpful in assessing the correlations between the retained predictors in the reduced model. To avoid confusion, we will refer to the LOO *P*-values as LOO indices. The contrast between the univariate *P*-values and the LOO indices is instructive. Although both of these measures are defective and should not be taken too seriously, they are defective in different ways and together give a better idea of the truth.

### 2.4 Interaction effects

As mentioned previously, we advocate testing for interactions after identifying main effects. This strategy is prompted by the sobering number of interactions possible. With *p* predictors, there are *k*-way interactions, and 2^{p} interactions in all. With hundreds of thousands of SNPs, it is impossible even to examine all two-way interactions. These problems disappear once we focus on a handful of interesting marginal predictors. However, our commitment to a two-stage strategy brings in its wake certain technical problems.

First, there is the combinatorial question of how to generate all subsets of {1,…, *r*} up to a given size. Fortunately, good algorithms for this task already exist. Minor changes to the NEXKSB code in Nijenhuis and Wilf (1978) permit one to generate one subset after another, with smaller subsets coming before larger subsets. Thus, when the number of predictors *r* retained from stage one is too large to generate all subsets, one can easily visit all lower order interactions and bypass higher order interactions. Second, there is the problem of storing the interaction predictors. We finesse this problem by computing interaction products on the fly. Third, there is the question of how to integrate SNP predictors with other predictors such as sex, age and environmental exposures. Since this is largely a programming problem, we omit further discussion of it. Fourth, our interactions do not involve any self-interactions. Inclusion of self-interactions would force us to pass from subsets to multisets. For SNPs the potential gain seems worth less than the bother. Other predictors such as age have a richer range of values, so it may be useful to add predictors such as age squared, age cubed and so forth to the original list of predictors. Finally, there are the problems of model selection and hypothesis testing for the interaction effects. Here, again it seems reasonable to rely on lasso penalized estimation and LOO indices.

### 2.5 Cyclic coordinate ascent algorithm

In linear logistic regression, maximum likelihood estimates are usually found by the scoring algorithm. This requires the score and observed information

of the loglikelihood (3). Because scoring coincides with Newton's method, it is fast and reliable, and most statisticians would agree that it is the method of choice for low-dimensional problems. Its Achilles heel is the need to invert the observed information at each iteration. If we add to this drawback the complication of dealing with the non-differentiable lasso penalty, then it becomes abundantly clear that competing algorithms should be considered in association analysis.

The oldest and simplest alternative, coordinate ascent, updates one parameter one at a time. Coordinate ascent comes in two flavors, cyclic and greedy (Wu and Lange, 2008). In cyclic coordinate ascent, each parameter is updated in turn; in greedy coordinate ascent, the parameter leading to the greatest increase in the objective function is updated. Although greedy coordinate ascent makes faster initial progress in logistic regression, it suffers from excess overhead. For this reason we will confine our attention to cyclic coordinate ascent.

Although the logistic loglikelihood (3) is non-linear, it has the compensating property of concavity. Concavity fortunately carries over to the lasso penalized loglikelihood

because the sum of two concave functions is concave. The objective function *g*(θ) is non-differentiable, but it does possess a directional derivative along each forward or backward coordinate direction. For instance, if *u*_{j} is the coordinate direction along which β_{j} varies, then

and for *v*_{j}=−*u*_{j}

When a function such as *L*(θ) is differentiable, its directional derivative along *u*_{j} coincides with its ordinary partial derivative, and its directional derivative along *v*=−*u*_{j} coincides with the negative of its ordinary partial derivative.

To update a single parameter of the objective function *g*(θ), we use one-dimensional scoring. This works well for the intercept parameter μ because there is no lasso penalty. For a slope parameter β_{j}, the lasso penalty intervenes, and particular care must be exercised near the origin. In fact, it simplifies matters to start scoring at the origin. Here, we test the directional derivatives *d*_{uj}*g*(θ) and *d*_{vj}*g*(θ). If both are non-positive, then *g*(θ) cannot be increased by moving away from the origin. This claim follows from the concavity of *g*(θ). If one of the directional derivatives *d*_{uj}*g*(θ) and *d*_{vj}*g*(θ) is positive and the other is non-positive, then progress can be made along the corresponding arm of *g*(θ), and scoring is commenced until convergence is achieved along that arm. Concavity rules out the possibility that both directional derivatives are positive. A simple sketch of a concave function will convince the reader of this assertion.

In practice, we start all parameters at the origin. In overdetermined problems, the vast majority of slopes β_{j} are permanently parked there. Only those with considerable evidence in their favor can overcome the pressure of the lasso pushing them toward the origin. Even those that escape this pressure can be forced back to the origin as other more potent predictors enter the model. It is clearly computationally beneficial to organize parameter updates by tracking the linear predictor μ+*x*_{i}^{t}β of each case. These start at 0, and when a single component of θ is updated, it is trivial to update the linear predictors.

### 2.6 The score criterion and efficient computations

In Section 2.2, we demonstrated that the lasso penalty can be tuned to select a predetermined number of the most relevant SNPs. Once the value of the tuning constant λ is fixed, the penalized likelihood is quickly maximized by cyclic coordinate ascent to give us the desired number of non-zero coefficients. However, since we face a very large number of SNP predictors, it would be much more efficient if we could start our search procedure by focusing on a substantially smaller set of features that are more likely to be associated with the response. We accomplish this by a ‘swindle’ that screens the predictors according to a simple score criterion.

The score equations of the loglikelihood (4) for linear logistic regression define part of the Karush–Kuhn–Tucker (KKT) conditions (Lange, 2004)

for optimality in the penalized model. Here *p*_{i}(λ) is the fitted probability for observation *i*, fit using the indicated value of λ. For very large λ, all the β_{j} are estimated as zero, and the only non-trivial parameter is the intercept μ, which is unpenalized. If *p*_{0} is the overall proportion of cases in the data, then the intercept is estimated as for large λ.

We accordingly define the following initial absolute score:

for each predictor. Note that *a*_{j} determines the standard score statistic for testing the null model β_{j}=0 with μ fixed at . The first predictor to enter the lasso penalized model as λ decreases is the predictor with the largest value of *a*_{j}.

These considerations suggest a screening device for models with large numbers of SNPs. Because we insist on tuning the lasso penalty to select just a handful of predictors, the final absolute scores are apt to correlate strongly with the precomputed absolute scores. Thus, if we desire *s* predictors, we take *k* to be a reasonably large multiple of *s*, say *k*=10*s*, sort the *a*_{j}, and extract the *k* predictors with the largest values of *a*_{j}. Call this subset *S*_{k}. We now subject *S*_{k} to our estimation procedure and choose a value λ_{k} to give us exactly *k* predictors. The selected predictors satisfy the KKT conditions (5) and (6). If the predictors omitted from *S*_{k} also satisfy the KKT condition (6), then we have found the global minimum for the given value λ_{k} and stop. If one of the omitted predictors fails the KKT condition (6), we replace *k* by 2*k*, say, and repeat the process. Eventually, the KKT conditions are satisfied by all predictors. Since the KKT conditions are sufficient as well as necessary for a global maximum, this process legalizes the swindle. Often the value 10*s* works. When it does not, usually just a few doublings suffice. For example, if the desired number of predictors is *s*=10, in stage one we fit a model with 100 predictors. When stage two is needed, we fit a model with 200 predictors, and so forth. If there are hundred of thousands of SNPs, our swindle saves an enormous amount of computing with no loss in rigor.

Of course, the swindle sets *S*_{k} may contain highly correlated features with redundant information. This turns out to be the case with the HLA SNPs in our coeliac example. Fortunately, most of the redundant features are discarded by the lasso penalty. Our numerical results, for instance those displayed in Table 1, confirm that the swindle dramatically speeds up computation while preserving model selection results.

### 2.7 Computation of FDR

The score swindle also has implications for the assessment of the FDR for the univariate *P*-values. We will not pursue these delicate connections here because in practice most geneticists demand that all univariate tests be done. Fortunately, it takes just a few minutes of computing time to carry out the univariate logistic regressions encountered in a modern association study. Even substituting likelihood ratio tests for score tests does not change this fact.

In the Simes procedure highlighted by Benjamini and Hochberg (1995) in their analysis of FDR, there are *n* null hypotheses *H*_{1},…, *H*_{n} and *n* corresponding *P*-values *P*_{1},…, *P*_{n}. The latter are replaced by their order statistics *P*_{(1)},…, *P*_{(n)}. If for a given α≥0, we choose the largest integer *j* such that *P*_{(i)}≤(*i*/*n*)α for all *i*≤*j*, then we can reject the hypotheses *H*_{(1)},…, *H*_{(j)} at an FDR of α or better. This procedure is justified in theory when the tests are independent or positively correlated. In the presence of linkage equilibrium, association tests are independent; in the presence of linkage disequilibrium, they are positively correlated. For a more detailed discussion of the multiple testing issues in SNP studies, see Nyholt (2004).

## 3 ANALYSIS OF SIMULATED DATA

To evaluate the performance of lasso penalized regression in association testing, we focus on underdetermined problems where the number of predictors *p* far exceeds the number of observations *n*. Our simulation model

involves both marginal effects and two-way interactions. For ease of simulation, we assume that each predictor vector *x*_{i} is derived from a realization of a multivariate normal vector *Y*_{i} whose marginals are standard normal and whose covariances are

Thus, only the first 10 predictors are correlated. To mimic a SNP with equal allele frequencies, we set *x*_{ij} equal to −1, 0 or 1 according to whether *Y*_{ij}<−*c*, −*c*≤*Y*_{ij}≤*c*, or *Y*_{ij}>*c*. The cutoff −*c* is the first quartile of a standard normal distribution. In every simulation, we set μ=1, β_{j}=1 for 1≤*j*≤5, and β_{j}=0 for *j*>5. We also set η_{kl}=0 except for the special cases η_{12}=η_{34}=0.5. These substantial effect sizes allow us to discern signal from noise in fairly small samples.

To ameliorate the shrinkage of the non-zero estimates for a particular λ, we always re-estimate the selected parameters in the final model, omitting the non-selected parameters and the lasso penalty. This yields better parameter estimates for testing purposes. We compute LOO indices as mentioned earlier and contrast them to univariate *P*-values based on estimating the impact of each predictor without reference to the other predictors.

We analyzed the simulated data in two stages. In stage one, we considered only main effects and selected *s*_{1} predictors. In stage two, we discarded the non-selected predictors and sought *s*_{2} marginal effects or interactions among the selected predictors. The sensible choice *s*_{2}≥*s*_{1} permits all predictors singled out in stage one to remain in contention as marginal effects in stage two. Because virtually all association studies yield only a handful of predictors that can be replicated, we took *s*_{1} and *s*_{2} small and considered the specific pairs (*s*_{1}, *s*_{2})=(10, 10), (10, 20), (20, 10), (20, 20). Table 1 summarizes our results over 50 random replicates for various choices of the number of predictors *p*, the number of subjects *n* and the correlation coefficient ρ. Table 1 reports the average values of the tuning constants λ_{1} and λ_{2}, the average number of true predictors *K*_{true,1} and *K*_{true,2} found and the average computing times in seconds. The subscripts 1 and 2 refer to the first and second stages. The standard error of each average appears in parentheses.

The last two columns of Table 1 summarize computing times with and without our computational swindle. Forgoing the swindle inflates all times in Table 1. For *p*=5000 the differences are not too noticeable, but for *p*=100000 it takes 10 to 20 times longer to reach the lasso solution without the swindle.

The results Table 1 for the choice (*s*_{1}, *s*_{2})=(10, 20) appear best. In general, we recommend using a substantially larger *s*_{2} than *s*_{1}. Performance degrades as we pass from uncorrelated to highly correlated predictors. More iterations are needed for convergence, and the fraction of true predictors captured falls. With a large enough sample size, performance is perfect. Table 1 in our submitted Supplementary Materials displays our results for a single representative sample with *p*=50 000, *n*=2000, ρ=0 and (*s*_{1}, *s*_{2})=(10, 20). At stage one, all five true predictors are correctly selected with impressive univariate *P*-values and LOO indices. At stage two, all five main effects and both interaction effects are selected. In both instances, the univariate *P*-values and LOO indices of the true predictors are much smaller than the corresponding values for the false predictors.

It is also instructive to consider what happens in the simulated data with *p*=5000, *n*=500 and ρ=0 when the stage one tuning constant λ_{1} varies. Figure 1 plots six things as a function of λ_{1}: (i) the number of predictors selected at stage one, (ii) the number of predictors selected at stage two, (iii) the number of true predictors selected at stage one, (iv) the number of true predictors selected at stage two, (v) the FDR at stage one and (vi) the FDR at stage two. In stage two, we set the tuning constant λ_{2}=25. In counting true predictors, we consider only marginal predictors at stage one and marginal plus interaction predictors at stage two. When we know the true predictors, estimating FDR is trivial, and the Simes procedure can be ignored. Inspection of the six plots shows that all true predictors are recovered for a fairly broad range of λ_{1} values. As λ_{1} decreases, more predictors enter the model, and FDR increases.

## 4 ANALYSIS OF COELIAC DATA

### 4.1 Data description

In the British coeliac data of van Heel *et al.* (2007), *p*=310,637 SNPs are typed on *n*=2200 subjects (938 males and 1262 females). Controls outnumber cases 1422 to 778. Across the sample, an impressive 99.875% of all genotypes are assigned; no individual has more than 10% missing data. We impute missing genotypes at a SNP by the method sketched in Ayers and Lange (2008). Only 32 SNPs show a minor allele frequency below 1%; these are dropped from further analysis.

### 4.2 Simulation study based on coeliac data

We also tested our method by conducting a simulation study based on the coeliac data. Here in model (8), we took μ=−3, β_{j}=1 for gender, rs3737728 (SNP2), rs9651273 (SNP4) and rs4970362 (SNP9), and β_{j}=0 for the remaining SNPs. We also set η_{kl}=2 for the interaction of gender and rs3934834 (SNP1) and the interaction of SNP4 and SNP9; all other η_{kl} we set to 0. Notice that SNP1 has no marginal effect even though it interacts with gender in determining the response. The lower right-hand block of the correlation matrix

indicates fairly strong linkage disequilibrium among the three marginally important SNPs. Table 2 summarizes Fisher's exact test for Hardy–Weinberg equilibrium on the four SNPs (Lazzeroni and Lange, 1998). A total of 10 000 random tables were sampled to approximate *P*-values at each SNP.

Following our previous plan of analysis, we varied the numbers of predictors (*s*_{1}, *s*_{2}) in the model. The best results summarized in Table 3 reflect the sensible choice (*s*_{1}, *s*_{2})=(10, 20). At stage one, all four true predictors are correctly selected. In stage two all four main effects are selected, and both interaction effects are selected for the vast majority of the 50 random replicates.

Our success with the additive model was partially replicated when we simulated under dominant and recessive models. In the dominant model, we score a SNP predictor as 1 if the number of minor alleles is 1 or 2; otherwise we score it as −1. In the recessive model, we score a SNP predictor as 1 if the number of minor alleles is 2; otherwise we score it as −1. The last two rows of Table 3 report our analysis results for the dominant and recessive models. The results under the dominant model are nearly as good as those under the additive model. Since the numbers of predictor values equal to 1 and −1 are better balanced under the dominant model, it is hardly surprising that the recessive model does worse.

### 4.3 Results of real data analysis

Replicating earlier results with antigenic markers, van Heel *et al.* (2007) find overwhelming evidence for association in the human leukocyte antigen (HLA) region of chromosome 6. SNP rs2187668 in the first intron of HLA-DQA1 has the strongest association, followed by SNPs rs9357152 and rs9275141 within or adjacent to HLA-DQB1. van Heel *et al.* also identify a more weakly associated region on chromosome 4 centered on SNPs rs13119723 and rs6822844 in the *KIAA1109-TENR-IL2-IL21* linkage disequilibrium block. Their results are reproduced in our supplementary Table 2. The *P*-values listed in the table are univariate *P*-values taking one SNP at a time.

We now examine several models with different numbers of desired predictors. Since the grand mean μ always enters the model first, we omit it from further discussion. In model 0 with one predictor mandated, SNP rs2187668 on chromosome 6 HLA region is selected. This SNP has the smallest univariate *P*-value (9.48 × 10^{−191}) among all the 310 605 SNPs tested. In model 1 with five predictors mandated, we identify four HLA SNPs in addition to rs2187668. In model 2 with 10 predictors mandated, once again we recover only HLA SNPs from chromosome 6; these results are summarized in Table 3 of our submitted Supplementary Materials. Univariate *P*-values appear in column 4 and LOO indices in column 5 of the table. It is striking how different the univariate *P*-values and LOO indices are for these SNPs. This phenomenon is just another manifestation of the high linkage disequilibrium among the SNPs. The estimated FDRs for the selected SNPs are all much smaller than 0.01. In model 3 with 50 predictors mandated, we finally see predictors outside the HLA region. Table 4 records the non-HLA predictors identified. Here, univariate *P*-values differ less from LOO indices because the SNPs are largely uncorrelated.

We find similarities and differences between the van Heel *et al.* (2007) results and our results. Almost all of the SNPs in Table 4 with univariate *P*-values below 10^{−4} are singled out by van Heel *et al.* (2007). The one exception is SNP rs1499447 on chromosome 8, which they dismiss because of irregularities in genotyping. We find different SNPs in the KIAA1109-TENR-IL2-IL21 block on chromosome 4. This is the region that replicates well in their Dutch and Irish samples. Our failure to identify the same SNPs in the KIAA1109-TENR-IL2-IL21 block is hardly a disaster; the region and ultimately the underlying gene are more important than the individual SNPs. It is noteworthy that among the 1000 most significant SNPs listed by van Heel *et al.* (2007), 979 are in the HLA region. Since SNPs in the HLA region on chromosome 6 are highly correlated with coeliac disease, model 4 with 10 mandated predictors removes the HLA SNPs, with the aim of finding associated SNPs outside the HLA region. Table 4 in our submitted Supplementary Materials now picks up SNPs on chromosomes 9, 11, 14 and 18 that do not appear in Table 4. Removing all chromosome 6 SNPs rather than just HLA SNPs leads to virtually the same results as displayed in Supplementary Table 4.

To test for interactions, we take the *s*_{1}=50 predictors selected in model 3 and examine all marginal and two-way effects. The total number of predictors is , and we keep *s*_{2}=50 predictors in the model. Most of the 50 selected predictors have LOO indices close to one. Table 5 lists the marginal and interaction predictors with LOO indices less than 0.01. Several of these interactions are interesting. Given the predominance of female patients, the interaction between gender and one of the HLA SNPs is credible. The interactions between two HLA SNPs and SNPs on chromosomes 2, 3 and 8 are more surprising. It is particularly noteworthy that the univariate *P*-values for these three SNPs as marginal effects (Table 4) are far less impressive than their univariate *P*-values as interaction effects (Table 5).

## 5 DISCUSSION

Our analysis of simulated data demonstrates that lasso penalized regression is easily capable of identifying pertinent predictors in grossly underdetermined problems. Computational speed is impressive. If predictors are uncorrelated, then interaction effects can be found readily as well. As one might expect, correlations among important predictors degrade computational speed and the recognition of interactions. For very large datasets involving more than, say, 10^{9} total SNP genotypes, data compression is mandatory. Repeated decompression of chunks of the data then slows computation. Our computational swindle circumvents this problem because all of the working predictors easily fit within memory.

The coeliac dataset of van Heel *et al.* (2007) is challenging for two reasons. First, the overwhelming HLA signal masks the weaker signals coming from other chromosome regions. Second, the HLA SNPs are in strong linkage disequilibrium and hence highly correlated. Linkage disequilibrium manifests itself as increased LOO indices and significant two-way interactions. Despite these handicaps, lasso penalized regression identifies several promising non-HLA regions and interaction effects. Our results for chromosome 4 differ slightly from those of van Heel *et al.* (2007) because we impute missing genotypes differently. Ayers and Lange (2008) introduce a new penalized method of haplotype frequency estimation that enforces parsimony and achieves both speed and accuracy. When phase can be deduced from relatives, this extra information can be included in estimation. Finally, it is noteworthy that van Heel *et al.* have validated the chromosome 4 association on two further datasets.

One can quibble with our method of picking candidate predictors for interaction modeling. An obvious alternative would be to look for two-way interactions between the top *s* predictors and all other predictors. This tactic requires little change in our numerical methods.

Readers may want to compare our approach with the approach of Shi *et al.* (2006, 2007, 2008). One major difference is our application of cyclic coordinate ascent. A second major difference is that we always select a fixed number of predictors. These choices allow us to quickly process a very large numbers of SNPs or interactions among SNPs. The path following algorithm of Park and Hastie (2008) has the advantage of revealing the exact sequence in which predictors enter the model. Path following is more computationally demanding than simply finding the best *r* predictors, but note that their software [glmpath in R, Park and Hastie (2007)] can quickly post-process the best *r* predictors discovered.

We have featured univariate *P*-values and LOO indices in this article, but neither measure is ideal. Although FDR analysis is valuable, no one has said the last word on multiple testing (Balding, 2006, Kimmel and Shamir, 2006). For instance, some form of generalized cross-validation may ultimately prove useful. As a matter of principle, most geneticists would not accept a single study as definitive. All important findings are subject to replication. This attitude, whether justified or not, puts the onus on finding the most important SNPs rather than on declaring their global significance. Our approach to data analysis is motivated by this consideration. The software discussed here will be made available in the next release of Mendel.

*Funding*: USPHS (GM53275, MH59490 to K.L. in part).

*Conflict of Interest*: none declared.

## REFERENCES

- Ayers KL, Lange K. Penalized estimation of haplotype frequencies. Bioinformatics. 2008;24:1596–1602. [PubMed]
- Balding DJ. A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 2006;7:781–791. [PubMed]
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B. 1995;57:289–300.
- Chen SS, et al. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 1998;20:33–61.
- Claerbout JF, Muir F. Robust modeling with erratic data. Geophysics. 1973;38:826–844.
- Daubechies I, et al. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 2004;57:1413–1457.
- Friedman J, et al. Pathwise coordinate optimization. Ann. Appl. Stat. 2007;2:302–332.
- Friedman I, et al. Regularized Paths for Generalized Linear Models Via Coordinate Descent. Department of Statistics, Stanfard University; 2008. [PMC free article] [PubMed]
- Fu WJ. Penalized regressions: the bridge versus the lasso. J. Comput. Graph. Stat. 1998;7:397–416.
- Kimmel G, Shamir R. A fast method for computing high-significance disease association in large population-based studies. Am. J. Hum. Genet. 2006;79:481–492. [PMC free article] [PubMed]
- Koh K, et al. An interior-point method for large-scale l1-regularized logistic regression. J. Mach. Learn. Res. 2007;8:1519–1555.
- Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genet. Epidemiol. 2005;28:157–170. [PubMed]
- Lange K. Optimization. New York: Springer; 2004.
- Lazzeroni LC, Lange K. A conditional inference framework for extending the transmission/disequilibrium test. Hum. Hered. 1998;48:67–81. [PubMed]
- Lee S-L, et al. Efficient
*L*_{1}regularized logistic regression. Proceedongs of the 21th National Conference on Artifical Intelligence (AAAI-06). 2006 Available at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.1993. - Liang Y, Kelemen A. Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases. Stat. Surv. 2008;2:43–60.
- Malo N, et al. Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. Am. J. Hum. Genet. 2008;82:375–385. [PMC free article] [PubMed]
- Nijenhuis A, Wilf HS. Combinatorial Algorithms for Computers and Calculators. 2. New York: Academic Press; 1978.
- Nyholt DR. A simple correction for multiple testing for SNPs in linkage disequilibrium with each other. Am. J. Human. Genet. 2004;74:765–769. [PMC free article] [PubMed]
- Park MY, Hastie T. L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model. R package. 2007 Available at http://bm2.genes.nig.ac.jp/RGM2/pkg.php?p=glmpath.
- Park MY, Hastie T. Penalized logistic regression for detecting gene interactions. Biostatistics. 2008;9:30–50. [PubMed]
- Santosa F, Symes WW. Linear inversion of band-limited reflection seimograms. SIAM J. Sci. Stat. Comput. 1986;7:1307–1330.
- Schwender H, Ickstadt K. Identification of SNP interactions using logic regression. Biostatistics. 2008;9:187–198. [PubMed]
- Sha F, et al. Multiplicative Updates for L1-Regularized Linear and Logistic Regression, Lecture Notes in Computer Science. Springer; 2007.
- Shi W, et al. Technical Report 1131. Madison: University of Wisconsin; 2006. Lasso-Patternsearch Algorithm with Application to Ophthalmalogy Data.
- Shi W, et al. Detecting disease causing genes by LASSO-patternsearch algorithm. BMC Proc. 2007;1(Suppl. 1):S60. [PMC free article] [PubMed]
- Shi W, et al. Technical Report 1141. Madison: University of Wisconsin; 2008. LASSO-Patternsearch Algorithm with Applications to Ophthalmology and Genomic Data.
- Taylor HL, et al. Deconvolution with the ℓ
_{1}norm. Geophysics. 1979;44:39–52. - Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. 1996;58:267–288.
- Uh H-W, et al. Model selection based on logistic regression in a highly correlated candidate gene region. BMC Proc. 2007;1(Suppl. 1):S114. [PMC free article] [PubMed]
- van Heel D, et al. A genome-wide association study for celiac disease identifies risk variants in the region harboring IL2 and IL21. Nat. Genet. 2007;397:827–829. [PMC free article] [PubMed]
- Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2008;2:224–244.

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (145K) |
- Citation

- Penalized regression for genome-wide association screening of sequence data.[Pac Symp Biocomput. 2011]
*Zhou H, Alexander DH, Sehl ME, Sinsheimer JS, Sobel EM, Lange K.**Pac Symp Biocomput. 2011; :106-17.* - SNP selection in genome-wide and candidate gene studies via penalized logistic regression.[Genet Epidemiol. 2010]
*Ayers KL, Cordell HJ.**Genet Epidemiol. 2010 Dec; 34(8):879-91.* - Detecting disease-causing genes by LASSO-Patternsearch algorithm.[BMC Proc. 2007]
*Shi W, Lee KE, Wahba G.**BMC Proc. 2007; 1 Suppl 1:S60. Epub 2007 Dec 18.* - Bioinformatics tools for single nucleotide polymorphism discovery and analysis.[Ann N Y Acad Sci. 2004]
*Clifford RJ, Edmonson MN, Nguyen C, Scherpbier T, Hu Y, Buetow KH.**Ann N Y Acad Sci. 2004 May; 1020:101-9.* - Cluster-localized sparse logistic regression for SNP data.[Stat Appl Genet Mol Biol. 2012]
*Binder H, Müller T, Schwender H, Golka K, Steffens M, Hengstler JG, Ickstadt K, Schumacher M.**Stat Appl Genet Mol Biol. 2012 Aug 14; 11(4). Epub 2012 Aug 14.*

- LEAP: Biomarker Inference Through Learning and Evaluating Association Patterns[Genetic epidemiology. 2015]
*Jiang X, Neapolitan RE.**Genetic epidemiology. 2015 Mar; 39(3)173-184* - Research on Single Nucleotide Polymorphisms Interaction Detection from Network Perspective[PLoS ONE. ]
*Su L, Liu G, Wang H, Tian Y, Zhou Z, Han L, Yan L.**PLoS ONE. 10(3)e0119146* - AucPR: An AUC-based approach using penalized regression for disease prediction with high-dimensional omics data[BMC Genomics. ]
*Yu W, Park T.**BMC Genomics. 15(Suppl 10)S1* - A STRICTLY CONTRACTIVE PEACEMAN–RACHFORD SPLITTING METHOD FOR CONVEX PROGRAMMING[SIAM journal on optimization : a publicatio...]
*BINGSHENG H, LIU H, WANG Z, YUAN X.**SIAM journal on optimization : a publication of the Society for Industrial and Applied Mathematics. 2014 Jul; 24(3)1011-1040* - A SCREENING-TESTING APPROACH FOR DETECTING GENE-ENVIRONMENT INTERACTIONS USING SEQUENTIAL PENALIZED AND UNPENALIZED MULTIPLE LOGISTIC REGRESSION[Pacific Symposium on Biocomputing. Pacific ...]
*Frost HR, Andrew AS, Karagas MR, Moore JH.**Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 2015; 20183-194*

- Genome-wide association analysis by lasso penalized logistic regressionGenome-wide association analysis by lasso penalized logistic regressionBioinformatics. 2009 Mar 15; 25(6)714

Your browsing activity is empty.

Activity recording is turned off.

See more...