- Journal List
- HHS Author Manuscripts
- PMC2752029

# HIGH DIMENSIONAL VARIABLE SELECTION

^{*}The authors thank the Associate Editor and referees for helpful comments.

## Abstract

This paper explores the following question: what kind of statistical guarantees can be given when doing variable selection in high dimensional models? In particular, we look at the error rates and power of some multi-stage regression methods. In the first stage we fit a set of candidate models. In the second stage we select one model by cross-validation. In the third stage we use hypothesis testing to eliminate some variables. We refer to the first two stages as “screening” and the last stage as “cleaning.” We consider three screening methods: the lasso, marginal regression, and forward stepwise regression. Our method gives consistent variable selection under certain conditions.

**Keywords:**Lasso, Stepwise Regression, Sparsity

## 1. Introduction

Several methods have been developed lately for high dimensional linear regression such as the lasso (Tibshirani 1996), Lars (Efron et al. 2004) and boosting (Bühlmann 2006). There are at least two different goals when using these methods. The first is to find models with good prediction error. The second is to estimate the true “sparsity pattern,” that is, the set of covariates with nonzero regression coefficients. These goals are quite different and this paper will deal with the second goal. (Some discussion of prediction is in the appendix.) Other papers on this topic include Meinshausen and Bühlmann (2006), Candes and Tao (2007), Wainwright (2006), Zhao and Yu (2006), Zou (2006), Fan and Lv (2008), Meinshausen and Yu (2008), Tropp (2004, 2006), Donoho (2006) and Zhang and Huang (2006). In particular, the current paper builds on ideas in Meinshausen and Yu (2008) and Meinshausen (2007).

Let (*X*_{1}, *Y*_{1}),…,(*X _{n}, Y_{n}*) be iid observations from the regression model

where *ε* ~ *N*(0, σ^{2}), *X _{i}* = (

*X*

_{i}_{1},…,

*X*)

_{ip}*∈ ℝ*

^{T}*and*

^{p}*p*=

*p*>

_{n}*n*. Let

*X*be the

*n × p*design matrix with

*j*

^{th}column

*X*

_{•}

*= (*

_{j}*X*

_{1}

*)*

_{j},…, X_{nj}*and let*

^{T}*Y*= (

*Y*

_{1},…,

*Y*)

_{n}*. Let*

^{T}

be the set of covariates with nonzero regression coefficients. Without loss of generality, assume that *D* = {1,…, *s*}for some *s*. A variable selection procedure *$\widehat{D}$ _{n}* maps the data into subsets of

*S*= {1,…,

*p*}.

The main goal of this paper is to derive a procedure *$\widehat{D}$ _{n}* such that

that is, the asymptotic type I error is no more than *α*. Note that throughout the paper we use ⊂ to denote non-strict set-inclusion. Moreover, we want *$\widehat{D}$ _{n}* to have nontrivial power. Meinshausen and Bühlmann (2006) control a different error measure. Their method guarantees lim sup

_{n}_{→∞}ℙ(

*$\widehat{D}$*∩

_{n}*V*≠∅) ≤

*α*where

*V*is the set of variables not connected to

*Y*by any path in an undirected graph.

Our procedure involves three stages. In stage I we fit a suite of candidate models, each model depending on a tuning parameter *λ*,

In stage II we select one of those models *Ŝ _{n}* using cross-validation to select

*$\widehat{\lambda}$*. In stage III we eliminate some variables by hypothesis testing. Schematically:

Genetic epidemiology provides a natural setting for applying screen and clean. Typically the number of subjects, *n*, is in the thousands, while *p* ranges from tens of thousands to hundereds of thousands of genetic features. The number of genes exhibiting a detectable association with a trait is extremely small. Indeed, for Type I diabetes only ten genes have exhibited a reproducible signal (Wellcome Trust 2007). Hence it is natural to assume that the true model is sparse. A common experimental design involves a 2-stage sampling of data, with stages 1 and 2 corresponding to the screening and cleaning processes, respectively.

In stage 1 of a genetic association study, *n*_{1} subjects are sampled and one or more traits such as bone mineral density are recorded. Each subject is also measured at *p* locations on the chromosomes. These genetic covariates usually have two forms in the population due to variability at a single nucleotide and hence are called single nucleotide polymorphisms (SNPs). The distinct forms are called alleles. Each covariate takes on a value (0, 1 or 2) indicating the number of copies of the less common allele observed. For a well designed genetic study, individual SNPs are nearly uncorrelated unless they are physically located in very close proximity. This feature makes it much easier to draw causal inferences about the relationship between SNPs and quantitative traits. It is standard in the field to infer that an association discovered between a SNP and a quantitative trait implies a causal genetic variant is physically located near the one exhibiting association. In stage 2, *n*_{2} subjects are sampled at a subset of the SNPs assessed in stage 1. SNPs measured in stage 2 are often those that achieved a test statistic that exceeded a predetermined threshold of significance in stage 1. In essence, the two stage design pairs naturally with a screen and clean procedure.

For the screen and clean procedure it is essential that *Ŝ _{n}* has two properties as

*n*→ ∞

and

where |*M*| denotes the number of elements in a set *M*. Condition (3) ensures the validity of the test in stage III while condition (4) ensures that the power of the test is not too small. Without condition (3), the hypothesis test in stage III would be biased. We will see that the power goes to 1, so taking *α*= *α _{n}* → 0 implies consistency: ℙ(

*$\widehat{D}$*=

_{n}*D*) → 1. For fixed

*α*, the method also produces a confidence sandwich for

*D*, namely,

To fit the suite of candidate models, we consider three methods. In Method 1,

where *$\stackrel{\u0303}{\beta}$ _{j}*(

*λ*) is the lasso estimator, the value of

*β*that minimizes

In Method 2, take *Ŝ _{n}*(

*λ*) to be the set of variables chosen by forward stepwise regression after

*λ*steps. In Method 3, marginal regression, we take

where *$\widehat{\mu}$ _{j}* is the marginal regression coefficient from regressing

*Y*on

*X*. (This is equivalent to ordering by the absolute t-statistics since we will assume that the covariates are standardized.) These three methods are very similar to basis pursuit, orthogonal matching pursuit and thresholding; see, for example, Tropp (2004, 2006) and Donoho (2006).

_{j}### Notation

Let *ψ* = min_{j}_{∈}* _{D}|β_{j}|*. Define the loss of any estimator

*$\widehat{\beta}$*by

where $\widehat{\Sigma}$ * _{n}* =

*n*

^{−1}

*X*. For convenience, when

^{T}X*$\widehat{\beta}$*≡

*$\widehat{\beta}$*(

*λ*) depends on

*λ*we write

*L*(

*λ*) instead of

*L*(

*$\widehat{\beta}$*(

*λ*)). If

*M*⊂

*S*, let

*X*be the design matrix with columns (

_{M}*X*

_{•}

*∈*

_{j:}j*M*) and let ${\widehat{\beta}}_{M}={({X}_{M}^{T}{X}_{M})}^{-1}{X}_{M}^{T}Y$ denote the least squares estimator, assuming it is well-defined. Note that our use of

*X*

_{•}

*differs from standard ANOVA notation. Write*

_{j}*X*instead of

_{λ}*X*when

_{M}*M*=

*Ŝ*(

_{n}*λ*). When convenient, we extend

*$\widehat{\beta}$*to length

_{M}*p*by setting

*$\widehat{\beta}$*(

_{M}*j*) = 0 for

*j*∉

*M*. We use the norms:

If *C* is any square matrix, let *φ*(*C*) and Φ(*C*) denote the smallest and largest eigenvalues of *C*. Also, if *k* is an integer define

We will write *z _{u}* for the upper quantile of a standard Normal, so that ℙ(

*Z > z*) =

_{u}*u*where

*Z*~

*N*(0, 1).

Our method will involve splitting the data randomly into three groups
,
and
. For ease of notation, assume the total sample size is 3*n* and that the sample size of each group is *n*.

### Summary of Assumptions

We will use the following assumptions throughout except in Section 8.

- (A1) ${Y}_{i}={X}_{i}^{T}\beta +{\epsilon}_{i}$ where
*ε*~_{i}*N*(0,*σ*^{2}), for*i*= 1, …,*n*.(A2) The dimension*p*of_{n}*X*satisfies*p*→ ∞ and_{n}*p*≤_{n}*c*_{1}*e*^{n}^{c2}for some*c*_{1}> 0 and 0 ≤*c*_{2}< 1.(A3)*s*≡ |{*j: β*≠ 0}| =_{j}*O*(1) and*ψ*= min{|*β*|:_{j}*β*≠ 0} > 0._{j}(A4) There exist positive constants*C*_{0},*C*_{1}and*κ*such that ℙ (lim sup_{n}_{→ ∞}Φ(_{n}*n*) ≤*C*_{0}) = 1 and ℙ(lim inf_{n}_{→ ∞}*φ*(_{n}*C*_{1}log*n*) ≥*κ*) = 1. Also, ℙ(*φ*(_{n}*n*) > 0) = 1 for all*n*. - (A5) The covariates are standardized: (
*X*) = 0 and $\mathbb{E}({X}_{ij}^{2})=1$. Also, there exists 0 <_{ij}*B*< ∞ such that ℙ(|*X*≤_{jk}|*B*) = 1.

For simplicity, we include no intercepts in the regressions. The assumptions can be weakened at the expense of more complicated proofs. In particular, we can let *s* increase with *n* and *ψ* decrease with *n*. Similarly, the Normality and constant variance assumptions can be relaxed.

## 2. Error Control

Define the type I error rate *q*(*$\widehat{D}$ _{n}*) = ℙ(

*$\widehat{D}$*∩

_{n}*D*≠ ∅) and the asymptotic error rate lim sup

^{c}

_{n}_{→ ∞}

*q*(

*$\widehat{D}$*). We define the power

_{n}*π*(

*$\widehat{D}$*) = ℙ (

_{n}*D*⊂

*$\widehat{D}$*) and the average power

_{n}It is well known that controlling the error rate is difficult for at least three reasons: correlation of covariates, high dimensionality of the covariate and unfaithfulness (cancellations of correlations due to confounding). Let us briefly review these issues.

It is easy to construct examples where, *q*(*$\widehat{D}$ _{n}*) ≤

*α*implies that

*π*(

*$\widehat{D}$*) ≈

_{n}*α.*Consider two models for random variables

*Z*= (

*Y*,

*X*

_{1},

*X*

_{2}):

Model 1 | Model 2 |

X_{1} ~ N (0, 1) | X_{2} ~ N (0, 1) |

Y = ψX_{1} + N (0, 1) | Y= ψX_{2} + N (0, 1) |

X_{2}= ρX_{1} + N (0, τ^{2}) | X_{1}= ρX_{2} + N (0, τ^{2}). |

Under models 1 and 2, the marginal distribution of *Z* is *P*_{1} = *N* (**0**, Σ_{1}) and *P*_{2} = *N* (**0**, Σ_{2}) where

Given any *ε* > 0 we can choose *ρ* sufficiently close to 1 and *τ* sufficiently close to 0 such that Σ_{1} and Σ_{2} are as close as we like and hence
$d({P}_{1}^{n},{P}_{2}^{n})<\epsilon $ where *d* is total variation distance. It follows that

Thus, if *q* ≤ *α* then the power is less than *α* + *ε*.

Dimensionality is less of an issue thanks to recent methods. Most methods, including those in this paper, allow *p _{n}* to grow exponentially. But all the methods require some restrictions on the number

*s*of nonzero

*β*’s. In other words, some sparsity assumption is required. In this paper we take

_{j}*s*fixed and allow

*p*to grow.

_{n}False negatives can occur during screening due to cancellations of correlations. For example, the correlation between *Y* and *X*_{1} can be 0 even when *β*_{1} is huge. This problem is called unfaithfulness in the causality literature; see Spirtes, Glymour and Scheines (2001) and Robins, Spirtes, Scheines and Wasserman (2003). False negatives during screening can lead to false positives during the second stage.

Let *$\widehat{\mu}$ _{j}* denote the regression coefficient from regressing

*Y*on

*X*. Fix

_{j}*j*≤

*s*and note that

where *ρ _{kj}* = corr(

*X*,

_{k}*X*). If

_{j}

then *μ _{j}* ≈ 0 no matter how large

*β*is. This problem can occur even when

_{j}*n*is large and

*p*is small.

For example, suppose that *β* = (10, −10, 0, 0) and that *ρ*(*X _{i}*,

*X*) = 0 except that

_{j}*ρ*(

*X*

_{1},

*X*

_{2}) =

*ρ*(

*X*

_{1},

*X*

_{3}) =

*ρ*(

*X*

_{2},

*X*

_{4}) = 1 −

*ε*where

*ε*> 0 is small. Then

Marginal regression is extremely susceptible to unfaithfulness. The lasso and forward stepwise, less so. However, unobserved covariates can induce unfaithfulness in all the methods.

## 3. Loss and Cross-validation

Let *X _{λ}* = (

*X*

_{•}

*:*

_{j}*j*∈

*Ŝ*(

_{n}*λ*)) denote the design matrix corresponding to the covariates in

*Ŝ*(

_{n}*λ*) and let

*$\widehat{\beta}$*(

*λ*) be the least squares estimator for the regression restricted to

*Ŝ*(

_{n}*λ*), assuming the estimator is well defined. Hence, $\widehat{\beta}(\lambda )={({X}_{\lambda}^{T}{X}_{\lambda})}^{-1}{X}_{\lambda}^{T}Y$. More generally,

*$\widehat{\beta}$*is the least squares estimator for any subset of variables

_{M}*M*. When convenient, we extend

*$\widehat{\beta}$*(

*λ*) to length

*p*by setting

*$\widehat{\beta}$*(

_{j}*λ*) = 0 for j ∉

*Ŝ*(

_{n}*λ*).

### 3.1. Loss

Now we record some properties of the loss function. The first part of the following lemma is essentially Lemma 3 of Meinshausen and Yu (2008).

#### Lemma 3.1

Let ${\mathcal{M}}_{m}^{+}=\{M\subset S:\mid M\mid \phantom{\rule{0.16667em}{0ex}}\le m,\phantom{\rule{0.38889em}{0ex}}D\subset M\}$. Then,

Let ${\mathcal{M}}_{m}^{-}=\{M\subset S:\mid M\mid \phantom{\rule{0.16667em}{0ex}}\le m,\phantom{\rule{0.38889em}{0ex}}D\not\subset M\}$. Then,

### 3.2. Cross-validation

Recall that the data have been split into groups
,
, and
each of size *n*. Construct *$\widehat{\beta}$ *(*λ*) from
and let

We would like *$\widehat{L}$ * (*λ*) to order the models the same way as the true loss *L*(*λ*) (defined after equation (5)). This requires that, asymptotically, *$\widehat{L}$ * (*λ*) − *L*(*λ*) ≈ *δ _{n}* where

*δ*does not involve

_{n}*λ*. The following bounds will be useful. Note that

*L*(

*λ*) and

*$\widehat{L}$*(

*λ*) are both step functions that only change value when a variable enters or leaves the model.

#### Theorem 3.2

Suppose that max_{λ∈Λn} |Ŝ _{n}(λ)| ≤ k_{n}. Then there exists a sequence of random variables δ_{n} = O_{P} (1) that do not depend on λ or X, such that, with probability tending to 1,

## 4. Multi-Stage Methods

The multi-stage methods use the following steps. As mentioned earlier, we randomly split the data into three parts , and which we take to be of equal size.

- Stage I. Use to find
*Ŝ*(_{n}*λ*) for each*λ*. - Stage II. Use to find
*$\widehat{\lambda}$*by cross-validation and let*Ŝ*=_{n}*Ŝ*(_{n}*$\widehat{\lambda}$*) - Stage III. Use to find the least squares estimate
*$\widehat{\beta}$*for the model*Ŝ*. Let_{n}

where *T _{j}* is the usual t-statistic,

*c*=

_{n}*z*

_{α}_{/2}

*and*

_{m}*m*= |

*Ŝ*|

_{n}### 4.1. The Lasso

The lasso estimator (Tibshirani 1996) *$\stackrel{\u0303}{\beta}$*(*λ*) minimizes

and let *Ŝ _{n}*(

*λ*) = {

*j: $\stackrel{\u0303}{\beta}$*(

_{j}*λ*) ≠ 0}. Recall that

*$\widehat{\beta}$*(

*λ*) is the least squares estimator using the covariates in

*Ŝ*(

_{n}*λ*).

Let *k _{n}* =

*A*log

*n*where

*A*> 0 is a positive constant.

#### Theorem 4.1

Assume that (A1)–(A5) hold. Let Λ_{n} = {λ: |Ŝ _{n}(λ)| ≤ k_{n}}. Then:

- The true loss overfits: ℙ(D ⊂ Ŝ
_{n}(λ_{*})) → 1 where λ_{*}= argmin_{λ∈Λn}L(λ). - Cross-validation also overfits: ℙ(D ⊂ Ŝ
_{n}($\widehat{\lambda}$ )) →1 where $\widehat{\lambda}$ = argmin_{λ∈Λn}$\widehat{L}$ (λ). - Type I error is controlled: lim sup
_{n→ ∞}ℙ(D^{c}∩ $\widehat{D}$_{n}≠ ∅) ≤ α

If we let *α* = *α _{n}* → 0 then

*$\widehat{D}$*is consistent for variable selection.

_{n}#### Theorem 4.2

Assume that (A1)–(A5) hold. Let α_{n} → 0 and
$\sqrt{n}{\alpha}_{n}\to \infty $. Then, the multi-stage lasso is consistent,

The next result follows directly. The proof is thus omitted.

#### Theorem 4.3

Assume that (A1)–(A5) hold. Let α be fixed. Then ($\widehat{D}$ _{n}; Ŝ _{n}) forms a confidence sandwich:

#### Remark 4.4

This confidence sandwich is expected to be conservative in the sense that the coverage can be much larger than 1 − α.

### 4.2. Stepwise Regression

The version of stepwise regression we consider is as follows. Let *k _{n}* =

*A*log

*n*for some

*A*> 0.

- Initialize: Res =
*Y*,*λ*= 0,*Ŷ*= 0, and*Ŝ*(_{n}*λ*) = ∅. - Let
*λ*←*λ*+ 1. Compute*$\widehat{\mu}$*=_{j}*n*^{−1}〈*X*Res〉 for_{j},*j*= 1, …,*p*. - Let J= argmax
_{j}|$\widehat{\mu}$_{j}|. Set Ŝ_{n}(λ) = {Ŝ_{n}(λ −1), J}. Set Ŷ = X_{λ}$\widehat{\beta}$ (λ) where ${\widehat{\beta}}_{\lambda}={({X}_{\lambda}^{T}{X}_{\lambda})}^{-1}{X}_{\lambda}^{T}Y$ and let Res =*Y*−*Ŷ*. - If
*λ*=*k*stop. Otherwise, go to step 2._{n}

For technical reasons, we assume that the final estimator *x ^{T}$\widehat{\beta}$ * is truncated to be no larger than

*B*. Note that

*λ*is discrete and Λ

*= {0, 1, …,*

_{n}*k*}.

_{n}#### Theorem 4.5

With *Ŝ _{n}*(

*λ*) defined as above, the statements of Theorems 4.1, 4.2 and 4.3 hold.

### 4.3. Marginal Regression

This is probably the oldest, simplest and most common method. It is quite popular in gene expression analysis. It is used to be regarded with some derision but has enjoyed a revival. A version appears in a recent paper by Fan and Lv (2008). Let *Ŝ _{n}*(

*λ*) = {

*j*: |

*$\widehat{\mu}$*| ≥

_{j}*λ*} where

*$\widehat{\mu}$*=

_{j}*n*

^{−1}〈

*Y*,

*X*

_{•}

*〉.*

_{j}Let *μ _{j}* =
(

*$\widehat{\mu}$*) and let

_{j}*μ*

_{(}

_{j}_{)}denote the value of

*μ*ordered by their absolute values:

#### Theorem 4.6

Let k_{n} → ∞ with
${k}_{n}=o(\sqrt{n})$. Let Λ_{n} = {λ: |Ŝ _{n}(λ)| ≤ k_{n}}. Assume that

Then, the statements of Theorems 4.1, 4.2 and 4.3 hold.

The assumption (12) limits the degree of unfaithfulness (small partial correlations induced by cancellation of parameters). Large values of *k _{n}* weaken assumption (12) thus making the method more robust to unfaithfulness, but at the expense of lower power. Fan and Lv (2008) make similar assumptions. They assume that there is a

*C*> 0 such that |

*μ*| ≥

_{j}*C|β*for all

_{j}|*j*which rules out unfaithfulness. However, they do not explicitly related the values of

*μ*for

_{j}*j*∈

*D*to the values outside

*D*as we have done. On the other hand, they assume that

*Z*= Σ

^{−1/2}

*X*has a spherically symmetric distribution. Under this assumption and their faithfulness assumption, they deduce that the

*μ*’s outside

_{j}*D*cannot strongly dominate the

*μ*’s within

_{j}*D*. We prefer to simply make this an explicit assumption without placing distributional assumptions on

*X*. At any rate, any method that uses marginal regressions as a starting point must make some sort of faithfulness assumptions to succeed.

### 4.4. Modifications

Let us now discuss a few modifications of the basic method. First, consider splitting the data only into two groups and . Then do these steps:

- Stage I. Find
*Ŝ*(_{n}*λ*) for*λ*∈ Λwhere |_{n}*Ŝ*(_{n}*λ*)| ≤*k*for each_{n}*λ*∈ Λusing ._{n} - Stage II. Find
*$\widehat{\lambda}$*by cross-validation and let*Ŝ*=_{n}*Ŝ*(_{n}*$\widehat{\lambda}$*) using . - Stage III. Find the least squares estimate
*$\widehat{\beta}$*_{Ŝ }_{n}using . Let*$\widehat{D}$*= {_{n}*j*∈*Ŝ*: |_{n}*T*| >_{j}*c*} where_{n}*T*is the usual t-statistic._{j}

#### Theorem 4.7

Choosing

controls asymptotic type I error.

The critical value in (13) is hopelessly large and it does not appear it can be substantially reduced. We present this mainly to show the value of the extra data-splitting step. It is tempting to use the same critical value as in the tri-split case, namely, *c _{n}* =

*z*

_{α/}_{2}

*where*

_{m}*m*= |

*Ŝ*| but we suspect this will not work in general. However, it may work under extra conditions.

_{n}## 5. Application

As an example we illustrate an analysis based on part of the Osteoporotic Fractures in Men Study (MrOS, Orwoll et al. 2005). A sample of 860 men were measured at a large number of genes and outcome measures. We consider only 296 SNPs which span 30 candidate genes for bone mineral density. An aim of the study was to identify genes associated with bone mineral density that could help in understanding the genetic basis of osteoporosis in men. Initial analyses of this subset of the data revealed no SNPs with a clear pattern of association with the phenotype; however, three SNPs, numbered (67, 277, 289) exhibited some association in the screening of the data. To further explore the effacacy of the lasso screen and clean procedure we modified the phenotype to enhance this weak signal and then reanalyzed the data to see if we could detect this planted signal.

We were interested in testing for main effects and pairwise interactions in these data; however, including all interactions results in a model with 43,660 additional terms, which is not practical for this sample size. As a compromise we selected 2 SNPs per gene to model potential interaction effects. This resulted in a model with a total of 2066 potential coefficients, including 296 main effects and 1770 interaction terms. With this model our initial screen detected 10 terms, including the three enhanced signals, 2 other main effects and 5 interactions. After cleaning, the final model detected the 3 enhanced signals, and no other terms.

## 6. Simulations

To further explore the screen and clean procedures, we conducted simulation experiments with four models. For each model
${Y}_{i}={X}_{i}^{T}\beta +{\epsilon}_{i}$ where the measurement errors, *ε _{i}* and
${\epsilon}_{ij}^{\ast}$, are iid Normal(0, 1) and the covariates

*X*’s are Normal(0, 1) (except for model D). Models differ in how

_{ij}*Y*is linked to

_{i}*X*and the dependence structure of the

_{i}*X*’s. Models A, B and C explore scenarios with moderate and large

_{i}*p*, while Model D focuses on confounding and unfaithfullness.

- Null model:
*β*= (0,…,0) and the*X*’s are iid._{ij} - Triangle model:
*β*=_{j}*δ*(10 −*j*),*j*= 1,…, 10,*β*= 0,_{j}*j*> 10 and*X*’s are iid._{ij} - Correlated Triangle model: as B, but with ${X}_{ij(+1)}=\rho {X}_{ij}+{(1-{\rho}^{2})}^{1/2}{\epsilon}_{ij}^{\ast}$ for
*j*> 1, and*ρ*= 0.5. - Unfaithful model:
*Y*=_{i}*β*_{1}*X*_{i}_{1}+*β*_{2}*X*_{i}_{2}+*ε*, for_{i}*β*_{1}= −*β*_{2}= 10, where the*X*’s are iid for_{ij}*j*= {1, 5, 6, 7, 8, 9, 10}, but ${X}_{i2}=\rho {X}_{i1}+\tau {\epsilon}_{i2}^{\ast},{X}_{i3}=\rho {X}_{i1}+\tau {\epsilon}_{i10}^{\ast}$, and ${X}_{i4}=\rho {X}_{i2}+\tau {\epsilon}_{i11}^{\ast}$, for*τ*= 0.01 and*ρ*= 0.95.

We used a maximum model size of *k _{n}* =

*n*

^{1/2}which technically goes beyond the theory but works well in practice. Prior to analysis the covariates are scaled so that each has mean 0 and variance 1. The tests were initially performed using a third of the data for each of the three stages of the procedure (Table 1, top half, 3 splits). For models A, B and C each approach has Type I error less than

*ρ*, except the stepwise procedure which has trouble with model C when

*n*=

*p*= 100. We also calculated the false positive rate and found it to be very low (about 10

^{−4}when

*p*= 100 and 10

^{−5}when

*p*= 1000) indicating that even when a Type I error occurs, only a very small number of terms are included erroneously. The lasso screening procedure exhibited a slight power advantage over the stepwise procedure. Both methods dominated the marginal approach. The Markov dependence structure in model C clearly challenged the marginal approach. For Model D none of the approaches controlled the Type I error.

_{av}. The

**...**

To determine the sensitivity of the approach to using distinct data for each stage of the analysis, simulations were conducted screening on the first half of the data and cleaning on the second half (2 splits). The tuning parameter was selected using leave-one-out cross validation (Table 1, bottom half). As expected this approach lead to a dramatic increase in the power of all the procedures. More surprising is the fact that the Type I error was near *α* or below for models A, B and C. Clearly this approach has advantages over data splitting and merits further investigation.

A natural competitor to screen and clean procedure is a two-stage adaptive lasso (Zou, 2006). In our implementation we split the data and used one half for each stage of the analysis. At stage one, leave-one-out cross validation lasso screens the data. In stage two, the adaptive lasso, with weights *w _{j}* = |

*$\widehat{\beta}$*

_{j}|^{−1}, cleans the data. The tuning parameter for the lasso was again chosen using leave-one-out cross validation. Table 2 provides the size, power and false positive rate (FPR) for this procedure. Naturally, the adaptive lasso does not control the size of the test, but the FPR is small. The power of the test is greater than we found for our lasso screen and clean procedure, but this extra power comes at the cost of a much higher Type I error rate.

## 7. Proofs

Recall that if *A* is a square matrix then *φ*(*A*) and Φ(*A*) denote the smallest and largest eigenvalues of *A*. Throughout the proofs we make use of the following fact. If *v* is a vector and *A* is a square matrix then

We use the following standard tail bound: if *Z* ~ *N*(0, 1) then ℙ(|*Z*| > *t*) ≤ *t*^{−1}*e*^{−t2/2}. We will also use the following results about the lasso from Meinshausen and Yu (2008). Their results are stated and proved for fixed *X* but, under the conditions (A1)–(A5), it is easy to see that their conditions hold with probability tending to one and so their results hold for random *X* as well.

### Theorem 7.1 (Meinshausen and Yu, 2008)

Let $\stackrel{\u0303}{\beta}$(λ) be the lasso estimator.

- The squared error satisfies:$$\mathbb{P}\left(\parallel \stackrel{\sim}{\beta}(\lambda )-\beta \mid {\mid}_{2}^{2}\phantom{\rule{0.16667em}{0ex}}\le \phantom{\rule{0.38889em}{0ex}}\frac{2{\lambda}^{2}s}{{n}^{2}{\kappa}^{2}}+\frac{\mathit{cm}log{p}_{n}}{n{\phi}_{n}^{2}(m)}\right)\to 1$$(15)where m = |Ŝ
_{n}(λ)| and c > 0 is a constant. - The size of Ŝ
_{n}(λ) satisfies$$\mathbb{P}\left(\mid {\widehat{S}}_{n}(\lambda )\mid \phantom{\rule{0.38889em}{0ex}}\le \phantom{\rule{0.38889em}{0ex}}\frac{{\tau}^{2}C{n}^{2}}{{\lambda}^{2}}\right)\to 1$$(16)where ${\tau}^{2}=\mathbb{E}({Y}_{i}^{2})$.

#### Proof of Lemma 3.1

Let *D* ⊂ *M* and
$\phi =\phi ({n}^{-1}{X}_{M}^{T}{X}_{M})$. Then

where
${Z}_{j}={n}^{-1/2}{X}_{\u2022j}^{T}\epsilon $. Conditional on *X*,
${Z}_{i}\sim N(0,{a}_{j}^{2})$ where
${a}_{j}^{2}={n}^{-1}{\sum}_{i=1}^{n}{X}_{ij}^{2}$. Let
${A}_{n}^{2}={max}_{1\le j\le {p}_{n}\phantom{\rule{0.16667em}{0ex}}}{a}_{j}^{2}$. By Hoeffding’s inequality, (A2) and (A5), ℙ(*E _{n}*) → 1 where
${E}_{n}=\{{A}_{n}\le \sqrt{2}\}$. So

But ${\sum}_{j\in M}{Z}_{j}^{2}\le m{max}_{1\le j\le {p}_{n}}{Z}_{j}^{2}$ and (6) follows.

Now we lower bound *L*(*$\widehat{\beta}$ _{M}*). Let

*M*be such that

*D*⊄

*M*. Let

*A*= {

*j: $\widehat{\beta}$*(

*j*) ≠ 0} ∪

*D*. Then |

*A*| ≤

*m*+

*s*. Therefore, with probability tending to 1,

#### Proof of Theorem 3.2

Let *Ỹ* denote the responses, and *$\stackrel{\u0303}{X}$* the design matrix, for the second half of the data. Then *Ỹ* = *$\stackrel{\u0303}{X}$β* + *$\stackrel{\u0303}{\epsilon}$*. Now

and

where *δ _{n}* = ||

*$\stackrel{\u0303}{\epsilon}$*||

^{2}/

*n,*and ${\widehat{\mathrm{\sum}}}_{n}={n}_{1}^{-1}{X}^{T}X$ and $\stackrel{\u0303}{\Sigma}$

*=*

_{n}*n*

^{−1}

*$\stackrel{\u0303}{X}$*. By Hoeffding’s inequality

^{T}$\stackrel{\u0303}{X}$

for some *c* > 0 and so

Choose *ε _{n}* = 4/(

*cn*

^{1−}

^{c}^{2}). It follows that

Note that

Hence, with probability tending to 1,

for all *λ* ∈ Λ* _{n}*, where

and
${\mu}_{i}(\lambda )={\stackrel{\sim}{X}}_{i}^{T}(\widehat{\beta}(\lambda )-\beta )$. Now
$\parallel \widehat{\beta}(\lambda )-\beta )\mid {\mid}_{1}^{2}={O}_{P}({({k}_{n}+s)}^{2})$ since ||*$\widehat{\beta}$ *(*λ*)||^{2} = *O _{P}* (

*k*/

_{n}*φ*(

*k*)). Thus, ||

_{n}*$\widehat{\beta}$*(

*λ*) −

*β*||

_{1}≤

*C*(

*k*+

_{n}*s*) with probability tending to 1, for some

*C*> 0. Also, |

*μ*(

_{i}*λ*)| ≤

*B*||

*$\widehat{\beta}$*(

*λ*) −

*β*||

_{1}≤

*BC*(

*k*+

_{n}*s*) with probability tending to 1. Let

*W*~

*N*(0, 1). Conditional on ,

so ${sup}_{\lambda \in {\mathrm{\Lambda}}_{n}}\mid {\xi}_{n}(\lambda )\mid ={O}_{P}({k}_{n}/\sqrt{n})$.

#### Proof of Theorem 4.1

(1) Let
${\lambda}_{n}=\tau n\sqrt{C/{k}_{n}}$, *M* = *Ŝ _{n}*(

*λ*) and

_{n}*m*= |

*M*|. Then, ℙ(

*m*≤

*k*) → 1 due to (16). Hence, ℙ(

_{n}*λ*∈ Λ

_{n}*) → 1. From (15),*

_{n}Hence,
$\parallel \stackrel{\sim}{\beta}({\lambda}_{n})-\beta \mid {\mid}_{\infty}^{2}={o}_{P}(1)$. So, for each *j* ∈ *D*,

and hence ℙ(min_{j}_{∈}* _{D}*|

*$\stackrel{\u0303}{\beta}$*(

_{j}*λ*)| > 0) → 1. Therefore, Γ

_{n}*= {*

_{n}*λ*∈ Λ

*:*

_{n}*D*⊂

*Ŝ*(

_{n}*λ*)} is nonempty. By Lemma 3.1,

On the other hand, from Lemma 3.1,

Now, *nφ _{n}*(

*k*)/(

_{n}*k*log

_{n}*p*) → ∞ and so, (17) and (18) imply that

_{n}Thus, if *λ*_{*} denotes the minimizer of *L*(*λ*) over Λ* _{n}*, we conclude that ℙ(

*λ*

_{*}∈ Γ

*) → 1 and hence, ℙ(*

_{n}*D*⊂

*Ŝ*(

_{n}*λ*

_{*})) → 1.

(2) This follows from part (1) and Theorem 3.2.

(3) Let *A* = *Ŝ _{n}* ∩

*D*. We want to show that

^{c}Now,

Conditional on (
,
), *$\widehat{\beta}$ _{A}* is Normally distributed with mean 0 and variance matrix
${\sigma}^{2}{({X}_{A}^{T}{X}_{A})}^{-1}$ when

*D*⊂

*Ŝ*. Recall that

_{n}

where *M* = *Ŝ _{n}*,
${s}_{j}^{2}={\widehat{\sigma}}^{2}{e}_{j}^{T}{({X}_{M}^{T}{X}_{M})}^{-1}{e}_{j}$ and

*e*= (0, …, 0, 1, 0, …, 0)

_{j}*where the 1 is in the*

^{T}*j*

^{th}coordinate. When

*D*⊂

*Ŝ*, each

_{n}*T*, for

_{j}*j*∈

*A*, has a t-distribution with

*n*−

*m*degrees of freedom where

*m*= |

*Ŝ*|. Also,

_{n}*c*/

_{n}*t*

_{α}_{/2}

*→ 1 where*

_{m}*t*denotes the upper tail critical value for the t-distribution. Hence,

_{u}

where *a _{n}* =

*o*(1), since |

*A*| ≤

*m*. It follows that

#### Proof of Theorem 4.2

From Theorem 4.1, ℙ(*$\widehat{D}$ _{n}* ∩

*D*≠ ∅) ≤

^{c}*α*and so ℙ(

_{n}*$\widehat{D}$*∩

_{n}*D*≠ ∅) → 0. Hence, ℙ(

^{c}*$\widehat{D}$*⊂

_{n}*D*) → 1. It remains to be shown that

The test statistic for testing *β _{j}* = 0 when

*Ŝ*=

_{n}*M*is

For simplicity in the proof, let us take *$\widehat{\sigma}$ * = *σ*, the extension to unknown *σ* being straightforward. Let *j* ∈ *D*, ℳ = {*M*: |*M*| ≤ *k _{n}*,

*D*⊂

*M*}. Then,

Conditional on
∪
, for each *M* ∈ ℳ, *T _{j}*(

*M*) = (

*β*/

_{j}*s*) +

_{j}*Z*where

*Z*~

*N*(0, 1). Without loss of generality assume that

*β*> 0. Hence,

_{j}Fix a small *ε* > 0. Note that
${s}_{j}^{2}\le {\sigma}^{2}/(n\kappa )$. It follows that, for all large *n*,
${c}_{n}-{\beta}_{j}/{s}_{j}<-\epsilon \sqrt{n}$. So,

The number of models in ℳ is

where we used the inequality

So,

by (A2). We have thus shown that ℙ(*j* ∉ *$\widehat{D}$ _{n}*) → 0 for each

*j*∈

*D*. Since |

*D*| is finite, it follows that ℙ(

*j*∉

*$\widehat{D}$*for some

_{n}*j*∈

*D*) → 0 and hence (19).

#### Proof of Theorem 4.5

A simple modification of Theorem 3.1 of Barron, Cohen, Dahmen and DeVore (2008) shows that

(The modification is needed because Barron, Cohen, Dahmen and DeVore (2008) require *Y* to be bounded while we have assumed that *Y* is Normal. By a truncation argument, we can still derive the bound on *L*(*k _{n}*).) So

Hence, for any *ε* > 0, with probability tending to 1, ||*$\widehat{\beta}$ * (*k _{n}*) −

*β*||

^{2}<

*ε*so that |

*$\widehat{\beta}$*| >

_{j}*ψ*/2 > 0 for all

*j*∈

*D*. Thus, ℙ(

*D*⊂

*Ŝ*(

_{n}*k*)) → 1. The remainder of the proof of part 1 is the same as in Theorem 4.1. Part 2 follows from the previous result together with Theorem 3.2. The proof of Part 3 is the same as for Theorem 4.1.

_{n}#### Proof of Theorem 4.6

Note that
$\widehat{{\mu}_{j}}-{\mu}_{j}={n}^{-1}{\sum}_{i=1}^{n}{X}_{ij}{\epsilon}_{i}$. Hence,
$\widehat{{\mu}_{j}}-{\mu}_{j}\sim N(0,1/n)$. So, for any *δ* > 0,

By (12), conclude that *D* ⊂ *Ŝ _{n}*(

*λ*) when

*λ*=

*$\widehat{\mu}$*

_{(}

_{k}_{n)}. The remainder of the proof is the same as the proof of Theorem 4.5.

#### Proof of Theorem 4.7

Let *A* = *Ŝ _{n}* ∩

*D*. We want to show that

^{c}For fixed *A*, *$\widehat{\beta}$ _{A}* is Normal with mean 0 but this is not true for random

*A*. Instead we need to bound

*T*. Recall that

_{j}

where *M* = *Ŝ _{n}*,
${s}_{j}^{2}={\widehat{\sigma}}^{2}{e}_{j}^{T}{({X}_{M}^{T}{X}_{M})}^{-1}{e}_{j}$ and

*e*= (0, …, 0, 1, 0, …, 0)

_{j}*where the 1 is in the*

^{T}*j*

^{th}coordinate. The probabilities that follow are conditional on but this is supressed for notational convenience. First, write

When *D* ⊂ *Ŝ _{n}*,

where
${Q}_{{\widehat{S}}_{n}}={((1/n){X}_{{\widehat{S}}_{n}}^{T}{X}_{{\widehat{S}}_{n}})}^{-1},{\gamma}_{{\widehat{S}}_{n}}={n}^{-1}{X}_{{\widehat{S}}_{n}}^{T}\epsilon $, and *β _{Ŝ }*

_{n}(

*j*) = 0 for

*j*∈

*A*. Now, ${s}_{j}^{2}\ge {\widehat{\sigma}}^{2}/(nC)$ so that

for *j* ∈ *Ŝ _{n}*. Therefore,

Let *γ* = *n*^{−1}*X ^{T}ε*. Then,

It follows that

since *κ* > 0. So,

Note that *γ _{j}* ~

*N*(0,

*σ*

^{2}/

*n*) and hence

There exists *ε _{n}* → 0 such that ℙ(

*B*) → 1 where

_{n}*B*= {(1 −

_{n}*ε*) ≤

_{n}*$\widehat{\sigma}$*/

*σ*≤ (1 +

*ε*)}. So,

## 8. Discussion

The multi-stage method presented in this paper successfully controls type I error while giving reasonable power. The lasso and stepwise have similar performance. Although theoretical results assume independent data for each of the three stages, simulations suggest that leave-one-out cross-validation leads to valid Type I error rates and greater power. Screening the data in one phase of the experiment and cleaning in a followup phase leads to an efficient experimental design. Certainly this approach deserves further theoretical investigation. In particular, the question of optimality is an open question.

The literature on high dimensional variable selection is growing quickly. The most important deficiency in much of this work, including this paper, is the assumption that the model *Y* = *X ^{T}β* +

*ε*is correct. In reality, the model is at best an approximation. It is possible to study linear procedures when the linear model is not assumed to hold as in Greenshtein and Ritov (2004). We discuss this point in the appendix. Nevertheless, it seems useful to study the problem under the assumption of linearity to gain insight into these methods. Future work should be directed at exploring the robustness of the results when the model is wrong.

Other possible extensions include: dropping the Normality of the errors, permitting non-constant variance, investigating the optimal sample sizes for each stage, and considering other screening methods besides cross-validation.

Finally let us note that the example involving unfaithfulness, that is, cancellations of parameters to make the marginal correlation much different than the regression coefficient, pose a challenge for all the methods and deserve more attention even in cases of small *p*.

## Acknowledgments

The authors are grateful for the use of a portion of the sample from the Osteo-porotic Fractures in Men (MrOS) Study to illustrate their methodology. MrOs is supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS), the National Institute on Aging (NIA), and the National Cancer Institute (NCI) through grants U01 AR45580, U01 AR45614, U01 AR45632, U01 AR45647, U01 AR45654, U01 AR45583, U01 AG18197, and M01 RR000334. Genetic analyses in MrOS were supported by R01-AR051124. This work was supported by NIH grant MH057881. We also thank two referees and an AE for helpful suggestions.

## Appendix

#### Prediction

Realistically, there is little reason to believe that the linear model is correct. Even if we drop the assumption that the linear model is correct, sparse methods like the lasso can still have good properties as shown in Greenshtein and Ritov (2004). In particular, they showed that the lasso satisfies a risk consistency property. In this appendix we show that this property continues to hold if *λ* is chosen by cross-validation.

The lasso estimator is the minimizer of
${\sum}_{i=1}^{n}{({Y}_{i}-{X}_{i}^{T}\beta )}^{2}+\lambda \parallel \beta \mid {\mid}_{1}$. This is equivalent to minimizing
${\sum}_{i=1}^{n}{({Y}_{i}-{X}_{i}^{T}\beta )}^{2}$ subject to ||*β*||_{1} ≤ Ω, for some Ω. (More precisely, the set of estimators as *λ* varies is the same as the set of estimators as Ω varies.) We use this second version throughout this section.

The predictive risk of a linear predictor ℓ(*x*) = *x ^{T}β* is

*R*(

*β*) = (

*Y*− ℓ(

*x*))

^{2}where (

*X*,

*Y*) denotes a new observation. Let

*γ*=

*γ*(

*β*) = (−1,

*β*

_{1}, …,

*β*)

_{p}*and let Γ = (*

^{T}*ZZ*) where

^{T}*Z*= (

*Y*,

*X*

_{1}, …,

*X*). Then we can write

_{p}*R*(

*β*) =

*γ*Γ

^{T}*γ*. The lasso estimator can now be written as

*$\widehat{\beta}$*(Ω

*) = argmin*

_{n}

_{β}_{∈}

_{B}_{(Ωn)}

*$\widehat{R}$*(

*β*) where

*$\widehat{R}$*(

*β*) =

*γ*$\widehat{\Gamma}$

^{T}*γ*and $\widehat{\mathrm{\Gamma}}={n}^{-1}{\sum}_{i=1}^{n}{Z}_{i}{Z}_{i}^{T}$.

Define

where

Thus, ℓ_{*}(*x*) = *x ^{T} β*

_{*}is the best linear predictor in the set

*B*(Ω

*). The best linear predictor is well defined even though (*

_{n}*Y*|

*X*) is no longer assumed to be linear. Greenshtein and Ritov (2004) call an estimator

*$\widehat{\beta}$*persistent, or predictive risk consistent, if

_{n}

as *n* → ∞.

The assumptions we make in this section are:

- (B1)
*p*≤_{n}*e*^{n}^{ξ}for some 0 ≤*ξ*< 1 and - (B2) The elements of $\widehat{\Gamma}$ satisfy an exponential inequality:$$\mathbb{P}(\mid {\widehat{\mathrm{\Gamma}}}_{jk}-{\mathrm{\Gamma}}_{jk}\mid >\epsilon )\le {c}_{3}{e}^{-n{c}_{4}{\epsilon}^{2}}$$for some
*c*_{3},*c*_{4}> 0 and - (B3) There exists
*B*_{0}< ∞ such that, for all*n*, max(|_{j;k}*Z*|) ≤_{j}Z_{k}*B*_{0}.

Condition (A2) can easily be deduced from more primitive assumptions as in Greenshtein and Ritov (2004) but for simplicity we take (A2) as an assumption. Let us review one of the results in Greenshtein and Ritov (2004). For the moment, replace (A1) with the assumption that *p _{n}* ≤

*n*for some

^{b}*b*. Under these conditions, it follows that

Hence,

The latter term is *o _{P}* (1) as long as Ω

*=*

_{n}*o*((

*n*/log

*n*)

^{1/4}). Thus we have:

##### Theorem 8.1 (Greenshtein and Ritov 2004)

If Ω_{n} = o((n/log n)^{1/4}) then the lasso estimator is persistent.

For future reference, let us state a slightly different version of their result that we will need. We omit the proof.

##### Theorem 8.2

Let γ > 0 be such that ξ + γ < 1. Let Ω_{n} = O(n^{(1−ξ−γ)/4}). Then, under (B1) and (B2),

for some c > 0.

The estimator *$\widehat{\beta}$ *(Ω* _{n}*) lies on the boundary of the ball

*B*(Ω

*) and is very sensitive to the exact choice of Ω*

_{n}*. A potential improvement—and something that reflects actual practice—is to compute the set of lasso estimators*

_{n}*$\widehat{\beta}$*(ℓ) for 0 ≤ ℓ ≤ Ω

*and then select from that set based on cross validation. We now confirm that the resulting estimator preserves persistence. As before we split the data into and . Construct the lasso estimators {*

_{n}*$\widehat{\beta}$*(ℓ): 0 ≤ ℓ ≤ Ω

*}. Choose ℓ̂ by cross validation using . Let*

_{n}*$\widehat{\beta}$*=

*$\widehat{\beta}$*(ℓ̂ ).

##### Theorem 8.3

Let γ > 0 be such that ξ + γ < 1. Under (A1), (A2) and (A3), if Ω_{n} = O(n^{(1−ξ−γ)/4}). then the cross validated lasso estimator $\widehat{\beta}$ is persistent. Moreover,

##### Proof

Let *β*_{*}(ℓ) = argmin_{β}_{∈}_{B}_{(ℓ)}*R*(*β*). Define *h*(ℓ) = *R*(*β*_{*}(ℓ)), *g*(ℓ) = *R*(*$\widehat{\beta}$ *(ℓ)) and *c*(ℓ) = *$\widehat{L}$ *(*$\widehat{\beta}$ *(ℓ)). Note that, for any vector *b*, we can write *R*(*b*) = *τ*^{2} + *b ^{T}*Σ

*b*−2

*b*where

^{T}ρ*ρ*= ((

*Y X*

_{1}), …, (

*Y X*))

_{p}*.*

^{T}Clearly, *h* is monotone nonincreasing on [0, Ω* _{n}*]. We claim that |

*h*(ℓ +

*δ*) −

*h*(ℓ)| ≤

*c*Ω

*where*

_{n}δ*c*depends only on Γ. To see this, let

*u*=

*β*

_{*}(ℓ),

*v*=

*β*

_{*}(ℓ +

*δ*) and

*a*= ℓ

*β*

_{*}(ℓ +

*δ*)/(ℓ +

*δ*) so that

*a*∈

*B*(ℓ). Then,

where *C* = max* _{j,k}* |Γ

*| =*

_{j,k}*O*(1).

Next we claim that *g*(ℓ) is Lipschitz on [0, Ω* _{n}*] with probability tending to 1. Let

*$\widehat{\beta}$*(ℓ) = argmin

_{β}_{∈}

_{$\widehat{B}$ }_{(ℓ)}

*$\widehat{R}$*(

*β*) denote the lasso estimator and set

*û*=

*$\widehat{\beta}$*(ℓ) and

*$\widehat{v}$*=

*$\widehat{\beta}$*(ℓ +

*δ*). Let

*ε*=

_{n}*n*

^{−}

^{γ}^{/4}. From (20), the following chain of equations hold except on a set of exponentially small probability:

A similar argument can be applied in the other direction. Conclude that

except on a set of small probability.

Now let *A* = {0, *δ*, 2*δ*, …, *mδ*} where *m* is the smallest integer such that *mδ* ≥ Ω* _{n}*. Thus,

*m ~*Ω

*/*

_{n}*δ*. Choose

_{n}*δ*=

*δ*=

_{n}*n*

^{−3(1−}

^{ξ}^{−}

^{γ}^{)/8}. Then Ω

*→ 0 and Ω*

_{n}δ_{n}*/*

_{n}*δ*≤

_{n}*n*

^{3(1−}

^{ξ}^{−}

^{γ}^{)/4}. Using the same argument as in the proof of Theorem 3.2,

where *σ _{n}* =

*o*(1). Then,

_{P}

and persistence follows. To show the second result, let *$\stackrel{\u0303}{\beta}$* = argmin_{0≤ℓ≤Ωn} *g*(ℓ) and *$\stackrel{\u0304}{\beta}$* = argmin_{ℓ∈}* _{A} g*(ℓ). Then,

and the claim follows.

## References

- Barron A, Cohen A, Dahmen W, DeVore R. Approximation and learning by greedy algorithms. The Annals of Statistics. 2008;36:64–94.
- Bühlmann P. Boosting for high-dimensional linear models. The Annals of Statistics. 2006;34:559–583.
- Candes E, Tao T. The Dantzig selector: statistical estimation when
*p*is much larger than*n*. The Annals of Statistics. 2007;35:2313–2351. - Donoho D. For Most Large Underdetermined Systems of Linear Equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics. 2006;59:797–829.
- Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32:407–499.
- Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space. To appear:
*Journal of the Royal Statistical Association, Series B*2008 [PMC free article] [PubMed] - Greenshtein E, Ritov Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988.
- Hoeffding W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association. 1963;58:13–30.
- Meinshausen N. Relaxed Lasso. Computational Statistics and Data Analysis. 2007;52:374–393.
- Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462.
- Meinshausen N, Yu B. Lasso-type recovery of sparse representations of high-dimensional data. To appear:
*The Annals of Statistics*2008 - Orwoll E, Blank JB, Barrett-Connor E, Cauley J, Cummings S, Ensrud K, Lewis C, Cawthon PM, Marcus R, Marshall LM, McGowan J, Phipps K, Sherman S, Stefanick ML, Stone K. Design and baseline characteristics of the osteoporotic fractures in men (MrOS) study–a large observational study of the determinants of fracture in older men. Contemp Clin Trials. 2005;26:569–585. [PubMed]
- Robins J, Scheines R, Spirtes P, Wasserman L. Uniform consistency in causal inference. Biometrika. 2003;90:491–515.
- Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search. MIT Press; 2001.
- Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288.
- Tropp JA. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory. 2004;50:2231–2242.
- Tropp JA. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory. 2006;52:1030–1051.
- Wainwright M. Sharp thresholds for high-dimensional and noisy recovery of sparsity.arxiv.org/math.ST/0605740 2006
- Wellcome Trust. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [PMC free article] [PubMed]
- Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine learning research. 2006;7:2541–2563.
- Zhang CH, Huang J. Model selection consistency of the lasso in high-dimensional linear regression. To appear:
*The Annals of Statisstics*2006 - Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429.

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.5M) |
- Citation

- VARIABLE SELECTION AND ESTIMATION IN HIGH-DIMENSIONAL VARYING-COEFFICIENT MODELS.[Stat Sin. 2011]
*Wei F, Huang J, Li H.**Stat Sin. 2011 Oct 1; 21(4):1515-1540.* - Consistent group selection in high-dimensional linear regression.[Bernoulli (Andover). 2010]
*Wei F, Huang J.**Bernoulli (Andover). 2010 Nov; 16(4):1369-1384.* - An empirical approach to model selection through validation for censored survival data.[J Biomed Inform. 2011]
*Choi I, Wells BJ, Yu C, Kattan MW.**J Biomed Inform. 2011 Aug; 44(4):595-606. Epub 2011 Feb 16.* - Consistent high-dimensional Bayesian variable selection via penalized credible regions.[J Am Stat Assoc. 2012]
*Bondell HD, Reich BJ.**J Am Stat Assoc. 2012 Dec 21; 107(500):1610-1624. Epub 2012 Aug 14.* - Variable selection for multiply-imputed data with application to dioxin exposure study.[Stat Med. 2013]
*Chen Q, Wang S.**Stat Med. 2013 Sep 20; 32(21):3646-59. Epub 2013 Mar 25.*

- A Bayesian Approach for Graph-constrained Estimation for High-dimensional Regression[International journal of systems and synthe...]
*Sun H, Li H.**International journal of systems and synthetic biology. 2010; 1(2)255-272* - Endogeneity in High Dimensions[Annals of statistics. 2014]
*Fan J, Liao Y.**Annals of statistics. 2014 Jun 1; 42(3)872-917* - Penalized Multimarker vs. Single-Marker Regression Methods for Genome-Wide Association Studies of Quantitative Traits[Genetics. 2015]
*Yi H, Breheny P, Imam N, Liu Y, Hoeschele I.**Genetics. 2015 Jan; 199(1)205-222* - A SIGNIFICANCE TEST FOR THE LASSO[Annals of statistics. 2014]
*Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R.**Annals of statistics. 2014 Apr; 42(2)413-468* - COVARIANCE ASSISTED SCREENING AND ESTIMATION[Annals of statistics. 2014]
*Ke BT, Jin J, Fan J.**Annals of statistics. 2014 Nov 1; 42(6)2202-2242*

- PubMedPubMedPubMed citations for these articles

- HIGH DIMENSIONAL VARIABLE SELECTIONHIGH DIMENSIONAL VARIABLE SELECTIONNIHPA Author Manuscripts. 2009 Jan 1; 37(5A)2178

Your browsing activity is empty.

Activity recording is turned off.

See more...