Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Proc Am Stat Assoc. Author manuscript; available in PMC 2016 Jan 26.
Published in final edited form as:
Proc Am Stat Assoc. 2010 Jul-Aug; 2010: 5295–5309.
PMCID: PMC4727967
NIHMSID: NIHMS746164
PMID: 26823659

Regularized Variance Estimation and Variance Stabilization of High Dimensional Data

Abstract

Among the problems posed by high-dimensional datasets (so called pn paradigm) are that variable-specific estimators of variances are not reliable and tests statistics have low powers, both due to a lack of degrees of freedom. In addition, variance is observed to be a function of the mean. We introduce a non-parametric adaptive regularization procedure that uses the information contained in the mean to jointly generate local shrinkage estimators of the mean and variance. Regularized t-like statistics derived from these shrinkage estimators have significant more statistical power than their standard sample counterparts, regular common-value shrinkage estimators, or when the information contained in the sample mean is ignored. These estimators feature interesting properties of variance stabilization and normalization that can be used for preprocessing high-dimensional multivariate data.

Keywords: Bioinformatics, Inadmissibility, Regularization, Shrinkage Estimators, Normalization, Variance Stabilization

1 Introduction: Estimation of Population Variance Estimates

We introduce a regularization and variance stabilization method for parameter estimation, normalization and inference of data with many continuous variables. In a typical setting that we have in mind, the number of input variables (gene, peptide, protein, etc …) is much larger than the number of samples (so called pn paradigm), such as in high-throughput “omics” data. The data may be any kind of continuous or discrete covariates.

We and others have observed that variables in high-dimensional data often exhibit a complex mean-variance dependency with standard deviations severely increasing with the means [13, 20]. Since statistical procedures rely on equality/homogeneity of variances assumption and on independence, these issues become crucial for making accurate inferences. In addition, as the number of variables always dominates the number of samples in large high-throughput datasets, empirical evidence about univariate modeling approaches suggest that the usual variable-wise estimators of variance recurrently lead to false positives [10, 29], and suffer of a lack of statistical power due to small sample sizes [21, 24].

A large majority of authors have used non-parametric regularization techniques for variance estimation by borrowing information (pooling) across similar variables, and shown that the derived shrinkage estimators can significantly improve the accuracy of inferences. Jain et al. [15], proposed a local-pooled error estimation procedure, which borrows strength from variables in local intensity regions to estimate variability. Shrinkage estimation was used by Wright & Simon [32]; Cui et al. [6] and Ji & Wong [16]. Tong and Wang proposed a family of optimal shrinkage estimators for variances raised to a fixed power [28] by borrowing information across variables. The idea of borrowing strength across variables was also recently exploited in gene-sets enrichment analyses [9].

Shrinkage estimators have also been successfully combined with empirical Bayes approaches, where posterior estimators have been shown to follow distributions with augmented degrees of freedom, greater statistical power, and far more stable inferences in the presence of few samples [18, 21]. Following this approach, Baldi & Long estimated population variances by a weighted mixture of the individual variable sample variance and an overall inflation factor selected using all variables [1]. Lonnstedt & Speed [18] and later Smyth [21] proposed an empirical Bayes approach that combines information across variables. Kendziorski et al. extended the empirical Bayes method using hierarchical gamma-gamma and log-normal-normal models [17].

In a similar vein, shrinkage estimation was also used to generate (Bayesian- or not) “moderated” statistics. There, variable-specific variance estimators are inflated by using an overall offset. Efron et al. derived a t-test that estimates the offset by using a percentile of the distribution of sample standard deviations [10]. Tusher et al. [29] and Storey & Tibshirani [25] added a small constant to the variable-specific variance estimators in their t-test to stabilize the small variances (SAM). Smyth and Cui et al. proposed regularized t-tests and F -tests by replacing the usual variance estimator with respectively a Bayesian-adjusted denominator [21] or a James-Stein-based shrinkage estimator [6].

A commonality to all previous shrinkage methods involves generating some sort of shrinkage estimator of the sample variance to a global value, which is used for all variables. However, the assumption that a pooled estimator can be used in common to all variables seems unrealistic, and test statistics based on these estimators are as yet likely to give misleading results [6]. We propose an alternative type of shrinkage, being more akin to joint adaptive local shrinkage, using a joint and adaptive regularization technique of the location and scale parameters as follows.

First, if one cluster variables by their individual parameter estimate values, it then becomes possible to get improved cluster-pooled parameter estimates by using information from the cluster and make inferences about each variable individually with higher accuracy. Essentially, this amounts to generating an adaptive or local-pooled shrinkage estimator, similarly to Jain et al. [15] and Papana & Ishwaran [19]. Second, exploiting the observation that the variance is an unknown function of the mean [13, 20] and C. Stein's inadmissibility result of variance estimators [23], it is clear that shrinkage variance estimators should improve if information contained in the sample mean is known or used. In line, Wang recently proposed to use a constant coefficient of variation model and a quadratic variance-mean model for variance estimation as a function of an unknown mean [30].

By exploiting the above ideas of simultaneously compensating for the parameter dependency, and for the lack of statistical power by local-pooling of information from similar variables, we get joint regularized shrinkage estimates of population means and variances. We show that using these estimators can not only stabilize the variance for each variable, but also allow novel regularized tests statistics (e.g. t-test) with greater statistical power and improved overall accuracy.

Section 2 lays out the principle and notations of regularization and joint estimation of population mean and variance in a single or multi-group situation. Section 3 introduces the so-called similarity statistic as the basis of our clustering algorithm. Section 4 shows how to derive regularized statistics from our approach, and how to use them in significance tests to make inferences. With the help of a synthetic example, Section 5 demonstrates the adequacy of our procedure and its performance in inferences (hypotheses testing in a classification problem) and in improved variance stabilization in comparison to competitive estimators and other normalization/variance stabilization methods.

2 Adaptive Regularization Via Clustering

2.1 Introduction

Let Yi,j be the individual response (expression level, signal, intensity, …) of variable j ∈ {1, …, p} (gene, peptide, protein, …) in sample i ∈ {1, …, n}. Let's assume the individual response for any given variable j follows some unknown continuous location-scale family distribution: Yi,jiidD(μj,σj2). Generating variable-by-variable z-scores using individual sample mean and sample standard deviation of variable j to standardize the response Yi,j=Yi,jμ^jσ^j will give a transformed mean and standard deviation (μ^j,σ^j)=(0,1)common to each variable j. This is likely to be a poor transformation of the data, because even if an equal mean/variance model was true, we still expect sampling variability in (μ^j,σ^j). Also, this can lead to misleading inferences mostly due to over-fitting [10, 29] and lack of statistical power [21, 24].

By identifying those clusters where variables tend to have similar population parameters estimates, one can derive cluster-pooled versions of these estimates, which, in turn are used to standardize each variable individually within them. Similarly, Papana & Ishwaran proposed a strategy to generate an equal variance model [19], later used in Bayesian model selection [14]. The former is a variance stabilization procedure that is achieved by quantile regularization of sample standard deviations, by means of a recursive partitioning (CART-like) algorithm. Our strategy distinguishes itself in that it provides joint regularized estimates of population mean and variance.

To simultaneously (i) borrow information across variables, as well as (ii) use the information contained in the estimated population mean, our idea is to perform a bi-dimensional clustering of the variables in their individual parameter estimates space. This amounts to finding which clusters tend to gather variables that have similar sample mean along with sample standard deviations, wherefrom population variance σj2 and population mean μj for each variable j are estimated using cluster-pooled parameter estimates.

2.2 Single Group Situation

Suppose the variables assume C categorical values (clusters). Let Cl denotes the lth cluster for l ∈ {1, …, C}. Let lj{Cl}l=1C for j ∈ {1, …, p} be the cluster variable j belongs to, i.e. the cluster membership indicator of variable j:lj=l=1CCl·I(cluster(j)=Cl), for l ∈ {1, …, C} and j ∈ {1, …, p}, where I(·) denotes the indicator function throughout the paper. Let μ̂(lj) and σ̂2(lj) for cluster lj be the cluster mean of sample mean and cluster mean of sample variance respectively. In practice, the cluster mean of sample mean and the cluster mean of sample variance are given by:

μ^(lj)=1#{j:lj=l}{j:lj=l}μ^jσ^2(lj)=1#{j:lj=l}{j:lj=l}σ^j2
(1)

where lj denotes the cluster membership indicator of variable j, μ̂j is the usual sample mean for variable j, and σ^j2 is the usual unbiased sample variance for variable j defined as:

μ^j=1ni=1nYi,jσ^j2=1n1i=1n(Yi,jμ^j)2
(2)

Then, use these within cluster shared estimates {μ^(lj),σ^2(lj)}j=1p to standardize all the variables.

2.3 Multiple Groups Situation

Let Gk denotes the kth sample group for k ∈ {1, …, G}. Let ki{Gk}k=1K for i ∈ {1, …, n} be the group sample i belongs to, i.e. the group membership indicator of sample i: ki=k=1GGk·I(group(i)=Gk), for k ∈ {1, …, K} and i ∈ {1, …, n}. To deal with multiple groups of samples and the issue of differing variances across groups, one initially performs a separate bi-dimensional clustering of the variables within each group k ∈ {1, …, G} based on individual variable sample mean and sample variance as in the case of a single group (G = 1). Next, get an optimal cluster configuration 𝒞k for each group k ∈ {1, …, G} and merge the cluster configurations {Ck}k=1G into a refined cluster configuration 𝒞 (usually containing more clusters), following Papana et al.'s scheme [19]. Finally, use this new cluster configuration for generating regularized population estimates as before.

For instance, in the case of two groups (G = 2) the merging scheme is as follows:

Groupk=1:Config.C1XX|XX|XXX|X|X|XMerged Config.CGroupk=2:Config.C2XXX|XXXX|X|X|X

If we let nk = |GK|, nk ≠ 1 for k ∈ {1, …, G}, such that k=1Gnk=n, derive similarly as before (see 2.2) for the cluster membership indicator lj of variable j, the cluster mean of pooled sample mean and the cluster mean of pooled sample variance as:

μ^(lj)=1#{j:lj=l}{j:lj=l}μ^jσ^2(lj)=1#{j:lj=l}{j:lj=l}σ^j2
(3)

where μ̂j and σ^j2 are respectively the usual pooled sample mean and pooled sample variance across groups for each individual variable j:

μ^j=1nk=1Gnkμ^k,jσ^j2=1nGk=1G(nk1)σ^k,j2
(4)

and μ̂k,j and σ^k,j2 are respectively the usual sample mean and the unbiased sample variance for each individual variable j in group k.

μ^k,j=1nk{i:ki=k}Yi,jσ^k,j2=1nk1{i:ki=k}(Yi,jμ^k,j)2
(5)

Finally, use these within cluster shared estimates {μ^(lj),σ^2(lj)}j=1p to standardize all the variables. Note that this approach is required if an equal variance model cannot be assumed between groups. In practise, this situation arises very often.

3 Clustering by Similarity Statistic

3.1 Similarity Statistic

Recall that we need to perform a bi-dimensional clustering of the individual variables in the mean-variance space into C clusters. Any clustering algorithm may be used at this point. We chose here an agglomerative clustering approach using e.g. the K-means algorithm. Variables are clustered by their means and standard deviations with 1000-times replicated random start seedings. A major challenge in any cluster analysis is the estimation of the true number of clusters (or centroids) in a dataset. In order to determine an estimate of C in the combined set {μ^j,σ^j2}j=1p of sample means and standard deviations, we devised a similarity statistic, which is a modified version of the gap statistic introduced by Tibshirani [27].

Suppose we have clustered the variables into C clusters. Let pl = |Cl| for l ∈ {1, …, C}, such that l=1Cpl=p. Assume for a given cluster configuration 𝒞 that the data have been centered and standardized to have within-cluster means and standard deviations of 0 and 1 respectively. Let Dl = Σj,j′∈Cl dj,j be the sum of pairwise distances (typically taken to be Euclidean distances) of all variables in cluster Cl. Most methods for estimating the true number of cluster in a data set are based on the pooled within cluster dispersion defined as the pooled within cluster sum of squares around the cluster means: Wp(l)=l=1l12plDl.

An estimate of the true number of clusters is usually obtained by identifying a “kink” in the plot of Wp(l) as a function of l ∈ {1, …, C}. The gap statistic is a method for identifying this kink. The idea is to standardize the curve of log{Wp(l)} = g(l) by comparing it with its expectation under an appropriate null reference distribution. Our version of the sim statistic compares the curves of log{Wp(l)} to its expected value under an appropriate null reference distribution with true i) mean 0 and ii) standard deviation 1 (e.g. a standard Gaussian distribution ∼ N(0, 1). Define a corresponding similarity statistic by the absolute value of the gap between the two curves: Gapp(l)=|Ep[log{Wp(l)}]log{Wp(l)}|, where Ep and log{Wp(l)} denote respectively the expectation and the pooled within cluster dispersion as a function of l under a sample of size p from the reference distribution. By sampling from the null distribution, we account for sampling variability even when the true parent distribution has the desired true moments. Our estimate l̂ of the true number of clusters of variables is the smallest value of l for which the similarity between the two distributions will be maximal, i.e. for which the gap statistic Gapp(l) between the two curves will be minimal after assessing its sampling distribution: l̂ = minl [argminl{Gapp(l)}].

3.2 Estimation

In practice, we estimate Ep[log{Wp(l)}] and the sampling distribution of Gapp(l) by drawing, say B = 100, Monte-Carlo replicates from our standard Gaussian reference distribution. If we let the estimate of Ep[log{Wp(l)}] be E^p[log{Wp(l)}]=1Bb=1Blog{Wpb(l)}, denoted by L¯, then the corresponding gap statistic estimate is: Gap^p(l)=|L¯log{Wp(l)}|.

A usual stopping rule to estimate the true number l̂ of clusters is to take the smallest value of l for which Gap^p(l) is minimal up to one standard deviation: l^=minl{l:Gapp^(l)Gapp^(l+1)+sdp^(l+1)1+1/B}, where sdp^(l)=1Bb=1B[log{Wpb(l)}L¯]2 denotes the estimated standard deviation as a function of l under a sample of size p of log{Wpb(l)}. The user input is required to specify the range l ∈ {1, …, C} of number of clusters over which the gap statistic is estimated. In practice, we entertain a range of say 1-20 (see R package ‘MVR’ for more details [33]). The advantage of the gap statistic is that it works well even if estimate l̂ = 1, where most other methods are usually undefined [27]. On the other hand, the procedure tends to be conservative and can be computationally intensive.

4 Regularized Test Statistics Under Unequal Group Variance Model

4.1 Introduction

In high dimensional data, there is typically a very large number of variables and a relatively small number of replications. Among the many challenges this situation presents to standard statistical methods, standard test statistics usually have low power. One of the reasons is that conventional variance estimators that are used e.g. in the t-test statistic and other statistics are unreliable owing to the small number of replications [6, 11, 18, 21, 24]. As mentioned above, a variety of methods have been proposed in the literature to overcome this problem of lack of degrees of freedom.

Recently, Wang et al. explain that because the means are unknown and estimated with few degrees of freedom, naive methods that use the sample mean instead of a better estimate of the true mean are generally biased because of the errors-in-variables phenomenon [30]. They proposed three methods for overcoming this bias. The joint regularized population estimators that we propose here try to overcome this bias and to avoid the unrealistic assumptions of homoscedasticity/independence aforementioned. We show in the next section that they result in greater statistical power when one uses them to derive regularized test statistics.

Also, recall that in the general and real scenario case, we are dealing with multiple groups of samples (G) with a priori unequal group variances. Under this model, the above standard or regularized t-test statistics will also suffer from the violation of equal group distribution/variance assumption. Therefore, ideally, one would want to address all these issues simultaneously.

4.2 Two-group Setup

Using previous notations with G groups of samples, and C clusters of variables, define the cluster mean of group sample mean and the cluster mean of group sample variance for cluster lj and group k:

μ^(lk,j)=1#{j:lj=l}{j:lj=l}μ^k,jσ^2(lk,j)=1#{j:lj=l}{j:lj=l}σ^k,j2
(6)

where lk,j is the cluster membership indicator of variable j in the lth cluster and kth group, and where μ̂k,j is the group sample mean for variable j in group k (2.3), and σ^k,j2 is the unbiased group sample variance for variable j in group k (2.3).

Considering the case of a two-sample group problem (G = 2), define a Mean-Variance Regularized unequal group variance t-test statistic, further denoted tMVR, as follows:

tj=μ^(l1,j)μ^(l2,j)σ^2(l1,j)n1+σ^2(l2,j)n2
(7)

4.3 Inferring Significance

The aforementioned standard or regularized test-statistics (Subsection 4.1) generally assume equal group sample distribution/variance (with the exception of Welch's ttest, which use an approximation to the degrees of freedom) in order to use the pooled sample variance across groups for each individual variable j as a population variance estimate. However, one does not necessarily want to treat the group sample variances as being equal. In fact, regular estimates of group sample variance σ^k,j2 for variable j in group k are generally not equal/similar across groups. Further, even though a regularization procedure such as MVR or CVSR will make the pooled sample variance σ^j2 to follow a homoscedastic model (as shown e.g. in Figure 3 of the next section), regularized estimates of group sample variance σ^2(lk,j) (such as used in our regularized t-test (7)) will not necessarily be identically distributed. Our unpaired two-sided test-statistic 7 does not make this assumption of sample group homoscedasticity. This is an important relaxation as one does not want to make this assumption in reality.

An external file that holds a picture, illustration, etc.
Object name is nihms746164f3.jpg

Comparative quantile-quantile plots to verify the goodness of fit between observed and expected target moments under an arbitrary centered homoscedastic model (here from a reference standard Gaussian distribution N(0, 1)). Synthetic dataset results shown for Model (8b). When this is the case, estimated quantiles from observed data line up with those of the reference distribution. Observed transformed means (Top Row) and standard deviations (Bottom row) by MVR and CVSR regularization methods. Red line: inter-quartile line. One-sample two-sided Kolmogorov-Smirnov-test statistics are reported in boxes.

For the computation of test-statistic p-values, when the total number of permutations cannot be enumerated, Monte Carlo (approximate) permutation tests are generally used. Given that the underlying exchangeability assumption does not hold anymore under a heteroscedastic model for the sample group variances, one has to resort to non-exact tests such as the bootstrap test, which entails less stringent assumptions [12]. The estimated p-values provided by bootstrap methods (with replacement) are less exact than p-values obtained from permutation tests (without replacement) [7], but can be used to test the null hypothesis of no differences between the means of two statistics [8] without assuming that the distributions are otherwise equal [2].

Approximate p-values of our unpaired two-sided tMVR test-statistic should be computed as follows. For each variable j ∈ {1, …, p}, B′ bootstrap sets are generated by Monte Carlo sampling with replacement, each of which being generated by sampling with replacement of the sample group labels of the i ∈ {1, …, n} response (expression) values Yi,j. For each bootstrap set b ∈ {1, …, B′}, and for each j, the corresponding null test-statistics, denoted tjb, is computed to get the corresponding approximate null bootstrap distribution: {tj1,,tjB'}. Then, for each j, the p-value are estimated by the proportions p^(j)=1Bb=1BI(|tjb|>|tj|).

5 Simulation Study

5.1 Setup

To explore and compare the performances of our regularization and variance stabilization procedure, we consider a simulation study where the data, referred to as synthetic dataset, simulates a typical real scenario situation where (i) a complex mean–variance dependency exists, and (ii) the variances are unequal across sample groups.

Most popular parametric models for the variance function in the high-throughput data analysis literature include the constant coefficient of variation model and the quadratic variancemean model (reviewed in [30]). In the latter model, the variance is assumed to be a quadratic function of the mean to account for the commonly observed positive dependency of the variance as a function of the mean. In addition, to overcome the problem of low response values (e.g. low measured intensity signals) in comparison to the background, Rocke and Durbin [20], Chen et al. [4] and Strimmer [26] specified additive and multiplicative error components to this model.

Considering the previous model in a multi-group situation, the individual response Yi,j can be written for each variable j ∈ {1, …, p} and each group k ∈ {1, …, G} as (8a), where for group k, μk,j is the true group mean, νk2 is the true group variance, αk represents the group mean background noise (i.e. the mean response value of unexpressed variables in expression experiments), and ρk and νk are some error coefficients; εi,j and ηi,j are the error terms, assumed independent and identically distributed as ∼ N(0, 1) [20]. Following Papana et al.'s setup [19], we also considered a slight derivation of model (8a), where the mean background noise βk in group k is now subject to the multiplicative error (8b):

Yi,jαkμk,j ⋅ eρkηi,jνk ⋅ εi,j for {i:kik}
(8a)
Yi,jμk,j + (μk,jβk) ⋅ eρkηi,jνk ⋅ εi,j for {i:kik}
(8b)

In both models, the independence and normality assumption of the error terms is made for convenient reasons. This is a reasonable assumption in practise [20]. It can be shown from model (8b) and similarly from model (8a) that either model ensures two things simultaneously:

  • the sample variance for a variable is proportional to the square of its mean: In fact, using the delta method, one can derive the expectation and variance of the individual response Yi,j under the current assumptions, from where it follows that:

    Var(Yi,j)={E(Yi,j)μk,j}2Var(eρkηi,j){E(eρkηi,j)}2+νk2{E(Yi,j)μk,j}2eρk2+νk2Var(Yi,j){E(Yi,j)}2for{i:ki=k}

    In fact, for small values of ηi,j the signal Yi,j is approximately normally distributed, while for large values of ηi,j the signal Yi,j is approximately log-normal distributed:

    {Yi,jηi,j02μk,j+βk+νkεi,jηi,j0N(2μk,j+βk,νk2)Yi,jηi,j(μk,j+βk)eρkηi,jηi,jLognormal[log{(μk,j+βk)ρk},ρk2]

    When ηi,j falls in between these two extremes, all terms in model (8a) or (8b) play a significant role [20]. In this case, the signal Yi,j is approximately distributed as a linear combination of both distributions. What this means for the response is that (i) for small values of ηi,j its variance is approximately independent of its mean by property of the normal distribution (which is known to be the only distribution for which the standard deviation is independent its mean); (ii) while for large values of ηi,j its variance is approximately proportional to its squared mean.

  • sample variances for a variable across groups are unequal for kk′:

    Var(Yi,j)={(μk,j+βk)2Var(eρkηi,j)+νk2for{i:ki=k}(μk,j+βk)2Var(eρkηi,j)+νk2for{i:ki=k}Var(Yi:ki=k,j)Var(Yi:ki=k,j)

In this simulation, we consider a balanced two-group situation (G = 2) from e.g. model (8b) with sample size n1 = n2 = 5 and of dimensionality p = 1000 of variables. Using a Bernoulli distribution with probability parameter ⅕, we selected 20% of the variables as significant as follows. Let dT=[dj]j=1p be the p-indicator vector of significant variables, where dj~Bernoulli(⅕) for j ∈ {1, …, p}. With 80% probability (corresponding to non-significant variables) we set {μ1, j = μ2, j = 0}{j: dj=0}, while the other 20% of the time (corresponding to significant variables) {μ1, j}{j: dj=1} and {μ2, j}{j: dj=1} were independently sampled from an exponential density with mean λ1 and λ2 respectively, where λ1 and λ2 were independently sampled from the uniform distribution U(1, 10) : {μ1, j}{j: dj=1}Exp(λ1), and {μ2, j}{j: dj=1}Exp(λ2) with λ1U(1, 10) and λ2U(1, 10). For our simulation, we set β1 = β2 = 15, ρ1 = 0.1 and ρ2 = 0.2, ν1 = 1 and ν2 = 3. In this particular setting, even though we set a common mean background noise (β1 = β2), because ν1ν2 and ρ1ρ2, this represents a real scenario situation where variances are unequal across groups. The following subsections describe results for the second model only (8b).

5.2 Standardization and Transformation Results

We compared our Mean-Variance Regularization and Variance Stabilization procedure (hereafter referred to as MVR) to several standardization, normalization or variance stabilization transformations, using exploratory and diagnostic plots : (i) log transformation of the data, hereafter referred to as LOG; (ii) Papana & Ishwaran's CART Variance Stabilization Regularization [19] (CVSR); (iii) the generalized log (glog2(exp(b) · x + a) + c, where glog2(u)=log2(u+u2+1)=1log(2)arcsinh(u)) as described e.g. in Wolfgang Huber et al. [13] (VSN); (iv) a robust locally weighted regression [5] (LOWESS); (v) a natural cubic smoothing splines transformation [31] (CSS); (vi) quantile normalization method (this latter normalization procedure is designed to combine the features of quantile normalization and supposedly to perform variance stabilization at the same time) [3] (QUANT).

The similarity statistic profile of Figure 1 shows that the true number of variable clusters by sample group was estimated to be 14 and 7 for sample group #1 (G1) and #2 (G2) respectively. Notice that beyond these estimates, the goodness of fit of the transformed data relative to the hypothesized underlying parent distribution degrades. This can be viewed as a form of over-fitting. Using the estimated true number of variable clusters found for each sample group (Figure 1), we derived population mean and variance estimates according to the multi-group scheme of 2.3, and used them to standardize the variables within the clusters.

An external file that holds a picture, illustration, etc.
Object name is nihms746164f1.jpg

First column: similarity statistic profiles giving the estimated number of variable mean-standard deviation clusters by sample group G1 and G2 in the input space of the synthetic dataset (Log transformed data). K-means partitioning clustering algorithm was performed with 100 random start seedings. Red arrows indicate results of stopping rule (i.e. the smallest value of l for which Gap^p(l) is minimal up to one standard deviation). Second and Third column: distributions of means and standard deviations before and after multi-group mean-variance regularization for the synthetic dataset (Log transformed data). Model (8b) shown, results without log transformation are similar.

The result of transformation of the data in Figure 1 clearly shows that the standardization algorithm is effective in terms of centering and scaling the data. Note, however, that because this standardization is performed in a cluster-wise manner (i.e. not in an individual variable-wise manner), MVR does not intend to transform the data to achieve exact variable z-scores or mean-0, standard deviation-1 distribution parameters.

In addition, after a successful variance stabilization transformation the variance should approximately be independent of the mean values. Note especially how the variance increases as a function of the mean in the raw data (Figure 2), mimicking a real data situation (data not shown), and how adaptive regularization techniques, including ours, perform very well. In contrast, what is surprising is the not so good performance of the VSN transformation, given that this non parametric procedure is specifically designed to estimate the mean-variance relationships of the form described in models 8a and 8b. Also, the fact that any standard normalization procedure does not necessarily stabilize the variance across variables is made amply clear (Figure 2).

An external file that holds a picture, illustration, etc.
Object name is nihms746164f2.jpg

Mean-SD scatter-plot for the synthetic dataset. The success of variance stabilization is usually assessed by M/A plots or Mean-SD plots. The Mean-SD scatter-plot allows to visually verify whether there is a dependence of the variance on the mean. Plotted are the pooled sample variance across groups (pooled sd) against the ranks of pooled sample mean across groups (rank(pooled mean)) for each individual variable j under various transformations and variance stabilization procedures. The black dotted curve depicts the running median estimator (equal window-span of 0.5 for all procedures). If there is no variance-mean dependence, then this curve should be approximately linear horizontal. Model (8b) shown, results without log transformation are similar.

To emphasize the point, we tested the hypotheses that the true distribution of transformed mean and standard deviation of each variable after a local adaptive regularization follow that of the data under an arbitrary centered homoscedastic model. Using previous notations, in case of standard normality N(0,1) of the data and assuming independence of the first two moments, the theoretical sampling distributions of mean μ̂j and standard deviation σ̂j for each variable j follow a standard normal N(0;1) distribution and the square root of a χnG2nG distribution respectively. The tested null hypotheses were that the empirical (continuous) distribution of transformed mean μ^j and standard deviation σ^j for each variable j follow the theoretical null distribution, assuming the data under e.g. a reference standard Gaussian distribution N(0, 1):

{H01:μ1=μ2==μp=0H02:σ1=σ2==σp=1

We carried out these tests under several transformation or regularization procedures, using e.g. a one-sample two-sided Kolmogorov-Smirnov test statistic (Dn - Table 1).

Table 1

One-sample two-sided Kolmogorov-Smirnov-test statistics Dn for the null hypothesis of the transformed means (H01) and the null hypothesis of transformed standard deviations (H02) under several transformation or regularization methods. Note that Dn ∈ [0, 1], and a larger value indicates more evidence to reject the null hypothesis.

RAWLOGMVRCVSRVSNLOESSCSSQUANT
H011.0001.0000.0950.5261.0001.0001.0001.000
H020.9660.9780.0930.0840.8940.9951.0000.991

We also estimated the sample quantiles of the observed transformed pooled mean and standard deviation {(μ^j,σ^j)}j=1p of each variable j ∈ {1, …, p} and those expected under our reference distribution (Figure 3). Notice the goodness of fit of standard deviations in Table 1 for the MVR and CVSR regularization procedures only, and in the corresponding quantile-quantile plots of standard deviations (bottom row of Figure 3). These results clearly confirm that standard transformation procedures tend to violate the homoscedastic assumption of the variables, and may thereby be inappropriate for making inferences. The test also confirms the poor improvement of the variance stabilizing transformation (VSN) over other transformations.

5.3 Regularized Test Statistics Results

Here, we used the previous simulated dataset from model (8b) to assess the performance of our regularized variance estimator and its derived (tMVR) t-test statistic vs. standard and various modified t-test statistics : (i) Welch's two-sample unequal variances (tREG) t-statistic; (ii) Papana & Ishwaran's CART-Variance Stabilized and Regularized (tCVSR) t-statistic [19]; (iii) Baldi et al.'s Hierarchical Bayesian Regularized (tHBR) t-statistic [1]; (iv) Efron's Empirical Bayes Regularized (tEBR) z-statistic [10]; (v) Tusher et al.'s regularized SAM (tSR) t-statistic [25, 29]; (vi) Smyth's Bayesian Regularized (tBR) t-statistic [21]; and (vii) Cui et al.'s James-Stein shrinkage estimator-based Regularized (tJSR) t-statistic [6]. The new statistic is also compared to standard Welch's two-sample unequal variances t-test statistics, computed under previous common variance stabilization and/or normalization procedures (Subsection 5.2) and denoted tLOESS, tCSS, tQUANT and tVSN. Here, the performance is assessed in terms of classification errors i.e. the statistical power to discriminate truly significant variables from the truly non-significant ones as measured by:

False postiveFP^=#{variable called significant|variable is truly non significant}False NegativeFN^=#{variable called not significant|variable is truly significant}Total MisclassificationM^=FP^+FN^

We have ordered all the variables by their t-test statistics. Because each test/procedure has a different cutoff value for identifying significant variables, comparisons were calibrated by using the top significant variables, ranked by absolute value of their t-test statistics (e.g. variables with ttest > 20% percentile - see 5). Note that this is not equal to comparing the significant tests to the truth, across the various t-test statistics, for a common level of significance α. Table 2 reports the False Positives, False Negatives, and Total Misclassification Monte Carlo estimates, based on B = 128 replicated synthetic datasets according to model (8b). Note that by using a probability of success (⅕) for the number of significant tests in each simulation (Section 5), we avoid the restriction that the number of tests found significant must equal the number of true significant (i.e. the estimated model size equals the true model size). This avoids the restriction that the number of False Positives be equal to the number of False Negatives in a two-class situation.

Table 2

False Positive FP^, False Negative FN^ and total Misclassification ( M^) Monte Carlo estimates for each procedure (s.e), based on B = 128 replicated synthetic datasets on the raw and log scale. Model (8b) shown. With previous abbreviations (Subsection 5.2): LOG: log-transformed scale; RAW: untransformed scale, tREG : regular two-sample unequal variances (Welch) t-test; tMVR : Mean-Variance Regularized t-test; tCVSR : CART Variance Stabilization Regularized t-test; tHBR : Baldi's Hierarchical Bayesian Regularized t-test; tEBR : Efron's Empirical Bayes Regularized z-test; tSR : SAM Regularized t-test; tBR : Smyth's Bayesian Regularized t-test; tJSR : Cui et al.'s James-Stein Regularized t-test; tVSN : VS N-transformed t-test; tLOESS : LOESS-timsformed t-test; tCSS : CSS-transformed t-test; tQUANT : QUANT-transformed t-test.

tREGtMVRtCVSRtHBRtEBRtSR

LOG FP^66.5 (1.5)59.2 (1.8)71.5 (1.4)58.8 (1.4)66.3 (1.5)71.1 (1.4)
FN^66.6 (1.5)59.3 (1.8)71.6 (1.5)58.9 (1.4)66.3 (1.5)71.1 (1.5)
M̂133.1 (2.7)118.5 (3.4)143.0 (2.6)117.7 (2.6)132.6 (2.7)142.2 (2.6)

RAW FP^64.3 (1.3)53.8 (1.6)55.6 (1.3)59.2 (1.3)61.5 (1.3)59.7 (1.2)
FN^64.6 (1.1)54.1 (1.4)55.9 (1.1)59.5 (1.1)61.8 (1.1)60.0 (1.0)
M̂128.9 (2.2)107.8 (2.8)111.5 (2.2)118.7 (2.2)123.3 (2.1)119.6 (2.0)

tBRtJSRtVSNtLOESStCSStQUANT

LOG FP^68.1 (1.4)68.0 (1.5)60.7 (1.6)69.3 (1.7)68.5 (1.5)69.9 (1.6)
FN^68.2 (1.5)68.1 (1.5)60.8 (1.7)69.4 (1.8)68.6 (1.6)70.0 (1.6)
M̂136.2 (2.7)136.1 (2.7)121.5 (3.1)138.7 (3.3)137.0 (2.9)139.9 (2.9)

RAW FP^56.5 (1.3)56.4 (1.3)60.5 (1.2)66.2 (1.5)65.7 (1.5)67.8 (1.5)
FN^56.8 (1.1)56.7 (1.1)60.8 (1.1)66.5 (1.5)66.0 (1.3)68.1 (1.4)
M̂113.4 (2.2)113.0 (2.2)121.3 (2.2)132.7 (2.8)131.6 (2.6)135.9 (2.6)

Overall, our Mean-Variance Regularized t-test statistics (tMVR) outperforms outperforms all other t-test statistics considering both raw and log scales simultaneously (Table 2). Also, notice in Table 2 the loss of accuracy occurring: (i) in common normalization procedures that do not guarantee a stabilization of variance (tREG, tLOESS, tCSS, or tQUANT vs. e.g. tVSN, tMVR, or tCVSR), (ii) in global variance stabilization procedures as compared to adaptive local regularization techniques (tVSN vs. e.g. tMVR, or vs. tCVSR), (iii) between regularization techniques themselves that use different population mean estimates (tMVR vs. tCVSR, tHBR, tEBR, tSR, tBR, tJSR). Overall, this shows how sensitive these inferences are to a loss of power due to small sample size and to violation of assumptions in dependency and heteroscedastic situations.

5.4 Competitive Variance Stabilization & Regularization Methods

Over all the normalization procedures commonly used in omics data preprocessing, Wolfgang Huber et al. and Durbin et al.'s variance stabilization procedure (VSN, Subsection 5.2) yields comparable classification errors in a tVSN test statistic as compared to the regularized test statistics tMVR or tCVSR (Table 2). Yet, surprisingly, VSN stabilizes poorly the variance in this simulation (Figures 2, ,3),3), especially on the higher hand means (Figures 2). In general, regularization procedures such as CV SR, HBR, EBR, SR, BR, and JSR, as well as MVR tend to be simultaneously the most efficient in stabilizing the variance (Figures 2, ,3),3), and the most powerful in hypothesis testing (Table 2). Therefore, these regularization procedures turn out to be the only true competitors to each other.

While CVSR and MVR regularization produce near perfect variance stabilization results (Figures 2, ,3),3), the regularized test statistic tMVR outperforms nearly systematically tCVSR, tHBR, tEBR, tSR, tBR, and tJSR (Table 2). To elucidate what makes the MVR procedure more efficient than its counterparts, we compared Monte Carlo estimates of the regularized test-statistics tMVR and e.g. tCVSR, from B = 128 replicated synthetic datasets. A striking pattern arises when plotting the absolute value of test-statistics |tMVR| vs. |tCVSR| and their quantiles against each other (Figure 4). What can be gathered from these plots is that the tMVR test statistic is systematically larger in absolute value than tCVSR for the variables that are truly significant (blue dots - Figure 4) and conversely for the variables that are truly non-significant (red dots - Figure 4).

An external file that holds a picture, illustration, etc.
Object name is nihms746164f4.jpg

Comparison of performance of our regularized test statistics tMVR with its best competitor: tCVSR on simulated dataset. Shown are Monte Carlo estimates of regularized test statistics in absolute value: |tMVR| and |tCVSR|, based on B = 128 replicated synthetic datasets on the log scale. Model (8b) is shown. Black solid line : identity line. Left: scatter-plot of |tMVR| vs. |tCVSR| tests statistics. Middle: Quantile-Quantile plot of |tMVR| vs. |tCVSR| for non-significant variables (red dots). Right: Quantile-Quantile plot of |tMVR| vs. |tCVSR| for significant variables (blue dots). Blue and red dashed lines are LOESS curves with a span of 0.3.

In the joint estimation of population means and variances, population estimates used in tMVR tend to shift away from the identity line in opposite directions for the truly significant vs. truly non-significant variables. This is not so for tCVSR. The reason is that a variance estimator that is used alone does not necessarily lead to a more powerful test. This points to the recent work of Wang et al., who showed that because the true means are unknown and estimated with few degrees of freedom, naive methods that use the sample mean in place of a better estimate are generally biased because of the errors-in-variables phenomenon [30]. This is precisely what we observe for the means after transformation by CVSR but not by MVR: notice the lack of fit of transformed means by CVSR vs. MVR (top row of Figure 3 - idem in real data not shown). Therefore, when the mean and variance are jointly included in the regularization procedure, both the numerator and the denominator of the tMVR test-statistic tend to be less sensitive to bias, which ultimately translates into more statistical power.

6 Conclusion

To avoid unrealistic assumptions and pitfalls in inferences in high dimensional data, parameter estimation must be done carefully by taking into account the mean-variance dependency structure and the lack of degrees of freedom. Our non-parametric MVR regularization procedure performs well under a wide range of assumptions about variance heterogeneity among variables or sample groups in the multi-group situation, and avoids by nature the problem of model mis-specification. In addition, it performs as well on either raw or log scales, which makes it altogether robust, versatile and highly promising.

The improved performance of our joint regularized shrinkage estimators benefits from Charles Stein's inadmissibility results [22, 23], and Tong and Wang's recent extension [28]. Essentially, when p ≥ 3 parameters are estimated simultaneously from a multivariate normal distribution with unknown mean vector, their combined estimator is more accurate than any estimator which handles the parameters separately, in that there exists alternative estimators which always achieve lower risk under quadratic loss function (i.e. Mean Squared Error), even though the parameters and the measurements might be statistically independent [22]. Later, Stein's showed that the ordinary decision rule for estimating a single variance of a normal distribution with unknown mean is also inadmissible [23]. In addition, Tong and Wang recently showed that Stein's result for multiple means [22] extends to multiple variances as well [28]. Our work represents a direct consequence of the above combined inadmissibility results in that the standard sample variance estimator is improved by a shrinkage estimator (i) when the information in the sample mean is known or used [23], and/or (ii) when regularization is used [28].

From these joint regularized shrinkage estimators, we showed that regularized t-like statistics offer significant more statistical power in hypothesis testing than their standard sample counterparts, or regular common value-shrinkage estimators, or when the information contained in the sample mean is simply ignored. This result is a direct consequence of the strong mean-variance dependency and of the size/shape inherent to high-dimensional data. If one wants to jointly estimate the means and the variances in this type of data, the number of parameters to be simultaneously estimated can be as large as 2p. As Charles Stein states it, the possible improvement over the usual estimators can be quite substantial if p is large or p > n [23].

Acknowledgments

We thank Dr. Hemant Ishwaran for helpful discussion and for providing the R code of his CART Variance Stabilization and Regularization procedure (CVSR). This work was supported in part by the National Institutes of Health [P30-CA043703 to J-E.D., R01-GM085205 to J.S.R.]; and the National Science Foundation [DMS-0806076 to J.S.R.].

Footnotes

Conflict of Interest: None declared.

References

1. Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics. 2001;17:509–19. [PubMed] [Google Scholar]
2. Bickel D. 34th Symposium on the Interface. Vol. 34. Montreal, Quebec, Canada: Computing Science and Statistics; 2002. Microarray gene expression analysis:Data transformation and multiple comparison bootstrapping; pp. 383–400. [Google Scholar]
3. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–93. [PubMed] [Google Scholar]
4. Chen Y, Kamat V, Dougherty ER, Bittner ML, Meltzer PS, Trent JM. Ratio statistics of gene expression levels and applications to microarray data analysis. Bioinformatics. 2002;18:1207–15. [PubMed] [Google Scholar]
5. Cleveland W. Robust locally weighted regression and smoothing scatterplots. J Amer Stat Assoc. 1979;74:829–836. [Google Scholar]
6. Cui X, Hwang JT, Qiu J, Blades NJ, Churchill GA. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005;6:59–75. [PubMed] [Google Scholar]
7. Dudoit S, Y YH, Callow M, Speed TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica. 2002;12:111–139. [Google Scholar]
8. Efron B, Tibshirani R. An Introduction to the Bootstrap. London: Chapman & Hall/CRC; 1993. [Google Scholar]
9. Efron B, Tibshirani R. On Testing the Significance of Sets of Genes. The Annals of Applied Statistics. 2007;1:107–129. [Google Scholar]
10. Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. J Amer Stat Assoc. 2001;96:1151–1160. [Google Scholar]
11. Ge Y, Dudoit S, Speed TP. Resampling-based Multiple Testing for Microarray Data Analysis. Test. 2003;12:1–77. [Google Scholar]
12. Good I. Extensions of the concept of exchangeability and their applications. J Modern Appl Statist Methods. 2002;1:243–247. [Google Scholar]
13. Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002;18(Suppl 1):S96–104. [PubMed] [Google Scholar]
14. Ishwaran H, Rao JS. Spike and slab gene selection for multigroup microarray data. J Amer Stat Assoc. 2005;100:764–780. [Google Scholar]
15. Jain N, Thatte J, Braciale T, Ley K, O'Connell M, Lee JK. Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics. 2003;19:1945–51. [PubMed] [Google Scholar]
16. Ji H, Wong WH. TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics. 2005;21:3629–36. [PubMed] [Google Scholar]
17. Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med. 2003;22:3899–914. [PubMed] [Google Scholar]
18. Lonnstedt I, Speed TP. Replicated microarray data. Statistica Sinica. 2002;12:31–46. [Google Scholar]
19. Papana A, Ishwaran H. CART variance stabilization and regularization for high-throughput genomic data. Bioinformatics. 2006;22:2254–61. [PubMed] [Google Scholar]
20. Rocke DM, Durbin B. A model for measurement error for gene expression arrays. J Comput Biol. 2001;8:557–69. [PubMed] [Google Scholar]
21. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3 Article3. [PubMed] [Google Scholar]
22. Stein C. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. Vol. 1. University of California Press; 1956. Inadmissibility of the usual estimator for the mean of a multivariate distribution; pp. 197–206. [Google Scholar]
23. Stein C. Inadmissibility of the Usual Estimator for the Variance of a normal distribution with unknown mean. Vol. 16. Springer; Netherlands: 1964. [Google Scholar]
24. Storey JD, E TJ, S D. Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. J R Statist Soc. 2003;66:187–205. [Google Scholar]
25. Storey JD, Tibshirani R. Statistical significance for genome wide studies. Proc Natl Acad Sci U S A. 2003;100:9440–5. [PMC free article] [PubMed] [Google Scholar]
26. Strimmer K. Modeling gene expression measurement error: a quasi-likelihood approach. BMC Bioinformatics. 2003;4:10. [PMC free article] [PubMed] [Google Scholar]
27. Tibshirani R, Walter G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Statist Soc. 2001;63(Series B):411–423. [Google Scholar]
28. Tong T, Wang Y. Optimal shrinkage estimation of variances with applications to microarray data analysis. J Amer Stat Assoc. 2007;102:113–122. [Google Scholar]
29. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98:5116–21. [PMC free article] [PubMed] [Google Scholar]
30. Wang Y, Ma Y, Carroll R. Variance estimation in the analysis of microarray data. J R Statist Soc B. 2009;71:425–445. [PMC free article] [PubMed] [Google Scholar]
31. Workman C, Jensen LJ, Jarmer H, Berka R, Gautier L, Nielser HB, Saxild HH, Nielsen C, Brunak S, Knudsen S. A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biol. 2002;3 research0048. [PMC free article] [PubMed] [Google Scholar]
32. Wright GW, Simon RM. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics. 2003;19:2448–55. [PubMed] [Google Scholar]
33. Dazard JE, Xu H, Santana A. Contributed R Package MVR: Mean Variance Regularization. The Comprehensive R Archive Network. 2011 https://cran.r-project.org/web/packages/MVR/index.html.