Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Proc Am Stat Assoc. Author manuscript; available in PMC 2016 Jan 25.
Published in final edited form as:
Proc Am Stat Assoc. 2011 Jul-Aug; 2011: 3849–3863.
PMCID: PMC4725579
NIHMSID: NIHMS746138
PMID: 26819572

R package MVR for Joint Adaptive Mean-Variance Regularization and Variance Stabilization

Abstract

We present an implementation in the R language for statistical computing of our recent non-parametric joint adaptive mean-variance regularization and variance stabilization procedure. The method is specifically suited for handling difficult problems posed by high-dimensional multivariate datasets (pn paradigm), such as in ‘omics’-type data, among which are that the variance is often a function of the mean, variable-specific estimators of variances are not reliable, and tests statistics have low powers due to a lack of degrees of freedom. The implementation offers a complete set of features including: (i) normalization and/or variance stabilization function, (ii) computation of mean-variance-regularized t and F statistics, (iii) generation of diverse diagnostic plots, (iv) synthetic and real ‘omics’ test datasets, (v) computationally efficient implementation, using C interfacing, and an option for parallel computing, (vi) manual and documentation on how to setup a cluster. To make each feature as user-friendly as possible, only one subroutine per functionality is to be handled by the end-user. It is available as an R package, called MVR (‘Mean-Variance Regularization’), downloadable from the CRAN.

Keywords: R package, Parallel Programming, Mean-Variance Estimation, Regularization and Variance Stabilization, Regularized Test-statistics, High-Dimensional Data

1 Joint Adaptive Mean-Variance Regularization

1.1 Scope - Motivation

We introduce an implementation in the R language of a regularization and variance stabilization method for parameter estimation, normalization and inference of data with many continuous variables. In a typical setting, this method applies to high-dimensional high-throughput ‘omics’-type data, where the number of variable measurements or input variables (gene, peptide, protein, etc …) hugely dominates the number of samples (so called pn paradigm). The data may be any kind of continuous covariates.

It is common to deal in high-dimensional setting with the following issues:

  1. A severe lack of degrees of freedom generally due to tiny sample sizes, where usual variable-wise estimators lack of statistical power [29, 31, 34, 36] and lead to false positives [9, 35].

  2. Spurious correlation and collinearity between many variables (p ≫ 1) in part due to the nature of the data, but most of which due to an artifact of the dimensionality (see [4, 11] for a detailed discussion). In addition, False Detection Rates (FDR) get high in part because of the regression-to-the-mean effect induced by correlated parameter estimates [19].

  3. Variables in high-dimensional data recurrently exhibit a complex mean-variance dependency with standard deviations severely increasing with the means [15, 28], while statistical procedures usually assume their independence.

Since statistical procedures rely at least in part on the above assumptions for making inferences, these issues become crucial. Moreover, they make usual assumptions unrealistic, usual moment estimators unreliable (generally biased and inconsistent), and inferences inaccurate. The goal of this software is to implement an approach that will get lower estimation errors of mean and variance population parameters and more accurate inferences in high-throughput data.

1.2 Estimation Issues in High-Dimensional Setting

A large majority of authors have used regularization techniques for estimating population parameters in high dimensional data. The premise is that because many variables are measured simultaneously, it is likely that most of them will behave similarly and share similar parameters. The idea is to take advantage of the parallel nature of the data by borrowing information (pooling) across similar variables to overcome the problem of lack of degrees of freedom.

Non-parametric regularization techniques for variance estimation have shown that shrinkage estimators can significantly improve the accuracy of inferences. Shrinkage estimation was used by Wright & Simon [20, 37]; Jain et al. [20]; Cui et al. [5] and Ji & Wong [21]; Tong and Wang [34]. The idea of borrowing strength across variables was also recently exploited in genesets enrichment analyses [8]. Shrinkage estimators have also been successfully combined with empirical Bayes approaches, where posterior estimators have been shown to follow distributions with augmented degrees of freedom, greater statistical power, and far more stable inferences in the presence of few samples [2, 22, 23, 23, 29, 29]. In a similar vein, shrinkage estimation was also used to generate “moderated” t-tests and F-tests statistics. There, variable-specific variance estimators are inflated by using an overall offset using Bayesian approaches or not [9, 29, 32, 35] or a James-Stein-based shrinkage estimator [5].

A Commonality to all previous method is that (i) they focus on variance estimation alone, (ii) they involve shrinkage of the sample variance towards a global value, which is used for all variables. In our method based on the idea of joint adaptive shrinkage (see next and [6]), we explain in more details why in large datasets (when pn) inferences based on common-global-value shrinkage estimators or on variable-specific estimators are not reliable, mostly due to violation of the mean-variance independence assumption, overfitting [5, 9, 18, 19, 27, 35] and/or lack of degrees of freedom [29, 32, 34, 36].

1.3 Idea

Let yi,j be the individual response (expression level, signal, intensity, …) of variable j ∈ {1, …, p} (gene, peptide, protein, …) in sample i ∈ {1, …, n}, and μ^j=1ni=1nyi,j, and σ^j2=1n-1i=1n(yi,j-μ^j)2 be the usual population mean and standard deviation estimates respectively. We propose an alternative type of shrinkage, being more akin to joint adaptive shrinkage, which combines the following two ideas:

  1. Borrow information across variables in a local manner, i.e. look for local homogeneity of parameters of interest across variables to group them in clusters, using as few clusters as possible, and where each cluster has a unique parameter value. This is an important example of regularization.

  2. Use the information contained in the estimated population mean μ̂j to get a better estimate of the population standard deviation σ̂j (and vice-versa) for all j ∈ {1, …, p}. It is a direct consequence of Charles Stein's second result on inadmissibility for the usual estimator of a single variance when the mean is unknown [30], which has recently been extended to multiple variances case in the presence of unknown multiple means [34].

By combining the above ideas of the joint estimation, and of local-pooling of information from similar variables, we generate joint adaptive regularized shrinkage estimators of the mean and the variance. How we achieve this specifically is described in more details in [6], and is the purpose of this software.

1.4 Mean-Variance Regularization in Single or Multiple Group Designs

Our approach is to simultaneously (i) borrow information across (similar) variables, as well as (ii) use the information contained in the estimated population mean by performing a bi-dimensional clustering of the variables in the mean-variance parameter space. By identifying those clusters where variables tend to have similar location and scale parameters estimates, one can derive cluster-pooled versions of these estimates, which, in turn can be used to standardize each variable individually within them.

Suppose that variables assume a certain cluster configuration 𝒸 with C clusters. The goal is to find the clusters {Cl}l=1C, wherefrom cluster means of sample variances and of sample means are used as estimators shared by all variables within the cluster. This is an important type of joint and local regularization technique of the mean and variance parameters, which we coined Joint Adaptive Mean-Variance Regularization. Denoting by {μ̂(lj), σ̂2(lj)} the cluster mean of sample mean and the cluster mean of sample variance for j ∈ {1, …, p}, where lj{Cl}l=1C denotes the cluster membership indicator of variable j, we use these within cluster shared estimates to standardize all response variables individually j ∈ {1, …, p}. Note that in the case of multiple (G) sample groups, denoted {Gk}k=1G, a refined cluster configuration 𝒸 is generated after merging all cluster configurations {Ck}k=1G from all groups k ∈ {1, …, G}. Special expressions of μ̂(lj), σ̂2(lj) are derived in the case of multi-group designs (see [6] for details).

Clustering in our method can be done using any algorithm. Assuming a clustering algorithm, a major challenge in every cluster analysis is the estimation of the true number of clusters in the data. To estimate the true number l̂ of clusters in the combined set {μ^j,σ^j2}j=1p, we designed a measure of significance, called Similarity Statistic, denoted Simp(l), for each cluster configuration 𝒸 with l clusters [6]. Our objective criterion for determining l̂ is by minimizing the Similarity Statistic up to 1 standard deviation, using the usual one-standard deviation rule [14] (algo. 1):


Algorithm 1 Joint Adaptive Mean-Variance Regularization

1. for l = 1 to C do
  • Select an appropriate variable cluster configuration 𝒸 with l clusters (since it is not unique due to the NP-hardness of the clustering problem - see 3.2)

  • Standardize the variables within each cluster {ll}l=1C with estimates {μ̂(lj), σ̂2(lj)}

  • Compute the corresponding Similarity Statistic estimate Sim^p(l) as detailed in [6]

2. The optimal cluster configuration 𝒸 with l̂ clusters is the one giving the smallest Similarity Statistic up to 1 standard deviation l^=minl[argminl{Sim^p(l)}]
3. Re-standardize the variables using this optimal cluster configuration 𝒸. After which, all means and variances for the transformed data are assumed to match the target moments, i.e. {0,1} respectively, up to sampling variability (see [6]).

In practise, the user input is required to specify the range l ∈ {1, …, C} of number of clusters over which the Similarity Statistic is estimated. Empirically, we observed that a range of say {1, …, 30} is sufficient in most datasets where p < 10000, after log2-transformation, and if the signal/noise ratio is not too low. In other situations, larger values of nc.max may be required. The end-user interface of cluster diagnostic in the R package { MVR} helps make this assessment accurately (see subsection 2.2.3 and [7]). In short, these plots help determine whether a large enough number of clusters has been reached to find the optimal cluster configuration.

Over/under-regularization (i.e. over/under-fitting) is directly dependent on how many clusters are explored in the search for the optimal cluster configuration. A cluster configuration with a too small number would lead to under-fitting, while over-fitting would start to occur at larger numbers (see Figure 1). Empirically-determined factors at play are (among others): (i) the dimensionality p (larger p ⇒ larger l̂), (ii) the signal/noise ratio (lower ratio tends to create some over-regularization, and (iii) a pre log2-transformation tends to create some under-regularization. The advantage of the Similarity Statistic is that it works well even if estimate l̂ = 1, where most other methods are usually undefined [33]. On the other hand, the Similarity Statistic tends to be conservative and profiling it may be computationally intensive (see subsection 3.1).

An external file that holds a picture, illustration, etc.
Object name is nihms746138f1.jpg

Typical Similarity Statistic profile giving the estimated number of clusters for the mean-variance. Vertical red arrow indicates the result of the stopping rule: the smallest value l̂ of clusters for which the Similarity Statistic is minimal up to 1 standard deviation. Horizontal red arrows indicate directions of over/under-fitting.

1.5 Regularized Test Statistics

In high dimensional data, there are often a relatively small number of samples, mostly due to the costly nature of assessing many variables simultaneously. As a result, standard test statistics that used conventional parameter estimators usually have low power and are unreliable (see for instance [5, 12, 23, 29, 31]). As mentioned above, our Joint Adaptive Mean-Variance Regularization intends in part to overcome this problem of lack of degrees of freedom [6]. Using previous notations, consider the case of a multi-group design with, say G groups of samples (and C clusters of variables). Using our joint regularized estimates of the population mean and variance of each variable, one can not only stabilize the variance, but also derive so-called joint regularized test statistics in an attempt to improve statistical power. For instance, in the particular case of a two-sample group problem (G = 2), one can define, say a Mean-Variance Regularized unequal group variance t-like test statistic, further denoted t – MVR, as follows:

t-MVRj=μ^(l1,j)-μ^(l2,j)σ^2(l1,j)n1+σ^2(l2,j)n2
(1)

where lk,j is the cluster membership indicator of variable j in the lth cluster and kth group, and where μ̂(lk,j) is the cluster mean of group sample mean, and σ̂2(lk,j) is the unbiased cluster mean of group sample variance for variable j and for group k (see [6] for details).

1.6 Summary

Our Joint Adaptive Mean-Variance Regularization procedure provides accurate population parameter estimation by taking into account the mean-variance dependency structure and the lack of degrees of freedom in high dimensional data. The procedure performs well under a wide range of assumptions about variance heterogeneity between variables or between sample groups in multi-group designs. It also performs as well on either raw or log scales, which makes it altogether robust, versatile and promising [6].

In addition, the procedure avoids unrealistic assumptions and pitfalls while making inferences. From our joint regularized shrinkage estimators, we showed that Mean-Variance Regularized t-like statistics offer significantly more statistical power in hypothesis testing than their standard sample counterparts, or regular common value-shrinkage estimators, or when the information contained in the sample mean is simply ignored [6].

Results are a direct consequence of the strong mean-variance dependency and of the size/shape inherent to high-dimensional data. Our method benefits from the combined results of C. Stein's inadmissibility and shrinkage estimators in that standard estimators are improved (i) when the information in the sample mean is known or used [30], and (ii) when regularization is used [34].

2 Package Overview

2.1 Package Design

The R package { MVR} implements our non-parametric method for joint adaptive mean-variance regularization and variance stabilization of high-dimensional data [7]. It is particularly suited for handling “omics”-type dataset generated by high-throughput technologies. Key features include:

  • Normalization and/or variance stabilization of the data

  • Computation of mean-variance-regularized t- and F-statistics

  • Generation of diverse diagnostic plots

  • Synthetic and Real test datasets

  • Computational efficiency implementation, using C interfacing, and an option for parallel computing to enjoy a fast and easy experience in the R environment.

The following describes the end-user functions (entered in R-like syntax), which are needed for running and testing a complete Mean-Variance Regularization and Variance Stabilization procedure. Other internal subroutines are not to be called by the end-user at any time (see manual of [7]). For computational efficiency, end-user regularization functions offer the option to configure a cluster. This is indicated by an asterisk (*). For further details on each function/dataset and how to setup a Rocks™cluster (http://www.rocksclusters.org/), see the manual and vignette of the { MVR} package [7]. The R functions and datasets are currently categorized as follows:

  1. END-USER REGULARIZATION & VARIANCE STABILIZATION FUNCTION

    mvr (.) (*)
    

  2. END-USER REGULARIZED TESTS-STATISTICS FUNCTIONS

    mvrt.test (.) (*)
    

  3. END-USER DIAGNOSTIC PLOTS FOR QUALITY CONTROL

    cluster.diagnostic (.)
    target.diagnostic (.)
    stabilization.diagnostic (.)
    normalization.diagnostic (.)
    

  4. END-USER DATASETS

    Synthetic
    Real
    

2.2 Details of End-User Functions

2.2.1 mvr(.)

This is the unique end-user function for Mean-Variance Regularization and Variance Stabilization by similarity statistic under sample group homoscedasticity or heteroscedasticity assumption. It returns an object of class “ mvr”. The function takes advantage of the R package { snow}, which allows users to create a cluster of workstations on a local or remote machine, enabling parallel execution of this function and scaling up with the number of CPU cores available. See vignette of the { MVR} package and [7] for more details. It is used as follows:

mvr(data,
    block = rep(1,nrow(data)),
    tolog = FALSE,
    nc.min = 1,
    nc.max = 30,
    probs = seq(0, 1, 0.01),
    B = 100,
    parallel = FALSE,
    conf = NULL,
    verbose = TRUE)

Below are the details of the arguments used in the mvr (.) end-user function:

  • data is a numeric matrix of untransformed (raw) data, where samples are by rows and variables (to be clustered) are by columns, or an object that can be coerced to such a matrix (such as a numeric vector or a data.frame with all numeric columns). Missing values ( NA), NotANumber values ( NaN) or Infinite values ( Inf) are not allowed.

  • block is a character or numeric vector or factor grouping/blocking variable. Defaults to single group situation, i.e. to the assumption of equal variance between sample groups. It must be of length sample size with as many different character or numeric values as the number of levels or sample groups. All group sample sizes must be greater than 1.

  • tolog is a logical scalar specifying whether the data ought to be log2-transformed first (optional, defaulting to FALSE). Note that negative or null values will be changed to 1 before taking log2-transformation.

  • nc.min is a positive integer scalar of the minimum number of clusters, defaults to 1.

  • nc.max is a positive integer scalar of the maximum number of clusters, defaults to 30. See Figure 1 and discussion in 1.4 and 2.2.3 for more details.

  • probs is a numeric vector of probabilities for quantile diagnostic plots. Defaults to seq(0, 1, 0.01).

  • B is a positive integer scalar of the number of Monte Carlo replicates computed in the internal similarity statistic function sim.dis (.) (see { MVR} package in [7]). It is reset to conf$cpus*ceiling(B/conf$cpus) in case the cluster is used (i.e. conf argument below is non NULL), where conf$cpus denotes the total number of CPUs to be used (see below).

  • parallel is a logical scalar specifying whether parallel computing ought to be performed (optional, defaulting to FALSE).

  • conf is a list list of parameters for cluster configuration. These are inputs of the R package { snow} function makeCluster (.) (R package { snow}) for cluster setup. Optional, defaults to NULL. For details see parallelization subsection 3.3 and { MVR} package in [7].

  • verbose is a logical scalar specifying whether the output ought to be verbose (optional, defaulting to TRUE).

2.2.2 mvrt.test(.)

This is the unique end-user function for computing Mean-Variance Regularized t-like test statistic and its significance (p-value) under sample group homoscedasticity or heteroscedasticity assumption. Return an object of class “ mvrt.test”. The function takes advantage of the R package { snow}, which allows users to create a cluster of workstations on a local or remote machine, enabling parallel execution of this function and scaling up with the number of CPU cores available. See vignette of the { MVR} package and [7] for more details. It is used as follows:

mvrt.test(data,
          obj = NULL,
          block,
          tolog = FALSE,
          nc.min = 1,
          nc.max = 30,
          pval = FALSE,
          replace = FALSE,
          n.resamp = 100,
          parallel = FALSE,
          conf = NULL,
          verbose = TRUE)

In addition to the above (2.2.1), the following arguments apply to mvrt.test (.):

  • obj is an object of class “ mvr” returned by mvr 2.2.1. It defaults to NULL. However, to save un-necessary computations, previously computed MVR clustering by mvr 2.2.1 can be provided through this option. If obj is fully specified as a mvr 2.2.1 object, arguments data, block, tolog, nc.min, nc.max are ignored. In this case, the MVR clustering provided by obj will be used for the computation of the Mean-Variance Regularized t-like test statistic.

  • block is a character or numeric vector or factor grouping/blocking variable. It must be of length sample size with as many different character or numeric values as the number of levels or sample groups. The number of sample groups must be greater or equal to 2 and all group sample sizes must be greater than 1.

  • pval is a logical scalar specifying whether p-values ought to be computed. If not, n.resamp and replace will be ignored. If pval=FALSE (default), t-statistic only will be computed. If pval=TRUE, exact (permutation test) or approximate (bootstrap test) p-values will be computed.

  • replace is a logical scalar specifying whether permutation test (default) or bootstrap test ought to be computed If replace=FALSE (default), permutation test will be computed with null permutation distribution. If replace=TRUE, bootstrap test will be computed with null bootstrap distribution.

  • n.resamp is a positive integer scalar of the number of resamplings to do (default=100) by permutation or bootstrap. In case the cluster is used (i.e. conf is non NULL), it is reset to conf$cpus*ceiling(n.resamp/conf$cpus), where conf$cpus denotes the total number of CPUs to be used.

2.2.3 Cluster Diagnostic

This is the first of a series of four end-user functions for plotting diagnostic plots. The function plots Similarity Statistic profiles and empirical quantile profiles of means and standard deviations to check that the optimal cluster configuration has been reached.

The quantile diagnostic plots check how close the empirical quantiles of first and second moments of the MVR-transformed data are to that of their respective theoretical null distributions for each cluster configuration. The optimal cluster configuration found is indicated by the most horizontal red dotted curve, which should convergence towards its target null (solid green line), before which, under-fitting occur, and after which, overfitting starts to occur (the single cluster configuration, corresponding to no transformation, is the most vertical curve, while the largest cluster number configuration reaches horizontality). The quantile diagnostic plots uses theoretical null distributions for the mean and variance, assuming normality of the data under the null. However, we do not require normality of the data in general. This is just used as a convenient distributions to draw from. The subroutine internally generates null distributions of the data with target mean-0 and standard deviation-1 (e.g. N(0, 1)). Under the assumption of standard normality and independence for the data, the theoretical null distributions of the means and of the standard deviations are respectively N(0, 1), and χn-G2n-G, where G denotes the number of sample groups (see [6] for more details). Both cluster diagnostic plots in Figure 2 help determine whether values of the nc.min and nc.max parameters have been set appropriately in the mvr (.) 2.2.1 or in the mvrt.test (.) 2.2.2 functions. The minimum of the similarity statistic profile has to be reached within the range nc.min:nc.max, otherwise run the procedure again with a wider range until this is the case.

An external file that holds a picture, illustration, etc.
Object name is nihms746138f2.jpg

Similarity statistic profiles (top) showing the optimal clustering configuration for an example with a single group design. Red dashed line depicts the LOESS scatter-plot smoother. The optimal configuration is indicated by the red arrow. Empirical quantile profiles of means (middle) and standard deviations (bottom) for each clustering configuration (black dotted lines) check that the distributions of first and second moments of the MVR-transformed data approach their respective theoretical null distributions under a given cluster configuration. Notice how empirical quantiles (dashed red lines) of transformed pooled means and transformed pooled standard deviations converge to their theoretical null distributions (solid green lines) for the optimal configuration.

cluster.diagnostic (obj,
                    title = “”,
                    span = 0.75,
                    degree = 2,
                    family = “gaussian”,
                    device = NULL,
                    file = “Cluster Diagnostic Plots”)

In all subsequent diagnostic plot functions, the arguments are as follows:

  • obj is an object of class “ mvr” returned by mvr 2.2.1.

  • title is the title of the plot (defaults to the empty string).

  • pal is any color palette, e.g. as provided by R package { RColorBrewer}.

  • span is the span parameter of the loess (.) function (R package { stats}), which controls the degree of smoothing (defaults to 0.75).

  • degree is the degree parameter of the loess (.) function (R package { stats}), which controls the degree of the polynomials to be used (normally 1 or 2, defaults to 2). Degree 0 is also allowed, but see the note in { stats} package.

  • family is the family distribution in {“gaussian”, “symmetric”} of the loess (.) function (R package { stats}), used for local fitting. If “gaussian” fitting is by least-squares, and if “symmetric” a re-descending M estimator is used with Tukey's biweight function.

  • device is the graphic display device in {NULL, “PS”, “PDF”} (defaults to NULL, i.e. the screen). Currently implemented graphic display devices are “PS” (Postscript) or “PDF” (Portable Document Format).

  • file is the file name for output graphic (defaults to appropriate name depending on the function). The option file is used only if device is specified (i.e. non NULL).

2.2.4 Target Diagnostic

This is the second end-user diagnostic plot function for plotting comparative density distribution of means and standard deviations of the data before and after Mean-Variance Regularization. It checks for location shifts between observed first and second moments and their expected target values under a centered homoscedastic model.

Notice how transformed pooled means and transformed pooled standard deviations distribute nicely around their targets moments (0,1). In the general case, the variables are not normally distributed and not even independent and identically distributed before and/or after MVR-transformation. Therefore, the actual distributions of untransformed and/or MVR-transformed first and second moments are usually unknown and differ from their respective theoretical distributions, i.e., from N(0, 1) for the means, and from χn-G2n-G for the standard deviations, where G denotes the number of sample groups (see [6] for more details). This is reflected in the QQ plots, where observed quantiles do not necessarily align with theoretical values.

target.diagnostic(obj,
                  title = “”,
                  device = NULL,
                  file = “Target Moments Diagnostic Plots”)

2.2.5 Stabilization Diagnostic

This is the third end-user diagnostic plot function for plotting comparative variance-mean scatters. It checks the variance stabilization across the means for all variables before and after Mean-Variance Regularization.

stabilization.diagnostic(obj,
                         title = “”,
                         span = 0.5,
                         degree = 2,
                         family = “gaussian”,
                         device = NULL,
                         file = “Stabilization Diagnostic Plots”)

In the plots of standard deviations vs. means, standard deviations and means are pooled version, calculated in a feature-wise manner. The scatterplot allows to visually verify whether there is a dependence of the standard deviation (or variance) on the mean. If there is no such dependence, then this line should approximately be horizontal. Notice the change of scale between the untransformed data and MVR-transformed data.

2.2.6 Normalization Diagnostic

This is the fourth end-user diagnostic plot function for plotting comparative Box-Whisker and heatmap plots of variables across samples. It checks the effectiveness of normalization before and after Mean-Variance Regularization.

normalization.diagnostic(obj,
                         title = “”,
                         pal,
                         device = NULL,
                         file = “Normalization Diagnostic Plots”)

3 Implementation Design

3.1 Code Profiling

To speed up the overall runtime of our program, our R package calls two internal functions withinsumsq (.) and km.clustering (.) both of which use a modified version of Mac-Queen K-means clustering algorithm [24] that is currently implemented in the C interface of the R distribution (see R function kmeans (.) and its internal K-means clustering C function R_kmeans_MacQueen (.) from R package { stats}). Our C function, used in place of the standard R distribution, goes like this (for complete code see MVRc.c source file in { MVR} package in [7]):

void MVR_withinsumsq(int* pn,
                     int* pp,
                     int* pk,
                     int* pB,
                     double* lWk_bo,
                     int* pnstart,
                     int* pmaxiter,
                     int* perror)

The C function MVR withinsumsq (.) overcomes the overhead of generating random seeds for K-means clustering and improves the performance of the R function nkmeans (.) { stats}. This is achieved (i) by generating random seeds internally of the .C interface instead of outside of it as it is currently done with the R function sample (.) in the { stats} package, and especially (ii) by avoiding redundancy checks of elements/rows in the input data as it is currently done with the R function unique (.) of the base package. The fact that the latter function is admittedly computationally expensive (see R Documentation of unique(.) { base}) is of particular concern when multiple calls are to be made of it (as is our case) when it comes e.g. to generate multiple random data by re-sampling techniques. However, even after rewriting the R function kmeans (.) { stats} in C, R and C codes profiling results show that most of the CPU time is spent in K-means clustering (Table 1). Note that i.i.d N(0, 1) deviates were generated from a source of uniformly distributed random numbers using Box-Muller transformation [3]. Independent uniformly distributed random numbers were generated with a precision of 10−16 from a source of pseudo-random integral numbers in the range 0 to 215 using the standard C++ function rand (.).

Table 1

R and C codes profiling results without parallelization on an example from the synthetic dataset in a multi-group (2) design, assuming unequal variance between groups and for { nc.min:nc.max} = {1, …, 15}. Results are ordered by total CPU times of internal subroutines for the primary end-user function mvr (.). Only the top 7 are shown. For details see { MVR} package in [7]. The profiling of .C (.) is broken down into CPU time for K-means clustering (K-means) and random number generation and random seeds generation (RNG) respectively.

Subroutinetotal time (s)total time (%)self time (s)self time (%)
mvr (.)71.61100.0000.0100.01
mvrt (.)
MeanVarReg (.)70.5598.5200.0200.03
.C (.)65.6191.6265.6191.62
K-means(65.27)(99.48)
RNG(00.34)(00.52)
sim.dis (.)61.5185.9000.0000.00
withinsumsq (.)55.4177.3800.0000.00
km.clustering (.)11.8516.5500.0000.00

To further speed up the program, the K-means clustering algorithm can be optimized by using recently developed fast K-means clustering algorithms. Recent fast K-means clustering algorithms were claimed to achieve worthwhile speedup for a variety of data distributions including uniformly distributed data. For instance, Charles Elkan claims gains greater than 1.50 in up to 1000 dimensions [10], and Greg Hamerly claims his algorithm performs competitively in up to 50 dimensions [13]. It has been noted, however, that little or no acceleration can be achieved in K-means clustering for uniformly distributed data in high dimensions. As Moore stated it: ‘If there is no underlying structure in the data (e.g. if it is uniformly distributed) there will be little or no acceleration in high dimensions no matter what we do. This gloomy view, supported by recent theoretical work in computational geometry, means that we can only accelerate datasets that have interesting internal structure.’ [17, 26]. While this result is almost certainly true asymptotically as the dimension of a dataset tends to infinity, Charles Elkan claims that worthwhile speedup can still be obtained on uniform data up to at least 1000 dimensions [10]. Since our random dataset is sampled from the standard normal reference distribution [6, 7], we believe that there is definitely room for improvement in high dimension, and we plan to implement those fast K-means clustering algorithms in C/C++ in a future version of our R package to further gain in speed.

3.2 Computational Complexity Considerations

In our procedure, the clustering of variables relies upon a clustering algorithm, which can be of any type for that matter, such as any type of agglomerative algorithm (like current default K-means with James MacQueen algorithm [24]), or any hierarchical algorithm. The clustering problem is NP-hard in a general Euclidean space of dimensionality d, even for only C = 2 clusters [1], or for a general number of clusters C, even in the d = 2 plane [25].

The computational complexity of our procedure is therefore completely dictated by the computational complexity of the clustering algorithm. If C and d are fixed, the problem can be exactly solved in O(ndC+1 log(n)) time, where n is the number of entities to be clustered [16]. In our case, entities to be clustered for profiling the similarity statistic are the p variables (so np). This is done for l ∈ {1, …, C} cluster configurations and r random start seedings for each of them, once in the bi-dimensional mean-variance space (d ← 2), and a second time in the sample space (so dn). In conclusion, our MVR procedures can be solved in l=1CO(rpnl+1log(p))=O(rpnC+1log(p)), i.e. in time exponential in n. Fortunately, n is usually small in np settings.

3.3 Computational Parallelization

To overcome these computational bottlenecks, one way around is to take advantage of parallel computing and parallel Random Number Generation (RNG) in a R session. Among the various ways of doing this, we chose to setup a cluster of workstation and use the interface provided by the R package { snow} for its simplicity and flexibility (see e.g. designer's webpage at: http:// www.stat.uiowa.edu/˜luke/R/cluster/cluster.html). Among other things, it allows (i) to specify each individual local and remote host machines, (ii) multiple cluster configurations as long as slave hosts are not shared, (iii) heterogeneous local and/or remote machines, and (iv) PVM, MPI, and Socket types of cluster communication mechanisms. Note that the actual creation of the cluster, its initialization, and its stopping are all done internally in our end-user functions: mvr (.) and mvrt.test (.). In addition, when random number generation is needed, the creation of separate Stream of Parallel RNG (SPRNG) per node is done internally by distributing the stream states to the nodes.

To run a parallel session of the { MVR} procedures mvr (.) and mvrt.test (.), argument parallel is to be set TRUE, and argument conf is to be specified (i.e. non NULL). It must list the specifications of the following parameters for cluster configuration: (“cpus”, “type”, “homo”, “script”, “outfile”) matching the arguments and options described in function makeCluster (.) of the R package { snow}. For more details see vignette of the { MVR} package in [7], and function makeCluster (.) from R package { snow}.

In case p-values are desired in the mvrt.test (.) function ( pval=TRUE), the use of the cluster is highly recommended. Parallel computing is ideal for embarrassingly parallel tasks such as permutation or bootstrap resampling loops. Note that in order to maximize computational efficiency in case p-values are desired, and avoid multiple configurations (since a cluster can only be configured and used one session at a time, which otherwise would result in a run stop), the cluster configuration will only be used for the parallel computation of p-values, but not for the MVR clustering computation of the regularized t-test statistics.

3.4 Computational Scalability

We compared performances of parallel and serial computations and the amount of computational scalability, i.e. parallelism efficiency that can be achieved. To get an approximate and comparable measure of scalability, we define a dimensionless Scalability Ratio as:

SR=T(N1)T(N2)N2N1
(2)

where Ni denotes the number of CPU cores and T(Ni) the total elapsed CPU time for the R process. Index i stands either for a serial computation on a single CPU core (i = 1 and N1 = 1) or a parallel situation (i = 2, N2 ≥ 2). Here, we used a small Rocks™cluster (http://www.rocksclusters.org/) for developing and testing our own applications/extensions. It consists in a 4-node cluster (1 master + 3 slaves), each equipped with 2-Quad processors, amounting to N = 4 × 2 × 4 = 32 CPU cores in total. Results show that there is a fair amount of scalability using an increasing number of CPU cores, while keeping a balanced load. This is especially true in the case of embarrassingly parallel situations. Compare rows of mvr (.) vs. mvrt.test (.) in either synthetic or real dataset test (Table 2). This tends to be moderated by the overhead and networking load when multiple slave nodes are used. Compare columns of N = 8 vs. N = 32 CPU cores (Table 2).

Table 2

Timing (T) and Scaling Ratios (SR) for the synthetic and real datasets in a multi-group (2) design, assuming unequal variance between groups and for { nc.min:nc.max} = {1, …, 15}. The master was equipped with 64Gb RAM and each slave node with 32Gb RAM, amounting to 160Gb RAM. Each Quad processor is Intel®Xeon®Processor E5430 2.66GHz (12M Cache, 1333 MHz FSB). For a detailed description of datasets see the { MVR} package in [7].

DatasetN18 (1 master node)32 (4 nodes)

TT (h m s)T (h m s)SR(%)T (h m s)SR(%)
Synthetic mvr (.)0h 1m 18s0h0m36s26.70h0m32s7.50
mvrt.test (.)2h 6m 22s0h 16m 29s95.770h5m24s73.02

Real mvr (.)0h 20m 26s0h6m47s37.610h5m30s11.61
mvrt.test (.)33h 59m 19s4h 27m 12s95.401h 23m 15s76.54
An external file that holds a picture, illustration, etc.
Object name is nihms746138f3.jpg

Density distribution plots of means (top) and standard deviations (bottom) to check that the means and standard deviations of the MVR-transformed data have correct distributions around their target moments (0,1). Captions show the p-values from the parametric two-sample two-sided t-tests for the equality of expected parameters (black dashed lines) to their observations (red dashed lines). Comparative QQ scatterplots (bottom) look at the departures of means (top) and standard deviations (bottom) distributions between their observed distributions after MVR-transformation and their theoretical distributions under a true mean-0, standard deviations-1 model. Each black dot represents a variable. The red solid line depicts the interquartile line. Captions show the p-values from the nonpara-metric two-sample two-sided Kolmogorov-Smirnov tests of the null hypothesis that a parameter distribution equal its theoretical distribution.

An external file that holds a picture, illustration, etc.
Object name is nihms746138f4.jpg

Mean-Variance (standard deviation) plots of the untransformed data (left) and MVR-transformed data (right). Means are ordered by increasing values. Each red dot represents a variable. The black dotted line depicts the LOESS scatterplot smoother estimator. Notice the change of scale between the untransformed data (left) and MVR-transformed data (right).

An external file that holds a picture, illustration, etc.
Object name is nihms746138f5.jpg

Plots of comparative Box-Whisker and Heatmap plots of variables across samples. Notice the change of scale between the un-transformed data (left) and MVR-transformed data (right).

Acknowledgments

We thank Hemant Ishwaran for helping with the implementation of the R package { MVR} and Alberto Santana for the setup of the Rocks™cluster. This work was supported in part by the National Institutes of Health [P30-CA043703 to J-E.D., R01-GM085205 to J.S.R.]; and the National Science Foundation [DMS-0806076 to J.S.R.].

Footnotes

Conflict of Interest: None declared.

References

1. Aloise D, Deshpande A, Hansen P, Popat P. NP-hardness of Euclidean sum-of-squares clustering. Mach Learn. 2009;75:245248. [Google Scholar]
2. Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics. 2001;17:509–19. [PubMed] [Google Scholar]
3. Box G, Muller M. A Note on the Generation of Random Normal Deviates. Ann Math Statist. 1958;29:610–611. [Google Scholar]
4. Cai T, Lv J. Discussion: The Dantzig Selector: Statistical Estimation when p is much larger than n. The Annals of Statistics. 2007;35:23652369. [Google Scholar]
5. Cui X, Hwang JT, Qiu J, Blades NJ, Churchill GA. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005;6:59–75. [PubMed] [Google Scholar]
6. Dazard JE, Rao JS. Joint Adaptive Mean-Variance Regularization and Variance Stabilization of High Dimensional Data. Comput Statist Data Anal. 2012;56(7):2317–2333. [PMC free article] [PubMed] [Google Scholar]
7. Dazard JE, Xu H, Santana A. Contributed R Package MVR: Mean Variance Regularization. The Comprehensive R Archive Network. 2011 https://cran.r-project.org/web/packages/MVR/index.html.
8. Efron B, Tibshirani R. On Testing the Significance of Sets of Genes. The Annals of Applied Statistics. 2007;1:107–129. [Google Scholar]
9. Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. J Amer Stat Assoc. 2001;96:1151–1160. [Google Scholar]
10. Elkan C. Using the Triangle Inequality to Accelerate K-Means. In: Mishra TF, editor. Twentieth International Conference on Machine Learning (ICML-2003) Nina: AAAI Press; 2003. p. 147153. [Google Scholar]
11. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Statist Soc. 2008;70:849–911. [PMC free article] [PubMed] [Google Scholar]
12. Ge Y, Dudoit S, Speed TP. Resampling-based Multiple Testing for Microarray Data Analysis. Test. 2003;12:1–77. [Google Scholar]
13. Hamerly G. Making K-means Even Faster. International Conference on Data Mining (Society for Industrial and Applied Mathematics) 2010:130–140. [Google Scholar]
14. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer Science; 2009. [Google Scholar]
15. Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002;18(Suppl 1):S96–104. [PubMed] [Google Scholar]
16. Inaba M, Katoht N, Imai H. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. 10th ACM Symposium on Computational Geometry. 1994:332–339. [Google Scholar]
17. Indyk P, Amir A, Efrat A, Samet H. 40th Annual Symposium on Foundations of Computer Science. IEEE Computer Society; Los Alamitos, CA, USA: 1999. Efficient algorithms and regular data structures for dilation location and proximity problems; p. 160170. [Google Scholar]
18. Ishwaran H, Rao JS. Detecting differentially expressed genes in microarrays using Bayesian model selection. J Amer Stat Assoc. 2003;98:438–455. [Google Scholar]
19. Ishwaran H, Rao JS. Spike and slab gene selection for multigroup microarray data. J Amer Stat Assoc. 2005;100:764–780. [Google Scholar]
20. Jain N, Thatte J, Braciale T, Ley K, O'Connell M, Lee JK. Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics. 2003;19:1945–51. [PubMed] [Google Scholar]
21. Ji H, Wong WH. TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics. 2005;21:3629–36. [PubMed] [Google Scholar]
22. Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med. 2003;22:3899–914. [PubMed] [Google Scholar]
23. Lonnstedt I, Speed TP. Replicated microarray data. Statistica Sinica. 2002;12:31–46. [Google Scholar]
24. MacQueen J. Some methods for classification and analysis of multivariate observations. In: Neyman LMLC, J, editors. Fifth Berkeley Symp on Math Statist and Prob. Vol. 1. Univ. of Calif. Press; 1967. pp. 281–297. [Google Scholar]
25. Mahajan M, Nimbhorkar P, Varadarajan K. Proceedings of the 3rd International Workshop on Algorithms and Computation. Vol. 1507122. Springer-Verlag; 2009. The Planar k-Means Problem is NP-Hard; pp. 274–285. [Google Scholar]
26. Moore A. The Anchors Hierarchy: Using the Triangle Inequality to Survive High-Dimensional Data. In: Morgan Kaufmann., editor. Sixteenth Conference Conference on Uncertainty in Artificial Intelligence. San Francisco, C.: AAAI Press; 2000. pp. 397–405. [Google Scholar]
27. Papana A, Ishwaran H. CART variance stabilization and regularization for high-throughput genomic data. Bioinformatics. 2006;22:2254–61. [PubMed] [Google Scholar]
28. Rocke DM, Durbin B. A model for measurement error for gene expression arrays. J Comput Biol. 2001;8:557–69. [PubMed] [Google Scholar]
29. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3 ARTICLE3. [PubMed] [Google Scholar]
30. Stein C. Inadmissibility of the Usual Estimator for the Variance of a normal distribution with unknown mean. Vol. 16. Springer; Netherlands: 1964. [Google Scholar]
31. Storey JD, JE T, S D. Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. J R Statist Soc. 2003;66:187–205. [Google Scholar]
32. Storey JD, Tibshirani R. Statistical significance for genome wide studies. Proc Natl Acad Sci U S A. 2003;100:9440–5. [PMC free article] [PubMed] [Google Scholar]
33. Tibshirani R, Walter G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Statist Soc. 2001;63(Series B):411–423. [Google Scholar]
34. Tong T, Wang Y. Optimal shrinkage estimation of variances with applications to microarray data analysis. J Amer Stat Assoc. 2007;102:113–122. [Google Scholar]
35. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98:5116–21. [PMC free article] [PubMed] [Google Scholar]
36. Wang Y, Ma Y, Carroll R. Variance estimation in the analysis of microarray data. J R Statist Soc B. 2009;71:425–445. [PMC free article] [PubMed] [Google Scholar]
37. Wright GW, Simon RM. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics. 2003;19:2448–55. [PubMed] [Google Scholar]