- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- BMC Bioinformatics
- v.8; 2007
- PMC1929126

# A robust measure of correlation between two genes on a microarray

^{1}Department of Mathematics, Pomona College, Claremont, CA 91711, USA

^{2}Department of Mathematics, Pitzer College, Claremont, CA 91711, USA

^{3}Department of Statistics, University of Nebraska, Lincoln, NE 68588, USA

^{4}Department of Mathematics, Lewis and Clark College, Portland, OR 97219, USA

^{}Corresponding author.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

### Background

The underlying goal of microarray experiments is to identify gene expression patterns across different experimental conditions. Genes that are contained in a particular pathway or that respond similarly to experimental conditions could be co-expressed and show similar patterns of expression on a microarray. Using any of a variety of clustering methods or gene network analyses we can partition genes of interest into groups, clusters, or modules based on measures of similarity. Typically, Pearson correlation is used to measure distance (or similarity) before implementing a clustering algorithm. Pearson correlation is quite susceptible to outliers, however, an unfortunate characteristic when dealing with microarray data (well known to be typically quite noisy.)

### Results

We propose a resistant similarity metric based on Tukey's biweight estimate of multivariate scale and location. The resistant metric is simply the correlation obtained from a resistant covariance matrix of scale. We give results which demonstrate that our correlation metric is much more resistant than the Pearson correlation while being more efficient than other nonparametric measures of correlation (e.g., Spearman correlation.) Additionally, our method gives a systematic gene flagging procedure which is useful when dealing with large amounts of noisy data.

### Conclusion

When dealing with microarray data, which are known to be quite noisy, robust methods should be used. Specifically, robust distances, including the biweight correlation, should be used in clustering and gene network analysis.

## 1 Background

One of the primary goals of experiments involving DNA microarrays is to find genes which are somehow similar across various experimental conditions. "Similar" is usually taken to mean co-expressed, but it can be measured in several different ways. The distance (usually one minus similarity) measure most commonly used is Pearson correlation, though Euclidean distance, cosine-angle metric, Spearman rank correlation, and jackknife correlation are also used frequently. (Note that correlation and cosine-angle metrics do not fulfill the triangle inequality, so they are not true distance metrics. However, they are used to measure distance in many applications.) For example, [1-4] use Pearson correlation in their gene network analysis; [5-13] use Pearson correlation (or a modification) to cluster gene expression data. Once the similarity or distance measure is chosen, the relationship between the genes is given by some sort of clustering algorithm (e.g., k-means, hierarchical clustering, *k *nearest neighbors) or gene network analysis.

Clustering results can be highly dependent on the choice of similarity measure (particularly when comparing genes whose similarities are based on tens of samples instead of comparing samples whose similarities are based on thousands of genes); one or two outlying values can produce large changes in the value of some similarity measures. Outlying data points can be real or noise, though microarray data are known to have substantial noise. The noise can occur during any of the stages in the experimental process, and the effect can be in any direction. For example, a large outlier might cause co-expressed genes to seem dissimilar while a different large outlier might cause dissimilar genes to look co-expressed. Sometimes the outlying value is meaningful and important in which case the data should be included in the correlation. Our flagging procedure lets the practitioner determine whether or not a flagged value should be removed.

The goal in our paper is to give a resistant correlation measure that can be used as a distance metric in any clustering or gene network algorithm which calls for some type of distance or similarity measure in order to identify the relationship between a pair of genes, across gene modules, or within a cluster of genes. Tukey's biweight [14] has been well established as a resistant measure of location and scale for multivariate data [15-17]. When considering 2 genes on *n *samples, the 2 × 2 biweight covariance matrix that results from the biweight measurement of multivariate scale can be thought of as a resistant covariance between two genes (or of *n *points in dimension 2). Translating a 2 × 2 biweight covariance matrix into a biweight correlation measure is simply a matter of taking the biweight covariance divided by the product of the individual gene biweight standard deviations (analogous to computing the Pearson correlation from a standard covariance matrix.) Tukey's biweight is a type of M-estimate, a class of estimators which has been used in robust correlation estimates (for example, Mosteller and Tukey defined the cob [18] and Wilcox defined the percentage bend correlation [19].) M-estimates are consistent estimates of multivariate location and shape, so the biweight correlation is estimating the same parameter as the Pearson correlation. We show that our robust correlation based on the biweight M-estimate is intuitive, flexible, and performs well under a variety of data distributions.

When considering the correlation between each pair of genes, we find that, although the biweight correlation and the Pearson correlation usually agree, when they do not agree, there are often problems with the gene's (or genes') data which may indicate to the biologist that the gene should be removed from further study. Our biweight correlation method provides two novel applications particularly suited to microarray analysis: 1. We have created a similarity measure that is resistant to outlying data points (an important feature in analyzing microarray data), and 2. By investigating gene pairs that have discrepant correlation values, we create a diagnostic procedure to identify values which may need to be flagged (i.e., removed or else further investigated.)

In the remainder of the paper, we provide details of the method and results. First, in section 1.1 we discuss microarrays and their particular need for resistant measures. In section 1.2 we explain Tukey's biweight (its computation is given in the appendix, section 8.) We give our results in section 2, showing that the biweight correlation can be used as a resistant similarity measure or a diagnostic procedure for flagging data. We then demonstrate, in section 2.4, that our method is more efficient than Spearman correlation (another resistant correlation method.) In section 2.5 we show that the biweight correlation has empirically low bias and is superior to other robust measures. We conclude with some ideas of how to further develop our methods for other microarray applications.

### 1.1 Why resistance is important in microarray analysis

Microarray technology requires biologists and statisticians to work side by side in analyzing gene expression information. Gene microarray chips measure, simultaneously, the expression levels of thousands of genes in an organism. Comprehensive gene expression data is useful if one wishes to find clusters of genes with similar function. Microarrays have been used to study the gene expression trends (for example across time) in diseases and even to classify and diagnose different types of diseases, such as cancerous tumors [20]. For some organisms, microarrays enable biologists to monitor the entire genome of interest on a single chip in order to create a large picture of the interactions among thousands of genes simultaneously. A microarray is an orderly arrangement of spots that provides a medium for measuring known and unknown DNA pieces (genes) based on base-pairing rules. Each microarray measures thousands of genes simultaneously, so resulting microarray data is typically on the order of thousands of genes by tens of samples.

Although microarray technology has been very useful in discovering changes in gene expression, limitations of the technology have been observed: dye bias and relative gene expression levels having different sample variances due to differences in experimental conditions [21]; differences due to laboratories and platforms [22,23]; pixel saturation [24]; low signal/noise ratio [25]; and differences due to image analysis techniques [26-29]. Researchers have worked to address the particular problems inherent in microarray analyses, but even after novel techniques (of, for example, normalization or filtering) have been applied, microarray data remain noisy [30,31].

Some work has been done showing the need for resistant correlation metrics as similarity measures. In particular, Heyer et al. give a jackknife correlation that is more resistant than the Pearson correlation. However, as they state in their paper, the jackknife correlation is only resistant to single outliers [32].

### 1.2 Biweight as a resistant correlation measure

Tukey's biweight has been used as a resistant estimator of location and scatter as well as a resistant estimator of regression parameters in a wide range of applications (see [33] for an overview of Tukey's work in resistant statistics). The former approach has been used by Affymetrix to normalize microarray data [34] but not in applications of data distances.

M-estimators are a class of estimators of multi-dimensional location and scatter that provide for flexibility, efficiency, and resistance. The key to M-estimation is the ability of the estimator to down-weight points that are far from the data center with respect to the data scatter. Because of the weighting, M-estimates are more resistant to outlying values than standard estimates (like the mean or the Pearson correlation.) Additionally, M-estimates use the actual data values in constructing location and scatter estimates and are therefore more efficient than estimators based on rank (like the median or the Spearman rank correlation.) M-estimators are defined iteratively using a weight function which down-weights data values that are far from the center of the data. We use the M-estimate of 2-dimensional scatter (i.e., covariance) to calculate a biweight correlation. Details for the biweight are given in the appendix, and R code for the biweight is available from the corresponding author (some of the R code is taken from Wilcox [19].)

An important aspect of M-estimators is their resistance to outlying data values. One measure of the resistance of an estimator is its replacement breakdown, which is the smallest fraction of a data set that one could replace with corrupt data in such a way as to take the estimator over all bounds [17]. Unlike the mean (breakdown = 0) or the median (breakdown close to $\frac{1}{2}$), the biweight is parameterized so that the breakdown can be adjusted over a range of values. Adjusting the breakdown value will have implications in flagging data values (discussed further in section 2.3).

The results of the biweight iteration scheme are a multivariate location estimate, $\tilde{T}$, and shape estimate, $\tilde{S}$. Letting ${\tilde{s}}_{jl}$ be the (*j, l*)^{th }element of $\tilde{S}$, we can think of ${\tilde{s}}_{jl}$ as a resistant estimate of cov (**X**_{j}, **X**_{l}) (where **X**_{j }and **X**_{l }are two *n*-vectors of interest.) Consequently,

is the biweight correlation between vectors *j *and *l *and is a more resistant estimate of correlation than the Pearson correlation (denoted by *r*_{jl}.) Because the components (center and shape parameters) are estimated using resistant techniques (unlike the Pearson correlation), we know the biweight correlation will be more resistant than the Pearson correlation. Note that |${\tilde{r}}_{jl}$| ≤ 1.

Using the biweight correlation ($\tilde{r}$) as a resistant estimate of the correlation measure, we can incorporate $\tilde{r}$ into clustering algorithms which depend on similarities or 1 - $\tilde{r}$ into clustering algorithms that depend on distances. In the next section we will demonstrate that the biweight correlation is clearly a better choice for a distance (or similarity) measure than the Pearson correlation (*r*).

## 2 Results

Because our methods are most valuable when applied to noisy data, we applied our technique to a real microarray data set. The data set was chosen because it has been used widely in clustering applications [5-7] as well as gene network applications [1-3]. The data are taken from an experiment on *Saccharomyces cerevisiae *created to describe yeast genes with periodically varying transcript levels within the cell cycle [35]. The cell cycle data are based on a time course experiment, and so they are not independent and identically distributed (iid.) However, they are typical of many microarray data sets which are also not iid. The data are publicly available from the Stanford Microarray Database (SMD) http://smd.stanford.edu and include 25 samples on over 6000 genes. We kept the default filters from SMD, including using "Log (base2) of R/G Normalized Ratio (Mean)" as our value of interest (that is, we worked with a value that is the log (base2) transformation of the normalized ratio of the average red signal ("R") and the average green signal ("G").) Typically, the red signal measures the amount of gene expression activity under an experimental condition, and the green signal measures the gene expression activity for a control. The value of interest is the relative expression measured by the ratio log_{2 }(R/G). The only additional filtering we did was to eliminate genes that had more than ten missing values (correlation was computed on the remaining values for those genes with minor missing data.) Note, also, that we have applied similar techniques to multiple other independent data sets, and the results are consistent across platforms (e.g., oligonucleotide or cDNA), organisms, and normalization techniques (results not shown.)

### 2.1 Biweight correlation as a resistant similarity measure

To demonstrate the difference between Pearson correlation (PC) and biweight correlation (BWC), we computed both correlations (BWC based on breakdown of 0.2) on all $\left(\begin{array}{c}1000\\ 2\end{array}\right)$ pairs of genes from the top 2 most 1000 variable genes (in terms of standard deviation.) A scatterplot with all $\left(\begin{array}{c}1000\\ 2\end{array}\right)$ pairs of genes is given in figure figure11 (the horizontal axis is BWC, the vertical axis is PC.) The PC and BWC are highly positively correlated, with most of the correlations in relative agreement. However, in the corners and on the edges, we see numerous strong discrepancies between the PC and the BWC. A further investigation into those edge points gives clear evidence of why PC and BWC values differ.

**...**

Before discussing the particular pairs of interest, we will break down the plot into four (not well defined) groups:

1. gene pairs that give "consistent" PC and BWC

2. gene pairs that give "opposite" PC and BWC

3. gene pairs that give PC ≈ 0 and large |BWC|

4. gene pairs that give large |PC| and BWC ≈ 0

We will discuss group 1 further in section 2.3.

In groups 2–4, the inability to consistently measure gene correlation can generate serious problems in clustering algorithms. We argue that for gene pairs in groups 2–4, the BWC is a much better measure of distance than the PC.

Consider points e, j, d, and k from figure figure11 (group 2 points). For each pair of genes, there is an extreme outlying value causing the PC to be manipulated in the outlier's direction. The panel of plots in figure figure22 shows the clear outlying values for each of the points in figure figure11 identified as being in group 2.

*log*

_{2 }(

*R/G*) value for the specified genes on a particular array. The Pearson correlation and the biweight correlation give opposite values for these group 2 pairs. Due to the outlying values, it is clear that the Pearson correlation

**...**

Consider points i, a, b, and c from figure figure11 (group 3 points). For each pair of genes, there appears to be an outlying point which is nullifying the existing (strong) correlation. The panel of plots in figure figure33 shows the existing correlation that has been calculated as low (using PC) because of the outlier(s).

*log*

_{2 }(

*R/G*) value for the specified genes on a particular array. The Pearson correlation is close to zero, and the biweight correlation gives a relatively high value for these group 3 pairs.

Consider points f, g, h, and l from figure figure11 (group 4 points). For each pair of genes (seen in figure figure4),4), there appears to be virtually no correlation though the PC calculates a strong correlation. The highly influential points in group 4 are those that we are most worried about. Points in group 4 will show up as strongly co-expressed in either a cluster or a gene network and will give researchers misleading results. Because of the high dimensionality of microarray data, already we often come across false positive results (even without outlying values). Using BWC instead of PC will help to reduce the number of false positives in any given application that are due to outlying values which produce misleadingly high PC.

### 2.2 Using the biweight correlation to flag low quality data

Our method of comparing PC and BWC can also be used as a data flagging method. For gene pairs that produce a relatively high correlation (by at least one of the methods) and highly discrepant correlations across the two methods, we flag the gene pair for further investigation. Note that we require the gene pair to yield a high correlation value (in either direction), because we are not particularly interested in genes whose value flips, for example, from *r *= 0.3 to $\tilde{r}$ = - 0.2; such a change will not have a strong impact in a clustering scheme because the similarity across the two genes is weak in both measures. We will flag points as outlying if

1. |*r*| > 0.85 OR |$\tilde{r}$| > 0.85

2. AND |*r *- $\tilde{r}$| > 1.0

Depending on how strict one wants to be at flagging possible outlying values, one may want to adjust the cutoffs in the above procedure. The quality of data will affect the size of the absolute difference |*r *- $\tilde{r}$|. Therefore, if a data set has low quality (for example, if GeneSpring output files give evidence of low quality), a resistant metric should be used and/or claims should only be made about genes for which the resistant and non-resistant metrics give similar results.

Using the outlying flag procedure defined above, we identified 12 gene pairs (see figure figure5.)5.) We can see that in the 12 pairs of genes, genes YOL087C, YDR388W, and YGL157W were identified three times each. Genes which get flagged as outlying when compared to multiple other genes are likely to have some poorly measured data or other outlying virtues. In order to find those genes which are repeatedly seen as outlying, we reduce the flag procedure to 1. a single correlation being at least 0.8 and 2. a difference of at least 0.65. We then identify 593 gene pairs (which represent 363 unique genes) as possibly outlying. By tabulating the frequency of the 363 unique genes in the 593 plots of gene pairs, we are able to find genes which are repeatedly identified (eight genes that showed up in at least 20 of the 593 pairs.) The four most common gene outliers (YLR328W, YDR388W, YJL042W, YJL188C) all showed up in our more stringent outlier detection plot (see figure figure5).5). These gene outliers could represent either noise or truly large values. Follow-up experiments or other datasets would be needed to distinguish these two possibilities.

### 2.3 Effect of breakdown in biweight correlation

As mentioned in section 1.2, the breakdown controls the resistance of the estimator. For example, setting a breakdown at 0.2 allows for up to 20% of the data values to be manipulated without being able to take the estimator across all bounds. Naturally, the lower the breakdown, the less resistant the estimator. In our case, a breakdown of zero will give a BWC estimate almost equivalent to the PC (the very slight difference is due to the BWC weight scheme weighting points differently while PC weights every point $\frac{1}{n}$.) In a plot (not shown) of the PC vs. BWC at zero breakdown there are no points in the groups we had previously defined as 2, 3, and 4 (see section 2.1.) Conversely, the higher the breakdown, the more discrepant the PC and BWC will be (data not shown.) A higher breakdown will lead to flagging more genes as possibly low quality. Depending on the noise level of the data, one may want to adjust the breakdown. Ideally, the breakdown should be set only as high as the percentage of data which is outlying. A breakdown of 0.2 gives a good balance between resistance and ability to make use of the bulk of the data in the estimation process. The breakdown value will have an effect on the correlation value that is used as a similarity measure. If the breakdown is not high enough, the metric will not be resistant. If the breakdown is too high, the technique will lose power. The effect of the similarity metric on a clustering or gene network will depend on the particular algorithm. However, if the similarities between two genes is estimated to be 0.9 with the first measure and 0.1 with the second measure (or vice versa, both situations seen in figure figure1),1), we would expect clustering algorithms to link the two genes with the first measure and not with the second (or vice versa.) The effect of non-resistant similarity metrics on clustering results can be disastrous.

### 2.4 Efficiency of the biweight correlation

We have demonstrated that the biweight correlation is effective as a resistant correlation as well as a tool for flagging low quality data (both valuable in analyzing microarray data.) The Spearman correlation, based on the ranks of the data, is also a resistant correlation technique. However, because the biweight incorporates the actual data values (instead of just their order), the biweight correlation is more efficient than the Spearman correlation. The efficiency of the biweight as a location and scale estimator has been well studied [17,36]. Table Table44 gives the efficiencies (versus the Pearson correlation) for both the biweight and the Spearman correlations. The efficiency is calculated from 10,000 bivariate samples of a given size and correlation. The table values are each the ratio of the variance of the biweight (or Spearman) correlation across the 10,000 samples versus the variance of the Pearson correlation across the 10,000 samples. Particularly for high correlations (those in which we are interested), the biweight correlation is substantially more efficient than the Spearman correlation.

Additionally, in a plot of Spearman vs. Pearson correlations (see figure figure6)6) for the yeast data, we see that the Spearman and Pearson correlations are more consistent with each other than the biweight versus Pearson correlations were (see figure figure1)1) for this noisy data. With clean data we would expect the Pearson and biweight to be more consistent than the Pearson and Spearman. However, because the biweight is able to capture the large correlations that are seen as small with Pearson, the biweight actually seems more different from the Pearson than the Spearman does. The efficiency of the biweight correlation helps us discover large correlations that the Spearman correlation measure misses.

### 2.5 Empirical consistency of biweight correlation under non-normal distributions

In order to assess the performance of the biweight correlation under different situations, we ran a series of simulations computing the empirical correlation for each of different true correlations, sample sizes, distributions, and correlation metrics. For a given true correlation, sample size, and distribution a pair of data were simulated; the empirical correlation was then calculated. In each of the simulations except the one-wild, the correlation structure was imposed after the data were simulated. For the one-wild distribution, the data were simulated with the appropriate correlation structure and then the wild observation was substituted randomly. The process was repeated 10,000 times. The average of the 10,000 simulations (with standard deviations) are seen in tables tables1,1, ,2,2, and and3.3. Three of the correlation metrics (Pearson, Spearman, and biweight) have already been discussed. The fourth, the percentage bend correlation [19], is an additional robust correlation metric based on M-estimation using Huber's *Ψ*-function. The percentage bend correlation has an advantage over the Spearman in that it uses the weighting of the M-estimation instead of just the ranks of the data, but it has a disadvantage over the biweight because the percentage bend is not iterative and can fail to be as resistant as the biweight.

The distributions of data are meant to cover a variety of situations. The Lognormal data are skewed; the Beta(2,2) data have light tails. The slash distribution is created by dividing a standard normal deviate by an independent uniform (0,1) deviate and has much heavier tails than normal while being less pathological than the Cauchy distribution. The one-wild distribution is a contaminated standard normal such that one value (in only one dimension) is replaced with a random deviate from a uniform (5,10) distribution. From tables tables1,1, ,2,2, and and33 we make the following observations:

• Pearson correlation is seriously affected by heavy tails (slash distribution) and outliers (one-wild distribution.)

• Spearman and percentage bend correlations are quite resistant, but they tend to underperform the biweight (at any breakdown).

• The biweight correlation performs well consistently across different distributions and sample sizes.

• Though some efficiency is lost when the biweight is compared to the Pearson, the improvement in performance for non-normal data is essential for applications to microarray data.

• The breakdown parameter alters the biweight correlation performance only slightly. As long as the breakdown percentage covers the amount of the contamination, the biweight correlation will have low bias and high efficiency.

### 2.6 Convergence of the biweight

As described in section 1.2, the biweight correlation is an iterative estimator. For a (normal) sample size of 25, it takes about 43 seconds to compute 100,000 pairwise biweight correlations on a Pentium 4, 3GHz computer with 2GB RAM running Windows XP (compared to less than a second for the Pearson correlation and the Spearman correlation, and about 2 seconds for the percentage bend correlation.) Admittedly, the computation time is the shortcoming for the biweight correlation when computing all pairwise correlations across hundreds or thousands of genes. All simulations are done using R and would likely be considerably faster using a different programming language.

As mentioned in the appendix, the biweight correlation is computed by first finding an initial estimate of location and scatter. We have found that initializing the biweight using robust estimates of location and scatter of the median and MAD (Median of the Absolute Deviations from the median) converge to the same biweight estimates as using the Minimum Covariance Determinant [37] location and scatter estimates (which are slightly better multivariate estimates but slower to compute.) Additionally, we have found (simulations not shown) that running the iteration for 5–10 steps gives equivalent results (biweight correlations). Typically, for noisy data, the iteration scheme will take between 10–25 steps to converge fully. For a slash distribution with a sample of size 25, it takes 95 seconds to compute 100,000 pairwise biweight correlations which completely converge and 42 seconds when the iterations are capped at 8.

## 3 Discussion

We have provided a novel resistant estimate of correlation based on a well-known multivariate location estimator of location and scale. Tukey's biweight has been used as a resistant estimator in diverse contexts such as regression, analysis of variance, time series, and control charts to monitor product quality [33] because of its resistance and efficiency properties. We have shown that the biweight is also a powerful technique to use when computing correlations between pairs of genes regardless of whether there is a significant amount of contamination or not.

Additionally, the tuning parameters of the biweight allow for the estimates to be minimally or largely resistant to outlying values; the breakdown of the correlation (which defines the tuning parameters) can be set to allow for a degree of resistance suitable for the analysis. We have found that setting our breakdown to 0.2 works well in most situations.

Not only does the biweight correlation give a resistant measure of correlation, but it also provides a data flagging method that (a) finds pairs of genes which give misleading Pearson correlations, and (b) finds genes that, when compared to many other genes, consistently give misleading Pearson correlations. The data flagging method can be used to improve the accuracy of secondary analyses (e.g., clustering or gene network analyses) and to decrease the rate of false positives. The high dimensionality of microarray data produces a need for automated data cleaning, and we provide one way of examining the data for outlying values before analyses are performed.

Because the biweight estimator is iterative, it is computationally more time intensive than either the Pearson or the Spearman correlation estimates. To save computation time, one might use the biweight as an initial estimate and an outlier detection method, and then progress to one of the other methods for analyses that require computing correlations multiple times in a row.

We have used microarray data to illustrate our methods. However, the methods can easily be applied to any data set, and they will be particularly useful for data sets where there is a large amount of noise and many distance pairs are being calculated. For example, we could use this method on other high throughput data like proteomic or metabolomic data. Additionally, other disciplines with large data sets like Astronomy and Econometrics will also value a robust and systematic procedure for calculating distances. Many supervised discrimination techniques use metrics/statistics which are similar to correlations. For example, Fisher's Linear Discriminant Analysis (LDA) is based on the mahalanobis squared distance. Because the biweight is inherently a multivariate estimator, one could easily use the biweight to measure resistant mahalanobis squared distances to use in LDA. Additionally, popular methods like Classification and Regression Trees (CART) use regression models to partition samples into groups. Biweight regression methods could also be used to make for more resistant partitioning in CART methods.

## 4 Conclusion

Tukey's biweight has been well established as a resistant estimation method in many fields. It has played a small role in the analyses of microarray data. However, the need for resistant methods in microarray data is great, and the biweight is a powerful tool that can provide improved methodology and results in many applications of microarray analyses. The methods shown here use Tukey's biweight to give a robust and efficient estimate of distance between two genes on a microarray.

## 5 Availability

R code is available from the authors as well as in a supplementary file to this article.

## 6 Authors' contributions

JH conceived of the study, participated in its design and coordination, and wrote the manuscript. AM, LH, and BVK contributed to writing the computer code, validating the method, and editing the manuscript.

## 7 Appendix

Multivariate estimates of location and scatter given by constrained M-estimates are defined iteratively using a weight function, *w *(·), to down-weight data values that are far from the bulk of the data. Using initial estimates of the location, $\tilde{T}$, and shape, $\tilde{S}$, we calculate the distance of each point to the center of the data set,

For a given objective function, *ρ *(·), the constraint, parameterized by *k*, (for the constrained M-estimator) is (see [38] for details)

where *b*_{o }= *E *[*ρ *(*d/k*)] and *n *is the number of samples (*d *is defined originally as above in equation (2) and subsequently as below in equation (6), and *k *is found using equation (3) after *b*_{o }and *d*_{j }are determined.) *b*_{o }is given as the product of the specified breakdown and the maximum value of *ρ *[39]. To find *k*, the expected value (*b*_{o}) is calculated under the assumption of multivariate normality (*d *will have a chi-square distribution if the data are normally distributed). Though we retain the convention, we do not presume to think that microarray data are normally distributed. The implication of an incorrect normality assumption will be a breakdown value slightly different from what we set. Because our work does not focus on a particular breakdown value of interest (and instead focuses on the general idea of having a resistant estimation procedure), we are not bothered by a slight miscalculation of the breakdown value.

Subject to the constraint in equation (3), the M-estimators of location and scale are given by the following iterative equations [40,41]

(alternating with equation (3) to determine *k*, note that the value for *k *changes at each iteration) where **X**_{j }is a gene of interest and

*v*(

*d*) =

*d ψ*(

*d*)

The objective function for Tukey's biweight [42] is given by:

Because *E *[*ρ *(·)] is a function of *c *and the breakdown, we can use the Newton-Raphson method to find *c *using *E *[*ρ *(·)] and the breakdown. For example, given a sample in dimension two (*p *= 2), with a breakdown of 0.2, *c *will be 5.07.

The iterative scheme has the potential for the existence of multiple solutions, although multiple solutions essentially never happen in practice. After our iterative scheme has converged, we are left with a location vector and shape matrix. As seen in equation (1), the biweight correlation will be the biweight covariance divided by the product of the individual gene biweight standard deviations.

## Supplementary Material

**Additional file 1:**

R-code for translated biweight. Annotated R code for computing the translated biweight. Includes both code and a worked example.

^{(4.4K, r)}

## 8 Acknowledgements

We thank K. Kafadar for sharing her profound knowledge on Tukey's biweight; S. Horvath for introducing us to the need for resistant correlations in the field of gene network analysis; and D.M. Rocke for insightful conversations and help with R code of the biweight. Funding for AM, LH, and BVK was provided by a Research Experience for Undergraduates grant from the National Science Foundation awarded to the Claremont Colleges.

## References

- Zhang B, Horvath S. A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology. 2005;4 Article 17. [PubMed]
- Davidson G, Wylie B, Boyack K. Cluster Stability and the Use of Noise in Interpretation of Clustering. Proceedings of the IEEE Symposium on Information Visualization. 2001. pp. 23–30.
- Bergmann S, Ihmels J, Barkai N. Similarities and Differences in Genome-Wide Expression Data of Six Organisms. PLOS Biology. 2004;2:85–93. [PMC free article] [PubMed]
- Carter S, Brechbühler C, Griffin M, Bond A. Gene Co-Expression Network Topology Provides a Framework for Molecular Characterization of Cellular State. Bioinformatics. 2004;20:2242–2250. [PubMed]
- Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. PNAS. 1998;95:14863–14868. [PMC free article] [PubMed]
- Qin J, Lewis D, Noble W. Kernel Hierarchical Gene Clustering from Microarray Expression Data. Bioinformatics. 2003;19:2097–2104. [PubMed]
- Gat-Vilks I, Sharan R, Shamir R. Scoring Clustering Solutions by their Biological Relevance. Bioinformatics. 2003;19:2381–2389. [PubMed]
- Dudoit S, Fridlyand J, Speed T. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002;97:77–87.
- Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown P, Herskowitz I. The Transcriptional Program of Sporulation in Budding Yeast. Science. 1998;282:699–705. [PubMed]
- Bar-Joseph Z, Demaine E, Gifford D, Srebro N, Hamel A, Jaakkola T.
*K*-ary Clustering with Optimal Leaf Ordering for Gene Expression Data. Bioinformatics. 2003;19:1070–1078. [PubMed] - Jiang D, Tang C, Zhang A. Cluster Analysis for Gene Expression Data: A Survey. IEEE Transactions on Knowledge and Data Engineering. 2004;16:1370–1386.
- Datta S, Datta S. Comparison and Validation of Statistical Clustering Techniques for Microarray Gene Expression Data. Bioinformatics. 2003;19:459–466. [PubMed]
- Yeung K, Medvedovic M, Bumgarner R. From Co-Expression to Co-Regulation: How Many Microarray Experiments Do We Need? Genome Biology. 2004;5:R48. [PMC free article] [PubMed]
- Tukey J. Data ananlysis, computation, and mathematics. Quarterly of Applied Mathematics. 1972;30:51–65.
- Rousseeuw P, Leroy A. Robust Regression and Outlier Detection. John Wiley; 1987.
- Huber P. Robust Statistics. John Wiley; 1981.
- Hoaglin DC, Mosteller F, Tukey JW, Eds . Understanding robust and exploratory data analysis. Wiley Classics Library, Wiley-Interscience, New York; 2000. [Revised and updated reprint of the 1983 original].
- Mosteller F, Tukey J. Data Analysis and Regression: a second course in statistics. Addison Wesley; 1977.
- Wilcox R. Introduction to Robust Estimation and Hypothesis Testing. Elsevier Academic Press; 2005.
- Golub T, et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286:531–537. [PubMed]
- Yang Y, Dudoit S, Luu P, Lin D, Peng V, Ngai J, Speed T. Normalization for cDNA Microarray Data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research. 2002;30:e15. [PMC free article] [PubMed]
- Toxicogenomics Research Consortium Standardizing global gene expression analysis between laboratories and across platforms. Nature Methods. 2005;2:351–356. [PubMed]
- Wang X, He X, Band M, Wilson C, Liu L. A study of inter-lab and inter-platform agreement of DNA microarray data. BMC Genomics. 2005;6:#71. [PMC free article] [PubMed]
- Dodd L, Korn E, McShane L, Chandramouli G, Chuang E. Correcting log ratios for signal saturation in cDNA microarrays. Bioinformatics. 2004;20:2685–2693. [PubMed]
- Wang X, Istepanian R, Song Y. Microarray image enhancement by denoising using stationary wavelet transform. IEEE Transactions on Nanobioscience. 2003;2:184–189. [PubMed]
- Glasbey C, Ghazal P. Combinatorial image analysis of DNA microarray features. Bioinformatics. 2003;19:194–203. [PubMed]
- Brown C, Goodwin P, Sorger P. Image metrics in the statistical analysis of DNA microarray data. PNAS. 2001;98:8944–8949. [PMC free article] [PubMed]
- Schadt E, Li C, Ellis B, Wong W. Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. Journal of Cellular Biochemistry Supplement. 2001;37:120–125. [PubMed]
- Yang Y, Buckley M, Dudoit S, Speed T. Comparison of methods for image analysis on cDNA microarray data. Journal of Computational and Graphical Statistics. 2002;11:108–136.
- Marshall E. Getting the Noise Out of Gene Arrays. Science. 2004;306:630–631. [PubMed]
- Ioannidis J. Micorarrays and Molecular Research: Noise Discovery? Lancet. 2005;365:454–455. [PubMed]
- Heyer L, Kruglyak S, Yooseph S. Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Research. 1999;9:1106–1115. [PMC free article] [PubMed]
- Kafadar K. John Tukey and Robustness. Statistical Science. 2003;18:319–331.
- Hubbel E, Liu W, Mei R. Robust estimators for expression analysis. Bioinformatics. 2002;18:1585–1592. [PubMed]
- Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Fitcher B. Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast
*Saccharomyces cervisiae*by Microarray Hybridization. Molecular Biology of the Cell. 1998;9:3273–3297. [PMC free article] [PubMed] - Kafadar K. The Efficiency of the Biweight as a Robust Estimator of Location. Journal of Research of the National Bureau of Standards. 1983;88:105–116.
- Rousseeuw P. Least Median of Squares Regression. Journal of the American Statistical Association. 1984;79:871–880.
- Rocke DM. Robustness properties of s-estimators of multivariate location and shape in high dimension. The Annals of Statistics. 1996;24:1327–1345.
- Lopuhaä H, Rousseeuw P. Breakdown of Points of Affine Equivariant Estimators of Multivariate Location and Covariance Matrices. The Annals of Statistics. 1991;19:229–248.
- Rocke D, Woodruff D. Computation of Robust Estimates of Multivariate Location and Shape. Statistica Neerlandica. 1993;47:27–42.
- Lopuhaa HP. On the relation between s-estimators and m-estimators of multivariate location and covariance. The Annals of Statistics. 1989;17:1662–1683.
- Beaton A, Tukey J. The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data. Technometrics. 1974;16:147–185.

**BioMed Central**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (395K)

- Novel implementation of conditional co-regulation by graph theory to derive co-expressed genes from microarray data.[BMC Bioinformatics. 2008]
*Rawat A, Seifert GJ, Deng Y.**BMC Bioinformatics. 2008 Aug 12; 9 Suppl 9:S7. Epub 2008 Aug 12.* - Clustering of gene expression data: performance and similarity analysis.[BMC Bioinformatics. 2006]
*Yin L, Huang CH, Ni J.**BMC Bioinformatics. 2006 Dec 12; 7 Suppl 4:S19. Epub 2006 Dec 12.* - Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory.[BMC Bioinformatics. 2007]
*Luo F, Yang Y, Zhong J, Gao H, Khan L, Thompson DK, Zhou J.**BMC Bioinformatics. 2007 Aug 14; 8:299. Epub 2007 Aug 14.* - Comparing algorithms for clustering of expression data: how to assess gene clusters.[Methods Mol Biol. 2009]
*Yona G, Dirks W, Rahman S.**Methods Mol Biol. 2009; 541:479-509.* - Manipulating large-scale Arabidopsis microarray expression data: identifying dominant expression patterns and biological process enrichment.[Methods Mol Biol. 2009]
*Orlando DA, Brady SM, Koch JD, Dinneny JR, Benfey PN.**Methods Mol Biol. 2009; 553:57-77.*

- CorSig: A General Framework for Estimating Statistical Significance of Correlation and Its Application to Gene Co-Expression Analysis[PLoS ONE. ]
*Wang HQ, Tsai CJ.**PLoS ONE. 8(10)e77429* - Differential network analysis for the identification of condition-specific pathway activity and regulation[Bioinformatics. 2013]
*Gambardella G, Moretti MN, de Cegli R, Cardone L, Peron A, di Bernardo D.**Bioinformatics. 2013 Jul 15; 29(14)1776-1785* - Comparison of co-expression measures: mutual information, correlation, and model based indices[BMC Bioinformatics. ]
*Song L, Langfelder P, Horvath S.**BMC Bioinformatics. 13328* - Evaluation of Gene Association Methods for Coexpression Network Construction and Biological Knowledge Discovery[PLoS ONE. ]
*Kumari S, Nie J, Chen HS, Ma H, Stewart R, Li X, Lu MZ, Taylor WM, Wei H.**PLoS ONE. 7(11)e50411* - Fast R Functions for Robust Correlations and Hierarchical Clustering[Journal of statistical software. ]
*Langfelder P, Horvath S.**Journal of statistical software. 46(11)i11*

- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree