Comparison of statistical data models for identifying differentially expressed genes using a generalized likelihood ratio test.

Currently, statistical techniques for analysis of microarray-generated data sets have deficiencies due to limited understanding of errors inherent in the data. A generalized likelihood ratio (GLR) test based on an error model has been recently proposed to identify differentially expressed genes from microarray experiments. However, the use of different error structures under the GLR test has not been evaluated, nor has this method been compared to commonly used statistical tests such as the parametric t-test. The concomitant effects of varying data signal-to-noise ratio and replication number on the performance of statistical tests also remain largely unexplored. In this study, we compared the effects of different underlying statistical error structures on the GLR test's power in identifying differentially expressed genes in microarray data. We evaluated such variants of the GLR test as well as the one sample t-test based on simulated data by means of receiver operating characteristic (ROC) curves. Further, we used bootstrapping of ROC curves to assess statistical significance of differences between the areas under the curves. Our results showed that i) the GLR tests outperformed the t-test for detecting differential gene expression, ii) the identity of the underlying error structure was important in determining the GLR tests' performance, and iii) signal-to-noise ratio was a more important contributor than sample replication in identifying statistically significant differential gene expression.


Introduction
The development of microarray technology has been phenomenal in the past decade, with transcriptional profi ling now a standard tool in many genomics research laboratories (Ginsberg and Mirnics, 2006;Rosa et al. 2006). The rapid development and acceptance of this method can be attributed to the fact that microarrays permit the simultaneous measurement of thousands of gene expressions on a single platform instead of analyzing them on a gene-by-gene basis. One major application of this technology is the identifi cation of genes that are differentially expressed across various experimental conditions. Such differentially expressed genes may be implicated in biological pathways of interest and can help our understanding of disease mechanisms and treatment strategies (Phan et al. 2006;Cowell and Hawthorn, 2007).
A popular method to detect difference in gene expression has been the use of fold-change cutoffs (Chattopadhyay et al. 2007;Shimada et al. 2007). This approach seeks genes whose expression intensities change, for example, by a factor of two or more between control and treatment samples. However, the fi xed threshold cutoff method is not based on specifi c data modeling assumptions and is statistically ineffi cient because it cannot account for the numerous systemic and biological variations inherent in a microarray experiment (Jaluria et al. 2007). Another commonly used method is the traditional parametric t-test. The performance of the t-test depends on the sample size, and whether the expression intensities can be assumed as normally distributed (Riva et al. 2005).
To address the need for a better statistical framework for microarray data analysis, various investigators have proposed the quantifi cation of measurement errors associated with gene expression intensities (Ideker et al. 2000;Tu et al. 2002). In particular, parameters of a statistical data model, which account for potential error sources, can be estimated using the maximum-likelihood estimation (MLE) method. A generalized likelihood ratio (GLR) test can then be applied to identify genes whose expression levels are statistically different. A crucial step in the GLR test lies in the selection of the underlying error structure summarizing the infl uence of multiple sources of variation in microarray studies. Several models have been proposed for measurement errors in microarray data (Ideker et al. 2000;Rocke and Durbin, 2001;Tu et al. 2002). All of these models account for the observation that the variance of expression data of a gene increases with its mean. Ideker et al. have shown that a model that refl ects two types of error, one additive and one multiplicative, can adequately model microarray data at varying intensity levels (Ideker et al. 2000). The multiplicative error term accounts for intraarray variance that infl uences individual study parameter estimates. The additive error component captures the infl uence of inter-array variations during replicate experiments. This GLR model has been applied to non-logarithm transformed intensity levels from cDNA microarray experiments.
Although various other implementations of error structures under the GLR test have been presented, no systematic comparative studies of their performance have been published. It has been reported that logarithm transformation can improve the normality of expression levels and help equalize variance, since raw intensities follow a lognormal distribution (Baldi and Long, 2001;Quackenbush, 2002). To the best of our knowledge, no statistical data model expressed in the log scale has been implemented in the GLR test. Furthermore, it is unclear how the GLR method compares to a traditional statistical test such as the parametric t-test in detecting differential gene expression. Our primary aim of this paper was to assess the performance of several variants of the GLR test-with error models fi tting raw and log-transformed expression intensities-and the one sample t-test using simulated data derived from actual cDNA microarray experiments. Our null hypothesis for the implemented tests was that each gene was not differentially expressed under control and experimental conditions. When identifying differentially expressed genes based on the expression levels of thousands of genes, however, statistical diffi culties can arise due to massive multiple hypothesis testing, where the fi nite signifi cance of each test produces many false positives overall (Norris and Kahn, 2006). Though the issue of multiple testing is crucial and warrants reliable correction techniques, we have restricted the scope of our present work to comparing tests for assessing marginal (i.e. per gene) signifi cance without multiplicity adjustment. We compared the power of these statistical tests by receiver operating characteristic (ROC) curves (Hanley and McNeil, 1982). Specifi cally, we determined the ROC summary index (area under ROC curve) and its confi dence interval using the bootstrap technique (Efron and Gong, 1983;Efron and Tibshirani, 1986). A secondary objective of the present study was to investigate the relative importance of signal-to-noise ratio and replication in the determination of differential gene expression.

Experimental study
Gene expression data were generated from lung tissues of mice exposed to chronic hypoxia. During chronic hypoxia, mice develop pulmonary hypertension and pulmonary vascular remodeling (Gharib et al. 2005). However, this in vivo perturbation leads to only a modest degree of differential gene expression with relatively few genes changing by more than two-fold (Gharib et al. 2005). In contrast, a typical yeast experiment such as response to galactose stimulation results in a much more profound differential gene expression profi le (Ideker et al. 2000).
Four, 8 week old, male Balb/C mice (The Jackson Laboratory, Bar Harbor, ME) were exposed to 21 days of hypobaric hypoxia (0.5 atm). Four control mice were housed at sea level for the duration of the experiment (normoxia). On day 21, all mice were sacrifi ced, whole lungs removed, total RNA isolated and hybridized to cDNA microarrays consisting of 5313 murine genes and expressed sequence tags. The RNA from the four control mice was pooled and used as reference for all microarray experiments. Four microarray replications with dye swapping were performed for three of the hypoxic mice; the fourth hypoxic mouse was studied in triplicate for a total of 15 labeling experiments. By performing replications for each animal and using multiple animals, our experiment design captured both the biological and technical noise in differential gene expression during hypoxic exposure.

Data simulation
The fi rst step in our data simulation strategy was to construct a realistic statistical model based on the experimental microarray data, and then to use this model to generate artifi cial gene expression values with varying statistical characteristics. Note that our statistical model described data from a cDNA microarray experiment in which control and treatment samples are hybridized on the same array. Based on exploratory data analysis of our 15 replicate microarray experiments, we observed that the variability of each gene's intensity under hypoxic or control conditions in the log scale was approximately Gaussian. In addition, we also noted that two intensity measurements (under hypoxia and normoxia) of the same spot were highly correlated. These observations were in agreement with published fi ndings (Ideker et al. 2000;Baldi and Long, 2001;Rocke and Durbin, 2001;Tu et al. 2002).
We simulated 2000 genes in this study: 1000 genes were defi ned as differentially expressed (diffgenes) and 1000 genes as unchanged (nullgenes) during hypoxia relative to the unperturbed normoxic condition. Each diffgene had a unique expected value under hypoxic and normoxic conditions, in addition to a noiseinduced error per replication. By defi nition, each nullgene had the same expected mean intensity during hypoxia and normoxia but different measurable values from one replication to another due to experimental variability. We used the following model for simulating the observed paired intensities, (x, y), of background-subtracted and normalized diffgenes and nullgenes during replicate experiments: for diffgenes i = 1,2,…,N, nullgenes j = 1,2,…,M, and replications k = 1,2,…,r. Subscript h refers to hypoxia and n to normoxia; µ diffgene h (or µ nullgene h ) is the expected "true" intensity under hypoxia and µ diffgene n (or µ nullgene n ) is the expected "true" intensity during normoxia (in the log scale). Recall that µ µ diffgene diffgene F o r each gene, the random error terms during hypoxia and normoxia, ε h and ε n , were chosen from an independent bivariate normal distribution with mean zero, variance under hypoxia (ω h ) and normoxia (ω n ), and correlation (ρ h,n ).
In order to create realistic data simulations, we chose the parameters for the above model from our set of 15 replicate microarray experiments. We randomly selected 1000 genes from this data set and assigned their mean expression values during hypoxia and normoxia as our 1000 simulated µ diffgene h and µ diffgene n (1a). In addition, we used the variance of each of these 1000 genes over 15 replicate experiments as the variance of the error component of each assigned diffgene (the same procedure applied for each nullgene). For the nullgenes, we chose a different set of 1000 genes from the original data set, and assigned, by defi nition, µ µ nullgene nullgene h n = (1b). The correlation term for the bivariate normal distribution for each diffgene and nullgene was obtained directly from the expression measurements in these experiments. Note that there was a range of differential gene expression values among the diffgenes since each gene had a different µ (during hypoxia and normoxia) and each replicate had a different ε. Similarly, although each nullgene was defi ned to have equal mean intensity during hypoxia and normoxia, each replication resulted in a different expression value because of the error term ε. Gene expression values were now derived by randomly drawing error terms from the bivariate Gaussian distribution unique to each of the 2000 genes. Simulated gene expression replicates based on this approach should capture both the biological variability and the technical "noise" of the original microarray experiments. Furthermore, since these simulated log-transformed intensities were normally distributed, parametric approaches such as the parametric t-test and GLR test could be applied to identify differentially expressed genes.
We denoted data sets with s = 1, 2 or 3 as data containing 'low', 'medium' or 'high' signal-to-noise ratios respectively. We generated the 'low' signalto-noise ratio (s = 1) data based on differential gene expression values in our experimental model of hypoxic pulmonary hypertension. As discussed above, we believe this perturbation resulted in only a modest level of differential gene expression. Therefore, we increased the difference between µ diffgene h and µ diffgene n for the 1000 diffgenes by two-fold to obtain 'medium' (s = 2) and three-fold to obtain 'high' (s = 3) signal-to-noise ratios. These data sets were aimed at modeling data generated during more profound perturbations such as ionizing radiation or yeast sporulation studies. Similarly, replications with r = 2, 3 or 4 and 6 or 8 were data sets containing 'low', 'medium' or 'high' number of replications per gene, respectively. In total, 15 sets of simulated data at each signal-to-noise ratio (s = 1, 2 and 3) and replication number (r = 2, 3, 4, 6 and 8) were generated. These data sets therefore differed in terms of quality, with r2s1 (2 replications per gene and a signal-to-noise ratio of 1) having the lowest and r8s3 (8 replications per gene and a signal-to-noise ratio of 3) having the highest quality respectively. This pool of 15 data sets constituted this study's original simulated sample. Because we knew a priori which genes were differentially expressed and which were not, altering signal-to-noise ratios and replications per gene allowed us to compare the robustness and performance of various statistical tests in detecting differentially expressed genes.

Statistical data models
We implemented the statistical data model proposed by Ideker and co-authors (Ideker et al. 2000). In addition, motivated by empirical observations, we also devised three additional error models that considered log-transformed expression intensitiestermed GLR1, GLR2 and GLR3-for analysis using the GLR test. These three error structures have the following generic form:  BN( , , , , ) 0 0 (ρ denotes correlation). The correlation term models the observation that the log-transformed expression intensities (under control and treatment conditions) per spot consistently correlate over repeated measurements. ε H and ε A are assumed to be independent of each other. Additionally, f(µ) allows one to incorporate into the statistical error structure the common observation that the standard deviation is dependent on the magnitude of intensity (mean) (Ideker et al. 2000;Baldi and Long, 2001;Rocke and Durbin, 2001;Tu et al. 2002).
For the simplest case, we set f x i ( ) µ and f y i ( ) µ equal to 1. Under this model, the measured expression intensity is dependent on a linear combination of normal error components. Hence, GLR1 is: Based on (2), GLR2 was constructed by equating f x i ( ) µ and f y i ( ) µ to µ x i and µ y i respectively (4), so that the variance could be approximately proportional to the mean intensity (Ideker et al. 2000;Tu et al. 2002).
. In GLR3, we explicitly modeled the observation that correlation between expression levels under control and treatment conditions decreases at lower intensities by weighting the multiplicative error term using the reciprocals of the mean intensity, as shown in (5). (5) Describing raw intensity data, the error structure proposed by Ideker et al. (2000) is: (6) was implemented in a public-domain computer program, Vera and Sam (http://db.systemsbiology. net/software/VERAandSAM) (Ideker et al. 2000). Henceforth, we refer to this statistical error model as VS. Although GLR1-3 may appear similar to the data simulation model in (1), there were fundamental differences between them. Crucially, we used a different Gaussian distribution for each of the 2000 genes in our data simulation while the GLR tests were applied to each gene to determine whether, under a single statistical data model encompassing all 2000 genes, the expression levels under control and experimental conditions were different. For example, although the GLR1 structure (3) may appear very similar to the data simulation model (1), a different Gaussian distribution was sampled for each of the simulated genes, whereas the GLR test was performed for each gene to assess whether applying one distribution to the entire set of 2000 genes can discriminate expression levels between normoxic and hypoxic states. Furthermore, since the parametric t-test was applied to each gene separately without the requirement of a constant variance or a constant coeffi cient of variation across genes, we expected the t-test to be well suited in detecting differential gene expression.

GLR method
Details of the GLR test can be found in Ideker et al. (2000). For illustrative purposes, we will briefl y discuss the GLR test for VS as defi ned by (6). This statistical error model is dependent o n f i v e g e n e -i n d e p e n d e n t p a r a m e t e r s , , , as well as a mean pair p e r g e n e µ µ = [( , ), ,( |β β µ µ . The GLR method is essentially composed of two consecutive steps: parameter estimation followed by likelihood ratio test. In the Vera (Variability and ERror Assessment) program, β and µ are fi rst estimated from the expression levels of all genes via MLE. MLE is the procedure of fi nding the value of one or more parameters for a given statistic that makes the known likelihood distribution a maximum (Kendall and Stuart, 1979). Likelihood functions, for gene i and over all genes, are defi ned as, respectively: The MLE parameter values that maximize L, denoted by β β µ µ ˆˆ and , represent the estimates for the true parameters of the error model. In this study, we modifi ed the original Vera program to incorporate the new statistical data models, namely GLR1, GLR2 and GLR3. All stages of the optimization were performed in C code (Ͻ10 min on a Pentium 4 computer with 1.4 GHz, 768 MB DDRAM memory for a data set containing 2000 genes per microarray and 4 replications per gene). All parameter values converged after 100-200 iterations.
To determine whether, under the pre-specifi ed error model, individual genes are signifi cantly differentially expressed (i.e. µ µ x y i i ≠ ) between the two cell populations, a likelihood ratio test is subsequently performed, which assumes a univariate normal distribution for the expression data of a gene not differentially expressed (null hypothesis) and a mixture of two univariate normal distributions (with different means) for the expression data of a gene differentially expressed under control and treatment conditions (alternative hypothesis). To this end, the GLR test statistic λ i is computed using SAM (Signifi cance of Array Measurement) (Kendall and Stuart, 1979): Two maximizations are performed in (8): in the numerator, the constraint µ x = µ y = µ is imposed (null hypothesis), whereas in the denominator, the optimization is unconstrained, i.e. µ x ≠ µ y (alternative hypothesis). In the case where µ µ x y i i = , λ i follows a χ 2 distribution with one degree of freedom. To select differentially expressed genes, a critical cutoff value λ c must be determined beforehand based on the results of a set of control to control microarray experiments. However, since our data simulation provides us a priori information on which genes are differentially expressed and which ones are unchanged we can circumvent the control to control experiments and use ROC curves to compare the performance of variants of the GLR test and the t-test.

Parametric t-test
The t-test statistic and its variants are commonly used to detect differential gene expression because genes deemed as significantly expressed have maximal difference in mean expression values between two groups and minimal variation of expression within each group. In this study, we compared the log ratios of spot intensities between control and treatment groups using the parametric one-sample t-test (Skokal, 1995).
Performance of statistical tests on simulated microarray data ROC curve analysis was used for performance evaluation. Expression status of each gene (i.e. whether a gene was classifi ed as a diffgene or nullgene) was known and served as the reference or 'gold standard'. The area under the ROC curve (AROC) was used as a unidimensional summary of the ability of the statistical test to discriminate between diffgene or nullgene, ranging from 0.5 for a test with no diagnostic capability to 1 for a test with perfect separation of the two groups. The ROC curve was derived by: 1) ranking values of the test metric (i.e. t-statistic or λ) of each gene; and 2) identifying the correctly and incorrectly assigned diffgenes and nullgenes by varying a threshold; and 3) computing the full range of true-positive fraction or sensitivity values, and the false-positive fraction or 1-specifi city values. ROC curves thus provide a more complete measure of a test's discriminatory ability than the choice of an arbitrary and isolated (sensitivity, specificity) point. Once the ROC curve was obtained, the area underneath it was computed to estimate AROC.
Furthermore, a parametric bootstrap procedure was used to quantify the precision of estimates of AROC and to approximate the probability of rejecting the null hypothesis when comparing the AROC's of two different statistical tests. Bootstrapping provides a way of making probability-based, assumptionfree inferences about a population characteristic (e.g. AROC) without strict distributional assumptions (Efron and Gong, 1983;Efron and Tibshirani, 1986). In short, one thousand samples of diffgenes and nullgenes of size 2000 each were generated by sampling with replacement from the original set of 2000 genes in each of the 15 original samples. The ROC curve and its associated AROC for each re-sample were then determined. The 95% confi dence interval for AROC under each data condition and statistical test was derived from these bootstrap re-samples by fi nding the minimum interval [I 1 , I 2 ] for which the probability P AROC (I 1 =Ͻ AROC Ͻ I 2 ) = 0.95. To compare the performance between two statistical methods, we computed a z-statistic using the bootstrap to generate the variance of the AROC estimates (Margolis et al. 2002): where A 1 and A 2 refer to the AROC's from the two statistical tests to be compared, SE is the standard error, which was estimated using the bootstrap, and Cov(A 1 , A 2 ) represents the covariance of A 1 and A 2 . Cov(A 1 , A 2 ) was derived by taking the product of the correlation between the bootstrapped samples (corresponding ROC curves will be correlated since they are derived based on the analysis of the same set of bootstrapped data) and standard deviations. Under the null hypothesis of no difference between two statistical tests, the zstatistic is approximately normally distributed (Margolis et al. 2002). In the present study, we chose a critical cutoff value of 1.96 to indicate that there was an approximately 5% chance (i.e. about 50 out of 1000 bootstrap re-samples) that any AROC difference could be attributed to pure chance and that the two ROC curves under comparison were in fact identical. Additionally, to determine if the predictive value of a statistical test was better than random chance, individual AROC was compared to 0.5 (corresponding to a coin toss) using a one-sided test at an α level of 5% (Delong et al. 1988).

Overall performance of statistical tests on the original simulated data
Representative ROC curves generated from our analysis are shown in Figure 1A. These curves were based on the GLR3 model using r2s1, r2s3, r4s1, r4s3, r8s1 and r8s3 original simulated data sets. The overall patterns of the ROC curves with respect to varying signal-to-noise ratio and number of replications per gene were similar among the statistical tests. As expected, the predictive power of the GLR tests and t-test increased with improvements in the sample size and/or signal-to-noise ratio of data sets. This indicates that the GLR tests and the t-test predicted more diffgenes correctly as the overall quality of microarray data improved. Also note that the curves corresponding to r4s3 (and even r2s3) were above those for r8s1, suggesting that stronger signals could lead to a higher test power than larger sample repeats. Individual AROC values ranged from 0.549 to 0.93, and onesided test at an α level of 5% confi rmed that the tests' performance was always superior to random classifi cation.

Bootstrap comparison between GLR tests and t-test
The mean (±S.D.) AROC's calculated on the 1000 bootstrap replications for each of the fi ve statistical tests and for each data condition are presented in Figure 2. The corresponding 95% confi dence intervals of AROC's are shown in Table 1. Table 2 shows the number of times out of 1000 that a statistical difference was detected between statistical tests for various data conditions using the bootstrap method and a z-statistic. For two statistical tests to be signifi cantly different at the P Ͻ 0.05 level, it is expected that the null hypothesis would be rejected in 950 comparisons. These tables and fi gure reveal that, overall, the best discriminator was obtained by the GLR test employing GLR3 as its underlying statistical model, which attained signifi cance with respect to all other tests in virtually every condition studied. There were signifi cant differences between the GLR test employing the VS model and every other statistical test, indicating that the former was the least discriminatory. Of note, the t-test underperformed relative to the GLR tests when data quality was poor (low signalto-noise or replication), but improved when the number of gene replicates and signal-to-noise were 'high' (r = 8, s = 3). In Figure 1B, using the r2s1 data set, we explored an important aspect of the ROC curve to further emphasize differences among the tests: the trade-off between sensitivity and specifi city. Here, we examined the sensitivity of the statistical tests at a high range of specifi city since this situation is often of interest to biologists. Using ROC curves determined from the rest of the data sets, which we have not displayed for the sake of clarity, we concluded that the GLR3 error structure consistently achieved the highest level of sensitivity in the high specifi city range.

Effect of signal-to-noise ratio and replication number on test power
The overall effects of signal-to-noise ratio and number of replications per gene on the statistical tests' power using the bootstrap technique are presented in Figures 3-5. Improvement in the signal-to-noise ratio and increase in the number of replications both led to enhanced power. (Fig. 3). However, while increased replication improved test performance, this effect diminished as the signal-to-noise ratio of the data set improved. This observation is highlighted in Figure 4, where differences in AROC's between 'high'-signalquality data sets (signal-to-noise ratio = 3) with replications ranging from 3 to 8 are negligible. The overall accuracy of the GLR tests and t-test improved by 34.4% when analyzing data sets with a signal-to-noise ratio of 1 compared to 3 (mean AROC of 0.614 and 0.821 respectively). In contrast, after increasing the number of replications   by 4 fold (from 2 to 8 repeats per gene), we observed a lower increase of 11.9% (67% and 76% accuracy for 2 and 8 replicates respectively) in the average discriminating power (Fig. 5). Cumulatively, these fi ndings indicate that signal-to-noise ratio was a more important contributor than sample replication to the performance of the statistical tests in identifying differentially expressed genes.

Log-transformed GLR methods are superior to the parametric t-test
We compared the power of the t-test with that of the GLR test since the t-statistic has been commonly used to determine differentially expressed genes in microarray studies (Baldi and Long, 2001;Riva et al. 2005). Furthermore, the one sample t-test performs a gene-by-gene analysis, computing the sample variance for each gene in the analysis. Hence, this approach does not require a constant variance or a constant coeffi cient of variation across genes. This is consistent with our data simulation model where the variances of expression values of each gene were drawn from a unique bivariate Gaussian distribution. As a consequence, we expected the t-test to perform optimally in our simulated data sets. The GLR method, on the other hand, hinges on the usage of a single error structure to quantify the variability present across all 2000 genes (prior to genespecifi c likelihood ratio tests) and therefore dramatically simplifies the complexity of the simulated gene expression data. Yet, our results indicate that the discriminatory ability of the GLR test was still equivalent or more powerful than the ) AROC's out of 1000 bootstraps, sorted by signal-to-noise ratio of data sets. Note that for the largest number of gene replications considered in this study, the improvement in the tests' performance from processing data sets with signal-to-noise ratio of 'medium' to processing data sets with signal-to-noise ratio of 'high' is reduced. For clarity's sake, we have omitted the bootstrapped results from 3 and 6 replications per gene.
t-test for data sets containing 'low' and 'medium' gene replication number. It follows that GLR test's assumption of a single statistical data model to quantitate the error distribution and structure of pooled gene expression data does not necessarily compromise its performance at identifying differentially expressed genes. While there was an improvement in the t-test's performance on data sets containing 'high' gene replications and signal-to-noise ratios, we showed that the GLR test (employing GLR3) could still attain a higher power than the t-test. A fundamental problem with employing tests dependent on Gaussian likelihoods in microarray studies (under the Gaussian model, the t-test is a generalized likelihood test) is that replications are often few due to costs or availability of biological material. Though such shortcomings can be addressed by increasing sample size, the potential gain in the statistical power of the GLR and t-test may be markedly offset by increases in cost and effort (Baldi and Long, 2001). This paper demonstrates that the GLR test represents a better framework for achieving a higher predictive power at detecting gene expression differences, especially at low numbers of replicates per gene.

Error structure affects predictive power of GLR test
Overall, our results suggest that GLR3 was the best performing error structure compared with the GLR1, GLR2 and VS models. The GLR test using VS as its underlying error model consistently generated the lowest AROC's for all bootstrap re-samples. We attribute this to the fact that the VS model's error structure considers raw, non-log-transformed gene expression levels. We contend that such an assumption can potentially limit this test since logarithm transformation can improve the normality of expression intensities  (Baldi and Long, 2001;Quackenbush, 2002). Among the statistical data models that considered log-transformed expression levels, GLR1 was the least discriminatory, implying that weighting the multiplicative error term by intensity enables the GLR test to achieve a higher predictive power (recall that f(µ) = 1 in GLR1 whereas f(µ) = µ and 1/µ in GLR2 and GLR3, respectively). Our results showing that the GLR3 was the most powerful statistical test. We believe that GLR3 performed better because its model structure accounts for two common traits associated with microarray expression data: (i) log-transformed microarray expression data are normally distributed (we transformed data to a logarithmic scale of base 2); (ii) correlation between log(x i ) and log(y i ) increases at higher spot intensities (we weighted one of two error components with the reciprocal of the mean spot intensity of each gene); and (iii) for each gene i, log(x ij ) and log(y ij ) are correlated over repeated measurements j. This suggests the importance of adjusting for a decrease in correlation between expression values of control and treatment genes at low intensities.
Signal-to-noise ratio contributes more to test performance than replication number A novel fi nding of this work was to show the greater infl uence of signal-to-noise ratio compared to gene replication on test performance. This implies that experiments that maximize the signal-to-noise ratio can facilitate to a larger extent the identifi cation of signifi cant differences in gene expression than those with more sample repeats. To the best of our knowledge, this study represents the fi rst attempt to directly compare the infl uence of replication and signal-to-noise ratio on the performance of statistical analysis of microarray data, although several reports have commented on the signifi cance and implications of each factor separately (Lee et al. 2000;Li et al. 2001;Wildsmith et al. 2001; Rosa  et al. 2006). The greater infl uence of signal-to-noise ratio may have important implications in experiment design. Although in practice it may be diffi cult to modulate signal-to-noise, there are several venues for improvement. The quality of RNA is crucial for successful microarray hybrization, and this can be confi rmed using microfl uidics-based platforms such as the Agilent Bioanalyzer. Several techniques can be used to minimize technical noise, and microarrays from the same print batch can be used to reduce microarray to microarray variability. Importantly, our results also imply that if a biological system does have a high signal-to-noise ratio, then only a few replications are necessary. This may substantially cut the cost of expression profi ling experiments.

How many replications per gene?
It has been previously reported that performing microarray studies without suffi cient replicates can lead to poor sensitivity and reliability (Lee et al. 2000;Wildsmith et al. 2001). However, the question 'how many repeats are enough?' is shaped by many potentially confounding factors such as the type of array equipment, laboratory technique, and, most importantly, the quality of the samples. Our study provides a framework to decide on the replication number per gene based on the overall signal-to-noise ratio of microarray data. More specifi cally, we recommend repeating array experiments 6-8 times for data with 'poor' and 'medium' signal-to-noise ratios, since the discriminatory ability of our tests levels off for sample size 6 or greater. In other words, data containing 6-8 replications per gene represent the best compromise between inferior signal quality and false positive rates. On the other hand, for data sets with 'high' signals (signal-to-noise ratio = 3), we show that 3 replications have essentially the same effect as 4 or more. This implies potential savings in future microarray studies since replications can be signifi cantly reduced for such experiments. Taken together, our observations reinforce the notion that a successful microarray project is dependent on all steps of the process being accurately and consistently performed to maximize the reliability and signifi cance of results. Here, we show that consideration of steps upstream of data processing, such as deciding on the proper number of microarray replications and minimizing technical/biological noise, may be necessary to ensure experimental and analytic accuracy.
An issue that bears consideration is that, in a practical setting, it may be diffi cult to assess the signal-to-noise ratio of actual microarray data. Signal-to-noise ratio depends on many factors, including quality of tissues, sample size relative to the number of variables to be classified, experimental variation, and inherently variable original signal intensities (Semenza, 2000;Norris and Kahn, 2006). In this light, our classifi cation of signal-to-noise ratio levels may not be fully generalizable. To address this issue, we suggest independently quantifying the technical and biological sources of variation by, for instance, performing control-to-control experiments. Although this approach will require additional resources, it will provide valuable information on signal to noise proportion in the biological system of interest, allowing the investigator to choose the optimal number of replications to compensate for inadequate signal characteristics based on our present fi ndings. In turn, this can lead to potential savings in experimental cost and resources.
There are several limitations in this study. First of all, we were constrained by how realistically our data simulation models the entire variability of microarray experiments. Furthermore, our experiments and data simulations were performed using cDNA microarrays and therefore might not be applicable to single dye oligonucleotide arrays such as the Affymetrix platform. We restricted our analyses to the GLR method and the parametric t-test. Additionally, the four statistical data models that were implemented under the GLR test do not represent the full spectrum of possible error structures. We also did not use partial AROC as a performance indicator to differentiate between statistical tests' power at low false positive rate (or high specifi city range). Finally, the problem of multiple comparisons was not addressed in this work because we were primarily interested in comparing the inherent statistical power of the t-test and GLR methods.

Conclusions
In summary, although the present study depended on a specifi c set of simulated data derived from cDNA experiments and only the GLR test and the t-test statistical approaches were compared, our fi ndings have three important implications for analysis of microarray data: • The GLR method is more powerful than the parametric t-test for detecting differential gene expression. This underlines the importance of incorporating and modeling the error structure of microarray data during the development of future statistical tests. • Within the GLR approach, an error structure that contains a multiplicative error term weighted by the reciprocal of mean expression intensity outperforms other models (GLR3). • Designing experiments that maximize signalto-noise ratio, instead of just raising the number of replicates per gene, can better identify differentially expressed genes in microarray data.