![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2004 Townsend; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. Resolution of large and small differences in gene expression using models for the Bayesian analysis of gene expression levels and spotted DNA microarrays 1Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA Corresponding author.Jeffrey P Townsend: Townsend/at/Nature.Berkeley.edu Received February 13, 2004; Accepted May 5, 2004. This article has been cited by other articles in PMC.Abstract Background The detection of small yet statistically significant differences in gene expression in spotted DNA microarray studies is an ongoing challenge. Meeting this challenge requires careful examination of the performance of a range of statistical models, as well as empirical examination of the effect of replication on the power to resolve these differences. Results New models are derived and software is developed for the analysis of microarray ratio data. These models incorporate multiplicative small error terms, and error standard deviations that are proportional to expression level. The fastest and most powerful method incorporates additive small error terms and error standard deviations proportional to expression level. Data from four studies are profiled for the degree to which they reveal statistically significant differences in gene expression. The gene expression level at which there is an empirical 50% probability of a significant call is presented as a summary statistic for the power to detect small differences in gene expression. Conclusions Understanding the resolution of difference in gene expression that is detectable as significant is a vital component of experimental design and evaluation. These small differences in gene expression level are readily detected with a Bayesian analysis of gene expression level that has additive error terms and constrains samples to have a common error coefficient of variation. The power to detect small differences in a study may then be determined by logistic regression. Background Spotted DNA microarrays can be used to measure genome-wide gene expression levels in cells of different genotypes, in different developmental states, or within different environments. The precision and accuracy of these measurements depend on the technical performance of the microarray, the degree of replication of the experiment, and the suitability of the model used to analyze the data. A number of models have been advanced for the statistical analysis of experimental designs involving two samples [1-4]. Two methods, a classical ANOVA method [5-7] and a Bayesian method [8], have been designed for the analysis of experimental designs involving multiple nodes of expression such as genotypes, environments, and developmental states. These analyses yield quantitative results on the expression level of a gene, evaluating data from direct hybridizations as well as data from hybridizations that are informative through transitive inference [9]. Optimal statistical inference depends upon the choice of model used for analysis. Townsend and Hartl [8] derived a core model that has been widely used for the estimation of gene expression levels and statistical significance in multifactorial experiments (e.g. [9-15]). This model assumed additive small error terms and either estimated error variances for each genotype, environment, or developmental state or estimated a single variance for all genotypes, environments, or developmental states. The ANOVA models of Kerr et al. [5] and Wolfinger et al. [7] have also been widely used, and assume multiplicative small error terms. Accordingly, Bayesian models are considered here that incorporate multiplicative error terms. A number of other studies have correlated error variance to raw expression level [2,16-18]. To evaluate the potential effect of this correlation on ratio measurements, models are developed here that constrain the relationship between the error variances and the expression levels to a constant coefficient of variation. Nested error models for spotted DNA microarrays are compared using the Bayesian information criterion for model choice [19]. The power to detect differences in gene expression using these models is evaluated, and the relationship between the estimated expression level, the number of replicate hybridizations, and the ability to determine the statistical significance of small differences in gene expression is explored both for simulated and empirical data. A summary statistic for determining the fold-resolution detectable as significant in empirical microarray studies is presented. Implementation Models Model with small additive error effects The intensity of hybridization of a DNA spot on a microarray is often used as measure of gene expression, but the raw intensity is subject to a number of confounding error terms, such as DNA concentration in a spot and sequence hybridization efficiency. As foreseen by the pioneers of DNA microarray technology, these confounding effects (regardless of their multiplicative or additive nature) are eliminated by consideration of the ratio of hybridization of two samples [20]. The remaining small error terms may be modeled either additively or multiplicatively. Townsend and Hartl [8] modeled them additively, deriving a density function for the observed ratio of gene expression in the ith and jth condition, zij, as ![]() where μi is the expression level in condition i and σ2i is the variance in condition i. If n conditions or genotypes are under study, direct use of this likelihood function requires estimation of 2n - 1 parameters (n - 1 expression levels plus n variances) for each gene. One alternative is to constrain the variance so that it is common and equal among nodes of an experimental design [8], thus reducing the number of parameters requiring estimation to n (i.e. n - 1 expression levels plus one variance). Another alternative is to constrain the variances such that they have a consistent relationship to the expression levels in each node. For example, conditions or genotypes can have a common coefficient of variation (CV) for each gene, ν = σi / μi for all i. This alternative also requires the estimation of n parameters (n - 1 expression levels and one CV). It is intuitively motivated by the consideration that larger means for a population are accompanied by larger variances across many phenomena in the sciences [21]. In order to implement a model in which all nodes of an experimental design have a common error CV for each gene, equation 1 may be rewritten, substituting ν2μi 2 for σi2: ![]() Equation 2 can then be used with a prior to construct a Markov chain whose stationary distribution is the posterior distribution of the parameters given the data [22,23], in all other ways following the algorithm of Townsend and Hartl [8]. The formulation in Equation 2 has additional appeal over Equation 1 when used (as it will be below) within a Markov Chain Monte Carlo (MCMC) analysis. Because values of μ and ν tend to scale similarly across the real line compared to μ and σ2, less tuning of the MCMC jump size may be necessary to achieve a satisfactorily mixed chain. Model with small multiplicative error effects An alternative to additively modeled error is to model error multiplicatively, such that the post-normalization intensity in one fluorescence channel at a reporter spot is ![]() where μ is the absolute quantity of mRNA per cell, the cm are spot-associated terms of arbitrary distribution for any a multiplicatively confounding factors, the cl are spot-associated terms of arbitrary distribution for any q-t linearly confounding factors, and ε is a term for small random errors not associated with the spot. The observed ratios of intensities after normalization, yij, would then be ![]() Taking the log of both sides, ![]() The formulation in Equation 4 has some evident similarities to formulations in ANOVA models of gene expression measurement error, where the confounding terms c correspond to the array spot effects identified by Kerr et al. [5] and Wolfinger et al. [7], except that the derivation presented here does not assume that these terms are lognormally distributed. Note that these confounding terms are generally not of biological interest and can immediately cancel in equations 3 or 4. Assuming that error terms log εi and log εj are composed of many small, unbiased effects, and scaling them so that they are distributed with variances specific to each node, σi2 and σj2, it follows from equation 3 that the ratio data, zij, are drawn from a ratio of two lognormal distributions. The numerator is drawn from a lognormal distribution with parameters μi and σi2, and the denominator is drawn from a lognormal distribution with parameters μj and σj2. Just as the difference of two Gaussians is itself Gaussian, the ratio of two lognormals is lognormal, thus the probability density function is ![]() Following in all other ways the algorithm of Townsend and Hartl [8], Equation 5 can then be used with a prior to construct a Markov chain, the stationary distribution of which is the posterior distribution of the parameters given the data. Furthermore, just as with equation 2, all variances for each node may be constrained to be equal or each may be constrained to be linearly proportional to its respective expression level by a single CV. In the latter case, with ν = σi / μi = σj / μj for all i and all j, ![]() Model abbreviations and relations Models used will be abbreviated with a two-letter acronym. The first letter indicates (A)dditive or (M)ultiplicative error, and the second letter indicates a general (U)nconstrained variance model, a constrained (V)ariance model, or a constrained (C)oefficient of variation model. Thus, the AV and AC models are nested within the AU model, while the MV and MC models are nested within the MU model. With n nodes in the experimental design, the AU and MU models both have 2n - 1 parameters (n - 1 expression levels plus n variances), and the AV, AC, MV, and MC models all have n parameters (n-1 expression levels, plus 1 variance or CV). Algorithm The three-dimensional matrix of ratio results from DNA microarray comparisons, Z, may be constructed, with dimensions i denoting the sample labeled with one fluorophore, j denoting the sample labeled with another, and k denoting the replicate ordinate of that particular dye-labeled comparison. Then, for any continuous structure of comparisons among the nodes of interest, the likelihood density for the parameters μl and νl, 1 ≤ l ≤ n, is, by Bayes' rule, ![]() where g(μi, νi, μj, νi) is the prior distribution of the parameters, and where the probability f(zijk) of empty elements in the data matrix Z is properly evaluated as one. Appropriate informative priors for the variance of microarray data are under investigation [2,4,24]. In this paper, a noninformative prior distribution, uniform across positive real numbers, has been used for both the expression levels and for their variances and CVs. The range has been nominally constrained between zero 100, though that upper constraint makes no difference for the datasets examined here. The uniform prior gives the microarray data itself the greatest impact on the inferred expression levels and variances, and implies that credible intervals around parameter estimates (the Bayesian equivalents of classical confidence intervals) are close to those that would be found by maximum likelihood. Fortunately, we may use the constant denominator of the Bayes' rule formulation (Equation 5) to assert that ![]() Equation 8 may be used to construct a Markov Chain whose stationary distribution is the posterior distribution of the parameters given the data. A vector of initial error coefficients of variation is chosen arbitrarily, and a vector of initial expression levels is chosen such that at step t = 0. Subsequent values in the chain are determined iteratively by choosing successive proposed values according to an acceptance rule.Our proposed values are constructed in two separate steps. First, two of the n gene expression level parameters from are chosen at random. A step size is drawn at random from a triangular distribution centered at zero with range [-Δμ, +Δμ]. The first of the two chosen parameters is incremented by the chosen step size, and the second is decremented by the same quantity, so that is maintained, where the apostrophe indicates a proposed parameter value. In the next iteration, each of the CV parameters in is separately incremented by an amount drawn at random from a triangular distribution with range [-Δν, +Δν] to form . The conjecture is accepted for the next state of the Markov chain if![]() Otherwise the original state is retained for the next iteration of the Markov Chain. These steps are repeated over many generations in order to "burn in" the chain, so that it converges from the initial parameter settings to a stationary distribution. Subsequently, states are sampled from the chain at regular intervals to build a posterior distribution for each parameter, integrated across the probable states of all other parameters. All analyses in this paper were performed with 20,000 generations of burn-in, followed by 200,000 generations during which the chain was sampled every 20 generations to construct the posterior distribution. Runs using multiple starting vectors and were performed and always converged to the same, unimodal posteriors. Results reported here were the outcomes of Markov chains started with the elements of all equal to one, and started with the elements of equal to 0.2. Step sizes, Δμ and Δν, were tuned for each gene so that acceptance ratios for each parameter update were in the efficient and well-mixed range, (0.15, 0.50) [25]. If acceptance ratios for either parameter jump were less than 0.15 or greater than 0.5, the chain was run again with a better-tuned jump size, until acceptable ratios for both parameters were obtained. In this way, there is no alteration of the jump size during any run. There is only the evaluation of pilot Markov chains to optimize jump size.Output This implementation of these models can accommodate complex experimental designs, where a number of genotypes, environments, and developmental time points are examined. Within this framework, missing data (e.g. excluded single spots, or even missing hybridizations) do not require special consideration or a change in methodology; credible intervals and P values reflect accurately the degree to which the data informs each estimate. This software allows the quantitative information on gene expression levels from microarrays to be thoroughly analyzed and carefully considered in assessing the biological effects of genetic or environmental differences of cellular state. Output from the software implementation is in the form of a tab-delimited text file with one header row. Each row thereafter displays the results for a single gene, including columns with: the estimate of expression level for each node (the median of the posterior distribution); the additions and subtractions to make 95% upper and lower bounds on that estimate; the stationary acceptance rates for the Monte Carlo steps for that gene; and the posterior probabilities (P values) for whether the expression level of a gene in each expression node is greater, or lesser, than the expression level of that gene in each other expression node. Evaluation Nested model choice The common variance (AV, MV) and common CV (AC, MC) models are both nested within their respective general unconstrained variance model (AU, MU). The same number of parameters is estimated in both of the nested models. They differ only in how the estimated variances are constrained with relation to the estimated expression levels. Whether the nested models are appropriate compared to the general model may be assessed using the Bayesian Information Criterion [19], which is to choose the model m that maximizes ![]() where Mm is the maximum likelihood of model m, hm is the number of parameters estimated in the model, and n is the number of observations. Tests of power Simulated data sets have an advantage over real data sets, in that true gene expression levels for simulated data are known. Data sets were simulated to ensure that methods introduced here yielded appropriate results when data was derived from a number of reasonable and proposed distributions for gene expression data. For simulated data sets, six ratio measurements were drawn 1400 times from each of five distributions. The simulated distributions were sampled from by the following procedures. For the ratio of normal distributions with a single variance term among all nodes of the experimental design, ratios were created by the division of a random variable drawn from a Normal distribution by another random variable drawn from a Normal distribution, then discarded if outside the range . For the ratio of normal distributions with a single CV term among all nodes, ratios were created by the division of a random variable drawn from a Normal distribution by another random variable drawn from a Normal distribution, then discarded if outside the range . For the lognormal distribution, ratios were drawn directly from log N(μ, σ2) or log N(μ, μ2ν2). For the simulation of data from the Gamma distribution and the Cauchy distribution, parameters were chosen such that the means of the distributions were the same as the intended true expression level. Ratios drawn from the Cauchy distribution were discarded if they were below zero or above ten.For each distribution, 1000 measurements of gene expression level were simulated where both samples had the same expression level, and one hundred measurements were simulated for ratios of expression level of 1.1, 1.25, 1.5, and 2. Variance and CV parameters for all the above distributions simulated expression levels were set at the average values inferred from the dataset of Townsend et al. [10] under additive models. Note that, although parameters of each distribution were generally chosen so that the variances of the ratio output of each distribution would be similar, no attempt was made to make higher moments than the mean identical. Therefore, the relevant comparisons are between analysis methods on a given simulated dataset, and frequencies of significance calling are not directly comparable across simulated datasets. Logistic regressions Power to detect a difference in gene expression depends critically on the true factor of fold-difference between samples. A continuous logistic function, ![]() describing the probability of detection of statistical significance, p, of simulated log2 factors of difference in gene expression, x, was parameterized with an intercept, b, and slope, m, by logistic regression. The same regression was performed on real data by substituting estimates of the factor of difference in gene expression level for known factors of difference, thus providing a profile of the power of an experiment to detect differences in gene expression. A useful metric for such an analysis is the factor of difference in gene expression level that has a fifty percent chance of being identified as significant. Herein, this is referred to as the GEL50, for the Gene Expression Level at which there is a 50 percent chance of detection of statistical significance. Results The general and nested models were implemented on two independent published data sets large enough to estimate parameters within the general model [10,26]. The Bayesian Information Criterion (BIC) [19] was used for model choice. For both datasets examined, the nested models had considerably higher BIC values than the general models, regardless of the kind of error model (Table 1), indicating that the nested models, with fewer parameters, are preferable.
Computation time for analysis of published data sets varied across models (Table 1). Computation using additive models (AV, AC, AU) was more rapid than computation using multiplicative models. Regardless of whether small error terms were modeled as additive or multiplicative, constrained CV models (AC, MC) were faster than constrained variance (AV, MV) or general unconstrained (AU, MU) models. Furthermore, the relative ranks of these models in terms of speeds of computation, without exception, remained as above in all analyses of simulated datasets. In the analysis of data simulated as a ratio of two normal distributions, model AC exhibited the greatest power to detect true differences in gene expression (Figure (Figure1).1
In the analysis of data simulated as a ratio of two lognormal distributions, model AC again exhibited the greatest power to detect true differences in gene expression (Figure (Figure2).2
In the analysis of ratio data simulated from Cauchy and Gamma distributions, model AC again was found to exhibit the greatest power to detect true differences in gene expression (Figure (Figure3).3
Higher power to detect true differences, although important in practice for the purpose of choice of model in an experimental study, does not indicate a better fit to the data. This is made clear by comparing Figure Figure1A1A The power to detect differences in gene expression as a continuous function of the log2 factor of difference in gene expression for the simulated data shown in Figure Figure1C1C
In comparison to performing logistic regression of the frequency of positive calls versus true differences in gene expression level, a logistic regression of the frequency of positive calls versus the estimates of gene expression level derived from analysis of the simulated data may be performed (Figure (Figure4B).4B Fortunately, the regression in Figure Figure4B4B
Discussion Distinguishing the optimal models to use for the analysis of replicated spotted DNA microarray data is important. Optimized models will yield qualitatively more accurate lists of significantly differently expressed genes, and quantitatively more precise resolution of smaller differences in gene expression. The Bayesian Information Criterion for model selection can be used to choose between models that invoke distinct error variances or coefficients of variation for each node as characterized by genotype, environment, and developmental state, and the nested models that invoke a single variance or CV for all nodes. The values of the BIC for the relatively small studies examined here (Table 1) clearly support analysis with the nested models that invoke a single variance or CV. In addition to direct assessment of the fit of the model to the data, power to detect known differences may guide model choice. Generally, the ranking of the power of models was consistent regardless of the distribution used to simulate the data (Figures (Figures1,1 If variances are generally proportional to their expression levels, then the constrained CV models (AC and MC) pertain. A linear regression of the estimated coefficients of variation to their respective expression levels should have positive slope. Specifically, regressions on the datasets here typically have positive slope (y = ~ 0.4x + c) and are highly statistically significant, although the data exhibit considerable scatter and thus poor correlation (r2 ~ 0.04). These data sets are barely large enough to estimate error variances in a gene-by-gene manner using the general model. Future experimental data with greater replication, analyzed by the general model, will yield higher precision estimates of the error variances and thus better resolution of this question. When the nominal false positive rate α = 0.05, all models have an actual false positive rate that is moderately to considerably less than 0.05, averaging 0.02 (Figures (Figures1,1 Prediction of the number of replicates required for statistical significance testing of microarray data is theoretically possible [29,30], by making specific assumptions about the error variances and the level of gene expression difference of interest. Here, empirical examination of the power to detect significant differences at different gene expression levels in different studies (Figure (Figure5)5 From the datasets analyzed here, it is clear that increased replication leads to greater resolution of small differences in gene expression (Figure (Figure5).5 , where n is the number of nodes in the design and r is the total number of hybridizations performed. The studies examined here all contained replicated comparisons, and, in accord with MIAME standards [31], reported ratio results from each hybridization. Future analyses of a range of additional studies that also report results of each hybridization for each gene will have the potential to reveal a more accurate and precise prediction of power using more sources of information about the quality of the microarray hybridizations and about the optimal design of multifactorial experiments [9].Increased power to detect differences in gene expression, consequent to better analysis, better replication, or better technical performance, identifies more significant differences in gene expression of genes with smaller and smaller true expression differences. These small differences in gene expression are not only present [10,32], they are relevant to the evolution of gene regulation [10] and to organismal function and phenotype [32,33]. Transcription factors, for instance, may have enormous impact on cellular function with minimal changes in expression level [34,35]. The detection of the differential expression of transcription factors is often a major goal of many microarray studies. Therefore, understanding the resolution of difference in gene expression that is detectable as significant is a vital component of experimental design and evaluation. Availability and requirements Project name: Bayesian Analysis of Gene Expression Level (BAGEL) Project home page: http://plantbio.berkeley.edu/~taylor/jto.html Operating system(s): MacOS 9, MacOS X, Windows and Linux. Other requirements: none Acknowledgements Thanks to Takao Kasuga, Alison Galvani, and Betty Gilbert for comments on the manuscript. JPT was supported while performing this work by a Research Fellowship from the Miller Institute for Basic Research in Science. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
J Comput Biol. 2000; 7(6):805-17.
[J Comput Biol. 2000]Nucleic Acids Res. 2001 Jun 15; 29(12):2549-57.
[Nucleic Acids Res. 2001]J Comput Biol. 2000; 7(6):819-37.
[J Comput Biol. 2000]J Comput Biol. 2001; 8(6):625-37.
[J Comput Biol. 2001]Genome Biol. 2002; 3(12):RESEARCH0071.
[Genome Biol. 2002]Genome Biol. 2002; 3(12):RESEARCH0071.
[Genome Biol. 2002]BMC Genomics. 2003 Oct 2; 4(1):41.
[BMC Genomics. 2003]J Comput Biol. 2000; 7(6):819-37.
[J Comput Biol. 2000]J Comput Biol. 2001; 8(6):625-37.
[J Comput Biol. 2001]Bioinformatics. 2001 Jun; 17(6):509-19.
[Bioinformatics. 2001]Methods Enzymol. 1999; 303():179-205.
[Methods Enzymol. 1999]Genome Biol. 2002; 3(12):RESEARCH0071.
[Genome Biol. 2002]Genome Biol. 2002; 3(12):RESEARCH0071.
[Genome Biol. 2002]Genome Biol. 2002; 3(12):RESEARCH0071.
[Genome Biol. 2002]J Comput Biol. 2000; 7(6):819-37.
[J Comput Biol. 2000]J Comput Biol. 2001; 8(6):625-37.
[J Comput Biol. 2001]Genome Biol. 2002; 3(12):RESEARCH0071.
[Genome Biol. 2002]Bioinformatics. 2001 Jun; 17(6):509-19.
[Bioinformatics. 2001]Nucleic Acids Res. 2001 Jun 15; 29(12):2549-57.
[Nucleic Acids Res. 2001]J Comput Biol. 2001; 8(1):37-52.
[J Comput Biol. 2001]Mol Biol Evol. 2003 Jun; 20(6):955-63.
[Mol Biol Evol. 2003]Mol Biol Evol. 2003 Jun; 20(6):955-63.
[Mol Biol Evol. 2003]FEBS Lett. 2001 Jun 1; 498(1):98-103.
[FEBS Lett. 2001]Mol Biol Evol. 2003 Jun; 20(6):955-63.
[Mol Biol Evol. 2003]J Comput Biol. 2001; 8(1):37-52.
[J Comput Biol. 2001]Genome Biol. 2002; 3(5):research0022.
[Genome Biol. 2002]J Comput Biol. 2001; 8(6):625-37.
[J Comput Biol. 2001]Nat Genet. 2001 Dec; 29(4):365-71.
[Nat Genet. 2001]BMC Genomics. 2003 Oct 2; 4(1):41.
[BMC Genomics. 2003]Mol Biol Evol. 2003 Jun; 20(6):955-63.
[Mol Biol Evol. 2003]Mol Biol Evol. 2002 Nov; 19(11):1991-2004.
[Mol Biol Evol. 2002]Nat Genet. 2002 Jan; 30(1):25-6.
[Nat Genet. 2002]Theor Popul Biol. 1996 Feb; 49(1):58-89.
[Theor Popul Biol. 1996]Plant Cell. 1998 Jul; 10(7):1075-82.
[Plant Cell. 1998]