![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2009 RNA Society Impact of normalization on miRNA microarray expression profiling 1Lausanne DNA Array Facility, Center for Integrative Genomics, University of Lausanne, CH-1015 Lausanne, Switzerland 2Bioinformatics Core Facility, Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland 3Department of Biochemistry, University of Lausanne, CH-1066 Epalinges, Switzerland 4These authors contributed equally to this work.
Reprint requests to: Sylvain Pradervand, Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland; e-mail: Sylvain.Pradervand/at/unil.ch; fax: 41-21-692-4065. Received July 28, 2008; Accepted December 5, 2008. This article has been cited by other articles in PMC.Abstract Profiling miRNA levels in cells with miRNA microarrays is becoming a widely used technique. Although normalization methods for mRNA gene expression arrays are well established, miRNA array normalization has so far not been investigated in detail. In this study we investigate the impact of normalization on data generated with the Agilent miRNA array platform. We have developed a method to select nonchanging miRNAs (invariants) and use them to compute linear regression normalization coefficients or variance stabilizing normalization (VSN) parameters. We compared the invariants normalization to normalization by scaling, quantile, and VSN with default parameters as well as to no normalization using samples with strong differential expression of miRNAs (heart–brain comparison) and samples where only a few miRNAs are affected (by p53 overexpression in squamous carcinoma cells versus control). All normalization methods performed better than no normalization. Normalization procedures based on the set of invariants and quantile were the most robust over all experimental conditions tested. Our method of invariant selection and normalization is not limited to Agilent miRNA arrays and can be applied to other data sets including those from one color miRNA microarray platforms, focused gene expression arrays, and gene expression analysis using quantitative PCR. Keywords: miRNA profiling, microarray, normalization, invariants INTRODUCTION Micro-RNAs (miRNAs) are regulators of mRNA translation and stability that play key roles in a variety of processes, including development, cell proliferation, and differentiation (Alvarez-Garcia and Miska 2005; Pasquinelli et al. 2005; Fabbri et al. 2008). They are derived from long RNA precursors (pri-miRNA) that are first processed into hairpin miRNA precursors (pre-miRNA) of ~70 nucleotides (nt), then into mature 19- to 25-nt-long single-stranded RNAs (Bartel 2004; Kim and Nam 2006). Release 11.0 of the miRBase database cataloged 678 human miRNAs (http://microrna.sanger.ac.uk) (Griffiths-Jones 2006). In order to quantify the expression of different mature molecules simultaneously, DNA microarray technology, originally developed for messenger RNA (mRNA) profiling, has been adapted to miRNAs (Krichevsky et al. 2003). In contrast to mRNA profiling, miRNA profiling must distinguish between mature miRNAs and their precursors and must also distinguish between miRNAs that differ in sequence by as little as a single nucleotide (Shingara et al. 2005). Commercial miRNAs microarrays are manufactured with a variety of design strategies. One approach uses locked nucleic acid (LNA)-modified capture probes that increase the stability of the hybrids and allow the discrimination between single nucleotide differences (Castoldi et al. 2006). These arrays are hybridized with two samples labeled with two different fluorescent colors (Cy3 and Cy5). Other approaches use a single-color array format with only one sample hybridized per array. Among those, Agilent Technologies has developed a miRNA profiling assay that is based on a highly efficient labeling method and a novel microarray probe design (Wang et al. 2007). This system's simple direct-labeling method has little sequence bias and normally requires only 100 ng of total RNA. Furthermore, the probe design strategy used with Agilent arrays provides both sequence and size discrimination for mature miRNAs. An important goal of microarray data analysis is to remove systematic differences between samples that do not represent true biological variation. This is usually done at the data normalization stage of the analysis process. Different normalization methods have been used on miRNA microarray expression profiling data sets, but there is currently no clear consensus about their relative performances. Some have even chosen to omit normalization (Baskerville and Bartel 2005; Liang et al. 2005; Wang et al. 2007). The first normalization methods to be used with miRNA array data employed centering to median values (Sun et al. 2004; Castoldi et al. 2006; Garzon et al. 2006) or scaling based on total array intensities (Miska et al. 2004; Tian et al. 2008). Recently, quantile normalization, a popular method for large-scale mRNA array expression data, has also been used with miRNA data (Laurent et al. 2008; Sengupta et al. 2008). Another method developed for mRNA array data analysis—variance stabilizing normalization (VSN)—has also been applied to miRNA array data (Davison et al. 2006; Pan et al. 2008). VSN was developed for mRNA arrays and is based on a parameterized arsinh transformation (instead of a logarithmic transformation) that calibrates sample-to-sample variations and renders variance approximately independent of the mean intensity (Huber et al. 2002). VSN assumes that most genes are not differentially expressed (i.e., are invariant). This concept was used by Garzon et al. (2008), who based their normalization on a set of small noncoding “housekeeping” RNAs, and by Perkins et al. (2007), who used rank invariants. Overall, the normalization methods cited above were developed for the analysis of large-scale mRNA profiling data sets, and no assessment of their relative performances exists for miRNA data sets. Hua et al. (2008) investigated the effect of different normalization methods on data from a custom two-color microarray that does not differentiate between precursor and mature miRNAs. They evaluated the effectiveness of the methods by comparing the normalized microarray data to QPCR data. The correlation between the microarray and QPCR data tended to be low. The objective of this study is to apply different normalization methods to miRNA profiling data sets generated using the one-color Agilent platform and to assess the impact on sensitivity, specificity, and fold-change measurement relative to QPCR. There are very significant differences between miRNA and mRNA data sets: the number of measurements is much smaller (a few hundred versus 10,000–50,000), and the majority of the miRNAs are either not expressed or are expressed at very low levels. Therefore, normalization methods used for mRNA expression array may not be appropriate for miRNA arrays. Considering the unique characteristics of miRNA profiling data, we have developed a method based on the minimal assumption that there exists a set of miRNAs whose expression is constant across all the arrays in the experiment (i.e., are invariant). The “invariant” probes are those that have medium-high mean intensity and low variance across arrays, and these probes are identified using mixture models of the mean and variance distributions (Fig. 1
RESULTS Technical variability We first evaluated the ability of the different normalization procedures to reduce the variability between technical replicates. Standard deviations and means were measured for the 100% brain and 100% heart samples separately (Fig. 2A,B
Sensitivity and specificity To evaluate gains in sensitivity and specificity after normalization, a set of 59 miRNAs differentially expressed between brain and heart (true positives) were monitored in mixed brain–heart samples. True positives were defined as miRNAs in Q3 and Q4 (considering heart and brain samples together) with a minimum threefold change and a P value < 0.01 using all four normalization methods. We compared a 50% brain–50% heart mixture (Mixture 1) with a 75% brain–25% heart (Mixture 2) or a 95% brain–5% heart (Mixture 3) using a t test. The number of true positive miRNAs identified in these mixtures was plotted against the theoretical false discovery rate (FDR) (Benjamini and Hochberg 1995) as performed previously by Naef and Huelsken (2005). We obtained plots similar to standard receiver operating characteristic (ROC) plots where the area under the curves can be used to compare the sensitivity and specificity of the different methods (Fig. 3A,B
Correlation with QPCR When data from mRNA microarray experiments are compared with data generated with TaqMan assays, the microarray results, typically, show a compression of fold-change measures (Shi et al. 2006). In order to assess if miRNA microarray data have the same bias, we selected 17 miRNAs (16 from Q4 and 1 from Q3) covering the entire fold-change range observed in the microarray experiments and performed QPCR experiments with the same brain and heart RNA samples (Supplemental Table 1). We then compared the fold changes determined by microarray using the different normalization methods with the fold changes determined by QPCR (Fig. 4
Squamous carcinoma cell line data set Comparing miRNA gene expression in brain and heart may not reflect accurately the type of experiment in which microarrays normally will be used. To do this, we tested the different normalization methods on microarray results from a system where fewer miRNA expression differences are expected. We chose to focus on a system involving p53, as several studies have been published that implicate miRNAs targets like miR-34a in the regulation of pathways affected by this protein (Xi et al. 2006; Chang et al. 2007; He et al. 2007; Tarasov et al. 2007). MiR-34a expression is induced by p53 both in mice and humans, and its overexpression induces cell cycle arrest (Chang et al. 2007; He et al. 2007). In the present study, we compared the miRNA profiles of the human squamous carcinoma cell line SCC13 infected with recombinant adenoviruses expressing either p53 or GFP (Ad-p53, Ad-GFP). As observed for the brain–heart data set, all normalization methods reduced the variability between technical replicates for expressed probes (Q4) with scaling being the least effective (data not shown). To assess the gain in power that normalization provides, we performed t tests on all the 556 miRNAs. The results are displayed in Q–Q plots with the number of significant miRNAs using an FDR cutoff of 5% (Fig. 5A
DISCUSSION In the present study, we have developed a novel miRNA profiling data normalization approach based on the selection of unchanged or invariant probes. We have compared this invariant-based normalization method with other normalization methods using two miRNAs expression profiling data sets with different characteristics: a comparison of two tissues where a large fraction of the miRNAs are differentially expressed and a data set from a cell line transfected with two different constructs where a much smaller number of miRNAs are affected. Similar conclusions can be drawn from the two data sets: (1) all normalization methods improve the data compared to nonnormalized data; (2) scaling normalization does not perform as well as VSN, quantile, and invariant-based normalization; and (3) VSN with default parameters may not perform as well as quantile and invariant-based normalization when a majority of miRNAs are differentially expressed, but VSN transformation parameters can be computed from a set of preidentified invariants to improve its performance. Variability between samples can be generated from three sources: the true biological difference, the systematic variation that can be corrected through normalization, and the stochastic variation (noise). The normalization methods compared in this study make different assumptions about the true biological difference and the random noise in order to be able to estimate the systematic variation. Scaling assumes that the overall signal intensity does not change. This implies that the down-regulated miRNAs should equal the up-regulated miRNAs in magnitude of signal intensity or that the majority of miRNAs are unchanged. It also implies that the noise and the stochastic variations of miRNAs are proportional to the signal intensity. Although scaling normalization is a significant improvement compared to nonnormalized data, it does not perform as well as the other methods, particularly invariant-based regression, which also uses a linear approach. Invariant-based regression, by only taking into account the less-variable probes, will be less affected by large stochastic variations and large biological effects. Quantile normalization assumes that the overall distribution of signal intensity does not change. Whereas this assumption likely holds true for the comparison between p53 overexpressing versus control cells where few probes are affected, it may not be true for the brain–heart comparison where the distributions of expression profiles are significantly different. Under these conditions, although quantile normalization reduces the technical variability of the brain samples (Q2–Q4), it increases the technical variability of heart samples in Q2 and Q3 (Fig. 2B Invariant-based methods were among the first approaches used to normalize mRNA gene microarray data (Li and Wong 2001; Tseng et al. 2001). In those applications, nondifferentially expressed genes are selected such that they occur in the same rank order on each chip. The intuitive justification for this is that the measured expression signal of a truly differentially expressed probe is more likely to have different rank relative to the other probes. Micro-RNA profiling platforms have many less features than mRNA gene expression profiling platforms (~500 versus 10,000–50,000). Therefore, the probe for a truly differentially expressed gene may have a large difference in intensity without appreciably altering its rank order, and, therefore, it could be classified as invariant. Normalization based on predefined housekeeping genes, popular in QPCR, has also been used for miRNA profiling, where noncoding genes such as tRNA, U2, U4, and U6 small noncoding RNA as well as GAPDH mRNA were selected as invariants (Garzon et al. 2008). However, many housekeeping genes have been reported to exhibit considerable variability under different experimental conditions (Lee et al. 2002), and their expression levels are often relatively high, making them unrepresentative of the entire expression intensity range. Our normalization approach based on invariant genes is data driven and requires no a priori selection of probes. It also has the advantage of avoiding the large proportion of probes near or at the background signal level. The only assumption of our procedure is a distinguishable low-SD/high-mean population as determined by a mixture model. This assumption is satisfied in the examples presented here. In some cases, it might not be possible to fit the data, for instance, if all probes have low mean and low SD. However, this will be obvious from diagnostic plots, such as Figure 1 With an increasing number of studies addressing the role of miRNAs in various physiological processes, miRNA profiling is becoming a standard bioanalytical technique. However, to our knowledge, no study has yet addressed the impact of normalization on mature miRNA profiling data. Here, we show that normalization is an important step in miRNA microarray data preprocessing. Since assumptions that are valid for messenger RNA profiling normalization may not hold for miRNA profiling, we propose to calculate the normalization parameters from a set of invariant probes. This method of invariant probe selection is not limited to Agilent miRNA profiling data, but can be generalized to other types of one-color arrays and other data types such as QPCR as well as medium-scale mRNA profiling (e.g., focused gene content DNA microarrays), which interrogate a few hundred probes. MATERIALS AND METHODS RNA samples and experimental design Human heart and brain total RNA were from Stratagene (MVP human normal adult tissue RNA; Stratagene). Micro-RNA profiling of cell cultures was performed on the human keratinocyte-derived squamous cell carcinoma SCC13 cell line infected with either p53 overexpressing adenovirus (Adp53) or the control (AdGFP) for 24 h as previously described (Lefort et al. 2007). Cells were collected in Tri-Reagent (Sigma), and total RNA was extracted following the manufacturer's instructions (with the exception that three rounds of chloroform extraction were performed instead of one). To assess technical reproducibility, three technical replicates from brain and heart RNA were hybridized on Agilent human miRNA microarrays (Wang et al. 2007). To determine sensitivity and specificity, heart and brain RNA were mixed in the following ratios: 50% heart 50% brain, 25% heart 75% brain, and 5% heart 95% brain. Each of the dilutions was hybridized in a technical duplicate on Agilent human miRNA arrays (Human miRNA Microarray Kit #G4470A; Agilent Technologies, Inc.). To assess the effect of p53 expression on miRNA levels in the human SCC cell line, RNA from three biological replicates of p53-expressing versus control cells was hybridized in technical duplicates on the microarrays, resulting in a total of 12 hybridizations. Target preparation and hybridization Each sample was prepared according to the Agilent's miRNA Microarray System protocol. Total RNA (100 ng) was dephosphorylated with calf intestine alkaline phosphatase (GE Healthcare Europe GmbH), denatured with dimethyl sulfoxide, and labeled with pCp-Cy3 using T4 RNA ligase (GE Healthcare Europe GmbH). The labeled RNAs were hybridized to Agilent human miRNA microarrays for 20 h at 55°C with rotation. After hybridization and washing, the arrays were scanned with an Agilent microarray scanner using high dynamic range settings as specified by the manufacturer. Agilent Feature Extraction Software was used to extract the data. Data are accessible through NCBI GEO (Series record GSE12085). Normalization All normalization methods were performed on the Total Gene Signal from Agilent “GeneView” data files in R, an open source statistical scripting language (http://www.r-project.org). Except for VSN, data were log2 transformed after adding a small constant (16 for the SCC13 cell line data set, 28 for the brain/heart data sets) such that the smallest value of the data set was 1 before taking the log. Scaling normalization was performed by dividing each array by its mean signal intensity and then by rescaling to the global mean intensity of all arrays. Quantile normalization was performed using the “normalize.quantiles” function from R package “affy” from the Bioconductor project (http://www.bioconductor.org) (Bolstad et al. 2003). VSN uses an arcsinh value transformation that is tolerant to negative numbers; therefore, it was applied directly to the raw signal data using the “vsn” function with default parameters from the Bioconductor package “vsn” (Huber et al. 2002). For invariant-based normalization, we proceeded as follows:
Invariant selection Invariant miRNAs were selected in two steps: (1) removal of SD versus mean trend and (2) identification of invariant probes from the mean and corrected standard deviation (Fig. 1 Removal of SD versus mean trend is done by fitting a loess curve to the scatter plot of SD-versus-mean (function “loess” in R with default parameters). The fitted curve corresponds to the trend of SD as the function of the mean. Ideally, it should be flat so that when a curvature is observed, the expression measures have to be rescaled so that there is no trend using the formula
is the fitted loess curve and λ is a small constant of 0.1 to avoid division by a value that is close to 0. This transformation removes the trend in the SD versus mean scatter plot.Invariant probes are those that have high mean expression across arrays and low SD (i.e., constitutive expression across arrays). Normal mixture models are fitted separately to the distribution of mean and corrected SD, using “mclust” package in R (Fraley and Raftery 2002). First, a mixture model with two components (i.e., expressed and nonexpressed miRNA) was fitted to the distribution of mean. The posterior probability of class membership is used to decide whether a probe is in a high expression or low expression group. Then, a mixture model was fitted to the standard deviations of the probes from the high expression group only. We ran the “Mclust” function of the “mclust” package with default parameters and let it find the model with the optimal number of components. The probes with more probability of being in the first component (smallest SD) were selected as invariants. Details of implementation of invariants normalization are described in the R script available at http://www.unil.ch/dafl/page58744.html. miRNA expression profiling using TaqMan MicroRNA assays Total RNA was reverse transcribed with looped microRNA-specific RT primers (Applied Biosystems) contained in the TaqMan MicroRNA Assays Human Panel Early Access Kit (Applied Biosystems, PN 4,365,381) and TaqMan microRNA Human Assays. Briefly, single-stranded cDNA was synthesized from 10 ngtotal RNA in 15-μL reaction volume with TaqMan MicroRNA Reverse Transcription Kit (Applied Biosystems), according to the manufacturer's protocol. The reaction was incubated at 16°C for 30 min followed by 30 min at 42°C and inactivation at 85°C for 5 min. Each cDNA was amplified with sequence-specific TaqMan microRNA Assays from Applied Biosystems. PCR reactions were performed on an Applied Biosystems 7900HT Sequence Detection system in 10 μL volumes in a 384-well plate at 95°C for 10 min, followed by 45 cycles of 95°C for 15 sec and 60°C for 1 min. All samples were tested in quadruplicate. The threshold cycle (Ct) values obtained with the SDS software (Applied Biosystems) were exported into qBase version 1.3.5, a Visual Basic Excel based script for the management and automated analysis of qPCR data (Hellemans et al. 2007). Ct values were transformed to relative quantities (RQ) and analyzed with geNorm 3.4 software (Vandesompele et al. 2002). This application for Microsoft Excel allows determination of the most stable reference gene from a set of candidate normalization genes (RNU24, RNU43, and Z30) in a given panel of cDNA samples. The small nucleolar RNA Z30 (AJ007733) was found to be the most stable and was subsequently used for normalization. SUPPLEMENTAL MATERIAL Supplemental material can be found at http://www.rnajournal.org. ACKNOWLEDGMENTS We thank Darlene Goldstein and Mauro Delorenzi for their critical comments of this paper. Footnotes Article published online ahead of print. Article and publication date are at http://www.rnajournal.org/cgi/doi/10.1261/rna.1295509. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Development. 2005 Nov; 132(21):4653-62.
[Development. 2005]Curr Opin Genet Dev. 2005 Apr; 15(2):200-5.
[Curr Opin Genet Dev. 2005]Cancer J. 2008 Jan-Feb; 14(1):1-6.
[Cancer J. 2008]Cell. 2004 Jan 23; 116(2):281-97.
[Cell. 2004]Trends Genet. 2006 Mar; 22(3):165-73.
[Trends Genet. 2006]RNA. 2005 Mar; 11(3):241-7.
[RNA. 2005]Nucleic Acids Res. 2005 Jan 31; 33(2):e17.
[Nucleic Acids Res. 2005]RNA. 2007 Jan; 13(1):151-9.
[RNA. 2007]Nucleic Acids Res. 2004 Dec 22; 32(22):e188.
[Nucleic Acids Res. 2004]RNA. 2006 May; 12(5):913-20.
[RNA. 2006]Proc Natl Acad Sci U S A. 2008 Mar 11; 105(10):3945-50.
[Proc Natl Acad Sci U S A. 2008]Genome Biol. 2007; 8(2):R27.
[Genome Biol. 2007]Nucleic Acids Res. 2005 Jul 19; 33(13):e111.
[Nucleic Acids Res. 2005]Nat Biotechnol. 2006 Sep; 24(9):1151-61.
[Nat Biotechnol. 2006]Clin Cancer Res. 2006 Apr 1; 12(7 Pt 1):2014-24.
[Clin Cancer Res. 2006]Mol Cell. 2007 Jun 8; 26(5):745-52.
[Mol Cell. 2007]Nature. 2007 Jun 28; 447(7148):1130-4.
[Nature. 2007]Cell Cycle. 2007 Jul 1; 6(13):1586-93.
[Cell Cycle. 2007]Proc Natl Acad Sci U S A. 2001 Jan 2; 98(1):31-6.
[Proc Natl Acad Sci U S A. 2001]Nucleic Acids Res. 2001 Jun 15; 29(12):2549-57.
[Nucleic Acids Res. 2001]Proc Natl Acad Sci U S A. 2008 Mar 11; 105(10):3945-50.
[Proc Natl Acad Sci U S A. 2008]Genome Res. 2002 Feb; 12(2):292-7.
[Genome Res. 2002]Genes Dev. 2007 Mar 1; 21(5):562-77.
[Genes Dev. 2007]RNA. 2007 Jan; 13(1):151-9.
[RNA. 2007]Bioinformatics. 2003 Jan 22; 19(2):185-93.
[Bioinformatics. 2003]Bioinformatics. 2002; 18 Suppl 1():S96-104.
[Bioinformatics. 2002]Genome Biol. 2007; 8(2):R19.
[Genome Biol. 2007]Genome Biol. 2002 Jun 18; 3(7):RESEARCH0034.
[Genome Biol. 2002]