• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Oct 2008; 36(17): e108.
Published online Aug 1, 2008. doi:  10.1093/nar/gkn430
PMCID: PMC2553586

Consolidated strategy for the analysis of microarray spike-in data


As the number of users of microarray technology continues to grow, so does the importance of platform assessments and comparisons. Spike-in experiments have been successfully used for internal technology assessments by microarray manufacturers and for comparisons of competing data analysis approaches. The microarray literature is saturated with statistical assessments based on spike-in experiment data. Unfortunately, the statistical assessments vary widely and are applicable only in specific cases. This has introduced confusion into the debate over best practices with regards to which platform, protocols and data analysis tools are best. Furthermore, cross-platform comparisons have proven difficult because reported concentrations are not comparable. In this article, we introduce two new spike-in experiments, present a novel statistical solution that enables cross-platform comparisons, and propose a comprehensive procedure for assessments based on spike-in experiments. The ideas are implemented in a user friendly Bioconductor package: spkTools. We demonstrated the utility of our tools by presenting the first spike-in-based comparison of the three major platforms–Affymetrix, Agilent and Illumina.


Assessing sensitivity presents a challenge for microarray technology because one needs experimental designs in which the correct outcome for a given measurement is known a priori. Spike-ins provide a way to do this and are therefore used extensively for assessment purposes (1–8). In fact, spike-in experiments will be integral to government-led projects that will help determine the practicality of microarrays in clinical applications. An example is the External RNA Control Consortium (ERCC) (9), led by the National Institute for Standards and Technology (NIST). Creating well-characterized, tested RNA spike-in controls is the first goal of the ERCC. Proper statistical analysis strategies for the data generated by these experiments are indispensable. Unfortunately, the statistical assessments for spike-in experiments presented in the literature vary widely and are not generally applicable. For reasons explained in detail below, comparisons across platforms are particularly problematic. In this article, we propose a consolidated strategy that builds on a widely used benchmark methodology (10): we assessed specificity and sensitivity in a way that can be easily related to practical performance. We propose a solution to the cross-platform problem and demonstrate its use by analyzing data from spike-in experiments performed by Affymetrix, Agilent and Illumina. The data were preprocessed with the most commonly used procedures, as described in the next section. We refer to the processed data (in log2 scale) as expression values.

An important fact that has been overlooked by previous assessments is that microarray performance largely depends on concentration levels (11). Assessments based on experiments for which spike-in concentrations lead to unusually high expression measurements have resulted in misleading conclusions (12). For this reason, it is essential that the distribution of observed expression for the spike-in transcripts reflects the distributions seen in typical experiments. Figure 1 shows the typical distribution of expression values for the background RNA for the three studied data sets. The tick marks on the x-axis represent the average expression at each reported spike-in level. This figure illustrates that the spike-in transcripts resulted in higher expression measurements, on average, than the background RNA transcripts. Furthermore, we see that, relative to their respective background RNA distributions, the Agilent and Illumina spike-ins have higher observed expression than those in the Affymetrix experiment. Previous work (11) suggests that comparing platform performance without correcting for this leaves Affymetrix at a disadvantage.

Figure 1.
Empirical densities. These plots depict the empirical density of the average (across arrays) expression values for the background RNA. The tick marks on the x-axis show the average expression at each nominal concentration. The dotted lines represent the ...

For spike-in experiments to be useful in a cross-study assessment, we need to understand how the reported nominal concentrations relate across data sets. The reported concentrations attempt to quantify the amount of spike-in RNA in a sample relative to the total amount of RNA. However, in our experience, the reported values are impossible to relate between the different manufacturers. One reason is that two approaches have been used: (i) adding prelabeled spike-ins to the target solution just before hybridization and (ii) adding spike-ins to the total RNA at the beginning of amplification. We prefer the second approach because it better imitates a real experiment (13). Specifically, it reproduces the technical variation due to cDNA synthesis, fragmentation, labeling and hybridization. However, the Affymetrix and Illumina data presented in this article are from experiments that followed the first approach, while the Agilent experiment followed the second approach. A molarity calculation is easy to perform in the first approach but difficult in the second because the only true known values in the experiment are the mass of the spike-in and the mass of total RNA. For this reason, Agilent reports nominal values in different units (relative concentration) than Affymetrix and Illumina [picomolar (pM)]. However, even when picomolar concentrations are reported we find that nominal concentrations do not map well across experiments (Table 1). We hope that the ERCC will solve this problem by standardizing protocols. Here, we propose a data-driven solution that permits cross-platform comparison with existing data sets.

Table 1.
Nominal concentration to ALE mapping


Experimental protocols

The platforms used were Affymetrix's; HGU133A GeneChip, Agilent's; 4x44K Whole Human Genome Oligo Microarray and Illumina's; Human-6 v2 Beadchip. The experiments for each platform were performed by the respective manufacturer. Each manufacturer followed different experimental procedures (Table 2). The raw data were preprocessed with the default procedures: Affymetrix was preprocessed using RMA (14); Agilent used background subtraction and normalized to the 75th percentile; Illumina used local background subtraction and quantile normalization (15).

Table 2.
Description of data sets

Relating nominal concentrations across data sets

Our solution to the problem of mapping was to replace each nominal concentration with the average log expression across arrays (ALE) for genes spiked in at that concentration. This approach assures that performance assessments based on spike-in data are related to expression measurements that are defined consistently across platforms: low, medium and high ALE values correspond to low, medium and high observed expression values, respectively (Table 1).

Accuracy assessment

With the ALE values in place, we were ready to adapt some of the existing statistical assessments to cross-platform comparisons. We started with a basic assessment of accuracy: the signal detection slope (10). Microarrays are designed to measure the abundance of sample RNA. In principle, we expect a doubling of nominal concentration to result in a doubling of observed intensity. In other words, on the log2 scale, the slope from the regression of expression on nominal concentration can be interpreted as the expected observed difference when the true difference is a fold change of 2. Thus, an optimal result is a slope of one, and values higher and lower than one are associated with over and under estimation, respectively (Figure 2).

Figure 2.
Observed versus nominal values. For each of the three platforms, expression values are plotted against the log (base 2) of the reported nominal concentration. The regression slope obtained utilizing all the data and the regression slopes obtain within ...

ALE strata

It has been noted that at very high and very low concentrations one typically observes lower slopes compared to those seen at medium concentrations (11). To address this, we consider the signal detection slopes for genes spiked-in at low, medium and high ALE values (Figure 2). We implemented a data-driven approach to selecting these two cut-offs.

We defined f to be the function that maps nominal log concentration x to expected observed concentration f(x). Using a cubic spline, fitted to the observed data, we obtained a parametric representation of f. We then looked for concentrations for which clear changes in sensitivity occurred, i.e. values of x with large slope changes. Note that large changes in slope result in local maxima in the absolute value of the second derivative of f. For each platform, the absolute value of the second derivative f ′′ showed two clear local maxima (Supplementary Figure 1). For each platform, we mapped each concentration x to its corresponding empirical percentile Φ(x) and plotted |f′′(x)| against Φ(x) (Supplementary Figure 2). The percentiles that maximized the slope change were similar across platforms. The modes for the average curve were 0.615 and 0.993. Therefore, for the purpose of this comparison, we assigned as low, ALE values less than the 60th percentile of the distribution of background RNA. Similarly, we defined as high ALE values above the 99th percentile. The remaining ALE values, between the 60th and 99th percentile, were denoted as medium. Our choice of cut-points was further motivated by observing that for the Affymetrix data the 60th percentile provided a good cut-off for distinguishing genes called present from genes called absent (Supplementary Figure 3).

Precision assessments

To complete our comparison, we needed to assess specificity. Because the majority of microarray studies rely on relative measures (e.g. fold change) as opposed to absolute ones, we focused on the precision of the basic unit of relative expression: log-ratios. We adapted the precision assessment of Cope et al. (10) that focused on the variability of log-ratios generated by comparisons expected to produce log-ratios of 0. Our set of comparisons was created by making all possible comparisons between spiked-in transcripts across arrays in which they had the same nominal concentration and from all possible comparisons within the background RNA. We referred to this group of comparisons as the Null set. The SD of these log-ratios served as a basic assessment of precision and has a useful interpretation: it is the expected range of observed log-ratios for genes that are not differentially expressed. Table 3 and Figure 3 show results for the three platforms.

Table 3.
Assessment results
Figure 3.
Log-ratio distributions. These plots depict the distribution of observed log ratios for various nominal fold changes. In each case, the log ratios are stratified by the ALE values into which the two nominal concentrations fall. For example, HL means that ...

Because specificity varies with nominal concentration (11), we stratified these comparisons into low, medium and high ALE values. In Figure 3, many outliers were observed on each platform. This was expected given the documented problem of cross-hybridization. Because a platform with larger SD and small outliers might be preferable to the one with a smaller SD but large outliers, we included the 99.5th percentile of the null distribution as a second summary assessment of specificity. Note that in a typical experiment close to 0.5% of null genes are expected to exceed this value, which translates to approximately 100 genes on whole-genome arrays. Figure 3 also includes comparisons of spike-ins expected to yield a certain fold change. These serve to further demonstrate the variability of relative expression across ALE strata. They also serve as a rough illustration of the accuracy of log-ratios for each ALE strata and each platform.

Performance assessments

Precision and accuracy assessments on their own may not be of much practical use. However, the summary statistics described earlier (Table 3) can be easily combined to answer any practical question, as long as it can be posed in a statistical context. We focus on two summaries related to the common problem of detecting differentially expressed genes. Note that we purposely developed summaries that do not directly penalize for a lack of accuracy and precision as long as the real differences are detected. However, as expected, detection ability was highly dependent on accuracy and precision.

For the first example, we computed the chance that, when comparing two samples, a gene with true log fold change Δ = 1 will appear in a list of the top 100 genes (highest log-ratios). We refer to this quantity as the probability of being at the top (POT) and recommend computing it separately in each ALE strata. Specifically, we assume that the log-ratios in each ALE strata follow a normal distribution with mean and variance estimated from the data (accuracy slope and SD in Table 3) and compute the probability that a random variable from that distribution exceeds the 99.5th percentile of the null distribution.

As a second example, we computed the expected size of a gene list one would have to consider to find n genes that have a true log fold change Δ. To perform this calculation, we assumed m1 genes were differentially expressed and m0 were not. Note that m1 + m0 is the number of genes on the array. Furthermore, we assumed that the true log-ratios in each ALE strata followed a normal distribution with mean and variance estimated from the data (accuracy slope and SD in Table 3). The empirical distribution was used for the null genes. With these assumptions in place, we computed the gene list size for n = 10, m1 = 100 and m0 = 10 000, we calculate the gene list size, N, required to obtain n = 10 true fold changes (Table 3). We refer to this quantity as the gene-list needed to detect n true-positives (GNN). Again, we recommend computing it separately in each ALE strata.

Imbalance measure

Those interested in taking advantage of our methodology should know that an important requirement is a spike-in experimental design that does not confound nominal concentrations and genes. A large source of variability in microarray data is the probe-effect (16) and these vary across platforms. We fitted an analysis of variance (ANOVA) model to describe the probe effect for each platform (Table 4). Note that if nominal concentrations are confounded with genes, it becomes impossible to separate differences due to signal detection from differences in probe affinities. Many of the previously published spike-in experiments suffer from this confounding effect. To quantify design imbalance, we used the following measure of imbalance developed by Wu (17):

equation image

where i denotes each covariate, λi an optional weight associated with each covariate, ui are the possible levels for covariate i, t represents the treatment levels, nt(ui) is the number of units with its i-th covariate at level ui receiving treatment t, and n(ui) is the total number of units with its i-th covariate at level ui (17). In our case, the two covariates are probe and array, and the treatment is nominal concentration. Since imbalance is defined as a weighted sum of the imbalance due to each covariate, we chose to report the probe and array imbalance separately to give a better understanding of the source of the imbalance in each design. In order to not penalize large designs, we divided the probe imbalance by the number of probes and the array imbalance by the number of arrays. These results are included in Table 4.

Table 4.
ANOVA results

We have developed a software package that permits quick and easy creation of plots and tables such as those presented here. The software is freely available as the spkTools package from the Bioconductor Project (18). This package defines a new S4 class that extends the ExpressionSet class (18) to include a matrix of nominal concentrations; this new class is called SpikeInExpressionSet. The functions implemented in this package take an object of this type as their input and automatically produce the tables and plots presented in this paper. Of particular interest is a function named spkAll, which is a wrapper function for all the functions contained in this package. When run on a SpikeInExpressionSet object, it produces the full complement of tables and plots shown in this article and saves them with easily recognizable file names. Although this package was designed with the intent of producing the full array of results for each experiment, the functions can also be applied separately with a few exceptions where the output of one function is required as the input of another. Further details and examples outlining the use of these functions can be found in the help files accompanying the package.


The ANOVA analysis (Table 4) revealed that all platforms have similar sized probe effects. This underscores the importance of balancing genes and concentration levels. The Agilent experiment had a small imbalance (Table 4). This was because Agilent used one less concentration mixture than the number of spike-in probes. The Illumina array had a very large array imbalance because array and concentration were completely confounded. However, because all data sets were normalized we expected the array effect for Illumina to be small, as with Affymetrix and Agilent. This type of confounding is therefore less problematic.

Figure 2 demonstrated that Agilent performed best with regard to accuracy in all concentration bins. While Illumina performed better than Affymetrix in the medium concentrations, Affymetrix performed better in the low and high concentrations. Affymetrix was most consistent across all bins. If we had looked at only the overall slope, Illumina would have appeared to perform best because fold changes are overestimated in the medium concentrations. Figure 2 also shows the changing relationship between expression and nominal concentration. For all three platforms, the slope is small at low nominal concentrations, larger at high concentrations, and largest at medium concentrations. However, the difference between these slopes and the nominal concentrations at which the shift between bins occurs varies across platforms. It is this fact that illustrates why it is crucial to view nominal concentration and expression as platform dependent measures.

Figure 3 highlights two important findings: (i) precision depends strongly on concentration with higher variability observed for low concentrations, (ii) Affymetrix, which had the worst accuracy, has the best precision, especially for low concentrations where the difference was substantial.

In terms of the POT and GNN assessment, Affymetrix outperforms Agilent which outperforms Illumina: the gene list size in the low/medium/high strata for Affymetrix were 37/34/25, for Agilent they were 682/38/26 and for Illumina they were 1489/60/46 (Table 3). To provide a graphical version of this summary, we included boxplots of observed log-ratios for comparison with nonzero nominal log-fold changes Δ > 0 (Figure 3). Due to the different designs, using the same expected log-fold change, Δ, for all platforms was not possible. We used the closest possible values instead: log2(4) for Affymetrix and log2(3) for Agilent and Illumina. Log ratio (M) versus average intensity (A) plots also depict both accuracy and precision and are included as Supplementary Figure 4.


We have described a general assessment procedure for microarray data based on spike-in experiments and demonstrated how the procedure can be used to compare across different experiments and microarray platforms. A novel aspect of the approach is that we independently assess performance at low, medium, and high concentrations using ALE values: an empirically constructed mapping between nominal concentration and observed expression. This mapping is important because nominal concentrations can not be interpreted in the same way in all experiments. In our approach, measurements are interpreted relative to the distribution of background RNA expression.

Our results demonstrate that while Agilent and Illumina had better overall accuracy, Affymetrix has better precision. In the medium and high strata, Affymetrix and Agilent performed similarly, and better than Illumina, according to the POT and GNN measures. In the low strata, Affymetrix greatly outperformed Agilent and Illumina. Affymetrix's; advantage was due to the smaller number of outliers (Table 3 and Supplementary Figure 1). Note that to keep the article focused, we considered a basic analysis approach based on fold change. However, the spkTools package can be used for more elaborate platform comparisons. For example, to help reduce outliers, we filtered genes called absent or undetected by the manufacturer's; software. Because various noisy comparisons are no longer considered, this approach improves specificity in the low strata. However, sensitivity is made worse because true differences are accidentally filtered away (Supplementary Table 1). Problems with these detection calls have been documented (19).

As we previously described the 60th and 99th percentiles provided the best cut-points for this analysis; however, that need not be the case in future analyses. For this reason, the spkTools package permits the user to choose any two percentiles to use as the cut-points. We recommend that the optimal cut-points for a future data set be determined in a manner similar to what we described.

It is important to note that the microarray products compared here are of different generations, with Affymetrix's; the oldest and Agilent's; the newest. Also, the spike-in targets and background RNA vary between platforms (Table 2). Finally, different preprocessing algorithms will result in differences in performance (16). To illustrate this, we ran Affymetrix data processed with the manufacturers default MAS 5.0 through our assessment (Table 3 and Supplementary Figure 6). An interesting finding was that with MAS 5.0, instead of RMA, Affymetrix no longer had an advantage in the low strata. We expect results for Agilent and Illumina to improve with the development of novel preprocessing algorithms for these technologies by the scientific community. Because we expect a large increase in spike-in experiments and preprocessing algorithms, we developed the spkTools package to permit quick and easy creation of plots such as those presented here.

Spike-in experiments have been criticized for producing artificial data with little resemblance to real data produced by a typical experiment. A particular limitation of the data shown here is the use of technical replicates: the data fails to incorporate the biological variation present in most experiments. However, acceptable sensitivity and specificity measures, determined by spike-in experiments such as those presented here, are a minimal requirement for a microarray platform. A technology not performing well in our assessment will not perform well in the more complicated setting of real experiments. A strength of our spike-in data is that they permit a focused assessment based on the most basic attributes of this technology. Furthermore, we expect vastly improved spike-in experiments, e.g. using biological replicates, to emerge in large numbers once the ERCC makes it first formal recommendation. We have therefore developed a general tool that can be readily applied to data from these experiments. Furthermore, the strategy we described and used successfully to compare the three main microarray platforms using Latin-Square spike-in experiments for the first time, can serve as a blueprint for future methods and analyses.


Supplementary data are available at NAR Online. Raw data and annotation files are available from http://rafalab.org.

[Supplementary Data]


We thank Agilent, and especially Anne Bergstrom Lucas, for providing the Agilent spike-in data and insightful discussions on spike-in experiments. We thank Illumina, and especially Shawn Baker, for sharing the Illumina spike-in data. We also thank Affymetrix for making their spike-in experiments public. This work was supported by National Institutes of Health (1R01GM083084-01 to R.I., 1R01RR021967-01A2 to R.I., T32GM074906 to M.M.).

Conflict of interest statement. None declared.


1. Fodor S, Read J, Pirrung M, Stryer L, Lu A, Solas D. Light-directed, spatially addressable parallel chemical synthesis. Science. 1991;251:767–773. [PubMed]
2. Fodor S, Rava R, Huang X, Pease A, Holmes C, Adams C. Multiplexed biochemical assays with biological chips. Nature. 1993;364:555–556. [PubMed]
3. Schena M, Shalon D, Davis R, Brown P. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467–470. [PubMed]
4. Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M, Mittmann M, Want C, Kobayashi M, Horton H, et al. DNA expression monitoring by hybridization of high density oligonucleotide arrays. Nat. Biotechnol. 1996;14:1675–1680. [PubMed]
5. Higuchi R, Fockler C, Dollinger G, Watson R, Kinetic P. Analysis: real-time monitoring of DNA amplification reactions. Biotechnology. 1993;11:1026–1030. [PubMed]
6. Heid C, Stevens J, Livak K, Williams P. Real time quantitative PCR. Genome Res. 1996;6:986–994. [PubMed]
7. Wittwer C, Herrmann MG, Moss AA, Rasmussen RP. Continuous fluorescence monitoring of rapid cycle DNA amplification. Biotechniques. 1997;22:130–139. [PubMed]
8. Cronin M, Ghosh K, Sistare F, Quackenbush J, Vilker V, O'Connell C. Universal RNA reference materials for gene expression. Clin. Chem. 2004;50:1464–1471. [PubMed]
9. Baker S, Bauer S, Beyer R, Brenton J, Bromley B, Burrill J, Causton H, Conley M, Elespuru R, Fero M, et al. The external RNA controls consortium: a progress report. Nat. Methods. 2005;2:731–734. [PubMed]
10. Cope L, Irizarry R, Jaffee H, Wu Z, Speed T. A benchmark for Affymetrix GeneChip expression measures. Bioinformatics. 2004;20:323–331. [PubMed]
11. Irizarry R, Wu Z, Jaffee H. Comparison of Affymetrix GeneChip expression measures. Bioinformatics. 2006;22:789–794. [PubMed]
12. Irizarry R, Cope L, Wu Z. Feature-level exploration of a published Affymetrix GeneChip control dataset. Genome Biol. 2006;7:404. [PMC free article] [PubMed]
13. Tong W, Lucas A, Shippy R, Fan X, Fang H, Hong H, Orr M, Chu T, Guo X, Collins P, et al. Evaluation of external RNA controls for the assessment of microarray performance. Nat. Biotechnol. 2006;24:1132–1139. [PubMed]
14. Irizarry R, Bolstad B, Collin F, Cope L, Hobbs B, Speed T. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;31:e15. [PMC free article] [PubMed]
15. Kuhn K, Baker S, Chudin E, Lieu M, Oeser S, Bennett H, Rigault P, Barker D, McDaniel T, Chee M. A novel, high-performance random array platform for quantitative gene expression profiling. Genome Res. 2004;14:2347–2356. [PMC free article] [PubMed]
16. Irizarry R, Warren D, Spencer F, Kim I, Biswal S, Frank B, Gabrielson E, Garcia J, Geoghegan J, Germino G, et al. Multiple-laboratory comparison of microarray platforms. Nat. Methods. 2005;2:345–350. [PubMed]
17. Wu C. Iterative construction of nearly balanced assignments I: categorical covariates. Technometrics. 1981;23(1):37–44.
18. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. [PMC free article] [PubMed]
19. Zilliox M, Irizarry R. A gene expression bar code for microarray data. Nat. Methods. 2007;4:11–913. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles
  • Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...