pmc logo image
Logo of bioinformJournal URL: http://www.bioinformation.net/

Formats:

Bioinformation. 2007; 1(10): 423–428.
Published online 2007 April 10.
PMCID: PMC1896057
Exploratory Methods for Checking Quality of Microarray Data
Eun-Kyung Lee and Taesung Park*
Department of Statistics, Seoul National University, Seoul, Korea
*Taesung Park E-mail: tspark/at/stats.snu.ac.kr; Corresponding author
Received January 12, 2007; Accepted February 2, 2007.
In microarray experiments many undesirable systematic variations are commonly observed. Often investigators analyzing microarray data need to make subjective decisions about the quality of the experiment, by examining its chip image and a simple scatter plot. Thus, a more rigorous but simple method is desirable to determine the quality of microarray data. We propose two exploratory methods to investigate the quality of microarray experiments with replicated chips. The first method is based on correlations among chips and the second on the actual intensity values for each gene. The proposed methods are illustrated using a real microarray data set. The methods provide an initial estimation for determining the quality of microarray experiments.
Keywords: microarray, quality, checking, exploratory methods
In microarray experiments different sources of systematic and random errors can arise, which may significantly affect the inference on the measured gene expression patterns. A normalization procedure is regularly employed to remove (or minimize) the artifacts due to such errors. While these normalization approaches are useful for adjusting bias of each individual chip, they do not provide a rigorous statistical criterion to detect chips in poor quality. At an earlier stage of analysis, each microarray slide is often examined graphically using the scatter plot between chips to examine large variability (or low reproducibility) and any unusual patterns. However, such examinations are based on subjective human pattern recognition, and chips in poor quality can frequently enter the subsequent analysis, resulting in unreliable inference on the whole microarray study. Therefore, in this study we are concerned about checking the quality of overall microarray experiments and to identify the outlying chips that have much lower reproducibility than other chips.
There have been several approaches for checking reproducibility in microarray experiments. For example, Parmigiani et al., [1] defined integrative correlation between two experiments that are conducted separately to answer the same biological question. This integrative correlation is calculated for each gene and called a gene's reproducibility score. King et al., [2] used correlations, the rate of two fold changes, and principal component analysis to check the reproducibility of gene expression measurements. Park et al., [3] proposed a diagnostic plots for identifying outlying slides. In this paper, we propose an exploratory method to check the quality of microarray data using two different approaches.
Methodology
We first describe the approach based on the correlations between chips and then describe the other approach based on the actual intensity values.
Correlation Based Approach
Given in the supplementary material linked below
In this section, the proposed methods are applied to murine B-cell data. To study gene expression profiles in murine B-cell development, total cellular RNA was extracted from five consecutive B-lymphocyte lineage sub-populations (pre-BI cells, large pre-BII cells, small pre-BII cells, immature B-cells, and mature B-cells), and then, gene expression profiles from the five consecutive stages of mouse B cell development were generated with more than five replicates. [8]
Murine B-cell data show lower sensitivity (0.66) and specificity (0.02). For the further exploratory analysis, we apply the proposed methods. In the chip-wise correlation plot (Figure 1Figure 1), most treatments except small Pre-BII cells (chip 23 - chip 27) show high chip-wise correlations. Chipwise correlations of the small Pre-BII cell treatment have a highly skewed distribution and the third replicate has very small correlations compared to the other chips in the same group. Therefore, we can conclude that this third replicate is problematic and has to be checked or treated before a further analysis. In the summary correlation plot (Figure 2Figure 2), Murine B-cell data shows outliers, chip 25. All the chips except chips in Small Pre-BII group are located in the upper triangular and chip 25 is far from the other chips. It supports the result from chip-wise correlation plot (Figure 1Figure 1).
Figure 1
Figure 1
Figure 1
Chip-wise correlation plot: Murine B-cell data. The plots are for the five treatments: Immature B (1, 2, 3, 4, 5), Large Pre-BII (6, 7, 8, 9, 10), Mature B (11, 12, 13, 14, 15, 16), Pre-BI (17, 18, 19, 20, 21, 22), and Small Pre-BII (23, 24, 25, 26, 27) (more ...)
Figure 2
Figure 2
Figure 2
The summary correlation plot. The solid line across the plot is the reference line for specificity. The chips lower than this line represent low specificity and the chips upper than this line represent high specificity
In Table 1, the last column of PKS and PW show lower p-values than the others. Therefore, we can conclude that the distribution of within correlation in Small Pre-BII group is greater than the distribution of the other groups. Also the mean of within correlation in small Pre BII group is less than the mean of the other groups.
Table 1
Table 1
PKS and PW matrices of Murine B-cell data
Next, we apply the test based on intensities within treatment. We assume the FDR as 5%. Table 2 shows the result of the intensity based tests. Murine B-cell data show quite different patterns. Especially, the gamma of small Pre-BII treatment is lowest among five treatments. Therefore we can conclude that Murine B-cell data set is less reproducible.
Table 2
Table 2
Summary table for the within test based on intensities
We can conclude that murine B-cell data show lower reproducibility, sensitivity and specificity. Therefore, it is not clear whether or not a further statistical test procedure can detect true differences successfully among the five consecutive stages, especially with small pre-BII cells. It is mainly due to one outlying chip (chip 25), as shown in Figure 3. Therefore, the analyst should check the experimental procedure and tissues used for this chip before a further statistical analysis.
At the initial stage of the microarray data analysis, the exploratory data analysis (EDA) provides the first contact with data. The techniques of EDA consist of a number of informal steps such as checking the quality of the data, calculating simple summary statistics, and constructing appropriate graphs.
The proposed method is a more formal way of checking quality than simple EDA plots. Thus, at an initial stage of the microarray data analysis, the proposed method provides useful information regarding the quality of microarray experiments. The correlation based approaches check the treatment-wise quality, while the test based on the actual intensity values checks the gene-wise quality for each gene.
The proposed method is quite effective in detecting some outlying chips. It is much easier to apply than a traditional method of checking outlying chips either by the principal component analysis or the quality control plot. [3]
There are some statistical issues to be taken into consideration, however. First, the log intensities may not have an approximate normal distribution. For simplicity, we have assumed the normal distribution for testing all hypotheses. However extensions to other distributional assumptions are certainly possible. For example, the other distributions such as log-normal and gamma distributions can be easily handled. Second, we did not use a stringent criterion for identifying the concordant/discordant genes. All these genes should be checked by using a analysis such as SAM [9] or t-test [10] during a later stage of analysis. Third, the correlation coefficients derived from all possible pairs of chips may not be independent. We did not consider these correlations in the current analysis. A more sophisticated approach based on the bootstrapping method is under development which considers possible correlations among the correlation coefficients.
We would like to emphasize that the proposed method is an exploratory analysis. We believe the proposed method to be practically useful, simple and easy to implement that will provide a more rigorous approach in a preliminary overview regarding the quality of microarray experiments. Most proposed methods are implemented in the software arrayQCplot [11] and can be downloaded from Bioconductor(www.bioconductor.org).
Figure 3
Figure 3
Figure 3
The scatter plot matrix of five replicates for Small Pre BII treatment in Murine B-cell data
Supplementary Material
Data 1
Acknowledgments
The authors would like to thank to anonymous referees and the editor whose comments were extremely helpful. This study was supported by the National Research Laboratory Program of Korea Science and Engineering Foundation (M10500000126) and the Brain Korea 21 Project of the Ministry of Education.
Footnotes
Citation:Lee & Park, Bioinformation 1(10): 423-428 (2007)
References
1. Parmigiani G, et al. Clin Cancer Res. 2004;10:2922. [PubMed]
2. King C, et al. J Mol Diagn. 2005;7:57. [PubMed]
3. Park T, et al. Biotechniques. 2005;38:463. [PubMed]
4. Pavlidis P, Noble WS. Genome Biol. 2001;2:RESEARCH0042. [PubMed]
5. Jain N, et al. Bioinformatics. 2003;19:1945. [PubMed]
6. Storey JD, Tibshirani R. Proc Natl Acad Sci. 2003;100:9440. [PubMed]
7. Pounds S. Brief Bioinform. 2005;38:463. [PubMed]
8. Hoffmann R, et al. Genome Res. 2002;12:98. [PubMed]
9. Tusher VG, et al. Proc Natl Acad Sci. 2001;98:5116. [PubMed]
10. Choe SE, et al. Genome Biol. 2005;6:R16. [PubMed]
11. Lee EK, et al. Bioinformatics. 2006;22:2305. [PubMed]

See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph