![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2006 Biomedical Informatics Publishing Group Exploratory Methods for Checking Quality of Microarray Data Department of Statistics, Seoul National University, Seoul, Korea *Taesung Park E-mail: tspark/at/stats.snu.ac.kr; Corresponding author Received January 12, 2007; Accepted February 2, 2007. This is an open-access article, which permits unrestricted use, distribution, and reproduction
in any medium, for non-commercial purposes, provided the original author and source are credited. Abstract In microarray experiments many undesirable systematic variations are commonly observed. Often investigators analyzing microarray data
need to make subjective decisions about the quality of the experiment, by examining its chip image and a simple scatter plot. Thus, a more
rigorous but simple method is desirable to determine the quality of microarray data. We propose two exploratory methods to investigate the
quality of microarray experiments with replicated chips. The first method is based on correlations among chips and the second on the actual
intensity values for each gene. The proposed methods are illustrated using a real microarray data set. The methods provide an initial estimation
for determining the quality of microarray experiments. Keywords: microarray, quality, checking, exploratory methods Background In microarray experiments different sources of systematic and random errors can arise, which may significantly affect the inference on the measured gene expression patterns. A
normalization procedure is regularly employed to remove (or minimize) the artifacts due to such errors. While these normalization approaches are useful for adjusting bias of
each individual chip, they do not provide a rigorous statistical criterion to detect chips in poor quality. At an earlier stage of analysis, each microarray slide is often
examined graphically using the scatter plot between chips to examine large variability (or low reproducibility) and any unusual patterns. However, such examinations are
based on subjective human pattern recognition, and chips in poor quality can frequently enter the subsequent analysis, resulting in unreliable inference on the whole microarray
study. Therefore, in this study we are concerned about checking the quality of overall microarray experiments and to identify the outlying chips that have much lower
reproducibility than other chips. There have been several approaches for checking reproducibility in microarray experiments. For example, Parmigiani et al., [1]
defined integrative correlation between two experiments that are conducted separately to answer the same biological question. This integrative
correlation is calculated for each gene and called a gene's reproducibility score. King et al., [2] used correlations, the
rate of two fold changes, and principal component analysis to check the reproducibility of gene expression measurements. Park et al., [3]
proposed a diagnostic plots for identifying outlying slides. In this paper, we propose an exploratory method to check the quality of microarray data
using two different approaches. Methodology We first describe the approach based on the correlations between chips and then describe the other approach based on the actual intensity values. Correlation Based Approach Given in the supplementary material linked below Example In this section, the proposed methods are applied to murine B-cell data. To study gene expression profiles in murine B-cell development, total cellular RNA was extracted from
five consecutive B-lymphocyte lineage sub-populations (pre-BI cells, large pre-BII cells, small pre-BII cells, immature B-cells, and mature B-cells), and then, gene
expression profiles from the five consecutive stages of mouse B cell development were generated with more than five replicates. [8] Murine B-cell data show lower sensitivity (0.66) and specificity (0.02). For the further exploratory analysis, we apply the proposed methods. In the chip-wise correlation
plot (Figure 1
In Table 1, the last column of PKS and PW show lower p-values than the others. Therefore, we can conclude that the distribution of within correlation in Small Pre-BII group is
greater than the distribution of the other groups. Also the mean of within correlation in small Pre BII group is less than the mean of the other groups.
Next, we apply the test based on intensities within treatment. We assume the FDR as 5%. Table 2 shows the result of the intensity based tests. Murine B-cell data show
quite different patterns. Especially, the gamma of small Pre-BII treatment is lowest among five treatments. Therefore we can conclude that Murine B-cell data set is less reproducible.
We can conclude that murine B-cell data show lower reproducibility, sensitivity and specificity. Therefore, it is not clear whether or not a further statistical test procedure
can detect true differences successfully among the five consecutive stages, especially with small pre-BII cells. It is mainly due to one outlying chip (chip 25), as shown in
Figure 3. Therefore, the analyst should check the experimental procedure and tissues used for this chip before a further statistical analysis. Discussion At the initial stage of the microarray data analysis, the exploratory data analysis (EDA) provides the first contact with data. The techniques of EDA consist of a number of
informal steps such as checking the quality of the data, calculating simple summary statistics, and constructing appropriate graphs. The proposed method is a more formal way of checking quality than simple EDA plots. Thus, at an initial stage of the microarray data analysis, the proposed method provides
useful information regarding the quality of microarray experiments. The correlation based approaches check the treatment-wise quality, while the test based on the actual
intensity values checks the gene-wise quality for each gene. The proposed method is quite effective in detecting some outlying chips. It is much easier to apply than a traditional method of checking outlying chips either by the principal
component analysis or the quality control plot. [3] There are some statistical issues to be taken into consideration, however. First, the log intensities may not have an approximate normal distribution. For simplicity,
we have assumed the normal distribution for testing all hypotheses. However extensions to other distributional assumptions are certainly possible. For example, the other
distributions such as log-normal and gamma distributions can be easily handled. Second, we did not use a stringent criterion for identifying the concordant/discordant genes.
All these genes should be checked by using a analysis such as SAM [9] or t-test [10]
during a later stage of analysis. Third, the correlation coefficients derived from all possible pairs of chips may not be independent. We did not consider
these correlations in the current analysis. A more sophisticated approach based on the bootstrapping method is under development which considers possible correlations
among the correlation coefficients. We would like to emphasize that the proposed method is an exploratory analysis. We believe the proposed method to be practically useful, simple and easy to implement that will
provide a more rigorous approach in a preliminary overview regarding the quality of microarray experiments. Most proposed methods are implemented in the software
arrayQCplot [11] and can be downloaded from Bioconductor(www.bioconductor.org).
Data 1 Click here to view.(101K, pdf) Acknowledgments The authors would like to thank to anonymous referees and the editor whose comments were extremely helpful. This study was supported by the National Research Laboratory
Program of Korea Science and Engineering Foundation (M10500000126) and the Brain Korea 21 Project of the Ministry of Education. Footnotes
Citation:Lee & Park, Bioinformation 1(10): 423-428 (2007) References 1. Parmigiani G, et al. Clin Cancer Res. 2004;10:2922. [PubMed] 2. King C, et al. J Mol Diagn. 2005;7:57. [PubMed] 3. Park T, et al. Biotechniques. 2005;38:463. [PubMed] 4. Pavlidis P, Noble WS. Genome Biol. 2001;2:RESEARCH0042. [PubMed] 5. Jain N, et al. Bioinformatics. 2003;19:1945. [PubMed] 6. Storey JD, Tibshirani R. Proc Natl Acad Sci. 2003;100:9440. [PubMed] 7. Pounds S. Brief Bioinform. 2005;38:463. [PubMed] 8. Hoffmann R, et al. Genome Res. 2002;12:98. [PubMed] 9. Tusher VG, et al. Proc Natl Acad Sci. 2001;98:5116. [PubMed] 10. Choe SE, et al. Genome Biol. 2005;6:R16. [PubMed] 11. Lee EK, et al. Bioinformatics. 2006;22:2305. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Clin Cancer Res. 2004 May 1; 10(9):2922-7.
[Clin Cancer Res. 2004]J Mol Diagn. 2005 Feb; 7(1):57-64.
[J Mol Diagn. 2005]Biotechniques. 2005 Mar; 38(3):463-71.
[Biotechniques. 2005]Genome Res. 2002 Jan; 12(1):98-111.
[Genome Res. 2002]Biotechniques. 2005 Mar; 38(3):463-71.
[Biotechniques. 2005]Proc Natl Acad Sci U S A. 2001 Apr 24; 98(9):5116-21.
[Proc Natl Acad Sci U S A. 2001]Genome Biol. 2005; 6(2):R16.
[Genome Biol. 2005]Bioinformatics. 2006 Sep 15; 22(18):2305-7.
[Bioinformatics. 2006]