![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||
Copyright © 2008 Kim et al; licensee BioMed Central Ltd. Improving the prediction accuracy in classification using the combined data sets by ranks of gene expressions 1Oral Cancer Research Institute, Yonsei University College of Dentistry, Seoul, 120-752, South Korea 2Cancer Metastasis Research Center, Yonsei University College of Medicine, Seoul, 120-752, South Korea 3National Biochip Research Center, Seoul, 120-752, South Korea 4Brain Korea 21 Project for Medical Science, Yonsei University College of Medicine, Seoul, 120-752, South Korea 5Yonsei Cancer Center, Yonsei University College of Medicine, Seoul, 120-752, South Korea 6Department of Internal Medicine, Yonsei University College of Medicine, Seoul, 120-752, South Korea Corresponding author.Ki-Yeol Kim: kky1004/at/yuhs.ac; Dong Hyuk Ki: kdh1214/at/yuhs.ac; Hei-Cheul Jeung: jeunghc1123/at/yuhs.ac; Hyun Cheol Chung: unchung8/at/yuhs.ac; Sun Young Rha: rha7655/at/yuhs.ac Received January 14, 2008; Accepted June 16, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background The information from different data sets experimented under different conditions may be inconsistent even though they are performed with the same research objectives. More than that, even when the data sets were generated from the same platform, the data agreement may be affected by the technical variation among the laboratories. In this case, it is necessary to use the combined data set after adjusting the differences between such data sets, for detecting the more reliable information. Results The proposed method combines data sets posterior to the discretization of data sets based on the ranks of the gene expression ratios, and the statistical method is applied to the combined data set for predictive gene selection. The efficiency of the proposed method was evaluated using five colon cancer related data sets, which were experimented using cDNA microarrays with different RNA sources, and one experiment utilized oligonucleotide arrays. NCI-60 cell lines data sets were used, which were performed with two different platforms of cDNA microarrays and Affymetrix HU6800 oligonucleotide arrays. The combined data set by the proposed method predicted the test data sets more accurately than the separated data sets did. The biological significant genes were detected from the combined data set, which were missed on the separated data sets. Conclusion By transforming gene expressions using ranks, the proposed method is not influenced by systematic bias among chips and normalization method. The method may be especially more useful to find predictive genes from data sets which have different scale in gene expressions. Background Data sets that are created for the same purpose in different laboratories have accumulated rapidly. The results are often inconsistent due to the utilization of different platforms, sample preparations, or various technical variations. In this case, if a combined data set were analyzed after adjusting systematic biases that exist among different data sets derived from different experimental conditions, the power of statistical tests would be improved by an increase in the sample size. When the results from different data sets are inconsistent, usually the common portions can be adapted for stability. This type of meta-analysis involves a set of classical statistical techniques [1] and has been applied to microarray data sets [2,3]. As a method to combine data sets, Lee et al. [4] simply standardized gene expression ratios of human and mouse microarray data sets and combined these two data sets for comparative functional genomics. To analyze a combined data set of two different data sets, the transformation of gene expression was introduced [5]. This method transforms the gene expression ratios of two data sets in the form of a reference experiment and the reference experiment is created as a mean vector for all experiments. Consequently, this method does not consider the difference in gene expression patterns that exists between different experimental groups. To account for the variability that resulted from the various confounding factors such as different experimental conditions, an ANOVA (Analysis Of Variance) model was introduced [6]. It is a flexible method for considering gene expression ratios and other clinical variables together, although it does not create a combined large data set for applying various analytical methods. Sometimes gene expression ratios may include outliers as a result of incomplete experimental conditions, and these values can cause unreliable results by their strong influence. The usage of the categorized values of gene expression ratios can reduce the influence of outliers in this case and may improve the prediction accuracies in the classification of different experimental classes. The usage of the discrete values has advantage that is more concise to represent and specify, easier to use, and conducive to improved predictive accuracy [7]. The discretization of gene expression levels has been achieved [8]. The simplest discretization methods are the Equal Interval Width and Equal Frequency Intervals methods. Kerber [9] suggested the ChiMerge method and this method begins by placing each observed value into its own interval and proceeds by using the χ2 test to determine when adjacent intervals should be merged. A number of entropy-based methods have recently come to the forefront of work on discretization [10]. Fayyad and Irani [10] use a recursive entropy minimization heuristic for discretization and couple this method with the Minimum Description Length criterion [11] to control the number of intervals produced over the continuous space. In addition, a nonparametric scoring method was applied to gene expression data to discretize gene expression ratios [12], which usually transforms expression ratios based on their ranks by each experiment. In this case, some genes are included in the same rank and the score can be calculated differently according to the order of ranks with same values, which requires more time to score as the number of samples increases. In this study, gene expression ratios were transformed with their ranks for each data set. Next, the transformed data sets were combined and a nonparametric statistical method was applied to the combined data set to detect informative genes with high prediction accuracy. The performance of the proposed method using data sets derived from different platforms and different RNA sources was evaluated. Results A. The necessity of combining data sets The relationship between the number of genes and OOB (Out of Bag) error rates was investigated using data A, data B and data AB, which represent the data sets with total RNA, amplified RNA, and the combined data, respectively. The OOB error rates were calculated for randomly selected genes 500 times repeatedly with the same size and averaged them. The OOB error rates decreased as the number of informative genes increased. As shown in Figure Figure1A,1A
B. Improvement of prediction accuracy using combined data sets by the proposed method The prediction accuracies were compared using two original colon cancer data sets as training data sets, which were experimented with different RNA sources. While the prediction accuracy of data B using data A as a train data set was higher than 95%, the accuracy of data A using data B was lower than 80% (Figure (Figure2A).2A
C. Comparison of the prediction accuracy with the Minimal Entropy (ME) method Figure Figure33
Tumor 211 and Tumor 86 data sets were predicted more accurately with train data A and data B, respectively. This indicated that the prediction accuracy depends on train data sets in the ME method. Figure Figure33 Figure Figure44
D. Description of significant genes selected from a combined data set by the proposed method The descriptions of six discriminative genes selected from the combined data set, not two separated data sets, are summarized in Table 1.
AA485151 was upregulated by over five-fold in colorectal adenocarcinoma [14]. AA425217 was published as a significant gene in colorectal cancer [15], and 16q22.1, where AA425217 is located, is a region that includes CDH1, which encodes cell-cell adhesion protein and is expressed in gastric cancer and lobular breast cancer. AA464731 is known to be a downregulated gene in the SW620 cell line [16], a metastatic colorectal cancer cell line. It is also significantly overexpressed in pancreatic cell lines [17]. AA504130 is located on 13q12.3, similarly to BRCA2, which is known as a marker of breast and ovarian cancer. The mutated gene for retinoblastoma is located on chromosome 13q14 [18], on which AA504130 is also located. AA455925 is known as a E2F-1 regulated gene [19]. Xq26 is a region of two common chromosomal deletion regions, Xq25 and Xq26 [20], and is known to contribute to the malignant progression of gastric epithelial progenitor (GEP) endocrine carcinomas. Colorectal cancer is thought to be more common in men than in women. Xq26 is known as one of regions that contained multiple gains-of-function that were significantly more common in males than in females [21]. Since AW050510 is located at 17q25.3 and BIRC5 is at 17q25 and is known as a 'survivin expression colorectal cancer' [22,23], AW050510 is also expected to have similar characteristic to BIRC5. The descriptions of 4 colorectal cancer related genes and 9 cancer related genes are summarized in Table 2. These genes were selected from a combined data set.
E. Improvement of prediction accuracies by combining data sets performed using different platforms The prediction accuracies of combined data sets derived from different platforms were investigated. While the prediction accuracies of data A and data B on affy were low with a small number of genes, it increased as the number of genes was increased. By combining data A and data B, the prediction accuracy on affy was improved, as shown in Figure Figure5A.5A
F. Comparison of prediction accuracies of the proposed method in different platforms The proposed method was evaluated using a NCI 60 cell line data set, and the prediction accuracies of the proposed method and the ME method were compared. The prediction accuracies of both methods were compared as the number of genes was increased to 300, and they improved as the number of genes was increased, regardless of train and test data sets (Figure (Figure6A).6A
When Oligo data was used for train, the variation in accuracies was relatively small and prediction accuracy was high. It indicated that the Oligo data set predicted the cDNA data set more stably and accurately than the cDNA data set did. There was rare improvement in prediction after 20 or 30 genes, and this showed that a small informative gene set is sufficient for discrimination. It was also confirmed that the prediction accuracies were robust against the train data sets in the proposed method, while those in the ME method depended on the train data sets and there was significant difference between them (Figure (Figure6B6B Discussion The designed 25-mer oligochips from Affymetrix provide an absolute value of expression in an RNA sample, while cDNA microarrays perform a two-color competitive hybridization that gives the transcript expression in two samples. Also, long oligonucleotide platforms (typically 60 to 80-mers) also use hybridization, the relative measurements resulted in higher precision than did absolute measurements on this platform [3]. Therefore, some experimental biases can exist as a result of the differences in the usage of absolute measurements and ratios. Additionally, some previous studies indicated that the data sets from different microarray platforms should not be combined straightforwardly [24-27]. However, even when the data sets were generated from the same platform, the lab effect, especially when compounded with the RNA sample effect, plays a bigger role than the platform effect on data agreement [28]. There also exist inter-study biases among several microarray data sets tested with different RNA sources even when they are from the same laboratory and platform. Previous studies showed that there were some differences in results from data sets tested with different RNA sources, and the sensitivity to detect differential gene expression from a microarray data set using amplified RNA was also different compared to using total RNA [29,30]. An attempt to combine these different types of data sets is the usage of abstraction of expression values such as ranks or discretized values [9-12]. These methods reduce the variability in expression values from different microarray data sets. While there may be a slight loss of information by discretization, it is robust against outliers and fast and simple to understand. In colon cancer data sets derived from cDNA microarrays, a data set created with total RNA predicted more accurately a data set created using amplified RNA than vice versa (Figure (Figure2A).2A In the colon cancer data set derived from oligonucleotide arrays, the prediction accuracies were improved by combination with cDNA data sets. Although two data sets derived from different experimental conditions have different scales in gene expressions, such a different scale of gene expressions could be compensated by discretizing gene expression. Therefore, no transformation method was required to match these two types of data sets except only the ranking of gene expressions. In the NCI 60 cell line data sets from two different platforms, different types of two data sets were used by alternating train and test data sets. The prediction accuracies in datasets that were transformed by the ME method were greatly different according to train sets, while those by the proposed method were accompanied by stable fluctuation in the prediction accuracies. The Oligo data set predicted the cDNA data set more stably and accurately than vice versa. While the prediction accuracies in the ME method depended on train and test data sets and the significant difference existing between them, they were more robust against train data in the proposed method (Figure (Figure6B).6B In this study, we transformed microarray data using ranks of gene expressions to combine data sets created in different experimental conditions. The proposed method may be especially more useful to find discriminative genes from data sets that have different scales of gene expression ratios. Methods Data set The data sets used in this study are summarized in Table 3.
Two cDNA microarray data sets, data A and data B, experimented with 154 colorectal tissues (82 tumor and 72 normal) were used as train data sets for evaluation of the proposed method. These two cDNA microarray data sets derive from different RNA sources, which were total RNA and amplified RNA. Previous studies have concluded that there were differences between the results from these two types of data sets and the sensitivity to detect differential gene expression from microarray data sets using amplified RNA was also different compared to using total RNA [29,30]. It was also confirmed that systematic biases existed between these two data sets using unsupervised hierarchical cluster analysis [31]. Two more cDNA data sets, Tumor 86 and Tumor 211, were experimented with amplified RNA and under different batches, and they were used as test data sets. They included only colon tumor tissues. These colon cancer data sets performed with cDNA microarrays were from the Cancer Metastasis Research Center of Yonsei University, Seoul, Korea. One more colon cancer data set was used, which was performed with the Human 6800 Gene Chip Set (Affymetrix). It was obtained from microarray database of Princeton University [32] and it included experiments with the adenomas and their paired normal tissue [33]. To evaluate the performance of the proposed method in different platforms, NCI 60 cell line data sets derived from different platforms were also used. Gene expression data sets for NCI-60 using 9,706 cloned cDNA microarrays and 6,810 gene Affymetrix HU6800 oligonucleotide arrays were obtained separately from the additional files of Lee et al. [25], and the common 2,344 UniGene clusters were used for this study. Ovarian and colon cancer cell lines were used for this study among nine tumor cell lines, and these two groups included six and seven replications. Transformation method of gene expression ratios Data preprocessing Gene expression ratios were normalized such that they would have similar distributions across a series of arrays and the normalization process was executed using the 'limma' library of the R package [34]. The cDNA data in the NCI 60 cell line data sets included missing entries, and these were estimated by using the SeqKnn (Sequential k nearest neighbor) imputation method [35] before analysis. Discretization by proposed method using rank of gene expression For transformation of the data set, gene expression ratios are rearranged in order of expression ratios by each gene, and the ranks are matched with the corresponding experimental group. If the experimental groups are homogenous, the ranks within the same experimental group would be neighboring. This process can be seen as similar to the first step in the nonparametric Mann-Whitney U test. The process of discretization of gene expressions is summarized in the following steps: (1) Rank the gene expression ratios within a gene for each data set. (2) List in order of the ranks and assign the order of gene expressions to the corresponding experimental groups. (3) Summarize the result of (2) in the form of a contingency table for each gene. (4) Test the relationship between the gene expression patterns and experimental groups for each gene. When there are three data sets to be combined, the data sets can be added by each entry as shown in Table 4 after the transformation of each data set by rank.
Discretization of expression ratios using recursive minimal entropy A method for discretizing continuous attributes based on a minimal entropy (ME) heuristic, presented by Catlett [36] and Fayyad and Irani [10], was compared with the proposed method in the experimental study. The algorithm uses the class information entropy of candidate partitions to select binary boundaries for discretization. If there is a given set of instances S, a feature A, and a partition boundary T, the class information entropy of the partition induced by T, denoted E(A, T, S) is given by: For a given feature A, the boundary Tmin, which minimizes the entropy function over all possible partition boundaries, is selected as a binary discretization boundary. This method can be applied recursively to both of the partitions induced by Tmin until some stopping condition is achieved, thus creating multiple intervals on feature A. It must be evaluated N-1 times for each attribute with N the number of attribute values [37]. The library 'dprep' in R [34] was used for this method. Nonparametric method for significant gene selection After the summarization of gene expression ratios in the form of a contingency table for each gene, as shown in Table 5, a nonparametric statistical method was applied to the data sets for independency testing between gene expression patterns and experimental groups. The test statistics are calculated as follows for each gene:
When the sample size for each experiment is small, generally less than five, Fisher's exact test is recommended rather than the Chi-square test. Classification method to evaluate the informative gene set selected from the combined data set In order to evaluate the predictive accuracy of the selected significant gene set, the Random Forest (RF) test [38] was used to enable re-sampling while allowing repetition. The RF program in the R package [34] was used and it works using the following steps: (1) Generate n data sets of bootstrap samples {B1, B2, ..., Bn} by allowing repetition. (2) Use a Bk to build a tree classifier Tk, and classify Bms (m≠k) data (out-of-bag (OOB) samples). (3) Calculate classification errors of Bms and obtain an average for them which is the overall classification error (OOB error). (4) Calculate the prediction accuracy of test data sets using the classifier built in (2). Abbreviations ANOVA: Analysis Of Variance; OOB error: Out Of Bag error; ME: Minimal Entropy. Authors' contributions KYK participated in the design of algorithms, performed statistical analysis and drafted the manuscript. DHK performed the microarray experiments. HCJ participated in getting the consent form the patients and obtained the clinical data. HCC participated in the study design and data interpretation. SYR conceived of the study, participated in its design and coordination, and finalized manuscript. Acknowledgements This study was supported by a grant of the Korea Health 21 R&D Project, Ministry of Health & Welfare (0405-BC01-0604-0002), and Korea Research Foundation Grant funded by Korean Government (KRF-2005-005-J05904). We thank the members of National Biochip Research Center, Yonsei University, and the Genomic Tree Incorporation, Korea for the current project. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||
Bioinformatics. 2002 Jan; 18(1):205-6.
[Bioinformatics. 2002]Trends Genet. 2003 Oct; 19(10):570-7.
[Trends Genet. 2003]Nat Genet. 2004 Dec; 36(12):1306-11.
[Nat Genet. 2004]BMC Bioinformatics. 2004 Jun 24; 5():81.
[BMC Bioinformatics. 2004]Bioinformatics. 2006 Jul 15; 22(14):1682-9.
[Bioinformatics. 2006]Pac Symp Biocomput. 2001; ():52-63.
[Pac Symp Biocomput. 2001]Cancer Res. 2002 May 15; 62(10):2890-6.
[Cancer Res. 2002]Bioinformatics. 2005 Feb 15; 21(4):517-28.
[Bioinformatics. 2005]Nature. 1986 Oct 16-22; 323(6089):643-6.
[Nature. 1986]Nucleic Acids Res. 2006; 34(6):1745-54.
[Nucleic Acids Res. 2006]Virchows Arch. 2006 Feb; 448(2):119-26.
[Virchows Arch. 2006]J Int Med Res. 2006 Jul-Aug; 34(4):397-405.
[J Int Med Res. 2006]Gut. 2000 May; 46(5):645-50.
[Gut. 2000]Cancer Res. 2000 Apr 1; 60(7):1815-7.
[Cancer Res. 2000]Trends Genet. 2003 Oct; 19(10):570-7.
[Trends Genet. 2003]Bioinformatics. 2002 Mar; 18(3):405-12.
[Bioinformatics. 2002]Genomics. 2005 Jun; 85(6):657-65.
[Genomics. 2005]BMC Genomics. 2005 May 11; 6(1):71.
[BMC Genomics. 2005]Biotechniques. 2002 Oct; 33(4):906-12, 914.
[Biotechniques. 2002]BMC Genomics. 2004 Apr 30; 5(1):29.
[BMC Genomics. 2004]Pac Symp Biocomput. 2001; ():52-63.
[Pac Symp Biocomput. 2001]Biotechniques. 2002 Oct; 33(4):906-12, 914.
[Biotechniques. 2002]BMC Genomics. 2004 Apr 30; 5(1):29.
[BMC Genomics. 2004]BMC Bioinformatics. 2007 Jun 25; 8():218.
[BMC Bioinformatics. 2007]Cancer Res. 2001 Apr 1; 61(7):3124-30.
[Cancer Res. 2001]Genome Biol. 2003; 4(12):R82.
[Genome Biol. 2003]BMC Bioinformatics. 2004 Oct 26; 5():160.
[BMC Bioinformatics. 2004]