![]() | ![]() |
Formats:
|
||||||||
Copyright © 2009 The Author(s) The tspair package for finding top scoring pair classifiers in R Department of Oncology, Johns Hopkins School of Medicine, Baltimore, MD 21287, USA Associate Editor: Joaquin Dopazo Received January 11, 2009; Revised February 16, 2009; Accepted March 1, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Summary: Top scoring pairs (TSPs) are pairs of genes whose relative rankings can be used to accurately classify individuals into one of two classes. TSPs have two main advantages over many standard classifiers used in gene expression studies: (i) a TSP is based on only two genes, which leads to easily interpretable and inexpensive diagnostic tests and (ii) TSP classifiers are based on gene rankings, so they are more robust to variation in technical factors or normalization than classifiers based on expression levels of individual genes. Here I describe the R package, tspair, which can be used to quickly identify and assess TSP classifiers for gene expression data. Availability: The R package tspair is freely available from Bioconductor: http://www.bioconductor.org Contact: jtleek/at/jhu.edu 1 INTRODUCTION Classification of patients into disease groups or subtypes is the most direct way to translate microarray technology into a clinically useful tool (Quackenbush, 2006). A small number of tests based on microarrays have even been approved for clinical use, for example, for diagnosing breast cancer subtypes (Ma et al., 2004; Marchionni et al., 2008; Paik et al., 2004; van't Veer et al., 2002). But standard microarray classifiers are based on complicated functions of many gene expression measurements. This type of classifier is both hard to interpret and depends critically on the platform, pre-processing and normalization steps to be effective (Quackenbush, 2006). Identifying biologically interpretable, robust and cheap classifiers based on small subsets of genes would greatly speed progress in the development of clinical tests from microarray experiments. Top scoring pairs (TSPs) are pairs of genes that accurately classify patients into clinically relevant groups based on their ranks (Geman et al., 2004; Tan et al., 2005; Xu et al., 2005). The basic idea is to search among all pairs of genes, and look for genes whose ranking most consistently switches between two groups. To understand how the classification scheme works, consider the simulated gene expression data in Figure 1
The TSP approach has been successfully applied to identify subtypes of sarcoma, resulting in a RT-PCR-based test that correctly classified 20 independent tumors with perfect accuracy (Price et al., 2007). This early success suggests that it may be possible to identify TSP classifiers for other important diseases and quickly develop new inexpensive diagnostic tests. 2 THE TSPAIR PACKAGE Calculating the TSP for a gene expression dataset is relatively straightforward, but computationally intensive. I have developed an R package tspair that can rapidly calculate the TSP for typical gene expression datasets, with tens of thousands of genes. The TSP can be calculated both in R or with an external C function, which allows both for rapid calculation and flexible development of the tspair package. The tspair package includes functions for calculating the statistical significance of a TSP by permutation test, and is fully compatible with Bioconductor expression sets. The R package is freely available from the Bioconductor web site (www.bioconductor.org). 3 AN EXAMPLE SESSION Here I present an example session on a simple simulated dataset included in the tspair package. I calculate the TSP, assess the strength of evidence for the classifier with a permutation test, plot the output and show how to predict outcomes for a new dataset. The main function in the tspair package is tspcalc(). This function accepts either (i) a gene expression matrix or an expression set and a group indicator vector, or (ii) an expression set object and a column number, indicating which column of the annotation data to use as the group indicator. The result is a tsp object which gives the TSP score, indices, gene expression data and group labels for the TSP. If there are multiple pairs that achieve the top score, then the tie-breaking score developed by Tan et al. (2005) is reported. ![]() The function tspsig() can be used to calculate the significance of a TSP classifier by permutation as described in Geman et al. (2004). The class labels are permuted, a new TSP is calculated for each permutation, and the null scores are compared with the observed TSP score to calculate a P-value. Since the maximum score is calculated for each null permutation, tspsig() performs a test of the null hypothesis that no TSP classifier is better than random chance. ![]() Once a TSP has been calculated, the tspplot() function can be used to visualize the classifier. The resulting TSP figure (Fig. 2
![]() A major advantage of the TSP approach is that predictions are very simple and can be easily calculated either by hand or using the built-in functionality of the tspair package. In this example, the expression value for ‘Gene5’ is greater than the expression value for ‘Gene338’ much more often for the diseased patients. In a new dataset, when the expression for ‘Gene5’ is greater than the expression for ‘Gene338’ I predict that the patient will be diseased. The tspair package can be used to predict the outcomes of new samples based on new expression data. The new data can take the form of a new expression matrix, or an expression set object. The R function predict() searches for the TSP gene names from the original tspcalc() function call, and based on the row names or featureNames of the new dataset identifies the genes to use for prediction. If multiple TSPs are reported, the default is to predict with the TSP achieving the top tie-breaking score (Tan et al., 2005), but the user may also elect to use a different TSP for prediction. ![]() In this example, the predict() function finds the genes with labels ‘Gene5’ and ‘Gene338’ in the second dataset and calculates the TSP predictions based on the values of these two genes. The new data matrix need not be defined by a microarray, it could easily be the result of RT-PCR or any other expression assay, imported into R as a tab-delimited text file. ACKNOWLEDGEMENTS The author acknowledges the useful discussions with Giovanni Parmigiani, Leslie Cope, Dan Naiman and Don Geman. Funding: National Science Foundation (DMS034211); National Institutes of Health (1UL1RR025005-01). Conflict of Interest: none declared. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||
N Engl J Med. 2006 Jun 8; 354(23):2463-72.
[N Engl J Med. 2006]Cancer Cell. 2004 Jun; 5(6):607-16.
[Cancer Cell. 2004]N Engl J Med. 2004 Dec 30; 351(27):2817-26.
[N Engl J Med. 2004]Nature. 2002 Jan 31; 415(6871):530-6.
[Nature. 2002]Bioinformatics. 2005 Oct 15; 21(20):3896-904.
[Bioinformatics. 2005]Bioinformatics. 2005 Oct 15; 21(20):3905-11.
[Bioinformatics. 2005]Proc Natl Acad Sci U S A. 2007 Feb 27; 104(9):3414-9.
[Proc Natl Acad Sci U S A. 2007]Bioinformatics. 2005 Oct 15; 21(20):3896-904.
[Bioinformatics. 2005]Bioinformatics. 2005 Oct 15; 21(20):3896-904.
[Bioinformatics. 2005]