• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Cancer Res. Author manuscript; available in PMC Feb 15, 2010.
Published in final edited form as:
PMCID: PMC2763410
NIHMSID: NIHMS127672

Cancer-specific High-throughput Annotation of Somatic Mutations: computational prediction of driver missense mutations

Abstract

Large-scale sequencing of cancer genomes has uncovered thousands of DNA alterations, but the functional relevance of the majority of these mutations to tumorigenesis is unknown. We have developed a computational method, called CHASM (Cancer-specific High-throughput Annotation of Somatic Mutations), to identify and prioritize those missense mutations most likely to generate functional changes that enhance tumor cell proliferation. The method has high sensitivity and specificity when discriminating between known driver missense mutations and randomly generated missense mutations (area under ROC curve > 0.91, area under Precision-Recall curve > 0.79). CHASM substantially outperformed previously described missense mutation function prediction methods at discriminating known oncogenic mutations in TP53 and the tyrosine kinase EGFR. We applied the method to 607 missense mutations found in a recent glioblastoma multiforme sequencing (GBM) study. Based on a model that assumed the GBM mutations are a mixture of drivers and passengers, we estimate that 8% of these mutations are drivers, causally contributing to tumorigenesis.

Keywords: cancer drivers, CHASM, missense mutations, random forest, somatic mutations

Introduction

Today we face a bottleneck between large-scale acquisition of genomic information discovered through medical resequencing projects and the application of this information to improved understanding of human disease. Projects to systematically resequence tumor genomes have discovered thousands of genes that were not previously linked to tumorigenesis but are somatically mutated in a relatively small fraction of tumors and may be important for tumor initiation or progression (16). Many of these somatic changes are likely to be “passengers” (1) that have no functional effects but were already present in the cell that gave rise to the tumor or were acquired during subsequent tumor growth. Only a small fraction of the genetic alterations in a tumor are expected to drive tumor evolution by giving cells a selective advantage over their neighbors.

Determining which mutations are drivers and which are passengers is one of the most pressing challenges in cancer genetics. Though genes that are mutated very frequently (“mountains”) can be confidently classified as driver genes, most genes discovered so far are mutated in a relatively small fraction of tumors (“hills”). The examination of large numbers of tumors can provide helpful information for classification of drivers vs. passengers, but the ability of sequencing alone to provide definitive results is limited by the marked variation in mutation frequency among individual tumors and individual genes. Moreover, it has been clearly shown that genes that are mutated in only a small fraction (<1%) of tumors can still act as drivers (6). Thus, methods that can classify mutations as either drivers or passengers on the basis of data that is independent of mutation frequency are clearly needed. Such methods include functional studies in model organisms or in cultured cells, using gene knock-out, siRNA, or overexpression approaches. These methods are extraordinarily useful for elucidating the function of individual mutated genes but are not well suited to the analysis of the hundreds of gene candidates that arise from every large scale cancer genome project.

Here we describe a novel high-throughput computational prediction method to identify the mutations most likely to be drivers. We chose to focus on missense mutations as they account for the majority of somatic mutations found in the exons of tumor-derived DNA (6), and because their functional significance is more difficult to infer than that of nonsense or frameshift mutations.

Previous work in this area has resulted in several innovative ways to characterize the differences between driver and passenger missense mutations. Driver mutations may have characteristics similar to those causing Mendelian disease when inherited in the germ-line (7) and may be identifiable by constraints on tolerated amino acid residues at the mutated positions (3, 79). In contrast, passenger mutations may have characteristics more similar to those of non-synonymous single nucleotide polymorphisms (nsSNPs) with high minor allele frequencies (MAFs) (3, 7). Based on these similarities, supervised machine learning methods have been used to predict which missense mutations are drivers (3, 7). The CAN-Predict method trains a Random Forest (10) to discriminate between mutations from the COSMIC cancer somatic mutation database (11) and nsSNPs with high MAFs (3). A method specific to protein kinases (7) trains a support vector machine (12) to discriminate between known disease kinase nsSNPs and common kinase nsSNPs. While not specifically designed for this problem, bioinformatics methods, such as PolyPhen and SIFT (9, 13) have also been applied to identify pathogenic, tumor-derived mutations in genes of interest (6). These methods attempt to discriminate driver from passenger mutations by considering properties such as evolutionary conservation, compatibility of the mutant amino acid residue with the wild type or with equivalently positioned residues in homologous proteins, the predicted protein local environment (7), and enrichment of the protein structural domain in which mutations occur with respect to biological processes thought to be critical for cancer (3).

We hypothesized that although existing computational methods could detect differences between somatic missense mutations observed in cancers and high MAF nsSNPs in the germline, these differences might be less relevant to the discrimination between driver and passenger mutations that occur somatically in tumors. While high MAF nsSNPs and passenger mutations have properties in common, they also have differences. Passenger mutations may or may not have a functional impact on proteins; by definition, they are neutral with respect to cancer cell fitness. In contrast, high MAF nsSNPs have become fixed in the human genome and must be functionally neutral or have a mild functional impact with respect to normal cell fitness. We reasoned that we could train a classifier with improved specificity by representing passenger missense mutations not by high MAF nsSNPs, as done previously, but rather by in silico simulations using mutation profiles that reflected tumor type as well as mutation context.

Materials and Methods

Feature Selection

We used a Random Forest classifier (10, 14) that was trained on 49 predictive features (Supplementary Table 1). Feature selection was done with a protocol based on mutual information (Supplementary Materials: Feature Selection and Information Theory, Supplementary Figure 1). Mutual information is a generalized version of correlation that does not make assumptions about linear relationships between two variables of interest (15). Features with missing values were estimated with a k-nearest neighbors algorithm (Supplementary Materials: Missing Values).

Driver Mutation Dataset

We selected 2488 missense mutations previously identified as playing a functional role in oncogenic transformation from breast, colorectal, and pancreatic tumor resequencing studies (2, 46) and the COSMIC database (11).

Synthetically Generated Passenger Mutation Dataset

The synthetic passenger mutations were generated by sampling from eight multinomial distributions that depend on di-nucleotide context and tumor type (Supplementary Materials: Synthetically Generated Mutations, Supplementary Figure 2, Supplementary Table 2).

Classifier Training

The CHASM method is based on a Random Forest classifier (10, 14) trained to discriminate between driver missense mutations and synthetically generated passenger missense mutations. The classifier is implemented using PARF (http://www.irb.hr/en/cir/projects/info/parf/), a Fortran 95 adaptation of Leo Breiman’s original Random Forest software (http://www.math.usu.edu/~adele/forests/cc_home.htm). Prior to training, all features were standardized with the Z-score method using the scale command in R statistical software (16). To avoid overfitting, we divided our known driver mutations and synthetic passenger mutations into two partitions, one for feature selection and one for classifier training.

This Random Forest is an ensemble of “decision trees”, specifically classification and regression trees (17), each of which uses a hierarchical set of rules to decide whether a mutation is a driver or a passenger. The rules are based on our input features and the final score yielded for each mutation is the fraction of trees that voted for the passenger class. We used a forest with 500 trees, and default parameters (mtry=7). The Random Forest algorithm is robust to class label contamination and performs well with high dimensional data sets (10, 14).

Classifier Assessment

We assessed Random Forest classifier performance by two threshold-independent measures – ROC and Precision-recall curves (Supplementary Materials: ROC and PR Curves and Minimum Error Point). We considered both the training set out-of-bag error (10) and the error on two held-out validation sets of known oncogenic mutations in TP53 and EGFR. The out-of-bag error estimate is produced while the Random Forest is being trained and is a viable replacement for error estimates by cross-validation (18). We compared the Random Forest with a support vector machine classifier (12) (assessed with five-fold cross-validation) (Supplementary Materials: Support Vector Machine) and with the performance of several state-of-the-art missense mutation function prediction methods.

Probabilistic interpretation of random forest classification scores in tumor-derived GBM mutations

We used the trained Random Forest to compute a classification score for each of 607 GBM missense mutations reported in (4). However, these scores are not probabilities and the statistical behavior of the algorithm has not been well characterized (10). Therefore, it is not evident where to set a trusted score cutoff for purposes of identifying driver mutations. To do this, we first interpret the scores in the framework of statistical hypothesis testing. For each of the 607 GBM mutants, we test the null hypothesis: the mutant is not functionally related to the growth of the tumor (passenger), versus the alternative hypothesis that it is (driver). We obtain a p-value for a mutation by comparing its score to the null distribution, which consists of the scores of a filtered set of synthetic passengers that were held out from Random Forest training (Supplementary Materials: Filtering of Synthetically Generated Passenger Mutations), using the Benjamini-Hochberg algorithm to correct for multiple testing (19) (Supplementary Materials: Controlling the False Discovery Rate).

GBM Mutations

We assessed 607 glioblastoma multiforme (GBM) mutations from 21 patient samples (4), Five of the mutations described in (4) were dropped because they occurred in gene transcripts that are no longer supported by the RefSeq database (20). Three mutations were dropped because they were found in gene transcripts that were larger than 14,000 codons. For gene transcripts of this size, we were unable to generate protein multiple sequence alignments because of their high computational expense. Finally, one of the GBM tumor samples was from a patient with a hypermutator phenotype who had been treated with radiation and temozolomide. Because this sample had 17 times as many alterations as the other GBM samples and a radically different mutation spectrum (4), these mutations were excluded from our analysis.

Estimation of fraction of drivers in GBM

We assumed that the GBM mutations are a mixture of drivers and passengers and wanted to estimate the proportion of drivers in the mixture. The probability distribution of the GBM CHASM scores should then be similar to the CHASM score distribution of a mixture of known driver and synthetic passenger mutations (21). We numerically find the mixing proportion which minimizes the distance between these two score distributions (Supplementary Materials: Estimating the Fraction of Drivers).

Comparison with other methods

For comparison purposes, we assessed the performance of several published methods that were possibly useful for driver mutation prediction both on our training set and the two held-out validation sets of TP53 and EGFR mutations. The tested methods were: PolyPhen (22), SIFT (9), CanPredict (3) and KinaseSVM (7). We also assessed a consensus prediction, based on agreement between SIFT and PolyPhen (Supplementary Materials: Comparison with Other Missense Mutation Function Prediction Methods).

Wherever possible, we assessed the performance of these methods using a numeric score, rather than a categorical prediction, so that we could construct threshold-independent ROC and PR curves. We computed precision and recall statistics (Eq 4) when only categorical predictions were available (CanPredict and the PolyPhen/SIFT consensus).

Precision=TP/(TP+FP)Recall=TP/(TP+FN)
(Eq 4)

where TP=number of drivers correctly classified, FP=number of synthetic passengers misclassified, FN=number of drivers misclassified. We compared the performance of these methods to CHASM’s performance on its own training set, based on out-of-bag scores, and also to CHASM’s performance when all TP53 and EGFR mutations were held out of its training and feature selection sets. We also compared Random Forest performance with performance of a support vector machine (SVM) (12), another state-of-the-art machine learning classifier, using the same training sets and predictive features. The SVM was trained using the e1071 package in R statistical software and assessed using five-fold cross-validation and constructing ROC and PR curves.

Results

Feature selection

To develop a new classifier, we first evaluated a large number of candidate predictive features and found that > 50 features contained at least some information that appeared to be useful for discriminating between driver and passenger mutations. In particular, using a method that estimates mutual information between a predictive feature and class labels, we found that the majority of the candidate predictive features were weakly informative (23) (Supplementary Table 3). In our training set (described in Methods) we calculated that a feature capable of correctly classifying a mutation as a passenger or driver would require 2.05 bits of information (Supplementary Materials: Information Theory). As our top-ranked feature had only 0.06 bits of information, we compensated by using 49 features (Supplementary Table 3, Supplementary Figure 3). This is a much larger number of features than used in previous studies (3, 7). The sum of the information in each individual feature was 0.37 bits. However, the Random Forest works with all features jointly, which may yield much higher information content than the simple sum.

Some of our top-ranked features have not, to our knowledge, been used previously for missense mutant function prediction. These features include the average nucleotide-level conservation of the exon in which a mutation occurs in 17-way vertebrate Multiz alignments (24), estimated by PhastCons (25); SNP density (the number of SNPs in the exon where the mutation occurs, normalized by exon length); and frequency of missense change type in the COSMIC database of somatic variation in cancer (11).

Datasets used for training

As noted in the Introduction, the choice of training sets is critically important to the performance of any classifier. As drivers, we selected 2488 missense mutations previously identified as playing a functional role in cancer, culled from the COSMIC database and recent large scale resequencing studies (see Methods). The passenger dataset was derived by a two-step process. First, we selected genes that were mutated at least once in four large scale sequencing studies of colorectal, breast, brain, or pancreatic tumors (2, 46). Second, we generated synthetic passenger missense mutations in these genes in silico, using an algorithm that recapitulated the type of base substitutions found in brain tumors (mutation context). Note that we purposefully chose genes that were mutated as the substrate for the in silico generation of synthetic mutations. This increased the likelihood that the new classifier would detect mutations that were extraordinary rather than detect genes that were extraordinary (e.g., had very different codon compositions than the average). Our classifier would thus be able to detect differences between driver and passenger mutations even if the mutations were in the same gene.

Past classifiers have often employed high MAF nsSNPs as the passenger dataset rather than the synthetic passenger dataset described above. To determine whether there were major differences between our new dataset and high MAF nsSNPs, we compared them using principal components analysis (PCA) applied to the top-ranked 21 predictive features (Supplementary Table 1). As shown in Figure 1, a randomly selected set of 4395 high MAF ns SNPs from the HapMap project were distributed differently than a set of 4500 synthetic passengers. Interestingly, the synthetic passengers formed two distinct clusters in this analysis, along the dimension of principal component four, which is dominated by feature 72. The feature is a binary descriptor of regions in proteins that are functionally interesting, as annotated in the UniProtKB database (26). It appears that while a subset of the synthetically generated passenger mutations were located in annotated regions of functional interest, the MAF nsSNPs tended not to be located in these regions. This result is consistent with evolutionary selective pressure on MAF nsSNPs for functional neutrality. Other features with large magnitude coefficients in these PCA components included predicted amino acid residue propensities for secondary structure, solvent accessibility, backbone flexibility, and additional protein-based functional annotations from UniProtKB.

Figure 1
Principal components analysis of nsSNPs vs. synthetic passenger mutations

Classifier Construction

We then attempted to use these features and datasets to design a new classifier using two state-of-the-art machine learning methods, support vector machines and Random Forests. Though both methods were able to define good classifiers, the Random Forest proved superior (Supplementary Figure 4) and was used for the remainder of the analyses. Details of the construction of the Random Forest-based classifier, henceforth termed CHASM, are described in Methods.

To test the performance of CHASM, we first assessed it with respect to its out-of-bag classification error on the training sets (equivalent to a cross-validation test (10)). For this purpose, Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves were employed, as these metrics consider classification errors at all possible score thresholds. Using area under the curve (AUC) as a performance summary statistic, where 1.0 indicates perfect classification, CHASM yielded AUCs of 0.91 and 0.79 for ROC and PR, respectively (Figure 2).

Figure 2
ROC and PR curves calculated for A) CHASM, B) PolyPhen PSIC, and C) SIFT on the training set mutations

This performance was then compared to that of other methods, including PolyPhen’s PSIC score, SIFT, CanPredict, KinaseSVM (Supplementary Figure 5), and a SIFT-PolyPhen consensus. The fraction of mutations that could be evaluated by these alternative methods (coverage) was considerably lower than that of CHASM (Supplementary Table 4 and Methods). Moreover, even the best-performing of the alternative methods was inferior to CHASM in specificity, sensitivity, and precision (Supplementary Table 4). These differences translated to much lower AUCs for ROC and PR (Figure 2).

As another test of the performance of CHASM, TP53 or EGFR mutations were held out of the mutation dataset used for training and then these known driver mutations were assigned scores by CHASM and the other algorithms. To evaluate both the sensitivity and specificity of each method, we also held out 590 synthetic passenger mutations. If we consider the fraction of misclassified mutations at the minimum error point, the CHASM classifier had high sensitivity and specificity for both the TP53 and EGFR test sets (Supplementary Table 4). The performance of CHASM was considerably better, both in terms of sensitivity and specificity, than previously described classifiers (Supplementary Table 4). These differences are graphically illustrated in the AUCs presented in Figures 3 and and4.4. (Further detail is provided in Supplementary Table 5.)

Figure 3
ROC and PR curves calculated for A) CHASM, B) PolyPhen PSIC, and C) SIFT on TP53 and synthetic passenger mutations held out of the CHASM training set
Figure 4
ROC and PR curves calculated for A) CHASM, B) PolyPhen PSIC, and C) SIFT on EGFR and synthetic passenger mutations held out of the CHASM training set

For a practical estimate of the CHASM performance, we calculated p-values for each of the held out TP53 and EGFR mutations, then controlled the false discovery rate (FDR) to 0.2 using the Benjamini-Hochberg procedure. We found that 195 of the 196 experimentally observed TP53 mutations and 131 of the 133 experimentally observed EGFR mutations were predicted to be drivers by CHASM. In comparison, a maximum of 188 of the 196 experimentally observed TP53 mutations and 101 of the 133 experimentally observed EGFR mutations were predicted to be drivers by PolyPhen or SIFT.

Analyses of GBM

The CHASM Random Forest classifier was then used to score 607 missense mutations in glioblastoma multiforme (GBM) described by Parsons et al. (4). The driver dataset used to train the Random Forest was the same as that described above except that all of the missense mutations actually observed in GBMs were excluded. The raw CHASM scores of the mutations, representing the fraction of trees in the forest that voted for classifying the mutation as passenger, ranged from 0 to 1 (Figure 5). For each of these missense mutants, we tested the null hypothesis that the mutant was a passenger. A p-value was calculated for each mutant by comparing its CHASM score to the score distribution of a filtered set of synthetic passengers (see Methods for details). The Benjamini-Hochberg procedure was used to control the false discovery rate (FDR) at the desired level of 0.2 (19).

Figure 5
Histograms of CHASM scores for driver mutations and passenger mutations held out from the training set, and 607 mutations experimentally identified in GBM

At this FDR level, CHASM classified 24 of the 607 GBM mutations as drivers (Table 1). Importantly, CHASM successfully identified 11 mutations that were likely to be drivers based on previous experimental data. These 11 mutations included nine in TP53 or PTEN, well-known tumor suppressor genes, one in PIK3CA, a well-known oncogene and one in IDH1, a gene recently discovered to be altered in many brain tumors (27). In addition to these 11, CHASM identified 13 others that otherwise would not have been suspected of playing a major role in GBM tumorigenesis (Table 1). Intriguingly, these mutations included those in genes that are likely to be involved in critical signaling pathways, such as the protein kinases STK39 and RIPK4, the protein phosphatase PTPRM, and the insulin-signaling mediator PHIP.

Table 1
Driver mutations predicted by CHASM at FDR of 0.2, shown with their associated Random Forest scores and p-values

Finally, to estimate the proportion of driver missense mutations in the GBM mutation set, we minimized the difference between the distributions of the CHASM scores of the GBM mutations and the CHASM scores of a mixture of known driver and synthetic passenger mutations (see Methods for details). We thereby estimated that 49 of the 607 missense mutations identified in GBM, or 8%, were drivers.

Discussion

Computational methods to predict the impact of mutations discovered in tumor resequencing are still under development. While initial work focused on identification of driver genes rather than driver mutations (1, 5), it has recently been suggested that the occurrence of some missense mutations in oncogenes or tumor suppressor genes are actually passengers (7), motivating the need for a higher resolution approach that identifies individual mutations as drivers. In light of the large number of mutations that are being discovered in current large-scale cancer gene sequencing efforts, and the impossibility of assessing this large number through experimental functional studies, bioinformatic approaches to classify and prioritize mutations for further analysis are essential for progress.

Confronted with this problem, some researchers have tried to apply methods that were developed to predict the impact of germline missense variants. We found that these methods have good sensitivity in recognizing recurrent driver missense mutations in TP53 and EGFR, but poor specificity (Supplementary Table 4, Figures 3 and and4).4). This result implies that there may be differences between the distinguishing characteristics of neutral mutations in the cancer genome vs the germline genome. Application of methods developed for the latter problem to the former problem yielded less than optimal results. In contrast, the CHASM classifier, specifically developed to detect somatic rather than germline driver mutations, had substantially improved sensitivity, specificity, and precision over previously described methods.

Overall, our results highlight the importance of “null model” selection in designing a predictive algorithm to identify driver mutations in cancer resequencing data. Within the context of a prediction method, the “null model” incorporates assumptions about what driver missense mutations do not look like. It is used explicitly in supervised learning methods such as CAN-predict, Kinase SVM, and our previous version of CHASM (2, 4). It is also used implicitly in methods like SIFT and PolyPhen, because their utility has been assessed with a validation or benchmark set as a false positive control. SIFT has used experimental results of functional assays in bacterial and viral proteins as a control; PolyPhen has used species divergence data from amino acid substitutions found in equivalent positions in alignments of protein orthologs. We suggest that these null models of functional neutrality do not optimally represent the passenger missense mutations found in tumors.

While existing methods for missense mutant function prediction in cancer have provided tools to prioritize candidate driver mutations, we have developed a quantitative approach to identify candidate drivers by controlling the false discovery rate. To our knowledge, this is the first application of FDR to the classification of missense mutations, providing a statistically meaningful threshold for discovery.

We estimate that the proportion of drivers among all GBM missense mutations in our dataset is approximately 8%, with 5.4% occurring outside of known gene mountains. Note that the actual number of drivers in the mutation dataset of Parsons et al is likely to be higher, as CHASM only considers missense mutations. Many of the tumor suppressor gene alterations that drive tumorigenesis are nonsense mutations, frameshifts, or large deletions.

Our method is high-throughput and can be easily adapted to any tumor type of interest, given a sufficient sample size to compute context-based DNA mutation rates. It also represents an advance over previous classifiers in that most mutations can be scored (coverage, Supplementary Table 4). Because the method focuses on properties of individual mutations, rather than the frequency at which mutations appear in a gene, it can potentially detect driver mutations that are present at low frequencies. These mutations may disregulate pathways that are potential new drug targets. A recent example is the isocitrate dehydrogenase (IDH1) R132 mutation, discovered in GBM resequencing (4). In the initial screen in (4), this mutation was originally found in only a small proportion of GBMs, so its role as a driver was questionable. CHASM, however, shows that the mutation has a high likelihood of being a driver when present in a tumor. Subsequent studies revealed that the mutation was present in a high fraction of an uncommon GBM subtype as well as other brain tumor types (4, 2730). Functional studies suggest that mutant IDH1 dominantly inhibits production of α-ketogluterate, which is required by enzymes that degrade HIF-1α, thus hyperactivating the HIF-1 pathway and promoting tumor angiogenesis. Drugs designed to be α-ketogluterate mimics might thus be useful for GBM patients with the IDH1 mutation (31). We hope CHASM will provide a useful tool to guide follow-up experiments based on the results of the many cancer genome projects now being performed or planned.

Supplementary Material

Acknowledgments

This work was supported by a DOD NDSEG fellowship, NIH/NCI grant CA135877, Susan G. Komen foundation grant KG080137, the Virginia and D.K. Ludwig Fund for Cancer Research, NIH grants CA121113, CA43460, CA57345, CA62924, and NCI/SAIC contract 28XS268. We would also like to thank Giovanni Parmigiani and Mark Diekhans for valuable discussions and Ivan Adzubey, Josh Kaminker, and Ali Torkamani for help scoring the mutations with PolyPhen, CanPredict and KinaseSVM.

This work was supported by a NDSEG Fellowship to HC; the Virginia and D.K. Ludwig Fund for Cancer Research, NIH grants CA121113, CA43460, CA57345, CA62924, and NCI/SAIC contract 28XS268 to VV, KK, and BV; and NIH/NCI grant CA135877 and Susan G. Komen foundation grant KG080137 to RK

References

1. Greenman C, Stephens P, Smith R, et al. Patterns of somatic mutation in human cancer genomes. Nature. 2007;446(7132):153–8. [PMC free article] [PubMed]
2. Jones S, Zhang Z, Parsons DW, et al. Core signaling pathways in human pancreatic cancer revealed by tumor genome analysis. Science. 2008;321(5897):1801–06. [PMC free article] [PubMed]
3. Kaminker JS, Zhang Y, Waugh A, et al. Distinguishing Cancer-Associated Missense Mutations from Common Polymorphisms. Cancer Res. 2007;67(2):465–73. [PubMed]
4. Parsons DW, Jones S, Zhang X, et al. An integrated genomic analysis of glioblastoma multiforme. Science. 2008;321(5897):1807–12. [PMC free article] [PubMed]
5. Sjoblom T, Jones S, Wood LD, et al. The consensus coding sequences of human breast and colorectal cancers. Science. 2006;314(5797):268–74. [PubMed]
6. Wood LD, Parsons DW, Jones S, et al. The genomic landscapes of human breast and colorectal cancers. Science. 2007;318(5853):1108–13. [PubMed]
7. Torkamani A, Schork NJ. Prediction of cancer driver mutations in protein kinases. Cancer Res. 2008;68(6):1675–82. [PubMed]
8. Barnholtz-Sloan J, Sloan AE, Land S, Kupsky W, Monteiro ANA. Somatic alterations in brain tumors. Oncology Reports. 2008;20(1):203–10. [PMC free article] [PubMed]
9. Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. 2001;11(5):863–74. [PMC free article] [PubMed]
10. Breiman L. Random forest. Machine Learning. 2001;45:5–32.
11. Forbes S, Clements J, Dawson E, et al. Cosmic 2005. Br J Cancer. 2006;94(2):318–22. [PMC free article] [PubMed]
12. Vapnik V. The Nature of Statistical Learning Theory. New York: Springer-Verlag; 1995.
13. Sunyaev S, Ramensky V, Koch I, Lathe W, Kondrashov AS, Bork P. Prediction of deleterious human alleles. Hum Mol Genet. 2001;10(6):591–7. [PubMed]
14. Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural computation. 1997;9(7):1545–88.
15. Cover T, Thomas J. Elements of information theory. 1. Wiley and Sons; 1991.
16. R Core Development Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. 2008
17. Breiman L. Classification and regression trees Regression trees The Wadsworth statistics/probability series: Wadsworth International Group. 1984
18. Bylander T. Estimating generalization error on two-class datasets using out-of-bag estimates. Machine Learning. 2002;48(1):287–97.
19. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Methodological) 1995:289–300.
20. Wheeler DL, Barrett T, Benson DA, et al. Database resources of the National Center for Biotechnology Information. Nucl Acids Res. 2008;36(S1):D13–21. [PMC free article] [PubMed]
21. Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96(456):1151–60.
22. Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002;30(17):3894–900. [PMC free article] [PubMed]
23. Karchin R, Kelly L, Sali A. Improving functional annotation of non-synonomous SNPs with information theory. Pac Symp Biocomput. 2005:397–408. [PubMed]
24. Blanchette M, Kent WJ, Riemer C, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14(4):708–15. [PMC free article] [PubMed]
25. Siepel A, Bejerano G, Pedersen JS, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15(8):1034–50. [PMC free article] [PubMed]
26. Wu CH, Apweiler R, Bairoch A, et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. 2006 [PMC free article] [PubMed]
27. Yan H, Parsons DW, Jin G, et al. IDH1 and IDH2 Mutations in Gliomas. The New England Journal of Medicine. 2009;360(8):765. [PMC free article] [PubMed]
28. Balss J, Meyer J, Mueller W, Korshunov A, Hartmann C, von Deimling A. Analysis of the IDH1 codon 132 mutation in brain tumors. Acta Neuropathologica. 2008;116(6):597–602. [PubMed]
29. Watanabe T, Nobusawa S, Kleihues P, Ohgaki H. IDH1 Mutations Are Early Events in the Development of Astrocytomas and Oligodendrogliomas. American Journal of Pathology. 2009;174(4):1149. [PMC free article] [PubMed]
30. Bleeker FE, Lamba S, Leenstra S, et al. IDH1 mutations at residue p. R132 (IDH1R132) occur frequently in high-grade gliomas but not in other solid tumors Communicated by Richard Wooster. Human Mutation. 2009;30(1) [PubMed]
31. Zhao S, Lin Y, Xu W, et al. Glioma-derived mutations in IDH1 dominantly inhibit IDH1 catalytic activity and induce HIF-1alpha. Science. 2009;324(5924):261–5. [PMC free article] [PubMed]
32. Whibley C, Pharoah PDP, Hollstein M. p53 polymorphisms: cancer implications. Nature Reviews Cancer. 2009;9(2):95–107. [PubMed]
33. Qi H, Labrie Y, Grenier J, Fournier A, Fillion C, Labrie C. Androgens induce expression of SPAK, a STE20/SPS1-related kinase, in LNCaP human prostate cancer cells. Molecular and Cellular Endocrinology. 2001;182(2):181–92. [PubMed]
34. Schreiber SC, Giehl K, Kastilan C, et al. Polysialylated NCAM Represses E-Cadherin-Mediated Cell-Cell Adhesion in Pancreatic Tumor Cells. Gastroenterology. 2008;134(5):1555–66. [PubMed]
35. Ruf W, Mueller BM. Thrombin generation and the pathogenesis of cancer. 2006: Semin Thromb Hemost. 2006:61. [PubMed]
36. Shimizu A, Mammoto A, Italiano JE, Jr, et al. ABL2/ARG Tyrosine Kinase Mediates SEMA3F-induced RhoA Inactivation and Cytoskeleton Collapse in Human Glioma Cells. J Biol Chem. 2008;283(40):27230–8. [PMC free article] [PubMed]
37. Kaburagi Y, Okochi H, Satoh S, et al. Role of IRS and PHIP on insulin-induced tyrosine phosphorylation and distribution of IRS proteins. Cell Structure and Function. 2007;32(1):69–78. [PubMed]
38. Dearth RK, Cui X, Kim HJ, Hadsell DL, Lee AV. Oncogenic transformation by the signaling adaptor proteins insulin receptor substrate (IRS)-1 and IRS-2. Cell cycle (Georgetown, Tex) 2007;6(6):705. [PubMed]
39. Sinha S, Singh R, Alam N, Roy A, Roychoudhury S, Panda C. Alterations in candidate genes PHF2, FANCC, PTCH1 and XPA at chromosomal 9q22.3 region: Pathological significance in early- and late-onset breast carcinoma. Molecular Cancer. 2008;7(1):84. [PMC free article] [PubMed]
40. Hasenpusch-Theil K. PHF2, a novel PHD finger gene located on human chromosome 9q22. Mammalian Genome. 1999;10(3):294–8. [PubMed]
41. Kita D, Yonekawa Y, Weller M, Ohgaki H. PIK3CA alterations in primary (de novo) and secondary glioblastomas. Acta Neuropathologica. 2007;113(3):295–302. [PubMed]
42. Koul D. PTEN signaling pathways in glioblastoma. Cancer Biol Ther. 2008;7(9):1321–5. [PubMed]
43. Muller PJ, Dally H, Klappenecker CN, et al. Polymorphisms in ABCG2, ABCC3 and CNT1 genes and their possible impact on chemotherapy outcome of lung cancer patients. Int J Cancer. 2009;124(7):1669–74. [PubMed]
44. Moran ST, Haider K, Ow Y, Milton P, Chen L, Pillai S. Protein Kinase C-associated Kinase Can Activate NF{kappa}B in Both a Kinase-dependent and a Kinase-independent Manner. J Biol Chem. 2003;278(24):21526–33. [PubMed]
45. Basseres DS, Baldwin AS. Nuclear factor-kappa-B and inhibitor of kappa-B kinase pathways in oncogenic initiation and progression. Oncogene. 2006;25(51):6817–30. [PubMed]
46. Doerks T, Huber S, Buchner E, Bork P. BSD: a novel domain in transcription factors and synapse-associated proteins. Trends in Biochemical Sciences. 2002;27(4):168–70. [PubMed]
47. Sim DLC, Yeo WM, Chow VTK. The novel human HUEL (C4orf1) protein shares homology with the DNA-binding domain of the XPA DNA repair protein and displays nuclear translocation in a cell cycle-dependent manner. The International Journal of Biochemistry & Cell Biology. 2002;34(5):487–504. [PubMed]
48. Rodriguez-Antona C, Ingelman-Sundberg M. Cytochrome P450 pharmacogenetics and cancer. Oncogene. 2006;25(11):1679–91. [PubMed]
49. Triantafilou M, Triantafilou K. Lipopolysaccharide recognition: CD14, TLRs and the LPS-activation cluster. Trends in Immunology. 2002;23(6):301–4. [PubMed]
50. Anders L, Mertins P, Lammich S, et al. Furin-, ADAM 10-, and gamma-Secretase-Mediated Cleavage of a Receptor Tyrosine Phosphatase and Regulation of beta-Catenin’s Transcriptional Activity. Mol Cell Biol. 2006;26(10):3917–34. [PMC free article] [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...