• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jul 1, 2010; 38(Web Server issue): W118–W123.
Published online Jun 6, 2010. doi:  10.1093/nar/gkq515
PMCID: PMC2896190

CCancer: a bird’s eye view on gene lists reported in cancer-related studies

Abstract

CCancer is an automatically collected database of gene lists, which were reported mostly by experimental studies in various biological and clinical contexts. At the moment, the database covers 3369 gene lists extracted from 2644 papers published in ~80 peer-reviewed journals. As input, CCancer accepts a gene list. An enrichment analyses is implemented to generate, as output, a highly informative survey over recently published studies that report gene lists, which significantly intersect with the query gene list. A report on gene pairs from the input list which were frequently reported together by other biological studies is also provided. CCancer is freely available at http://mips.helmholtz-muenchen.de/proj/ccancer.

INTRODUCTION

At the moment, various high-throughput experimental platforms are employed intensively to provide new insights into the molecular mechanisms underlying a variety of biological phenomena (1,2). An increasing number of biological or clinical studies report differentially expressed genes, epigenetically silenced genes, frequently mutated genes, genes with copy number variations or other gene lists involved in common biological processes. Although being publicly available, this type of information, at the same time, is dissolved in hundreds of papers. The only way to collect this valuable data is to use automatic text mining systems.

Text-mining systems are employed by biomedical researchers to automatically extract relevant information from the literature [see ref. (3) for a review]. For example, PolySearch (4) is a generic text mining system for extracting relationships between genes and diseases. Several other databases, which are based on text mining, focus on specialized research areas: PubMeth (5) and MeInfoText (6) collect information on gene methylation in cancer. DDOC (7) and DDEC (8) collect heterogeneous information about genes differentially expressed in ovarian and esophageal cancer, such as manually curated information about the promoter regions and associated transcription factors, as well as text-mined reports.

Recently, we have developed the PLIPS database, a collection of protein lists extracted from proteomics studies by text-mining (9). PLIPS also provides a statistical framework for the interpretation of a protein list. To generate the PLIPS database, relatively few ‘text mining’ efforts were required, since a majority of proteomics studies are published in a few highly specialized proteomics journals. PLIPS covers only five major proteomics journals (Proteomics, Journal of Proteome Research, Molecular and Cellular Proteomics, Proteomics—Clinical Applications) and ~1000 different protein lists extracted from 800 independent studies.

In contrast to proteomics, high-throughput genomic technologies were more frequently used and their results are published in a much wider spectrum of journals. Gene lists, which were characterized to play key roles in molecular mechanisms for a variety of biological phenomena, are regularly reported in general biological journals, as well as in highly specific medical journals. Thus, automatic extraction of this information requires a lot of additional efforts.

Here, we present a database, termed CCancer, which provides a collection of 3369 gene lists automatically extracted from tables in 2644 studies covering ~80 peer-reviewed journals. Cancer is a major focus of biomedical research. According to our estimates, more than a half of the gene lists stored in CCancer are extracted from cancer related studies. This fact pre-defines the name of the database.

CCancer is not only a database but a web-based analysis platform, which employs an enrichment analyses framework (10–14) to interpret a given user-defined gene list. As input, CCancer accepts gene/protein list. As output, a catalogue of previously published studies that report a table of genes/proteins, which significantly intersects with a query list, is provided. Thus, CCancer supports the interpretation of the functional context for an experimentally derived gene lists. To illustrate the valuable and often unprecedented information that the user can get by using the CCancer database, we present several examples of data analyses.

MATERIALS AND METHODS

Text mining

We collected all articles (~150 000) published in 80 peer-reviewed journals for the last 5–7 years. The articles were screened for tables which report gene identifiers. The search algorithm was implemented to recognize a table with gene/protein identifiers within the paper text. If the table reports at least 10 unique gene identifiers of the same type [i.e. ‘Entrez Gene IDs’, ‘Gene Symbols’, ‘RefSeq’, ‘UNIGENE’, ‘ENSEMBL’, ‘Affymetrix Probes’, ‘IPISYN (Internatinal Protein Identifire)’, ‘Uniprot, SwissProt’] then the paper was selected. In total 3369 gene/protein lists were identified from 2644 papers. All gene list were mapped to ‘Entrez Gene IDs’. The data in Ccancer covers ~20 000 unique ‘Entrez Gene IDs’.

The top journals, in terms of the number of extracted gene lists, were highly specific journals in cancer research: ‘Cancer Research’ (327 papers and 411 gene lists), ‘Oncogene’ (214 papers and 278 gene lists), ‘Clinical Cancer Research’ (149 papers and 178 gene list), ‘International Journal of Cancer’ (109 papers and 143 gene lists). The full list of journals is accessible on the web server (http://mips.helmholtz-muenchen.de/proj/ccancer/journals).

We would like to point out that the data were collected automatically. A gene list in the database may be incomplete (in comparison to the originally reported list in the paper) and might have false positive gene identifiers. In our estimate based on 100 randomly selected records, ~60% of records are of high quality (the original table in the paper and the Ccancer record have <10% of false negative and false positive genes), ~20% of records are of good quality (containing 10–25% false negatives and not more than 10% of false positives) and ~15% of records contains ~35–75% of genes actually reported in the paper table. About 3–5% of the records may represent artefacts, i.e. a result from a wrongly recognized table which does not actually reports a gene list.

Automatic annotation of gene lists with human disease terms

A comprehensive hierarchical controlled vocabulary for human disease (http://do-wiki.nubic.northwestern.edu/index.php/Main_Page) was used to link articles to human disease terms. First we computed the background distribution for each disease term using all available articles (~150 000). For each disease term ‘A’ we select the subset of articles where the term was present at least once and compute for each article from this subset the number of times the term ‘A’ was present in the paper. The average number of times the term ‘A’ was mentioned per paper across this subset was computed. If a term ‘A’ was mentioned twice as many times as computed average value, then the paper was annotated with term ‘A’.

The gene lists from Ccancer database were annotated with ‘human disease terms’ based on the annotation of the paper from which it was extracted. We would like to point out that the biological context of genes reported in the table may not correspond to the context of the terms overrepresented in the paper text. In each case, manual analysis is required.

Statistical analysis: intersection of gene lists

To statistically link a given gene list (the query list) to the lists from the database we implement standard enrichment analyses. For each gene/protein list L in the database, the number of genes I common between the query list and the gene list L are counted. The null hypothesis H0 ‘Genes from the query list (size NQ) and from the list L (size NL) have at least I common genes by chance’ is tested. The hypergeometric test, adjusted for multiple testing by a Monte–Carlo simulation procedure, is employed to assess the significance of the intersection I. The estimated P-value corresponds exactly to the definition of an experiment-wise Westfall and Young P-value (11,12,15–17).

Statistical analysis: significantly associated gene pairs

Based on the data from the Ccancer database we identified gene pairs which were significantly associated (frequently reported together). Let us denote N as the total number of tables in the Ccancer database. Let us denote Cj to be the number of tables from Ccancer database where gene j was reported. For each pair of genes (j and k) we compute Cjk the number of tables where both gene j and k were reported. The number (intersection) Cjk follows a hypergeometric distribution with parameters N, Cj and Ck (‘Cj’ balls were drawn without replacement from an urn containing ‘N’ balls in total, ‘Ck’ of which are white). A Monte–Carlo simulation procedure was employed to adjust the P-value for multiple testing (for each gene we tested K hypotheses where K is a total number of genes). At the significance level (P < 0.05) each gene from the Ccancer database was associated on average with ~20 other genes.

Cross-linking of gene lists from the database

Extracted gene/protein lists were mapped to NCBI Entrez Gene IDs. Each gene/protein list L in the database was considered as ‘query’ list to identify the other lists from the database which have significant number of common genes (with ‘query’ list). Thus, we cross-linked all gene list pairs if they share a significant number (P < 0.01) of common genes. This information can be browsed online (http://mips.helmholtz-muenchen.de/proj/ccancer/journals/).

Linking gene lists to gene ontology terms

Each gene/protein list from CCancer database was linked to gene ontology (GO) terms which were significantly (P < 0.01) enriched in the list (13). In analogy to the calculation of the intersection P-value, a hypergeometric test, adjusted for multiple testing using a Monte–Carlo simulation approach, was employed to estimate the statistical significance of a GO category.

Monte–Carlo simulation to adjust P-values for multiple testing

All three considered cases (intersection between gene list and Ccancer records, GO enrichment of gene list and significantly associated gene pairs) can be modelled by the same statistical model ‘sampling without replacement’. In this model, k balls were drawn without replacement from an urn containing ‘N’ balls in total, n of which are white (all others are black). In this case, the number k1 of white balls drawn from the urn follows a hypergeometric distribution with parameters N, k and n. However, in our cases the balls are multicolored and we actually test multiple hypotheses at the same time: ‘white balls were drawn randomly’, ‘blue balls were drawn randomly’, ‘red balls were drawn randomly’ and so on. As we select the most enriched whatever color (let say red for this case), the estimated P-value based on the hypergeometric distribution does not reflect the actual probability to get k1 of whatever color balls; it reflects the probability to get k1 red balls. To get the actual probability to get k1 whatever color balls we need to adjust P-value for multiple testing. One way to do this is to use Monte–Carlo simulation to directly measure this probability based on, let say 1000 simulations.

In this case we simulate a random drawing of k balls 1000 times and each time we estimate the P-value based on hypergeometric distribution for the best (whatever) color. Thus, we got a distribution of size 1000 of the best P-values for a random drawn of k balls and compare it to the P-value for the best (whatever) color balls related to our original drawn of k balls. The estimate of the adjusted P-value is given by the share of random simulations where the best P-value was equal or superior (less) than the P-value for the best (whatever) color balls related to our original drawing of k balls.

RESULTS

CCancer: interpretation of gene list

The user can query his/her list of gene/protein identifiers to find statistically significant links to previously published studies as well as to identify gene pairs from the submitted list which were frequently reported together (P < 0.05). As input, CCancer accepts several types of gene identifiers. CCancer supports most gene and protein identifiers such as ‘Gene Symbol’, ‘Entrez Gene Id’, ‘UniProt/Swiss-Prot’, ‘UniGene’, ‘Ensembl’, ‘RefSeq Protein ID’, ‘RefSeq Transcript ID’ and’Affymetrix probe codes’. As output, a catalogue of previously published studies that report a table of genes that significantly intersect with a query list is provided.

After a list of potential studies is generated, one needs to check manually all interesting hits. First, a list of gene IDs common between query and database list need to be checked (a link ‘Mapping protocol’ is provided on the resulting page). Additionally, by looking into the corresponding study (a link is provided on the resulting page) one can understand better the functional context of the ‘database’ gene list. As been already mentioned in ‘text mining’ section, the database was collected automatically and, thus, some hits may represent artefacts.

CCancer: browsing for gene lists with common biological properties

CCancer also provides an interface to browse gene lists from the database with a common property. At the moment, the user can select gene lists which are statistically linked to either a particular GO biological process (http://mips.helmholtz-muenchen.de/proj/ccancer/go_bp), molecular function (http://mips.helmholtz-muenchen.de/proj/ccancer/go_mf) or cellular component (http://mips.helmholtz-muenchen.de/proj/ccancer/go_cc). The possibility to browse gene lists based on their local properties is going to be extended in the future.

CCancer: examples of possible applications

Next we present examples of analyses of experimental data by CCancer to illustrate the potential utility of our database. In principle, interpretation of gene list using CCancer is based on the widely accepted guilt-by-association principle: significant similarities between protein lists can be indicative of similarity in molecular mechanisms between corresponding phenomena. The next example aims to illustrate this application of CCancer

Example 1: Relationship between the senescence phenotype and cancer

A study by Young et al. (18) provided evidence that autophagy-related genes mediate the acquisition of the senescence phenotype. The authors studied 53 autophagy- and senescence-related genes, which were up- or down-regulated after Ras induction. A screen in CCancer for studies, which report gene sets that significantly intersect with the genes reported in ref. (18), identified several related papers (P < 0.01, Table 1). For example, a study is detected, where Staber et al. (19) report genes associated with recurrent acute myeloid leukemia after high-dose chemotherapy. The genes, which were differentially expressed in patients with acute myeloid leukemia prior to high-dose chemotherapy and after relapse, significantly intersect with the senescence-related genes reported by ref. (18).

Table 1.
An example of a CCancer output: a query list of autophagy- and senescence-related genes

Two other studies detected by CCancer covered related topics. A study (21) described genes related to macrophage activation and TH1 immune response, which were induced by low-dose radiation therapy in follicular lymphoma. A second study of ref. (22) identified genes differentially expressed in response to ionizing radiation in lymphoblastoid cells.

A relationship between cancer- and senescence-related genes or pathways would have been certainly expected. Interestingly, some topics, which emerged form the comparison of senescence-related genes to the studies in the CCancer database, were related to cancer therapy or prognosis. For instance, the studies commonly reported Cathepsin B and D (CTSB/D) and Cathepsin L1 (CTSL1), which participate in protein degradation and turnover (23,24).

Another output from Ccancer (Table 2) reports pairs of genes which were frequently reported together. For example, the already mentioned genes Cathepsin L1 (CTSL1) and Cathepsin B and D (CTSB/D) were reported together much more frequently than it would be expected by chance. CTSL1 and CTSB were reported together by 12 Ccancer records while each gene was reported 78 and 49 times, respectively (P < 0.05). Interesting, 7 out of 12 papers were related to ‘cancer’ and three to ‘LEUKEMIA’.

Table 2.
An example of a CCancer output: a query list of autophagy- and senescence-related genes

Example 2: Searching for novel potential clinical applications of drugs

CCancer can further be exploited to identify hints for novel clinical applications of known drugs (25) or drugs under development in the case, where the list of drug molecular targets is known. For a variety of cell disorders (including most cancer subtypes) CCancer stores lists of genes identified to be at differential states (comparison between normal versus disordered cells). This type of information can be mined to identify new potential therapeutic implications. Significant number of common genes between drug targets and gene lists from the database related to some cell disorder can be indicative of probable usability of the drug for the corresponding disease.

For example, Bosutinib is a novel promiscuous kinase inhibitor. We extracted a reported list of direct interactors of Bosutinib identified by chemical proteomics (26). We used CCancer to identify previously reported gene lists that have a significantly share of common genes/proteins (http://mips.helmholtz-muenchen.de/proj/ccancer/example.html) and, thus, to identify potential physiological conditions, where an application of Bosutinib might be effective. Among already known applications, like different types of ‘leukemia’ (27,28), CCancer suggests other specific cancer types, like ‘oral squamous cell carcinoma’(28).

DISCUSSION

Here, we have generated a comprehensive collection of gene list reported by the papers in 80 peer-reviewed journals. Tables in articles usually present, to some extent, pre-processed gene lists, which are selected for significance by experts. To our knowledge, no existing text-mining system provides a similarly accessible and comparable collection of experimentally derived gene lists for analysis.

The CCancer database provides a computational interface, which generates a highly informative survey over recently published cancer-related studies, which report similar and significantly intersecting gene lists. As we have demonstrated, by applying this automatic analysis, the user may obtain sometimes unexpected links to previous studies. It would be a tedious, if not impossible task for experimental researchers to gain these insights by a manual analysis of the literature. Articles, which contain significantly intersecting gene lists, are not necessarily listed as ‘related articles’ in PubMed. In addition, CCancer implements a robust statistical treatment of the intersection between a query and a database gene list, and provides a valid estimate of the P-values by a Monte–Carlo simulation procedure. The P-values actually reflect the probability of getting an intersection of the same size, in terms of the number of genes, for any random query gene list.

FUNDING

This work was supported by the Helmholtz Alliance on Systems Biology (project ‘CoReNe’). Funding for open access charge: Helmholtz Center Munich – German Research Center for Environmental Health (GmbH).

Conflict of interest statement. None declared.

REFERENCES

1. Fernandes TG, Diogo MM, Clark DS, Dordick JS, Cabral JM. High-throughput cellular microarray platforms: applications in drug discovery, toxicology and stem cell research. Trends Biotechnol. 2009;27:342–349. [PubMed]
2. Powell AK, Zhi ZL, Turnbull JE. Saccharide microarrays for high-throughput interrogation of glycan-protein binding interactions. Methods Mol. Biol. 2009;534:313–329. [PubMed]
3. Krallinger M, Valencia A, Hirschman L. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 2008;9(Suppl. 2):S8. [PMC free article] [PubMed]
4. Cheng D, Knox C, Young N, Stothard P, Damaraju S, Wishart DS. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 2008;36:W399–W405. [PMC free article] [PubMed]
5. Ongenaert M, Van NL, De MT, Menschaert G, Bekaert S, Van CW. PubMeth: a cancer methylation database combining text-mining and expert annotation. Nucleic Acids Res. 2008;36:D842–D846. [PMC free article] [PubMed]
6. Fang YC, Huang HC, Juan HF. MeInfoText: associated gene methylation and cancer information from text mining. BMC Bioinformatics. 2008;9:22. [PMC free article] [PubMed]
7. Kaur M, Radovanovic A, Essack M, Schaefer U, Maqungo M, Kibler T, Schmeier S, Christoffels A, Narasimhan K, Choolani M, et al. Database for exploration of functional context of genes implicated in ovarian cancer. Nucleic Acids Res. 2009;37:D820–D823. [PMC free article] [PubMed]
8. Essack M, Radovanovic A, Schaefer U, Schmeier S, Seshadri SV, Christoffels A, Kaur M, Bajic VB. DDEC: Dragon database of genes implicated in esophageal cancer 2. BMC Cancer. 2009;9:219. [PMC free article] [PubMed]
9. Antonov AV, Dietmann S, Wong P, Igor R, Mewes HW. PLIPS, an automatically collected database of protein lists reported by proteomics studies. J. Proteome. Res. 2009;8:1193–1197. [PubMed]
10. Huang dW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37:1–13. [PMC free article] [PubMed]
11. Dietmann S, Georgii E, Antonov A, Tsuda K, Mewes HW. The DICS repository: module-assisted analysis of disease-related gene lists. Bioinformatics. 2009;25:830–831. [PubMed]
12. Berriz GF, King OD, Bryant B, Sander C, Roth FP. Characterizing gene sets with FuncAssociate. Bioinformatics. 2003;19:2502–2504. [PubMed]
13. Antonov AV, Schmidt T, Wang Y, Mewes HW. ProfCom: a web tool for profiling the complex functionality of gene groups identified from high-throughput data. Nucleic Acids Res. 2008;36:W347–W351. [PMC free article] [PubMed]
14. Antonov AV, Dietmann S, Wong P, Lutter D, Mewes HW. GeneSet2miRNA: finding the signature of cooperative miRNA activities in the gene lists. Nucleic Acids Res. 2009;37:W323–W328. [PMC free article] [PubMed]
15. Westfall PN, Young SS. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. New York: John Wiley & Sons Inc; 1993.
16. Antonov AV, Dietmann S, Mewes HW. KEGG spider: interpretation of genomics data in the context of the global gene metabolic network. Genome Biol. 2008;9:R179. [PMC free article] [PubMed]
17. Antonov AV, Dietmann S, Rodchenkov I, Mewes HW. PPI spider: a tool for the interpretation of proteomics data in the context of protein-protein interaction networks. Proteomics. 2009;9:2740–2749. [PubMed]
18. Young AR, Narita M, Ferreira M, Kirschner K, Sadaie M, Darot JF, Tavare S, Arakawa S, Shimizu S, Watt FM, et al. Autophagy mediates the mitotic senescence transition. Genes Dev. 2009;23:798–803. [PMC free article] [PubMed]
19. Staber PB, Linkesch W, Zauner D, Beham-Schmid C, Guelly C, Schauer S, Sill H, Hoefler G. Common alterations in gene expression and increased proliferation in recurrent acute myeloid leukemia. Oncogene. 2004;23:894–904. [PubMed]
20. Gronborg M, Bunkenborg J, Kristiansen TZ, Jensen ON, Yeo CJ, Hruban RH, Maitra A, Goggins MG, Pandey A. Comprehensive proteomic analysis of human pancreatic juice. J. Proteome. Res. 2004;3:1042–1055. [PubMed]
21. Knoops L, Haas R, de Kemp S, Majoor D, Broeks A, Eldering E, de Boer JP, Verheij M, van Ostrom C, de Vries A, et al. In vivo p53 response and immune reaction underlie highly effective low-dose radiotherapy in follicular lymphoma 1. Blood. 2007;110:1116–1122. [PubMed]
22. Jen KY, Cheung VG. Identification of novel p53 target genes in ionizing radiation response. Cancer Res. 2005;65:7666–7673. [PubMed]
23. Benes P, Vetvicka V, Fusek M. Cathepsin D–many functions of one aspartic protease. Crit Rev. Oncol. Hematol. 2008;68:12–28. [PMC free article] [PubMed]
24. Cordes C, Bartling B, Simm A, Afar D, Lautenschlager C, Hansen G, Silber RE, Burdach S, Hofmann HS. Simultaneous expression of Cathepsins B and K in pulmonary adenocarcinomas and squamous cell carcinomas predicts poor recurrence-free and overall survival. Lung Cancer. 2009;64:79–85. [PubMed]
25. Antonov AV. Mining protein lists from proteomics studies: applications for drug discovery. Expert Opin. Drug Discovery. 2010;5:322–331. [PubMed]
26. Fernbach NV, Planyavsky M, Muller A, Breitwieser FP, Colinge J, Rix U, Bennett KL. Acid elution and one-dimensional shotgun analysis on an Orbitrap mass spectrometer: an application to drug affinity chromatography. J. Proteome. Res. 2009;8:4753–4765. [PubMed]
27. Walters DK, Goss VL, Stoffregen EP, Gu TL, Lee K, Nardone J, McGreevey L, Heinrich MC, Deininger MW, Polakiewicz R, et al. Phosphoproteomic analysis of AML cell lines identifies leukemic oncogenes. Leuk. Res. 2006;30:1097–2004. [PubMed]
28. Sticht C, Freier K, Knopfle K, Flechtenmacher C, Pungs S, Hofele C, Hahn M, Joos S, Lichter P. Activation of MAP kinase signaling through ERK5 but not ERK1 expression is associated with lymph node metastases in oral squamous cell carcinoma (OSCC) Neoplasia. 2008;10:462–470. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Compound
    Compound
    PubChem Compound links
  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...