• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of neoplasiaLink to Publisher's site
Neoplasia. May 2007; 9(5): 443–454.
PMCID: PMC1877973

Molecular Concepts Analysis Links Tumors, Pathways, Mechanisms, and Drugs1 *


Global molecular profiling of cancers has shown broad utility in delineating pathways and processes underlying disease, in predicting prognosis and response to therapy, and in suggesting novel treatments. To gain further insights from such data, we have integrated and analyzed a comprehensive collection of “molecular concepts” representing > 2500 cancer-related gene expression signatures from Oncomine and manual curation of the literature, drug treatment signatures from the Connectivity Map, target gene sets from genome-scale regulatory motif analyses, and reference gene sets from several gene and protein annotation databases. We computed pairwise association analysis on all 13,364 molecular concepts and identified > 290,000 significant associations, generating hypotheses that link cancer types and subtypes, pathways, mechanisms, and drugs. To navigate a network of associations, we developed an analysis platform, the Molecular Concepts Map. We demonstrate the utility of the approach by highlighting molecular concepts analyses of Myc pathway activation, breast cancer relapse, and retinoic acid treatment.

Keywords: Cancer, bioinformatics, gene expression signature, network, oncomine


Genome-scale molecular analyses are being widely employed to study human biology and disease. For example, hundreds of studies have examined gene expression profiles of human cancers identifying molecular subtypes associated with cancer progression, response to therapy, and patient outcome [1–6]. The Oncomine database represents a concerted effort to integrate and analyze such data, now including 270 independent studies comprising nearly 20,000 microarray experiments. In addition to the profiling of tumors and other disease specimens, a number of studies have profiled a wide variety of biologic perturbations in cell lines, including genetic modifications and drug treatments. For example, a series of experiments characterized the transient activation of various known oncogenes in mammary epithelial cells and defined gene expression signatures capable of predicting pathway activation in human tumors and sensitivity to specific inhibitors in cell lines [7]. Another genome scale analysis, the Connectivity Map, examined hundreds of compound treatment gene expression profiles and showed that such profiles could be used in a screen to identify compounds capable of reversing a gene expression program active in disease [8]. Although gene expression studies are the predominant type of genome-scale molecular analyses to date, other high-throughput experimental modalities include proteomic profiling, transcription factor binding analysis, epigenetic profiling, and sequence-based analyses. In addition, several systematic annotation efforts have provided a variety of valuable genome-scale characterizations.

A limitation of genome scale analyses to date is that they are often carried out in isolation, with the end product being a list of genes or proteins and a functional commentary. It may very well be that disparate analyses have identified common genes, proteins, and pathways, suggesting unknown and perhaps unexpected functional relationships. For example, one study may activate a transcription factor in cell lines and measure target genes; another study may investigate a protein complex using mass spectrometry; a third study may characterize genes activated in a subtype of disease; and a fourth study may identify genes deregulated by drug treatment. One can imagine a scenario in which all four studies were unknowingly biologically related, leading to the identification of overlapping genes and proteins. However, by today's publishing convention, it would be unlikely that the respective investigators or the research community at large would make these associations. With an integrative analysis platform coupled with a dedicated curation effort, one may be able to make these important functional connections and thus derive unexpected biologic insight in this hypothetical scenario, identifying relationships among a transcriptional program, a protein complex, a disease subpopulation, and drug treatment.

Along these lines, several tools are available to compare a query gene list to a reference set of gene lists. For example, Gene Set Enrichment Analysis allows one to compare a query signature to a variety of gene sets based on pathways, Gene Ontology terms, regulatory motifs, chromosomal regions, and perturbation experiments [9]. Another tool, L2L, compares a query gene list with differential expression gene lists published in the literature [10]. In addition, the Connectivity Map permits the analysis of a query signature against a database of drug treatment signatures [8]. These approaches have been applied successfully and demonstrate the utility of gene sets as a common language to compare heterogeneous biologic concepts.

Here, we have applied and extended the notion of comparing diverse biologic concepts represented by molecular signatures (i.e., sets, lists, and so on). We integrated a comprehensive collection of signatures relevant to cancer research and performed all-versus-all association analyses. Unlike approaches that analyze a single target signature across a set of reference signatures, our approach computes a network of associations across all available signatures representing hundreds of cancer types and subtypes, biologic perturbations, drug treatments, manually annotated pathways and protein interaction networks, predicted regulatory networks, gene ontologies, and protein families. This association analysis builds on Oncomine, the Connectivity Map, and the work of profiling and annotation communities to systematically connect cancer types and subtypes, pathways, mechanisms, and drugs. We designated this project as the Molecular Concepts Map (MCM) because the focus of the project is on biologic concepts represented by molecular signatures.

Materials and Methods

Molecular Concepts Data Collection

Sets of biologically related genes were collected or derived from 503 microarray studies and 12 external databases. All identifiers were mapped to Entrez Gene IDs for analysis. For each molecular concept, a null set was defined as the set of all genes measured or considered in defining the concept. For example, null sets for microarray-based concepts were defined as all genes measured on a microarray platform, whereas null sets for Gene Ontology-based concepts were defined as all genes with at least one Gene Ontology annotation.

Oncomine Data

Cancer signatures were derived from differential expression analyses that compared two logical groupings of normal or malignant human tissue or cell lines as defined by the Oncomine Cancer Microarray Database (http://www.oncomine.org) [11]. In total, data from ′ 18,000 microarrays from 270 independent studies were used in this analysis. From Oncomine, we downloaded gene lists rank-ordered by P value (Student's t test) from 1192 differential expression analyses. We defined gene signatures as the top 1%, 5%, and 10% of overexpressed or underexpressed genes from each analysis. We selected multiple cutoffs to allow for variability in the optimal association cutoff. Only the most significant of the three cutoffs is reported.

Connectivity Map Data

Drug overexpression and underexpression signatures were derived from the Connectivity Map dataset [8]. The dataset was normalized as described [11], except that normalized expression values of < − 0.5 were set to − 0.5. Each compound treatment experiment was compared to the appropriate control experiment(s) based on the assigned batch number. When multiple replicates were available, expression values were averaged. Genes that did not have a normalized expression value of > 0.0 in either treatment or control experiments were further filtered. Genes were then rank-ordered by overexpression and underexpression in treatment versus control, and the top 1% and 5% overexpression and underexpression genes were assigned to molecular concepts.

Additional Data Sources

Chromosome arm and cytoband mappings were downloaded from the National Center for Biotechnology Information (NCBI) map viewer (http://www.ncbi.nlm.nih.gov/mapview/). Biologic processes, molecular functions, and cellular component annotations from the Gene Ontology Consortium (http://www.geneontology.org/) [12] were downloaded from Entrez Gene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene). Metabolic pathways were downloaded from the Kyoto Encyclopedia of Genes and Genomes (KEGG; http://www.genome.jp/kegg/) [13]. Biocarta signaling pathways were downloaded from the Biocarta web site (http://www.biocarta.com/). Protein domains and family assignments were downloaded from InterPro (http://www.ebi.ac.uk/interpro/) [14]. Protein-protein interaction sets were downloaded from the Human Protein Reference Database (HPRD; http://www.hprd.org/) [15]. Literature-defined concepts were collected from 207 peer-reviewed publications that applied Affymetrix (Santa Clara, CA) arrays to study the transcriptional effects of an experimental perturbation such as drug treatment or candidate gene activation.

Transcription Regulation Data

TRANSFAC transcription factor motifs were defined by scanning all human gene promoter sequences for the presence of 361 experimentally defined transcription factor binding sites [16]. One-kilobase promoter sequences from 20,647 RefSeqs were downloaded from the UCSC genome browser (http://hgdownload.cse.ucsc.edu/goldenPath/hg17/bigZips/) in August 2004. Sequences were sequentially submitted tomatch a component of the TRANSFAC Professional Suite that scans a sequence for the presence of transcription factor binding sites, as determined by a database of position weight matrices. A hit list was filtered to contain only the top 2000 hits per matrix sorted by the matrix similarity score. Conserved promoter motifs and conserved 3′ untranslated region motifs were defined by a comparative genomics analysis that identified conserved motifs across four mammalian organisms [17]. Predicted microRNA target genes were downloaded from picTar (http://pictar.bio.nyu.edu/), a resource that applies a comparative genomics algorithm to identify putative miRNA target gene sets [18].

Data Analysis

To carry out molecular concepts analysis, each pair of molecular concepts was tested for association using Fisher's exact test. Results were stored if a given test had an odds ratio of >1.25 and P < .01. P < 1e − 100 was set to 1e − 100. All concept associations presented in the manuscript and supplementary materials represent a subset of statistically significant associations (P < 1e − 6). A complete set of significant concept associations is available from the MCM (http://www.oncomine.org).

Results and Discussion

Data Collection and Primary Analysis

We defined a “molecular concept” as any biologic concept (e.g., disease, drug treatment, pathway, regulatory mechanism, and so on) represented by a molecular signature (i.e., a collection of genes or proteins). For example, Gene Ontology ascribes 241 genes to the apoptosis process; InterPro names 16 proteins to the chemokine receptor family; a comparative genomics study identified 1188 genes with conserved promoter motifs corresponding to Myc binding sites; and an Oncomine analysis identified 410 genes overexpressed in BRCA1 mutant ovarian cancer.

Here we attempted to collect all molecular concepts in the biomedical knowledge space with relevance for cancer research. We began by deriving gene signatures from Oncomine (http://www.oncomine.org) [19], a cancer gene expression database that includes data and differential expression analyses from 270 independent profiling studies, comprising nearly 20,000 microarray experiments that profile normal and malignant human tissue and cell lines. We derived gene expression signatures from > 1000 differential expression analyses as top-ranking overexpressed and underexpressed genes using three percentile cutoffs (1%, 5%, and 10%). Gene signatures were defined for normal tissue types, cancer types relative to normal tissue, and cancer subtypes based on a variety of clinical, pathological, and molecular attributes, such as mutation status, treatment response, and patient outcome, among others. In addition, gene signatures were computed for cell line analyses based on factors such as drug sensitivity and pathway activation.

To supplement Oncomine with additional gene expression-based molecular concepts, we initiated a manual curation effort of published gene signatures related to human biology. This effort amassed an additional 465 concepts, including signatures of various cellular processes, biologic perturbations, and drug treatments. We also included signatures of drug response from the Connectivity Map [8] comprising 379 compound treatment experiments spanning 184 unique compounds, many of which are drugs approved by the Food and Drug Administration. To complement this dataset dominated by gene expression data, we included gene sets derived from several reference databases compiling molecular concepts based on chromosomal locations, protein domains and families [14], molecular functions, cellular localizations, biologic processes [12], signaling and metabolic pathways [13], protein-protein interaction networks [15], and protein complexes [20]. In addition, we collected data on transcriptional regulation in the form of putative transcription factor target gene sets derived by scanning human promoters for known transcription factor motifs [21] and by comparative genomics analyses that identified conserved promoter and 3′ UTR elements [17,18], many of which correspond to known transcription factor and miRNA binding sites, respectively. In total, 13,364 molecular concepts from 12 databases and 503 microarray studies were collected and analyzed. Table 1 summarizes the data collected or derived of each type, and Figure 1 depicts the integration and analysis of such data in the MCM.

Figure 1
The MCM project. (A) Molecular concepts or biologically related gene sets were collected, standardized, and stored in the MCM database. Molecular concepts were derived from 503 independent microarray studies comprising > 20,000 microarray experiments, ...
Table 1
Molecular Concept Types.

Molecular concepts were stored in a relational schema that stratifies concepts by concept type and associates concepts with Entrez Gene identifiers. In addition, for statistical analysis, each concept was associated with a “null set,” which represents the full set of genes from which the concept genes were defined. For each pair of concepts (n = 89,291,566), we counted the number of genes present in both concepts and assessed the significance of overlap with Fisher's exact test, which is simple yet highly scalable. We stored all significant (P < .01) associations but only considered highly significant (P < 1e − 6) associations for the analyses presented here. At this conservative threshold, we found 292,139 concept links where, by chance, we would only expect to find 89 links. As depicted in Figure 1C, significant concept links were observed between nearly all pairs of concept types, although the majority of significant concept links were among pairs of Oncomine signatures and pairs of Connectivity Map drug signatures. This is not surprising because both datasets are known to contain large sets of highly related signatures. To navigate the network of concept associations, we developed the MCM web application (http://www.oncomine.org). Users can search concepts in the database by keyword or MCM ID, query significant concept links, and visualize networks of concept relationships. In addition, new concepts can be uploaded for rapid analysis against the MCM database.

To demonstrate the utility of the MCM approach and a cross section of results, we undertook a series of cases studies, beginning with different types of molecular signatures with relevance to cancer research and exploring networks of concept links identified by our analysis. The purposes of case studies were to illustrate the breadth of results generated by MCM analysis, to demonstrate the usefulness of the analysis by highlighting known or expected results that were obtained objectively, and to suggest new hypotheses relating cancer types and subtypes, pathways, mechanisms, and drugs. Detailed results from the case studies, including MCM IDs, odds ratios, and P values, are provided in supplementary materials. All associations presented were within the P value threshold of 1e − 6.

Oncogenic Pathway Signatures

The strategy of using genomic signatures as a surrogate for oncogenic pathway activity has been previously demonstrated by Bild et al. [7]. Here we sought to apply molecular concepts analysis to extend the analysis of oncogenic pathways to all cancer types and subtypes represented in Oncomine. In addition, we sought to explore the processes and networks deregulated by specific pathways and to investigate small molecules capable of affecting pathway activation. Although several signatures of oncogenic pathway activation were analyzed in the MCM, including Ras, E2F, β-catenin, Src, p53, and NF-κB, among others, we selected the Myc pathway to illustrate the approach and to gain insights into Myc signaling in cancer.

Myc is an oncogenic transcription factor that is capable of regulating a variety of cellular processes, deregulated in a wide range of human cancers, and often associated with aggressive poorly differentiated tumors [22]. We began our analysis with a signature of 940 genes (top 5%) activated by c-Myc when transiently overexpressed in human mammary epithelial cells [7]. We first sought to validate the signature by examining associations with other molecular concepts related to Myc. Not surprisingly, the Myc signature shared significant overlap with an independent Myc signature generated in MCF-10 breast cancer cells [23] and an Myc signature measured in B cells [24], confirming that a common Myc molecular program is detectable across independent cell types. We also observed a significant overlap between the Myc signature and genes with putative Myc binding sites in their promoters, confirming that the signature consists, at least in part, of direct Myc target genes. Of note, no other molecular concepts of transcription factor putative target genes scored higher.

Confident that the Myc signature is valid, we sought to discover or validate cancer types and subtypes in which the Myc program is most significantly activated. We selected 25 of the top-scoring molecular concepts [odds ratio (OR) > 2; P < 1e − 8] for further analysis and discussion (Table W1). Figure 2A displays a concept map seeded with the Myc signature and selected linked concepts. The most significantly linked Oncomine signature, considering > 2000 signatures, was a signature for IgG-Myc mutation lymphoma [25] (OR = 7; P < 1e − 100) in which Myc is the known oncogenic mutation. It is noteworthy that an Myc signature generated artificially in human mammary epithelial cells can differentiate in vivo lymphomas by Myc status (Figure 1B). The MCM also included a signature of N-Myc amplification-positive neuroblastoma [26], which was significantly related to the query Myc signature as well (OR = 4; P < 3.7e − 9). In addition to these malignancies with obvious Myc involvement, the MCM identified several additional cancer types with coordinate activation of the Myc program, including metastatic prostate cancer (versus localized), grade III breast cancer (versus grades I and II), colon adenocarcinoma (versus normal colon), FLT3 internal tandem duplication acute myeloid leukemia (AML; versus FLT3 wild type), acute lymphoblastic leukemia with mixed lineage leukemia (MLL) AF4 translocation (versus other aberrations), chronic myeloid leukemia undergoing blast crisis (versus chronic phase), and plasma cell leukemia (versus other B-cell malignancies), among others (Figure 2, Table W1). As depicted in Figure 2A, several of these associations were validated by signatures from independent datasets. For example, four independent signatures of grade III breast cancer were highly linked to Myc activation, as well as to one another, confirming the importance of the Myc pathway in this subset of breast tumors. This observation is consistent with previous reports that Myc activation through either gene amplification or overexpression of an Mycstabilizing protein is highly associated with grade III breast cancer relative to grade I and grade II diseases [27]. Similarly, Myc has been previously linked to prostate cancer progression and metastatic disease [28], consistent with our observation of Myc pathway activation in several independent metastatic prostate cancer signatures (Figure 2, A and B). Lastly, our observed link between Myc activation and blast crisis in chronic myeloid leukemia is also consistent with previous reports [29]. It is important to note that although previous work has suggested Myc involvement in these malignancies, our analysis provides additional evidence that the Myc program is activated, as evidenced by the coordinate overexpression of target genes. In addition to validating cancer types and subtypes with Myc pathway activation, our analysis generated several new hypotheses linking Myc activity to specific disease subpopulations, such as myeloid leukemias, with FLT3 mutations and lymphoblastic leukemias with MLLAF4 translocations (Table W1).

Figure 2
Molecular concepts analysis of the Myc pathway activation signature, consisting of the top 5% of genes most overexpressed in Myc transfection in human mammary epithelial cells [24]. (A) A molecular concepts map of the Myc signature (red node) and selected ...

Next, we sought to identify small molecules capable of reversing the Myc expression program in vitro, with the goal of suggesting treatment strategies for tumors with Myc pathway activation. Thus, we focused our attention on drug signatures from the Connectivity Map resource. We identified four highly significant links between drug treatments and Myc pathway activation (OR > 3; P < 1e − 15). All four consisted of genes repressed by treatment with one of two PI3K inhibitors (LY-294002 and wortmannin) in MCF-7 breast cancer cells or HL-60 leukemia cells, suggesting that blocking PI3K signaling represses the Myc program more so than any other compound represented in the Connectivity Map (Figure 2, A and C). We considered that the repression of the Myc program may be mediated by the downregulation of c-Myc itself, but on the contrary, we found that Myc expression was not affected by PI3K inhibition in these experiments. Thus, we hypothesized that other signaling events downstream of PI3K may be necessary for the activation of the Myc transcriptional program. Supporting this hypothesis, previous work has shown that AKT-mediated phosphorylation of FOXO proteins is required for Myc induction of proliferation and transformation and that PI3K inhibition with LY-294002 leads to repression of Myc target genes [30]. MCM analysis shows that PI3K inhibition represses a large fraction of the Myc transcriptional program, further suggesting PI3K inhibition as a strategy for treating Myc-driven malignancies. Notably, the associations hold up across cellular backgrounds and independent PI3K inhibitors.

Similar results have been computed for additional cancerrelated pathways, including Ras, Src, β-catenin, E2F, NF-κB, and p53, and are available online (http://www.molecularconcepts.org). For example, similar to the results for the Myc activation signature, the Ras activation signature was linked to a number of cancer types and subtypes, several of which have known Ras pathway involvement, including N-Ras mutant melanoma (versus B-Raf and wild type), K-Ras mutant lung adenocarcinoma (versus wild type), and K-Ras mutant AML (versus wild type) (Table W2). Additional links suggest that the Ras pathway is activated in cell lines resistant to several common cytotoxic agents, is relatively more active in specific solid tumor types (such as bladder, colon, and cervical tumors), and is less active in tumor types such as prostate and breast (Table W2).

Disease Signatures: Relapse-Positive Breast Cancer

Next, we investigated the ability of the MCM to deconstruct complex disease signatures consisting of hundreds of genes differentially expressed in a given disease state relative to normal tissue or other disease states. To illustrate this analysis, we considered gene expression signatures of relapse in estrogen receptor-positive (ER+) and estrogen receptor-negative (ER) breast cancer, as metastatic relapse following surgery is the principal cause of mortality. We considered ER+ and ER tumors as separate disease entities and sought to compare and contrast molecular concepts associated with relapse in each. Both signatures consisted of the top 5% of genes overexpressed in tumors that relapsed within 5 years of surgery relative to tumors that did not relapse [6]. Previous studies have examined genes associated with breast cancer relapse identifying associations with fibroblast serum response [31] and cell cycle program [32]. We anticipated that our analysis should recapitulate these results and identify additional molecular concepts related to relapse signatures.

First, we examined molecular concepts associated with the ER+ relapse signature. Several hundred concepts were highly linked, of which a subset is presented here (Table W3). Confirming the validity of the signature, we observed that some of the most significant associations were with breast cancer relapse signatures generated from independent datasets. In addition, the ER+ relapse signature was highly associated with several additional high-grade/poorly differentiated cancer signatures, suggesting that the ER+ relapse signature is not unique to breast cancer. With respect to pathway signatures, the ER+ relapse signature showed evidence of E2F and Myc pathway activation, consistent with previous work [23]. We also observed an enrichment of E2F promoter binding sites in the ER+ relapse signature, further confirming the activation of this pathway. Similar to previous reports, we observed an association with fibroblast serum response and cell cycle signatures [33]. In addition, several ontological concepts showed strong associations, including mitosis, DNA repair, and chromosome organization, as well as the histone core and ATPase protein domain concepts—all consistent with increased proliferation in ER+ relapse breast tumors. Interestingly, several specific chromosomal cytobands showed significant associations with the ER+ relapse signature, including 17q25, 8q24, and 20q13, signifying chromosomal aberrations that likely contribute to the expression of the relapse signature. In fact, all three regions have been documented to be amplified in a subset of breast tumors [34]. Lastly, a number of biologic perturbation concepts showed strong associations with the ER+ relapse signature, including hypoxia, radiation toxicity, treatment with a methylase inhibitor, or treatment with selective ER modulators. Finally, a number of Connectivity Map signatures linked small molecules with the ER+ relapse signature, identifying compounds that repress the expression of the relapse signature. The most significant compound treatment, resveratrol, is known to inhibit the cell cycle [35], consistent with our observation that the cell cycle program is activated in the ER+ relapse signature and is downregulated by resveratrol. Other agents scoring highly include geldamycin (a heat shock protein inhibitor) and derivatives, as well as PI3K inhibitors, consistent with our previous observation linking Myc pathway activation and PI3K inhibition. Although preliminary, these results suggest that resveratrol and PI3K inhibitors, perhaps in addition to cytotoxic agents, should be considered as adjuvant therapy for ER+ breast cancers expressing the relapse signature. Figure 3A depicts a molecular concepts map seeded with the ER+ relapse signature, demonstrating the interrelatedness of identified molecular concepts. Interestingly, the Myc and E2F pathway activation concepts were not linked to one another and were linked to separate molecular concepts, suggesting that different components of the relapse signature are controlled by different oncogenic pathways.

Figure 3
Molecular concepts analysis of an ER+ breast cancer relapse signature (MCM: 122567) consisting of the top 5% of genes most overexpressed in ER+ tumors from patients who relapsed within 5 years relative to those who did not. (A) A molecular concepts map ...

Next, we investigated the ER relapse signature. It should be noted that although the ER+ relapse signature consisted of genes significantly overexpressed (Q < 0.05) in relapse tumors, the genes in the ER relapse signature did not reach statistical significance (Q < 0.05), even though the collection of genes assigned to the ER relapse signature did have a number of significant associations in molecular concepts analysis. To our surprise, none of the concepts associated with the ER+ relapse signature was also associated with the ER relapse signature, suggesting that distinct pathways mediate cancer progression in ER+ and ER breast cancer (Table W4). Notably, the ER relapse signature was significantly associated with independent Her2/neu+ breast cancer signatures, suggesting that the Her2/neu pathway or related pathways may drive progression in ER tumors. Consistent with this observation, Her2/neu was found to be significantly overexpressed in ER relapse tumors, although not in all ER relapse tumors, suggesting alternative pathways that activate a Her2/neu-like program. Unlike the ER+ relapse signature that showed evidence for the activation of the Myc and E2F pathways, the ER relapse signature was associated most significantly with the c-Src pathway, in addition to the Ras pathway. In addition, of note, the ER relapse signature was associated with overexpression of genes localized to chromosome Xp11, but not to cytobands detected in the ER+ relapse signature. Finally, with respect to drugs that may be able to reverse the ER relapse signature, only one drug signature was significantly associated—valproic acid, a histone deacetylase inhibitor. We observed that a significant fraction of genes overexpressed in the ER relapse signature were repressed in PC3 cells treated with valproic acid, suggesting a potential therapeutic strategy for ER tumors expressing the relapse signature. In summary, molecular concepts analysis demonstrated that breast cancer progression is mediated by distinct pathways and mechanisms in ER+ and ER tumors and, importantly, suggested hypotheses regarding therapeutic strategies that may be best suited to reverse the respective relapse signatures.

In a parallel analysis on prostate cancer progression [36], we applied molecular concepts analysis to interpret progression signatures generated from laser capture-microdissected prostate cells. As described, we obtained gene expression profiles spanning the complete spectrum of prostate cancer progression, including benign prostatic epithelial and stromal cells, prostatic intraepithelial neoplasia (PIN), low-grade (Gleason pattern 3) prostate cancer, highgrade (Gleason pattern 4) prostate cancer, and metastatic disease. Molecular concepts analysis of transition states led to a number of insights. For example, PIN signature was linked to protein biosynthesis and ETS promoter motifs, whereas the transition from low-grade to high-grade prostate cancer was linked to downregulated androgen sigaling, and the transition to metastatic disease was linked to cell cycle activation and overexpression of genes on chromosome 8q. This analysis further demonstrates the ability of a diverse collection of molecular concepts to elucidate biologic mechanisms underlying complex disease signatures.

Drug Signatures

Next, we explored the notion of matching gene signatures to identify drugs that are capable of reversing a gene expression program and thus treating a particular condition [7,37]. The approach proposed by Lamb et al. [8] begins with a disease signature and compares it against a reference database of drug signatures to identify drugs that may be able to reverse the disease signature. Here, we investigate the converse experiment, that is, beginning with a drug signature (e.g., genes repressed by drug treatment) and searching for disease types or subtypes that show coordinate activation of the signature, suggesting novel (or expected) indications for the drug.

To test this approach, we considered a drug signature consisting of genes repressed in acute promyelocytic leukemia (APL) cells treated with retinoic acid in vitro [38]. Because retinoic acid is a potent differentiating agent and is uniquely effective in treating APL due to activating retinoic acid receptor (RAR) translocation, we hypothesized that genes repressed by retinoic acid in vitro should be overexpressed in APL cases but not in other AML subtypes that do not have RAR translocation. As anticipated, several independent APL signatures, derived from analyses of M3 AML clinical specimens (i.e., APL) compared to other AML subtypes, showed a highly significant association with the retinoic acid signature (Table W5)—in fact stronger than any other Oncomine cancer signatures. This suggests that had retinoic acid not been known to be an effective treatment for APL, our analysis would have correctly prioritized APL as the optimal disease population for retinoic acid treatment, granted the drug treatment experiment was performed in the appropriate cellular context. Interestingly, the retinoic acid repression signature also showed a strong association with independent MLL signatures and a renal cell carcinoma signature, among others (Table W5). These associations suggest potential additional indications for retinoic acid, both of which are supported by preliminary experimental evidence [39,40]. In addition to these cancer signature associations, the repressed-by-retinoic-acid signature was significantly linked to five drug treatment signatures from the Connectivity Map, including two signatures of genes repressed by tretinoin (i.e., retinoic acid) in MCF-7 and HL-60 cells. These associations demonstrate cell type-independent gene expression changes caused by retinoic acid inhibition. As displayed in the molecular concepts map centered on the APL retinoic acid repression signature (Figure 4), the MCF-7 tretinoin signature is also linked to two of three APL signatures from clinical specimens, demonstrating an objective association of a drug signature with a disease subpopulation in which treatment is known to be effective, irrespective of cell type. This analysis further supports the notion of mapping drug signatures with disease signatures and suggests that such an analysis can begin with a drug signature to scan a database of disease signatures to identify optimal disease populations. Analogous associations have been computed and are available online for a number of other drug treatment experiments collected from the literature and from the Connectivity Map resource, suggesting cancer types and subtypes that might be responsive to drug treatment. Table W6 lists a sample of highly significant associations between drugs and cancer types.

Figure 4
Molecular concepts analysis of a retinoic acid signature, consisting of 454 genes repressed in APL cells treated with retinoic acid. (A) A molecular concepts map of the retinoic acid signature (yellow node) and selected significantly linked concepts. ...

In summary, we have assembled a collection of gene signatures representing cancer types and subtypes, pathways, mechanisms, and drugs, and we have performed a largescale association analysis comparing each pair of signatures. In addition, we have developed the MCM to store and navigate the signatures and their associations. Although this effort represents the beginning of an ambitious initiative to assimilate and integrate all molecular concepts into the biologic knowledge space, the results presented here demonstrate the ability of the approach to objectively validate hypotheses and to generate new hypotheses across diverse datasets and databases. For example, Myc analysis, representing one cross section of the results, validated the Myc signature against independent Myc signatures, confirmed Myc pathway activation in tumor subtypes with known Myc involvement, and suggested additional tumor types and subtypes with Myc activation. In addition, the analysis identified a class of compounds capable of indirectly repressing the Myc pathway in vitro, suggesting a therapeutic strategy for treating Myc-driven tumors. Notably, all of these associations were made from data deposited from the public domain, made possible by a concerted data curation effort coupled with a simple association analysis. Our analysis platform is simple and highly extensible, and we anticipate that it will serve as a powerful framework for integrating disparate molecular concepts and for generating new hypotheses and biologic insight.

Supplementary Material

Supplementary Figures and Tables:


We thank Douglas Gibbs and Gil Omenn for helpful discussions.


1This work was supported, in part, by the National Institutes of Health (R01 CA97063 to A.M.C., U54 DA021519-01A1 to A.M.C., and Prostate SPORE P50CA69568 to A.M.C.), the Early Detection Research Network (UO1 CA111275-01 to A.M.C.), the Department of Defense (W81XWH-06-1-0224 to A.M.C.), and the Cancer Center Bioinformatics Core (support grant 5P30 CA46592 to A.M.C.). D.R.R. was supported by the Cancer Biology Training Program. S.A.T. was supported by a Rackham Predoctoral Fellowship. A.M.C. was supported by a Clinical Translational Research Award from the Burroughs Welcome Foundation. S.A.T. and D.R.R. are fellows of the Medical Scientist Training Program.

2Daniel R. Rhodes and Shanker Kalyana-Sundaram contributed equally to this work.

3Financial Disclosure: Commercial rights to the Molecular Concepts technology were licensed to Compendia Bioscience, Inc., a for-profit company founded by D.R.R. and A.M.C.

*This article refers to supplementary material, which is designated by “W” (i.e., Table W1) and is available online at www.bcdecker.com.


1. Chang JC, Wooten EC, Tsimelzon A, Hilsenbeck SG, Gutierrez MC, Elledge R, Mohsin S, Osborne CK, Chamness GC, Allred DC, et al. Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. Lancet. 2003;362:362–369. [PubMed]
2. Graudens E, Boulanger V, Mollard C, Mariage-Samson R, Barlet X, Gremy G, Couillault C, Lajemi M, Piatier-Tonneau D, Zaborski P, et al. Deciphering cellular states of innate tumor drug responses. Genome Biol. 2006;7:R19. (Epub 2006 March 2015). [PMC free article] [PubMed]
3. Hess KR, Anderson K, Symmans WF, Valero V, Ibrahim N, Mejia JA, Booser D, Theriault RL, Buzdar AU, Dempsey PJ, et al. Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J Clin Oncol. 2006;24:4236–4244. (Epub 2006 August 4238). [PubMed]
4. Potti A, Dressman HK, Bild A, Riedel RF, Chan G, Sayer R, Cragun J, Cottrill H, Kelley MJ, Petersen R, et al. Genomic signatures to guide the use of chemotherapeutics. Nat Med. 2006;12:1294–1300. (Epub 2006 October 1222). [PubMed]
5. Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med. 2002;346:1937–1947. [PubMed]
6. van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, et al. A geneexpression signature as a predictor of survival in breast cancer. N Engl J Med. 2002;347:1999–2009. [PubMed]
7. Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A, et al. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2006;439:353–357. (Epub 2005 November 2006). [PubMed]
8. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN, et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006;313:1929–1935. [PubMed]
9. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene Set Enrichment Analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102:15545–15550. (Epub 12005 September 15530). [PMC free article] [PubMed]
10. Newman JC, Weiner AM. L2L: a simple tool for discovering the hidden significance in microarray expression data. Genome Biol. 2005;6:R81. (Epub 2005 August 2031). [PMC free article] [PubMed]
11. Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM. Large-scale metaanalysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci USA. 2004;101:9309–9314. (Epub 2004 June 9307). [PMC free article] [PubMed]
12. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32:D258– D261. [PMC free article] [PubMed]
13. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004;32:D277– D280. [PMC free article] [PubMed]
14. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, et al. InterPro, progress and status in 2005. Nucleic Acids Res. 2005;33:D201–D205. [PMC free article] [PubMed]
15. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, et al. Development of Human Protein Reference Database as an initial platform for approaching systems biology in humans. Genome Res. 2003;13:2363–2371. [PMC free article] [PubMed]
16. Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003;31:374–378. [PMC free article] [PubMed]
17. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M. Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals. Nature. 2005;434:338–345. (Epub 2005 February 2027). [PMC free article] [PubMed]
18. Krek A, Grun D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, MacMenamin P, da Piedade I, Gunsalus KC, Stoffel M, et al. Combinatorial microRNA target predictions. Nat Genet. 2005;37:495–500. (Epub 2005 April 2003). [PubMed]
19. Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM. ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia. 2004;6:1–6. [PMC free article] [PubMed]
20. Luc PV, Tempst P. PINdb: a database of nuclear protein complexes from human and yeast. Bioinformatics. 2004;20:1413–1415. (Epub 2004 April 1415). [PubMed]
21. Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Barrette TR, Ghosh D, Chinnaiyan AM. Mining for regulatory programs in the cancer transcriptome. Nat Genet. 2005;37:579–583. [PubMed]
22. Vita M, Henriksson M. The Myc oncoprotein as a therapeutic target for human cancer. Semin Cancer Biol. 2006;16:318–330. (Epub 2006 August 2003). [PubMed]
23. Adler AS, Lin M, Horlings H, Nuyten DS, van de Vijver MJ, Chang HY. Genetic regulators of large-scale transcriptional signatures in cancer. Nat Genet. 2006;38:421–430. (Epub 2006 March 2005). [PMC free article] [PubMed]
24. Schlosser I, Holzel M, Murnseer M, Burtscher H, Weidle UH, Eick D. A role for c-Myc in the regulation of ribosomal RNA processing. Nucleic Acids Res. 2003;31:6148–6156. [PMC free article] [PubMed]
25. Hummel M, Bentink S, Berger H, Klapper W, Wessendorf S, Barth TF, Bernd HW, Cogliatti SB, Dierlamm J, Feller AC, et al. A biologic definition of Burkitt's lymphoma from transcriptional and genomic profiling. N Engl J Med. 2006;354:2419–2430. [PubMed]
26. Ohira M, Oba S, Nakamura Y, Isogai E, Kaneko S, Nakagawa A, Hirata T, Kubo H, Goto T, Yamada S, et al. Expression profiling using a tumor-specific cDNA microarray predicts the prognosis of intermediate risk neuroblastomas. Cancer Cell. 2005;7:337–350. [PubMed]
27. Ioannidis P, Mahaira L, Papadopoulou A, Teixeira MR, Heim S, Andersen JA, Evangelou E, Dafni U, Pandis N, Trangas T. 8q24 Copy number gains and expression of the c-myc mRNA stabilizing protein CRD-BP in primary breast carcinomas. Int J Cancer. 2003;104:54–59. [PubMed]
28. Nupponen NN, Kakkola L, Koivisto P, Visakorpi T. Genetic alterations in hormone-refractory recurrent prostate carcinomas. Am J Pathol. 1998;153:141–148. [PMC free article] [PubMed]
29. Handa H, Hegde UP, Kotelnikov VM, Mundle SD, Dong LM, Burke P, Rose S, Gaskin F, Raza A, Preisler HD. Bcl-2 and c-myc expression, cell cycle kinetics and apoptosis during the progression of chronic myelogenous leukemia from diagnosis to blastic phase. Leuk Res. 1997;21:479–489. [PubMed]
30. Bouchard C, Marquardt J, Bras A, Medema RH, Eilers M. Myc-induced proliferation and transformation require Akt-mediated phosphorylation of FoxO proteins. EMBO J. 2004;23:2830–2840. (Epub 2004 July 2838). [PMC free article] [PubMed]
31. Chang HY, Nuyten DS, Sneddon JB, Hastie T, Tibshirani R, Sorlie T, Dai H, He YD, van't Veer LJ, Bartelink H, et al. Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci USA. 2005;102:3738–3743. (Epub 2005 February 3738). [PMC free article] [PubMed]
32. Dai H, van't Veer L, Lamb J, He YD, Mao M, Fine BM, Bernards R, van de Vijver M, Deutsch P, Sachs A, et al. A cell proliferation signature is a marker of extremely poor outcome in a subpopulation of breast cancer patients. Cancer Res. 2005;65:4059–4066. [PubMed]
33. Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, et al. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell. 2002;13:1977–2000. [PMC free article] [PubMed]
34. Rennstam K, Ahlstedt-Soini M, Baldetorp B, Bendahl PO, Borg A, Karhu R, Tanner M, Tirkkonen M, Isola J. Patterns of chromosomal imbalances defines subgroups of breast cancer with distinct clinical features and prognosis. A study of 305 tumors by comparative genomic hybridization. Cancer Res. 2003;63:8861–8868. [PubMed]
35. Wolter F, Akoglu B, Clausnitzer A, Stein J. Downregulation of the cyclin D1/Cdk4 complex occurs during resveratrol-induced cell cycle arrest in colon cancer cell lines. J Nutr. 2001;131:2197–2203. [PubMed]
36. Tomlins SA, Mehra R, Rhodes DR, Cao X, Wang L, Dhanasekaran SM, Kalyana-Sundaram S, Wei JT, Rubin MA, Pienta KJ, et al. Integrative molecular concept modeling of prostate cancer progression. Nat Genet. 2007;39:41–51. (Epub 2006 December 2017). [PubMed]
37. Stegmaier K, Ross KN, Colavito SA, O'Malley S, Stockwell BR, Golub TR. Gene expression-based high-throughput screening (GE-HTS) and application to leukemia differentiation. Nat Genet. 2004;36:257–263. (Epub 2004 February 2008). [PubMed]
38. Meani N, Minardi S, Licciulli S, Gelmetti V, Coco FL, Nervi C, Pelicci PG, Muller H, Alcalay M. Molecular signature of retinoic acid treatment in acute promyelocytic leukemia. Oncogene. 2005;24:3358–3368. [PubMed]
39. Iijima K, Honma Y, Niitsu N. Granulocytic differentiation of leukemic cells with t(9;11)(p22;q23) induced by all-trans-retinoic acid. Leuk Lymphoma. 2004;45:1017–1024. [PubMed]
40. Touma SE, Goldberg JS, Moench P, Guo X, Tickoo SK, Gudas LJ, Nanus DM. Retinoic acid and the histone deacetylase inhibitor trichostatin a inhibit the proliferation of human renal cell carcinoma in a xenograft tumor model. Clin Cancer Res. 2005;11:3558–3566. [PubMed]

Articles from Neoplasia (New York, N.Y.) are provided here courtesy of Neoplasia Press


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...