# Discovery of multi-dimensional modules by integrative analysis of cancer genomic data

^{1,}

^{2}Chun-Chi Liu,

^{3}Wenyuan Li,

^{1}Hui Shen,

^{4}Peter W. Laird,

^{4}and Xianghong Jasmine Zhou

^{1,}

^{*}

^{1}Program in Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA,

^{2}National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China,

^{3}Institute of Genomics and Bioinformatics, National Chung Hsing University, Taiwan 40227, Republic of China and

^{4}USC Epigenome Center, University of Southern California, Los Angeles, CA 90033, USA

## Abstract

Recent technology has made it possible to simultaneously perform multi-platform genomic profiling (e.g. DNA methylation (DM) and gene expression (GE)) of biological samples, resulting in so-called ‘multi-dimensional genomic data’. Such data provide unique opportunities to study the coordination between regulatory mechanisms on multiple levels. However, integrative analysis of multi-dimensional genomics data for the discovery of combinatorial patterns is currently lacking. Here, we adopt a joint matrix factorization technique to address this challenge. This method projects multiple types of genomic data onto a common coordinate system, in which heterogeneous variables weighted highly in the same projected direction form a multi-dimensional module (md-module). Genomic variables in such modules are characterized by significant correlations and likely functional associations. We applied this method to the DM, GE, and microRNA expression data of 385 ovarian cancer samples from the The Cancer Genome Atlas project. These md-modules revealed perturbed pathways that would have been overlooked with only a single type of data, uncovered associations between different layers of cellular activities and allowed the identification of clinically distinct patient subgroups. Our study provides an useful protocol for uncovering hidden patterns and their biological implications in multi-dimensional ‘omic’ data.

## INTRODUCTION

Cells are complex systems with multiple levels of organization that interact and influence each other. The precise coordination among epigenetic status, transcriptions, translations, transportation and metabolic reactions are essential in maintaining the function and robustness of cellular systems. However, study of the coordination among such multilevel cellular activities has been hindered by a lack of appropriate data resources; most genomic research has focused on global profiling at only one level (e.g. profiling of gene expression (GE) or protein abundance).

The recent development of high-throughput genomics technologies, especially sequencing technology, has significantly facilitated the characterization of biological systems at multiple levels. For example, The Cancer Genome Atlas (TCGA) project is generating multi-dimensional maps of the key genomic changes (e.g. SNP, DNA methylation (DM), GE and microRNA [miRNA] expression (ME)) for the same set of tumour samples (1). The NCI60 project has profiled 60 human cancer cell lines in terms of drug responses (2–4), GE (5), protein expression (6) and ME. With the expected drop in sequencing cost, multi-dimensional genomics characterizations on the same set of samples will soon become a standard practice.

Emerging multi-dimensional genomics data pose new challenges for data analysis. In particular, because different types of genomics data have different scales and units, we cannot simply aggregate multiple datasets for analysis. For a specific type of two-dimensional genomics dataset consisting of single-nucleotide polymorphism (SNP) and expression data, various eQTL approaches have been developed to identify regulatory SNPs (7). However, eQTL approaches cannot be applied to datasets with more than two dimensions, nor can they be used for datasets with a moderate sample size, which include most future multi-dimensional datasets generated by individual laboratories (rather than consortiums). Multivariate regression is another analytical method applicable to two-dimensional genomics datasets to infer correlative relationships (e.g. between GE and transcription factor binding data (8) or between GE and proteomic data (9)). More recently, Kutalik *et al.* (10) proposed a powerful modular analysis approach, called the Ping-Pong algorithm, to uncover the ‘co-modules’ across GE and drug response data. Undoubtedly, these studies have identified important relationships between pair-wise genomics variables. We believe that the time has come to simultaneously explore the coordination patterns across more than two types of genomics variables.

In this article, we apply a powerful matrix factorization framework to identify correlative modules in multi-dimensional genomics data (Figure 1). As the testing system, we used data from the TCGA project, including DM, ME and GE profiles of 385 ovarian cancer samples. These three types of genomics variables are known to be highly dependent on each other. Our goal was to identify subsets of mRNAs, miRNAs and methylation markers for which all or a subset of the samples exhibit correlated profiles across different types of measurements (Figure 1A). These subsets are termed as ‘**m**ulti-**d**imensional **modules** (md-modules)’.

**A**) An example of md-modules. In the three data matrices, rows correspond to the samples and columns correspond to different measurements. An md-module consists of

*r*rows and

*n*

_{I}(

*I*= 1,2,3) columns for GE, ME and DM data, respectively.

**...**

By identifying md-modules, we can break down the massive sets of data into smaller building blocks that exhibit similar patterns across certain rows and columns (Figure 1). This procedure provides two major advantages. First, representing coherent features across multiple datasets reduces the complexity of the data and facilitates a global overview of the inherent structure of the data. More importantly, this modular approach captures the associations among sets of different types of variables (mRNA, miRNA and methylation). The md-modules can identify vertical associations between multiple regulatory levels and can reveal significantly disrupted pathways that would be ignored if only data of the single dimensions were used. In addition, the md-modules can stratify patients (samples) into clinically distinct groups, which facilitate the identification of the complex molecular mechanisms that underlie different clinical phenotypes.

## MATERIALS AND METHODS

### Data preparation and preprocessing

The TCGA data were downloaded from the TCGA Data Portal on 27 April 2009. We used three types of data, as follows: GE data (Agilent G4502A), DM data (Illumina 27K) and ME data (Agilent H-miRNA_8x15K v2). In total, 385 samples are shared by the three datasets. We normalized the columns of the expression matrices, and then we scaled all the matrices so that sum of squares of each matrix is the same.

To make the input data fit the constraints of non-negativity, we used the method suggested by Kine and Tidor (11). We doubled the columns of each matrix, so that each variable (gene, miRNA) was represented with two columns in the respective matrix. If the original value of the variable was positive, then it was stored in the first column; otherwise, its absolute value was stored in the second column. The rest of matrices were filled with zeros.

### Brief overview

Non-negative matrix factorization (NMF) is increasingly being used to analyse high-dimensional genomics data (11,12). NMF factorizes a matrix *X*_{M×N} into two non-negative matrices *X* = *WH*, where *W* is an *M* by *K* matrix containing the basis vectors, and *H* is a *K* by *N* matrix containing the coefficient vectors. Each element in *W* and *H* must be ≥ 0. Thus, a key feature of NMF is the ability to identify nonsubtractive patterns that together explain the data as a linear combination of its basis vectors. The *K* basis vectors in *W* can be regarded as the ‘building blocks’ of the data, and the *K* coefficient vectors describe how strongly each ‘building block’ is present in the data.

Recently, an NMF-type method has been proposed to analyse pair-wise genomics data (13,14), including GE and transcription factor-binding data (13). We have developed a semi-supervised framework for combing miRNA/genes expression profiles and networked data to extract miRNA–gene regulatory programs (15). Here, we adopt the powerful NMF-type method for the discovery of md-modules by integrative analysis of cancer genomic data, all profiled on the same samples. We introduce the idea using a three-dimensional dataset, but it is applicable to higher dimensional datasets.

### The NMF problem

Given a dataset consisting of *N* measurements of *M* non-negative scalar variables, we let the *M*-dimensional measurement vectors *x*_{. j} ( *j* = 1, … , *N*) form the data matrix *X*_{M×N}. For each column *x*_{.j}, a linear, non-negative approximation of the data is given by

where *W* is an *M* × *K* matrix containing the basis vectors *w*_{.k} as its columns and *H* is an *K* × *N* matrix containing the coefficient vector *h*_{.j} corresponding to the measurement vector *x*_{.j}. Note that each measurement vector is written in terms of the same basis vectors. The *K* basis vectors *w*_{.k} can be thought of as the ‘building blocks’ of the data, and the *K*-dimensional coefficient vector *h*_{.j} describes how strongly each building block is present in the measurement vector *x*_{.j}.

Given a non-negative data matrix *X*, the optimal choices of matrices *W* and *H* are defined to be those non-negative matrices that minimize the reconstruction error between *X* and *WH*. Although several error functions have been proposed (16–18), the most widely used is the squared Euclidean error function:

The resulting *WH* is called the non-negative matrix factorization of *X*. The choice of *K* is often problem-dependent. In most cases, *K* is chosen such that *K* < min(*M, N*) and *WH* represents a compressed form of the data in *X*. By not allowing negative entries in *W* and *H*, NMF enables a non-subtractive combination of parts to form a whole (17).

### The joint NMF framework for integrative analysis

Let *X*_{1}, *X*_{2} and *X*_{3} be *M* × *N*_{1}, *M* × *N*_{2} and *M* × *N*_{3} matrices representing three types of genomic profiling of the same samples, e.g. the methylation profiles of *N*_{1} DNA markers and the expressions of *N*_{2} genes and *N*_{3} miRNAs of *M* samples. To extract md-modules across the three data matrices, the following joint factorization framework was used to decompose the three data matrices into a common basis matrix *W* and different coefficient matrices *H*_{I} (*I* = 1, 2, 3):

with the non-negativity constraints:

where *W* is an *M* × *K* matrix, and each column of *W* represents a basis vector of the reduced system. *H*_{I} is a matrix of size *K* × *N*_{I}, and each row of *H*_{I} represents a coefficient vector. Then, the objective function of joint NMF is formulated as

Several algorithms have been developed to optimize the NMF problem (19). Lee and Seung (18) devised a multiplicative algorithm that is simple to implement and performs well. Like the standard NMF, we used the ‘multiplicative update’ equations to minimize the Euclidean error function. Specifically, given a desired rank *K*, the algorithm iteratively computes the approximations of *X*_{1}, *X*_{2} and *X*_{3} in the same manner. The method starts by randomly initializing matrices *W* and *H*_{1}, *H*_{2} and *H*_{3}, which are iteratively updated to minimize the Euclidean distance function. Specifically, *W*, *H*_{1}, *H*_{2} and *H*_{3} are updated at each step by using the generalized multiplicative update rules as follows:

The above algorithm is a local optimization procedure, and thus, found only a local minimum. To address this limitation, we repeated the procedure for 50 times with different initial solution matrices. The factorization that leads to the lowest objective function value was used as the final solution for further analysis. The solutions found were reproducible, since that of different runs of the repeated algorithm showed strong correlations. The time complexity of the joint NMF decomposition is *O*(*tK*(*M* + *N*_{1} + *N*_{2} + *N*_{3})^{2}) which is similar to that of the original NMF model, where *t* is the number of iterations. The key to use this procedure is the computer memory. Generally, if we have enough memory space, it shall be applicable to even millions of features. If we do not have enough memory space, we can consider reducing the dimension of input data by data-reduction techniques such as the PCA-select tool used for decreasing the feature number in population structure studies (20).

In this way, the three data matrices are projected into a common coordinate system to explore the correlative relationships among the three types of variables (Figure 1B and C). Using this procedure, we obtained coefficient matrices *H*_{1}, *H*_{2} and *H*_{3} that can be used to identify memberships of DM markers, miRNAs and genes in md-modules, respectively. In the general application of NMF (11,12), researchers have used the maximum of each column of *H* (or row of *W*) to determine membership. In this way, each gene (or other object) can belong to one and only one module. However, some markers/miRNAs/genes may not be active in any module or may be active in multiple modules with multiple functions. Considering these facts, based on *H*_{1}, *H*_{2} and *H*_{3}, we calculated the *z*-score for each element in each row of *H* by

where μ_{i} is the average value for feature *j* (DM markers/miRNA/gene) in *H*_{I} (*I* = 1,2,3) and σ_{i} is the standard deviation. We assigned feature *j* as a member of module *k* , if *z*_{ij} was greater than a given threshold *T*. Each DM marker/miRNA/gene may be assigned to md-modules, which allows the identification of multiple functional activities of DM markers/miRNAs/genes. We have implemented the method as a Matlab software package, which is available from the Supplementary Data. Mathematically, the multi-dimensional data of same samples are modeled using multiple matrices that share the same rows. Therefore, the technique cannot be applied to different types of data from different samples.

### Statistical significance of vertical correlations in md-modules

We expect that, within an md-module, the profiles of genes, DM markers and miRNAs are highly (anti-)correlated. To determine whether such relationships are statistically significant, we performed the following assessment. We calculated the ‘between-correlation’ between two matrices with the same row dimensions as the sum of the absolute values of Pearson's correlations between any two columns (one column from each matrix). We derived the statistical significance (*P*-value) of the correlation between two matrices by comparing it with the distribution of between-correlations between 1000 random matrix pairs. Each pair is composed of two matrices with dimensions identical to the original ones, whose elements are extracted from randomly permuted matrices based on the original ones. *P*-values of < 0.05/200 were considered significant. For an md-module, if all three *P*-values for the pair-wise submatrices are significant, then the vertical correlation of this module is considered to be statistically significant.

### Functional analysis of identified md-modules

For each md-module, we identified three gene sets, as follows: (i) genes from the GE dimension; (ii) genes in the 20-kb region around the methylation markers in the DM dimension and (iii) genes targeted by miRNAs in the ME dimension (based on the miRNA targets from the Microcosm database). For each gene set, we performed two types of enrichment analyses: gene ontology (GO) biological process and KEGG pathway analyses.

### Cancer gene and protein interactions enrichment analysis

The protein–protein interaction network data were downloaded from BioGRID (release 2.0.54). The final network has 7682 proteins and 33 165 interactions. The cancer gene list was obtained from the Cancer Gene Census (CGC) web site (21). All cancer genes that are not included in our input gene list were excluded. The final list contains 290 cancer genes. We also collected an epigenetically regulated gene list of ovarian cancer, which includes 40 genes (22). All of the enrichment analyses for a gene set are assessed by the right-tailed Fisher's exact test.

### ‘Vertical’ implications of identified regulatory md-modules

For each md-module, we investigated the vertical associations between different dimensions by the following ‘overlapping analysis:’ we first identified overlapping genes between those from the GE dimension and those adjacent to methylation markers in the DM dimension, or between those from the GE dimension and those targeted by miRNAs in the ME dimension, and then performed the enrichment significance assessment.

### Clinical characterization

Based on the signals for all samples in each column of the common basis matrix *W*, we can characterize their level of association with the discovered md-modules. For each md-module, we divided the set of samples into two groups: module-specific and not module-specific, by using the *z*-score for each column of *W* with a threshold of 1. The clinical data were downloaded from the TCGA portal. Kaplan–Meier curves were computed by using R. Survival distributions between groups were computed through the log-rank test. Age differences between groups were compared with the Wilcoxon signed-rank test.

## RESULTS

Figure 2 illustrates an example using simulated data (see the Supplementary Data). In a matrix representation, a md-module consists of *r* rows and *n*_{I} (*I* = 1,2,3) columns for mRNA, miRNA and methylation markers, respectively. Within these *r* rows (samples) in each matrix, the *n*_{I} (*I* = 1,2,3) columns exhibit correlated measurements (Figure 2). In biological applications, permutation tests are performed to evaluate the statistical significance of each md-module according to the ‘between’ correlations of different types of variables. Details and parameter selections are described in the ‘Materials and Methods’ section and in the Supplementary Data.

**...**

Before describing the application of this method, we briefly show how the md-module discovery is related to, but different from, several typical data mining tasks. Most existing techniques for module identification were applicable only to one or two matrices at a time. For example, the goal of clustering methods is to identify a group of relevant rows or columns in a data matrix. A more related task ‘biclustering (co-clustering)’ refers to a class of clustering techniques that perform simultaneous clustering of rows and columns in a data matrix (23). More recently, Kutalik *et al.* (10) extended the traditional modular analysis approach from one to two data matrices that share one common dimension, and applied their method to identifying drug–gene co-modules. We should note that this method is not directly applicable to three matrices. Shen *et al.* (24) have proposed a joint clustering model for multiple genomic datasets, but it was designed for sample clustering and subtype discovery and cannot identify modules comprising of correlated variables. We have previously proposed a NMF-type method to analyze paired matrices subjected to network constraints (15). However, it has not been applied to more than two data matrices and tested for md-modules analysis.

### Identification of md-modules involved in ovarian cancer

The TCGA ovarian cancer dataset consisting of GE, DM and ME profiles across 385 samples (patients) was used as a testing system to show the discovery of md-modules. After parameter optimizations (details in Materials and Methods), the three large matrices were broken down into *K* = 200 basic building blocks, from which 200 md-modules were derived. The dimension reduction captures the major information embedded in the original data; the average sample-wise correlations of the reconstructed data using these building blocks (based on *W* and *H*_{I}) and the original data were 0.90, 0.92 and 0.91 in the methylation, miRNA and GE dimensions, respectively. The small variances of those correlations further demonstrate the robustness of the method (Figure 3A). The correlated profiles for the three samples are plotted in Figure 3B.

**A**) Box-plot of sample-wise correlations of original and reconstructed methylation, miRNA and GE profiles across 385 samples. (

**B**) Original data are plotted against the reconstructed methylation, miRNA and GE profiles for three samples.

Each of the 200 md-modules comprises a set of genes, methylation markers and miRNAs. In total, the 200 md-modules cover 2985 genes, 2008 DM markers and 270 miRNAs. The average module sizes in the gene, methylation markers and miRNA dimensions are 239.6, 162.3 and 13.8, respectively. Size distribution and other characteristics of these modules are described in the Supplementary Data.

#### Md-modules reveal multilevel vertical associations and cooperative functional effects

To assess the biological relevance of the identified multi-dimensional modules, we first tested the functional homogeneity of members within individual dimensions. A set of genes is defined to be functionally homogenous if it is enriched in at least one GO biological process category (25), with a *q*-value of < 0.05 (the *q*-value is the *P*-value after a false-discovery rate multiple testing correction). Among the 200 md-modules, 80, 62.7, and 12.5% were functionally homogenous in the GE dimension with respect to member genes, in the DM dimension with respect to genes directly adjacent to the member DNA methylation markers and in ME dimension with respect to member miRNAs, respectively. The functions of the miRNAs were predicted based on the functions of the target genes. These values are significantly higher than those obtained after randomization (5, 13.1 and 3.9% for GE, DM and ME, respectively) (Figure 4A).

**A**) Enrichment ratio of md-modules in each dimension (GE, DM and ME), with respect to the GO biological process terms. For comparison, the mean ratio of functional enrichment for 100 corresponding random runs is also plotted. (

**B**) and (

**C**) Examples of protein

**...**

Although all three dimensions showed significant enrichment in developmental processes that are known to be tightly associated with cancer pathogenesis, this preference is most obvious in the DM dimension, with additional strong participation in embryonic development. This result is consistent with the previous report that polycomb complex targets in the embryonic stem cell are predisposed to cancer-specific hypermethylation (26). The most frequently activated biological processes in the GE dimension are responses to external stimuli (e.g. chemotaxis, locomotor behavior and inflammatory responses). This observation points to the flexibility of GE programs upon external perturbations. The ME dimension shows a distinct preference for participation in transcriptional regulation (as expected) and cell differentiation.

Although the individual dimensions of these modules exhibit a significant level of functional homogeneity, combining all dimensions reveals an even stronger functional synergy. When the GE dimension genes, methylation adjacent genes and miRNAs of a module were combined, 93% of the md-modules were functionally homogenous, compared with only 7.9% after randomization (Figure 4A). This result shows the power of current integrative analysis of muilti-dimensional data in identifying genomic variables of different natures that are involved in the same functional pathways.

The ability of the modules to capture multilevel synchronicity was also observed relative to perturbed KEGG pathways. For example, simply by combining multiple dimensions, we observed that nine modules showed significant perturbations in at least one KEGG pathway (*P*-value < 0.05) that were not shown otherwise. These pathways include TGF-β signaling, Hedgehog signaling, bladder cancer and cytokine–cytokine receptor interaction pathways, all of which have been confirmed to be closely associated with ovarian cancer (27–32). For 11.5% of the md-modules, the pathway enrichment for combined members from all three dimensions are more significant than that for any individual dimension.

According to the model principle, a md-module should capture vertical associations, i.e. associations between variables of different dimensions (e.g. GE and DM) in it. Indeed, compared with randomly permuted modules, the Pearson's correlation coefficients between variables of any two of the GE, DM and ME dimensions are significantly high (*P*-value < 0.05/200) in 65.5% of the modules (for details see ‘Materials and Methods’ section). This result indicates that the probability of identifying these modules by chance is close to zero. The strong statistical correlations across different dimensions imply the coordinated activities of genes, methylations and miRNAs.

To explore further the biological implications of these vertical correlations, we tested whether genes in an md-module were likely to be located close to the methylation markers in the same module or/and targeted by miRNAs in the same module. At a significance level of 0.1, we found that 75 of the 200 md-modules showed significant overlap between genes adjacent to methylation markers and genes within the same module. This result confirms the strong influences of DNA methylation on the expression of adjacent genes. Likewise, 146 modules with *P*-value < 0.1 show significant overlap between genes targeted by miRNAs and genes within the same md-module. As the targeting relationship between miRNAs and genes is far from complete, our overlap assessment can only serve as an underestimate. These data show that the md-modules can elucidate the vertical association mechanisms between different layers of gene regulation. Table 1 showcases 12 of the md-modules, including the overlap between different dimensions within the same modules, and the over-represented functions and pathways of the modules.

Interestingly, among the 3733 genes overlapping at least two dimensions from all md-modules, genes related to ovarian cancer are significantly enriched (*P*-value = 0.000087). Note that the overlapping genes are those on which regulatory perturbations were observed at multiple levels. It is not surprising that those genes are especially concentrated in the biological processes of ‘positive regulation of developmental processes,’ ‘positive regulation of cell differentiation,’ ‘inflammatory response’ and ‘regulation of cell development.’ Md-module 173 contained six, nine and nine genes overlapping the GE and DM, GE and ME and DM and ME dimensions, respectively. Among these genes, *NID2* (Nidogen-2) was overlapped by all three dimensions. *NID2* recently was defined as a new biomarker for ovarian cancer by comparing its concentration in the serum of healthy women with that in women with ovarian carcinoma (33). More interestingly, *NID2* gene promoters are aberrantly methylated in human gastrointestinal cancer (34), and methylated *NID2* has been defined as a marker for primary bladder cancer (35).

In 44 modules, the genes from the GE and DM dimensions are enriched in protein–protein interactions (Figure 4B) (*P*-value < 0.05; we skipped the ME dimension, due to the large number of potential miRNA targets). Among these 44 modules, 18 are enriched in protein–protein interactions bridging the GE and DM dimensions (i.e. one protein belongs to the GE dimension and another belongs to the DM dimension) (*P*-value < 0.05 with right-tailed Fisher's exact test). This finding highlights the different regulatory effects on closely adjacent molecules of the same pathway.

Finally, we hypothesized that the identified md-modules might play a role in cancer. Indeed, 22 combined sets of genes (from the GE and DM dimensions) are enriched with the cancer gene reference set (*P*-value < 0.05 with right-tailed Fisher's exact test) (Figure 4C) (i.e. the CGC list (21)). The results of the large-scale enrichment analysis support the biological relevance of the regulatory programs detected by our method.

#### Md-modules capture multilevel synchronized disruptions on pathways: two case studies

This section provides in-depth descriptions of two case studies (modules 119 and 5) to demonstrate how multilevel regulatory changes cooperatively perturb pathways.

##### Md-module 119

The individual dimensions of module 119 do not show significant enrichment in any KEGG pathway. However, when all three dimensions were considered, the bladder cancer pathway emerged as a significantly disrupted pathway. This pathway, which is frequently altered in bladder cancer, shares a set of known oncogenes and tumor suppressors with many other cancers (e.g. prostate, ovarian and lung cancers). Module 119 overlaps with the bladder cancer pathway in three genes in the GE dimension (*MMP1, MYC* and *RB1*), three genes adjacent to markers in the DM dimension (*CDKN2A, RASSF1* and *TYMP*), and four miRNAs in the ME dimension (*mir-130b, mir-149, mir-196b* and *mir-218*). Figure 5A provides a snapshot of perturbation positions for some of these molecules along the pathway. Promoter hypermethylations of two identified tumor suppressor genes, *CDKN2A* and *RASSF1*, are thought to be involved in the development and progression of ovarian cancer (22). The DNA methylation markers adjacent to *RASSF1* and *CDKN2A* were negatively correlated with the expression of these two genes (*P*-value < 0.0001). Figure 5A shows that *RASSF1* could be additionally inhibited by the increased expression of its predicted post-transcriptional regulator *mir-130* in ovarian tumors, compared with normal tissues. *RASSF1* encodes a protein similar to the effector proteins of the oncogene *HRAS*. Thus, promoter hypermethylation will silence *RASSF1*, thereby upregulating the activity of *HRAS*. The effector of *HRAS* on the pathway, *ARAF*, is also a potential target of *mir-218* and, thus, would also be activated.

**A**) Bladder cancer pathway and (

**B**) TGF-β signaling pathway, which are enriched in the combination of molecules in all three dimensions, but not in each dimension. In both subfigures, molecules

**...**

The next two neighbors on the pathway, *MAPK1* and *RPS6KA5*, are potentially targeted by another onco-miRNA *mir-130b*, whose elevation is known to be associated with a variety of cancers (36–38). *Mir-130b* is predicted to target several downstream molecules in this pathway, including *CDKN1A* and *E2F2/3*, both of which are reported to be critically involved in the pathogenesis of ovarian cancer (39,40). In fact, the multiple potential targets of *mir-130b* in the bladder cancer pathway suggest that *mir-130b* could be a key regulatory factor of this dysfunctional pathway in ovarian cancer. The GE dimension of our module includes two important genes on this pathway, the oncogene *MYC* and the tumor suppressor *RB1*. An interesting gene, *CCND1*, connects *MYC, RB1* and another tumor suppressor gene, *CDKN1A*, in this pathway. Mutations, amplification or overexpression of *CCND1*, which alter the cell cycle progression, are observed frequently in a variety of tumors (41,42). Thus, *CCND1* may be an important contributor to tumorigenesis. This example clearly shows that the md-modules capture the associations among epigenetic regulation, gene expression and post-transcriptional regulation on various parts of the pathway. Such synchronized effects from multiple regulatory levels are otherwise difficult to identify.

##### Md-module 5

As another example, md-module 5 captures the significant dysfunction of the TGF-β signaling pathway in ovarian cancer, which, again, only become obvious by combining perturbations in all three dimensions. Genes in this module that participate in the TGF-β pathway include *INHBA, INHBB, COMP* and *MYC* in the GE dimension, *PPP2R2C, INHBE* and *GDF5* adjacent to markers in the DM dimension, and *mir-363, mir-768-5p* and *mir-451* in the ME dimension (Figure 5B shows a snapshot of perturbation positions for some of these molecules among the pathway). The TGF-β signaling pathway normally exerts anticancer activities by arresting the G1–S transition. However, its abnormal function reverts to promote tumorigenesis, especially in terms of metastatic progression, a functional switch known as the ‘TGF-β paradox’ (43). In fact, in this module, 60% of tumors with characterized recurrences sites have metastasized.

The ‘core’ metastasis-associated gene expression signature is manifested in this module, mainly through the increased expressions of *COMP* and *INHBA* (44). This finding further confirms the strong metastasis characteristics of samples in the module. Interestingly, *mir-363, mir-768-5p* and *mir-451* all potentially target *EP300*, a metastasis suppressor whose decreased expression and protein abundance have been detected in many highly metastatic cancer tissues (45). Another tumor suppressor, *PPP2R2C*, not only appeared in the methylation dimension of the module, but also may be a potential target of *mir-363*. In addition, *mir-363* targets a set of *SMAD* molecules, which play important roles in the metastasis transition contributed by TGF-β (46–48). Furthermore, *mir-768-5p* is predicted to inhibit *E2F5* and *BMPR1A*, both of which support the original anticancer activities of TGF- β pathway (50,49).

The TGF-β signaling pathway has been regarded as a potential therapeutic target in ovarian cancer metastases (27). More interestingly, a recent study suggests that the accumulation of epigenetic modifications, including DNA methylation, leads to the suppression of TGF-β signaling and contributes to ovarian carcinogenesis (51). Our md-module facilitates the discovery of the abnormal functions of this pathway at multiple regulatory levels. Thus, this method can aid a holistic approach to drug interventions that can simultaneously correct the effects of various types of dysfunctions.

#### Clinical associations of the md-modules

In the NMF framework, the decomposed component vector (i.e. column of the *W* matrix) can provide information on the association of each sample/patient with an individual module. This information, combined with the available clinical characterizations of each patient, can aid in the discovery of phenotype-specific md-modules. An md-module that stratifies patients into clinically distinct groups can shed light on the molecular mechanisms of the respective clinical phenotypes.

Based on the information from the *W* matrix, we compared the survival time of ovarian cancer patients that are strongly associated with a specific md-module versus those that are not. We found patients in several md-modules who showed significantly shorter or longer median survival time (log-rank test *P* < 0.05; Supplementary Data). For example, 13 patients are strongly associated with md-module 166. They show significantly worse outcome, with a median survival of 26.4 months compared with 34.1 months for other patients (*P* = 0.0006, log-rank test) (Figure 6A). In fact, in all three dimensions of this md-module, these 13 patients show distinct characteristics compared with the rest of the patients. For example, genes/miRNAs in this module are over/under-expressed in these 13 samples compared with other samples, as are the methylation levels of the markers. The module contains numerous cell cycle check-point genes (e.g. *BUB1B*, *CENPF, MAD2L1, CCNB1, BUB1, CCNA2, CHEK1* and *TTK*) and is significantly enriched in genes from the ‘nuclear division’ functional category (*P*-value < 10^{−8}). In another case, the patients in md-module 3 are associated with an improved survival, with a median survival of 38.2 months versus 33.8 months in the remaining patients (*P* < 0.02, log-rank test). This module reveals the significant perturbation of the endometrial cancer pathway with several key genes related to tumorigenesis, e.g. *EGFR*, *CTNNA2* and *ARAF*.

**A**) and (

**B**) Kaplan–Meier survival analysis for patients associated with module 166 (A) or module 3 (B) compared with other patients. The

*P*-values of the log-rank test were

*P*= 0.0006 and

*P*= 0.019, respectively.

**...**

We identified 20 md-modules, each of which contains patients with significantly different age characteristics from patients outside the module. For example, patients in module 28 had an older median age compared with other patients (66.3 years versus 58.7 years; *P* = 0.009, rank-sum test) (Figure 6C), and md-module 78 was associated with significantly younger patients (median age of 54.1 years versus 60.2 years for the rest of patients) (*P* = 0.002, rank-sum test) (Figure 6D).

Finally, in addition to tumor samples, our study samples include eight normal fallopian tube samples. Md-module 120 contains six samples, all of which are normal fallopian tube samples (enrichment *P* = 6.4 × 10^{−12} based on Fisher's exact test). This is an extreme example demonstrating that our modules can distinguish phenotypically distinct patient groups. A number of miRNAs, e.g. *mir-143, mir-145, mir-224* and *mir-424*, are reported to be down-regulated in ovarian carcinoma cells 53–54). Not surprisingly, all of them show high expression values in this module containing only normal samples.

## DISCUSSION

Recent technology has enabled the simultaneous multi-platform genomic profiling of biological samples, resulting in so-called multi-dimensional genomic data. With the rapid decline of sequencing costs, such data will soon accumulate rapidly. However, systematic analysis of such multi-dimensional data for discovering biologically relevant combinatorial patterns are currently lacking. A great number of tools designed for one- or, at most, two-dimensional data have been developed, and many of which have been applied for genomic data analysis in the past. In this article, we attempted to adopt powerful data analysis technique to address the sophisticated modular structures embedded in multi-dimensional genomics data. We proposed the novel concept of md-modules.

Using the TCGA ovarian cancer dataset comprising gene expression, DNA methylation and miRNA expression in 385 samples, we showed that md-modules provide several unique insights. (i) By considering several different aspects of genomic modulation, md-modules can reveal perturbed pathways that would be overlooked with only a single type of data. (ii) An md-module identifies associations between different layers of cellular activity (e.g. DNA methylation, gene or miRNA expression), even if these associations exist only in a subgroup of samples. (iii) An md-module can identify clinically distinct patient (sample) subgroups that share subsets of multi-dimensional genomic features (methylations, gene expressions, etc). Cancer in particular is characterized by the existence of many subtypes with heterogeneous genetic origins, and one type of genomic feature is often not sufficient to characterize the clinical subgroup. We should note that the md-modules were constructed based on variable correlations/associations, which do not necessarily imply causal relationships among the variables. However, since many identified md-modules are of significant biological relevance, we believe that such modules can be a good start to uncover further underlying causal mechanisms of gene regulation.

Identifying coordinated patterns across multiple regulatory layers is a vital step toward revealing the high-order organization of complex gene regulatory systems. In this study, we attempted to reveal the coordinated subspace patterns comprising the epigenetic, transcription and post-transcription levels, yet the real picture can be much more complex, given the many other levels of regulatory controls (e.g. copy number changes, SNPs, protein transport and localization). For example, gene copy number losses of *miR-210* have been found in ovarian carcinomas (55), and mutations in *p53* are the most common gene mutations in human cancer, including ovarian cancers (56). In future studies, it will be worthwhile to apply the proposed method to more data sources simultaneously, to uncover more sophisticated ‘factories’ that comprise many layers of regulatory factors.

## SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Figures 1–3, Supplementary Package and Supplementary Dataset.

## FUNDING

National Institutes of Health [R01GM074163]; the National Science Foundation [0747475] and the Alfred P. Sloan Fellowship (to X.J.Z.); the National Natural Science Foundation of China [11001256]; Innovation Project of Chinese Academy of Sciences (CAS), kjcx-yw-s7; the ‘Special Presidential Prize’—Scientific Research Foundation of the CAS; the Special Foundation of President of AMSS at CAS for ‘Chen Jing-Run' Future Star Program and the Foundation for Members of Youth Innovation Promotion Association, CAS (to S.Z.). Funding for open access charge: NSF07474750.

*Conflict of interest statement*. None declared.

## REFERENCES

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (3.5M) |
- Citation

- Identifying multi-layer gene regulatory modules from multi-dimensional genomic data.[Bioinformatics. 2012]
*Li W, Zhang S, Liu CC, Zhou XJ.**Bioinformatics. 2012 Oct 1; 28(19):2458-66. Epub 2012 Aug 3.* - Matrix factorization methods for integrative cancer genomics.[Methods Mol Biol. 2014]
*Zhang S, Zhou XJ.**Methods Mol Biol. 2014; 1176:229-42.* - Synergistic effect of different levels of genomic data for cancer clinical outcome prediction.[J Biomed Inform. 2012]
*Kim D, Shin H, Song YS, Kim JH.**J Biomed Inform. 2012 Dec; 45(6):1191-8. Epub 2012 Aug 15.* - Integrating the multiple dimensions of genomic and epigenomic landscapes of cancer.[Cancer Metastasis Rev. 2010]
*Chari R, Thu KL, Wilson IM, Lockwood WW, Lonergan KM, Coe BP, Malloff CA, Gazdar AF, Lam S, Garnis C, et al.**Cancer Metastasis Rev. 2010 Mar; 29(1):73-93.* - Principles and methods of integrative genomic analyses in cancer.[Nat Rev Cancer. 2014]
*Kristensen VN, Lingjærde OC, Russnes HG, Vollan HK, Frigessi A, Børresen-Dale AL.**Nat Rev Cancer. 2014 May; 14(5):299-313.*

- A Computational Approach to Identifying Gene-microRNA Modules in Cancer[PLoS Computational Biology. ]
*Jin D, Lee H.**PLoS Computational Biology. 11(1)e1004042* - Systematic Analysis of New Drug Indications by Drug-Gene-Disease Coherent Subnetworks[CPT: Pharmacometrics & Systems Pharmacology...]
*Wang L, Wang Y, Hu Q, Li S.**CPT: Pharmacometrics & Systems Pharmacology. 2014 Nov; 3(11)e146-* - Integrative clustering methods for high-dimensional molecular data[Translational cancer research. 2014]
*Chalise P, Koestler DC, Bimali M, Yu Q, Fridley BL.**Translational cancer research. 2014 Jun 1; 3(3)202-216* - Discovery of co-occurring driver pathways in cancer[BMC Bioinformatics. ]
*Zhang J, Wu LY, Zhang XS, Zhang S.**BMC Bioinformatics. 15(1)271* - Integrative Pathway Analysis Using Graph-Based Learning with Applications to TCGA Colon and Ovarian Data[Cancer Informatics. ]
*Dellinger AE, Nixon AB, Pang H.**Cancer Informatics. 13(Suppl 4)1-9*

- Discovery of multi-dimensional modules by integrative analysis of cancer genomic...Discovery of multi-dimensional modules by integrative analysis of cancer genomic dataNucleic Acids Research. 2012 Oct; 40(19)9379

Your browsing activity is empty.

Activity recording is turned off.

See more...