![]() | ![]() |
Formats:
|
||||||||||||
Copyright 2006, The Feinstein Institute for Medical Research Peeling Off the Hidden Genetic Heterogeneities of Cancers Based on Disease-Relevant
Functional Modules 1 Department of Bioinformatics, Harbin Medical University, Harbin, China 2 School of Biological Science and Technology, Tongji University, Shanghai, China 3 Department of Molecular Cardiology and Department of Cardiovascular Medicine, the Cleveland Clinic Foundation, Cleveland, OH, USA Address correspondence and reprint requests to Zheng Guo, Department of Bioinformatics, Harbin Medical University, Harbin, China 150086. Phone: +86-451-8661-5933; fax: +86-451-8666-9617; e-mail: guoz/at/ems.hrbmu.edu.cn; or Shaoqi Rao, Department of Molecular Cardiology/Office NB5-28, Lerner Research Institute/NB50, Cleveland Clinic Foundation, 9500 Euclid Ave., Cleveland, OH 44195. Phone: 216-444-0056; fax: 216-444-2682; e-mail: raos/at/ccf.org *J.X. and Z.G. contributed equally to this study Received November 11, 2005; Accepted March 22, 2006. This article has been cited by other articles in PMC.Abstract Discovering molecular heterogeneities in phenotypically defined disease
is of critical importance both for understanding pathogenic mechanisms
of complex diseases and for finding efficient treatments. Recently, it
has been recognized that cellular phenotypes are determined by the
concerted actions of many functionally related genes in modular fashions. The
underlying modular mechanisms should help the understanding of
hidden genetic heterogeneities of complex diseases. We defined a putative
disease module to be the functional gene groups in terms of both
biological process and cellular localization, which are significantly
enriched with genes highly variably expressed across the disease samples. As
a validation, we used two large cancer datasets to evaluate the
ability of the modules for correctly partitioning samples. Then, we sought
the subtypes of complex diffuse large B-cell lymphoma (DLBCL) using
a public dataset. Finally, the clinical significance of the identified
subtypes was verified by survival analysis. In two validation datasets, we
achieved highly accurate partitions that best fit the clinical
cancer phenotypes. Then, for the notoriously heterogeneous DLBCL, we
demonstrated that two partitioned subtypes using an identified module (“cellular
response to stress”) had very different 5-year
overall rates (65% vs. 14%) and were highly significantly (P < 0.007) correlated with the clinical survival rate. Finally, we built
a multivariate Cox proportional-hazard prediction model that included 4 genes
as risk predictors for survival over DLBCL. The proposed modular
approach is a promising computational strategy for peeling off
genetic heterogeneities and understanding the modular mechanisms of human
diseases such as cancers. INTRODUCTION Genetic heterogeneity describes the biological complexities whereby apparently
similar inheritable characters result from different genes or
different genetic mechanisms. In clinical settings, genetic heterogeneity
refers to the presence of a variety of genetic defects that cause
the same disease as defined in the current disease classifications
(1), a finding common to a list of complex human diseases such as cardiovascular
disease, cancer, diabetes, autoimmunity, psychiatric illness, and
many others, and even Mendelian disorders
(2). Genetic heterogeneity has profound influences on modern clinical practice
and biomedical research of common human disease. In the basic genomic
sciences, it is a thorny issue for genetic linkage analysis
(3,4), high-density admixture mapping of disease genes
(5), and microarray data analysis
(6). More accurate phenotyping of genetic heterogeneous samples, either by
explicitly modeling stratified population structure [for example, due
to racial difference
(7)] or by peeling off hidden genetic heterogeneity
(4), has been demonstrated to result in increased power to map disease genes. In
clinical practice, it is increasingly recognized that our current
categorization of human diseases still lumps together molecularly
distinct diseases (for example, cancers) with the same clinical phenotypes
(8). Because the clinical behaviors of some complex diseases such as cancers
cannot be accounted for completely by morphological or pretreatment
clinical characteristics, patients with the same phenotype, which might
be caused by different underlying molecular mechanisms, often show
different responses to drug treatment and have different prognoses. Thus, a
central challenge to study and to improve efficacy in treatment
of complex diseases is to resolve their molecular heterogeneity mechanisms
(9). Genomic-scale molecular data rapidly accumulating from biomedicine domains
offer opportunities to peel off genetic heterogeneities at the molecular
level. The promise for molecular classification and discovery of
hidden disease subtypes has been realized in successful stratification
of diffuse large B-cell lymphoma (DLBCL)
(8), based on the expression profiles of thousands of genes measured by microarrays
and using computationally clustering algorithms, an approach
aimed at defining genetically homogeneous novel cancer subtypes among
cancer patients. Although traditional clustering analysis of the expression
profiles of individual genes is a successful approach to discovering
disease subtypes, several significant shortcomings in this analysis
strategy remain. First, a traditional clustering analysis often groups
patients with overall similar gene expression profiles by the complete
set of thousands of genes represented on the arrays, and thus has
low ability to reflect the influence of the most disease-relevant genes. When
a large number of irrelevant or weakly relevant genes greatly
influence the clustering results, spurious structure of disease expression
patterns may appear out of the high dimensionality of data. Second, and
more important, because a traditional clustering analysis rarely
uses the current gene functional knowledge to the groupings, the
biological relevance and interpretation of the patient groupings by the
traditional clustering analysis are often unclear. It is obviously of
great value if we can discover and elaborate disease subtypes at the
functional module level by explicitly yielding functionally compact gene
sets with coherent expression across cancer samples. A module describes a biologically coherent set of genes that tend to express
and perform their highly integrated cellular functions in somewhat
isolated and interactive modular fashions
(10,11), a phenomenon that has inspired studies for elucidating the high-order
pathogenic mechanisms of complex diseases
(12,13). For example, Mootha et al.
(12) showed that the modest but coordinate disease-associated changes of a
set of functionally related genes could be identified even in the cases
where the expression of individual genes was not significantly different. Segal
et al.
(14) defined “modules” as biologically meaningful gene sets
that are conditionally activated or repressed across a wide variety
of cancer types, and identified some modules deregulated in cancer. Our
recent study demonstrated that cancer types can be precisely and robustly
classified based on functional modules enriched with differentially
expressed genes
(15). Nevertheless, nothing in the literature exists for fully exploiting
the power and value of the modular approaches to systematically dissecting
the molecular heterogeneities of human diseases. Here, we further proposed a module-based clustering approach for dissection
of cancer heterogeneity by using the disease-relevant functional
modules. First, we selected differentially expressed genes under the disease
conditions. It should be noted that algorithms such as t test or F test are not proper for selecting genes under the disease heterogeneity (subtypes), because
the validity of these tests relies on accuracy in
describing the disease population structure by the current clinical
disease categorizations, i.e., lack of phenotypic heterogeneity. Hence, we
took a robust metric, the overall variability of gene expressions, to
guide gene selection. Genes with top-ranked expression variations
across samples, which explain most of the total variances potentially
contributed by known or unknown factors (for example, the hidden disease
subtypes), were selected as “feature genes.” This
metric has been adapted by several researchers for initial gene selection
(16,17). Then, we identified cellular-localized biological processes enriched
with feature genes as “putative signature modules.” Finally, we
partitioned samples to seek for hidden disease subtypes using
the expression profiles of the genes annotated to these well-characterized
modules. As subcellular localization of genes and proteins is
a key functional characteristic determining their ability to interact
with other proteins and small metabolites in their local environment, we
characterized the modules in terms of biological processes and cellular
localizations based on Gene Ontology (GO)
(18). GO is a comprehensive ontological system describing gene functions in
three directed acyclic graphs: biological process (BP), molecular function (MF), and
cellular component (CC). In numerical analyses, we first
validated the proposed modular approach for accurately partitioning
cancer phenotypes using two publicly available large cancer datasets. Then, we
used the approach to explore the hidden subtypes of a notoriously
heterogeneous phenotype, DLBCL
(8). The results demonstrated that two partitioned subtypes using an identified
functional module had very different 5-year overall rates, and
the partition was highly significantly correlated with the clinical survival
rate. MATERIALS AND METHODS Description of Datasets We used two large datasets to evaluate the goodness-of-fit performance
of the proposed modular approach. The liver cancer dataset
(19) consists of 23,075 cDNAs measured in 105 primary hepatocellular carcinoma (HCC) samples
and 76 normal liver tissues, a typical large disease-control
example. Because the HCC phenotype is specifically defined, it
can be reasonably assumed that the original tissue phenotypes were
well characterized. We further explored the reliability of the proposed
modular approach for partitioning between various types of cancers by
analyzing a classical multiple-class dataset, NCI60
(20), which consists of 9,703 cDNAs measured in 60 cell lines of 9 cancer
types. The data for non–small cell lung carcinoma and breast tumors
were not used in this study because of the possible existence of
heterogeneous hidden subtypes
(20) or misassigned labels
(21) for their samples. The data for prostate cancer were also excluded because
they consisted of only 2 samples. Thus a subset of the NCI60 data (41 samples
of 6 cancer types) was used in this study, including 8 samples
of renal cancer (RE), 7 of colon cancer (CO), 6 of leukemia (LE), 8 of
melanoma (ME), 6 of ovarian cancer (OV), and 6 of central nervous
system cancer (CNS). After evaluating its ability for accurately partitioning
the two diverse data structures, we used the proposed modular
approach to peel off the hidden subtypes of DLBCL, which has been
demonstrated to be notoriously heterogeneous
(20,22,23). The third dataset consists of 4,026 cDNAs measured in 42 DLBCL samples
(8). We verified the identified hidden partitions (DLBCL subtypes) by survival
analysis of the clinical profiles of patients in each molecular-based
partition. For each of the above cDNA microarray datasets, we screened out clones
with missing data in more than 5% of arrays and applied a base-2 logarithmic
transformation. As in Alizadeh et al.
(8), we imputed remaining missing data with zeros. Each experiment was standardized
to zero median across the genes. The datasets of HCC, NCI60, and
DLBCL finally comprised 10,516, 6748, and 2751 genes, respectively. Selecting Putative Signature Modules from Gene Ontology Most current approaches to defining modules use only BP categorization
of GO. However, a BP category may actually encompass the genes involved
in distinct processes occurring in different cellular compartments, and
genes even within the same BP may show a clear expression distinction
with respect to their subcellular localizations
(24,25). Therefore, to identify modules containing the consistently coexpressed
genes potentially aroused by the disease conditions, we sorted genes
of a BP category into CCs to form combined categories. For example, genes
whose protein products function in cell adhesion (BP) on membrane (CC) are
accommodated in a combined GO category. We referred to all
the measured genes annotated to at least one of the combined categories
as “annotated genes.” For each dataset, the top x percent of genes with the largest expression variances were selected as
feature genes. Then, we used a hypergeometric distribution
(26,27) to calculate the probability P of a combined GO category having the number of annotated feature genes
by chance; a smaller P value corresponds to a higher likelihood of the feature genes enriched
in the category. We selected categories with P ≤ 0.001 and kept the categories containing at least 30 feature
genes to retain enough data for clustering. Owing to the hierarchical
nature of the GO-structured categories, there are some redundancies in
the selected categories; for example, BP function description is the
same but a general-specific (for example, parent-child) relationship
lies on the CC functional description. In such a case, only the combined
category with the child category in cellular component ontology was
reserved, because its functional description is more specifically defined. In
the following text, we refer to such GO categories as a “module” for
short. The identified modules should be statistically
robust to the differences in the criterions for selecting feature
genes because the analysis results are determined by the joint statistical
behaviors of sets of genes
(27). We demonstrated the robustness of the modules by comparing the modules
identified at different top percentage levels (x = 10, 15, 20) of feature genes with the largest variances. Clustering Samples Based on Individual Modules For each identified module, we extracted the expression profiles of the
measured genes that were annotated to it. By agglomerative hierarchical
clustering
(28), each sample was initially assigned to one cluster, then the distances
between all clusters were computed and the two clusters with the smallest
distance value were merged. Distance computation and merging were
repeated until there was only one cluster left. In this work, Pearson
correlation was used for the distance metric, and the centered average
linkage method was used for merging. For the purpose of evaluating
the modular clustering approach, we adopted the predefined cluster number
in the original data source for pruning off the hierarchical tree
and allocating the samples into clusters. Because the expected value of
the Rand index is not constant for random partitions
(29), we used the adjusted Rand index (ARI)
(30) to measure the agreement between the identified clusters and the original
partitions (for example, the clinical sample labels). The expected
value of the ARI is 0 when the partitions are drawn randomly, and the
ARI is 1 when two partitions agree perfectly. A larger ARI dictates
a higher correspondence between two types of partitions. One general approach to assess the significance of the observed ARI for
a module might be to compare the ARI value with those of the same-sized
gene subsets randomly selected from the whole microarrays that contained
the genes (and their coexpressed ones) in the current modules. However, we
are more interested in finding whether the profiles of the
genes in the current modules were significantly better at clustering than
the gene groups randomly selected from a null (or contrast) population, where
the gene had no or less functional relationship with the current
modules. It is well known that similarly expressed (coexpressed) genes
tend to share the same or similar functions
(31,32), and in fact the gene coexpression information is often used for predicting
gene functions
(33,34). Thus, we constructed the null gene population using the silence genes
among all the annotated genes from the original expression profiles, after
excluding (i) the genes annotated to the identified modules and (ii) the
genes significantly coexpressed with at least one gene in the
identified modules. Here, two genes were defined as coexpressed when
the absolute value of Pearson correlation coefficient (γ) of
their expressions was larger than a threshold corresponding to the significance
level P ≤ 0.005, determined by using 10,000 gene pairs randomly sampled
from the original expression profiles. Then, for each identified module, 1000 gene subsets of the same size as
the module were randomly sampled from the null population. Applying the
same clustering procedure to the 1000 random gene subsets, we set the P value of the ARI of the module as the fraction of 1000 random subsets
having ARIs larger than that of the module. The P value based on such randomizations was used to assess whether the observed
ARI for a module was achieved by chance or, in a more specific sense, whether
the module was better at clustering (that is, more likely
relevant to the phenotypic partitions) than gene subsets that were less
likely to be of close functional relationship with the identified modules. Clustering Based on Multiple Modules Some samples can be possibly misallocated by using one or a few modules. One
robust way to get improved partition results is to decide samples’ labels
in a collectively voting manner, by fusing the results
from the individual modules. Here, for each sample, based on its membership
labels obtained from different modules, we applied a simple majority
rule to determine a sample’s membership. If the sample
had the highest votes across several classes, we randomly assigned one
of the class labels to the sample. Survival Analysis To verify the clinical significance of the identified hidden DLBCL subtypes, we
estimated survival curves by Kaplan-Meier product-limit method
and assessed the differences between the survival curves of the subtypes
of DLBCL patients by log-rank test
(35). To construct a model for predicting the overall survival time, univariate
Cox proportion-hazards model
(36) was used to determine the significance (at P ≤ 0.05) of the effects of the genes annotated to the identified
module or modules on the patients’ survival months. Subsequently, genes
past the above threshold were reanalyzed using a multivariate
Cox proportional-hazards regression model, with the overall survival
months as the dependent variable. Wald χ2 test was used to determine the significance of each predictor’s
hazard toward the survival time. RESULTS Validation of the Proposed Modular Approach Using Two Large Microarray
Datasets In the liver cancer dataset, we identified 41 combined categories significantly (P ≤ 0.001) enriched with 10% top-ranked genes with the largest
expression variances. When two combined categories had the same
BP function description and their CC descriptions were of a general-specific
relationship, only the combined category with the more specific
CC description was retained. For example, the combined category, “BP: development” and “CC: cellular component,” was
removed because a more specifically defined module (“BP: development” and “CC: extracellular”) could
be identified. It should be noted that the purpose for removing some
redundant modules was the promise of finding a set of more compact
and more specific functional modules that would provide sufficient information
for characterizing cancer samples. The redundant modules that
overlapped in one or two dimensions with the selected ones were often
highly enriched with feature genes, too, suggesting that they were also
good candidates for separating disease samples. For example, based
on the gene expression profiles in the module of “BP: development” and “CC: cellular component,” an ARI of 0.732 was
obtained. After the redundancy treatment, six modules were left for the following
analysis. We used the expression profiles of the measured genes annotated
in each of the six modules to partition the samples. The clustering
results based on each of the six modules agreed well with the original
clinical labels, and the observed ARIs were 0.830, 0.871, 0.892, 0.790, 0.713, and 0.850. The average ARI for the six modules was 0.824 (± 0.065), and
the module “BP: cell growth and/or maintenance” occurring
at “CC: extracellular” achieved
the best results, with ARI 0.892. Then, for each module, we randomly
selected 1000 gene subsets of the same size of the module from the
null population as described previously. We found that no random subset
achieved an ARI larger than that of the corresponding module, so the
observed ARIs of all six modules were significantly (P < 0.001) better at clustering than randomly selected gene subsets. The
sample memberships assigned by the six individual modules show that
some samples were misallocated by one or more modules (Supplement 1). By
using majority rule clustering, which assigns the majority membership
labels to samples, we obtained an ARI of 0.934, where only 3 tumor
samples were misallocated (Table 1).
Accumulated biological experiments provided rich evidence to support the
roles of some key proteins annotated to the six modules. For example, it
has been reported that nucleoside transporters and glutamine transporters
are abnormally expressed in hepatoma cells
(37,38), which supports that the module of “BP: transport” occurring
at “CC: integral to plasma membrane” is relevant
to HCC. The significant correlations of serum IL-8 levels with tumor
size and tumor stage
(39) suggest that two modules (“BP: G-protein coupled receptor protein
signaling pathway” and “BP: immune response,” both
occurring at “CC: extracellular region”) may
be directly or indirectly involved in the progression of HCC. Genes such
as vascular endothelial growth factor (VEGF), annotated to the module “BP: cell development” and “BP: signal
pathway” occurring at “CC: extracellular
region,” have been suggested as diagnostic markers or prognostic
factors of HCC
(40). In addition, glypican 3 (GPC3) (in module “BP: G-protein coupled
receptor protein signaling pathway” occurring at “CC: cell”) has
been found to be both a marker for HCC and a target
for HCC therapy
(41). Based on the NCI60 dataset, we identified 38 combined categories significantly (P ≤ 0.001) enriched with the 10% top-ranked genes with the
largest expression variances. After the redundancy treatment, seven
modules remained. All the ARIs of the seven modules were significantly
larger than those achieved by chance (P < 0.001 for six modules and P < 0.010 for one), and the average ARI of the seven modules was 0.607 (± 0.073). The
majority rule clustering approach achieved an
ARI value of 0.697. Detailed results are listed in Table 1 and Supplement 2. Numerous reports
(42,43) have documented the relationships between cancers and the seven selected
modules: cell communication (modules “signal transduction” and “cell-surface receptor linked signal transduction” and “cell adhesion”), immune response, cell
development (modules “development” and “morphogenesis”), and
so on. In each dataset, based on the feature genes selected as the top 10%, 15%, and 20% ranked genes with the largest variances, the
identified modules largely overlapped, suggesting the robustness
of such modules to the differences of the thresholds for selecting
feature genes. In fact, for liver cancer, compared with the results
found when x = 10, two additional modules (“defense response” and “cell-surface receptor linked signal transduction”) were
identified when x = 15, and only one more module (“response to wounding”) was
identified when x = 20. Similar trends were found in the NCI60 dataset. Among the top 150 genes (about the average size of the modules across this
study) with the largest variances in the liver cancer or NCI60 datasets, there
were 120 and 130 genes, respectively, co-expressed with at
least 1 gene in the identified modules at the significance level P ≤ 0.005, determined using 10,000 gene pairs randomly sampled from
the original expression profiles. Thus, we expect that the set of
the top-ranked genes could achieve good clustering results. In fact, the
ARIs for the set of the top-ranked 150 genes were estimated to be 0.871 and 0.728 for
the two datasets (liver cancer and NCI60), respectively, which
were comparable to those obtained using the majority rule
modular approach (Table 1). Peeling Off the Hidden Genetic Heterogeneities of DLBCL Molecular heterogeneity in DLBCL patients was extensively investigated
previously [for example,
(8,22,23)]. Inspired by the successful results for partitioning HCC and
NCI60 datasets, we then applied the proposed modular approach to uncover
the underlying molecular subtypes of DLBCL. Based on the DLBCL dataset, six
modules were identified, as shown in Table 2 and Supplement 3. The most significant module (annotated with 173 genes, and P ≤ 9.25E–07) was “GO:0006950: cellular response
to stress” occurring at “GO:0005623: cell.” By
this module, two distinct DLBCL subtypes were discovered via unsupervised
clustering of the DLBCL patients based on the expression profiles
of the annotated genes. To elucidate the clinical implications of the
identified molecular module, we studied survival profiles for the two
subtypes (Figure 1A
To explore a compact model for clinical use, we selected a gene subset
of high prediction power. Multivariate Cox proportional-hazards model
was used to analyze the genes in the module labeled “response
to stress.” To reduce the number of variables to be modeled, first, we
applied a univariate Cox proportional-hazards model to identify
the genes whose marginal effects on the overall survival time were
significant. Fourteen genes (BCL2, CHES1, ERCC5, HMGB2, IRF4, LY64, SMAD7, OGG1, RPA3, TNF, CD83, PDIR, TLK2, and FLJ10858) were found at the significance level of 0.05. Then, using the stepwise
variable selection option (with the same inclusion and exclusion P value of 0.05) for the multivariate Cox proportional-hazards regression
model
(36), we ended up with 4 predictors (genes) (Table 3). Two of the 4 genes, cell CLL/lymphoma 2 (BCL2) and tumor necrosis factor (TNF), were previously reported as the prognostic factors for lymphoma
(44,45). It is interesting to note that high-mobility group box 2 (HMGB2), a member of the nonhistone chromosomal high-mobility group protein family, conferred
a high hazard ratio (20.04, with 95% CI 3.53–113.88) (see Table 3). A previous study
(46) demonstrated that HMGB2 has the potential to control cell- and promoter-specific down-or upregulation
of in vivo transcriptional activity of different members of the
tumor suppressor gene p53 family. Another predictor gene encoding CD83 antigen was also found at
elevated levels in 20% of chronic lymphocytic leukemia (CLL) and 5 of 7 mantle
cell lymphoma (MCL) patients
(47), suggesting its functional and/or prognostic significance in hematologic
malignancies, particularly CLL and MCL.
Please note that supplementary information is available on the Molecular
Medicine website (www.molmed.org). DISCUSSION In this article, we proposed a modular-based clustering approach to find
disease subtypes based on modules defined by cellular-localized biological
processes. As evaluated by the liver cancer and NCI60 datasets, based
on a few measured genes in an individual module, the sample partitions
agreed well with the original clinical labels. We thus deem that
the disease-relevant module may depict one of the multiple functional
facets leading to the molecular pathogenic mechanisms. Further studying
the functional descriptions of the identified modules suggests that
these modules enjoy explicit relevancy to the current understanding
of disease mechanisms and thus are appealing for dissecting the underlying
genetic heterogeneity of cancers at the modular level. It should
be noted that the proposed approach is also an efficient unsupervised
feature selection method that yields multiple feature gene sets (i.e., genes
annotated to the modules) of functional compactness. The genes
with top-ranked expression variations across samples are selected as
the initial feature genes
(16,17), and then are further filtered or organized by functional modules. In
general, because the selected feature genes by modular approach contain
both the gene expression signatures and the functional module signatures
of disease subtypes, they may provide functional guidance in experimental
investigation of the pathogenesis of the studied diseases. It has been shown that using multiple 2-dimensional characterized modules
individually or jointly could achieve comparable excellent partitioning
results, indicating that multiple molecular pathways may be involved
in the complex disease mechanisms. In addition, our previous study
(15) for classifying cancers using 1-dimensional (BP) characterization of
modules demonstrated that the modular approach to using the derived modular
functional expression profiles is a powerful and robust alternative
approach to analyzing high-dimensional gene profiles of cancers. Although
both 1- and 2-dimensional modular categorization can perform equally
well, we recommend using 2-dimensional (BP and CC) characterization
of modules to achieve more compact and detailed knowledge in both
functionality and cellular location, data that are more useful and revealing
for further experimental investigation (for example, by molecular
trafficking techniques). In supervised classification, the choice of the best module or modules
for disease prediction should be relatively easy; because the sample labels
in training set are given, the high accuracy rates of the classifiers
trained on the modules might be used to filter more specific and
critical modules highly relevant to disease pathogenesis. In unsupervised
clustering analysis, however, the ARI for evaluating a clustering
algorithm cannot be applied directly to choose the best module, because
no cross-validation can be done. Nevertheless, according to the results
in this study, some general guidelines can be given for choosing
one or more modules for clustering analysis of diseases. One way is to
focus on one or more biologically highly related modules to explore a
specific functional facet that may correspond to a unique genetic pathway. Although
this simple strategy may not get the highest ARI, it has
the advantage of focusing on specific disease mechanisms. Alternatively, disease
samples may be best partitioned based on “collectively
voting” from the identified modules, but with the loss of
the detailed functional characterization that each module provides. Another
way to use the information from the identified modules in a collective
manner is to put together all the measured genes contained in
the modules for clustering analysis. For example, the ARI values achieved
by this approach for the liver cancer and NCI60 datasets were 0.871 and 0.622, respectively (Supplement 4). Some early studies attempted to find cancer subtypes based on expression
profiles of the genes grouped by a clustering algorithm
(8). The underlying assumption is that genes with similar expression patterns
are more likely to have similar biological functions, but a clustering
algorithm itself does not provide proof of the best grouping of
genes in terms of biological functions
(48). Thus, the biological interpretation of the disease clustering results
relies heavily on expert knowledge, which is often subjective
(49). Here we directly used an external annotation database such as Gene Ontology
to extract multiple functionally compact and coherent gene sets (modules). The
application of the proposed modular approach to peel
off DLBCL identified two hidden subtypes. In terms of the well-characterized
modular functionality and based on the significant different survival
results for the patients defined by the two hidden subtypes, the
proposed computational approach is a feasible and promising toolbox
for peeling off molecular heterogeneities of complex human diseases. In this study, we took the known cluster number suggested by preassigned
labels as the basis to assess the validity of the proposed approach. Although
the clustering results provided good fits to the known phenotypic
partitions, the assumption of the lack of heterogeneity in the two
studied datasets might not be true. Likewise, the problem to estimate
the correct number of clusters for peeling off hidden disease subtypes
is largely unsolved. Recently, some methods for obtaining the best
number of sample partitions by optimizing some validity indices have
been published
(50,51), which would provide additional insights on improving the proposed modular
approach. By its nature, an extension of the proposed modular approach
could also further refine the functional modules by integrating
multiple sources of functional information at different molecular levels. ACKNOWLEDGMENTS This work was supported in part by the National High Tech Development Project
of China (grant nos. 2003AA2Z2051 and 2002AA2Z2052), the National
Natural Science Foundation of China (grant nos. 30170515, 30370388, 30370798, 30570424, and 30571034), the 211 Project, the Tenth “Five-year” Plan, Harbin Medical University, and the Heilongjiang
Province Department of Education Outstanding Overseas Scientist grant (grant
no. 1055HG009). Footnotes Online address: http://molmed.org REFERENCES 1. Rieger R, Michaelis A, Green MM. (1991) Glossary of Genetics: Classical and Molecular. Springer Verlag, Berlin, New York. 2. Krakow D, et al. Mutations in the gene encoding filamin B disrupt vertebral segmentation, joint
formation and skeletogenesis. Nat Genet. 2004;36:405–10. [PubMed] 3. Altmuller J, et al. Phenotypic and genetic heterogeneity in a genome-wide linkage study of
asthma families. BMC Pulm Med. 2005;5:1. [PubMed] 4. Shannon WD, Province MA, Rao DC. Tree-based recursive partitioning methods for subdividing sibpairs into
relatively more homogeneous subgroups. Genet Epidemiol. 2001;20:293–306. [PubMed] 5. Hoggart CJ, Shriver MD, Kittles RA, Clayton DG, McKeigue PM. Design and analysis of admixture mapping studies. Am J Hum Genet. 2004;74:965–78. [PubMed] 6. Li X, Rao S, Wang Y, Gong B. Gene mining: a novel and powerful ensemble decision approach to hunting
for disease genes using microarray expression profiling. Nucleic Acids Res. 2004;32:2685–94. [PubMed] 7. Patterson N, et al. Methods for high-density admixture mapping of disease genes. Am J Hum Genet. 2004;74:979–1000. [PubMed] 8. Alizadeh AA, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression
profiling. Nature. 2000;403:50311. 9. Golub TR, et al. Molecular classification of cancer: class discovery and class prediction
by gene expression monitoring. Science. 1999;286:531–7. [PubMed] 10. Hartwell LH, Hopfield JJ, Leibler S, Murray AW. From molecular to modular cell biology. Nature. 1999;402:C47–52. [PubMed] 11. Segal E, Friedman N, Kaminski N, Regev A, Koller D. From signatures to models: understanding cancer using microarrays. Nat Genet. 2005;37(Suppl):S38–45. [PubMed] 12. Mootha VK, et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately
downregulated in human diabetes. Nat Genet. 2003;34:267–73. [PubMed] 13. Huang E, et al. Gene expression phenotypic models that predict the activity of oncogenic
pathways. Nat Genet. 2003;34:226–30. [PubMed] 14. Segal E, Friedman N, Koller D, Regev A. A module map showing conditional activity of expression modules in cancer. Nat Genet. 2004;36:1090–8. [PubMed] 15. Guo Z, et al. 2005. Toward precise classification of cancers based on robust gene functional
expression profiles BMC Bioinformatics 658. doi: doi:10.1186/1471-2105-6-58.. [PubMed] 16. Ding CH. Unsupervised feature selection via two-way ordering in gene expression
analysis. Bioinformatics. 2003;19:1259–66. [PubMed] 17. Dudoit S, Fridlyand J. Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 2003;19:1090–9. [PubMed] 18. Harris MA, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32:D258–61. [PubMed] 19. Chen X, et al. Gene expression patterns in human liver cancers. Mol Biol Cell. 2002;13:1929–39. [PubMed] 20. Ross DT, et al. Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet. 2000;24:227–35. [PubMed] 21. Ellison G, Klinowska T, Westwood RF, Docter E, French T, Fox JC. Further evidence to support the melanocytic origin of MDA-MB-435. Mol Pathol. 2002;55:294–9. [PubMed] 22. Monti S, et al. Molecular profiling of diffuse large B-cell lymphoma identifies robust
subtypes including one characterized by host inflammatory response. Blood. 2005;105:1851–61. [PubMed] 23. Rosenwald A, et al. The use of molecular profiling to predict survival after chemotherapy for
diffuse large-B-cell lymphoma. N Engl J Med. 2002;346:1937–47. [PubMed] 24. Jimenez JL, Mitchell MP, Sgouros JG. Microarray analysis of orthologous genes: conservation of the translational
machinery across species at the sequence and expression level. Genome Biol. 2003;4:R4. [PubMed] 25. Zhou X, Kao MC, Wong WH. Transitive functional annotation by shortest-path analysis of gene expression
data. Proc Natl Acad Sci U S A. 2002;99:12783–8. [PubMed] 26. Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA. Global functional profiling of gene expression. Genomics. 2003;81:98–104. [PubMed] 27. Hosack DA, Dennis G, Jr, Sherman BT, Lane HC, Lempicki RA. Identifying biological themes within lists of genes with EASE. Genome Biol. 2003;4:R70. [PubMed] 28. Jain A, Dubes R. (1988) Algorithms for Clustering Data. Prentice Hall, New York. 29. Milligan GW, Cooper MC. A study of the comparability of external criteria for hierarchical cluster
analysis. Multivariate Behav Res. 1986;21:441–58. 30. Hubert L, Arabie P. Comparing partitions. J Classification. 1985;2:193–218. 31. Pavlidis P, Lewis DP, Noble WS. Exploring gene expression data with class scores. Pac Symp Biocomput. 2002:474–85. [PubMed] 32. Azuaje F, Bodenreider O. (2004) Incorporating ontology-driven similarity
knowledge into functional genomics: An exploratory study (Abstract). IEEE Fourth Symp. Bioinformatics Bioengineering Taichung, Taiwan, p. 317. 33. Chen Y, Xu D. Global protein function annotation through mining genome-scale data in
yeast Saccharomyces cerevisiae. Nucl Acids Res. 2004;32:6414–24. [PubMed] 34. Yu H, Gao L, Tu K, Guo Z. Broadly predicting specific gene functions with expression similarity and
taxonomy similarity. Gene. 2005;352:75–81. [PubMed] 35. Altman DG. (1991) Practical Statistics for Medical Research. Chapman & Hall, London. 36. Cox DR. Regression models and lifetables. J R Stat Soc [B]. 1972;34:187–220. 37. Pastor-Anglada M, Felipe A, Casado FJ, del Santo B, Mata JF, Valdes R. Nucleoside transporters and liver cell growth. Biochem Cell Biol. 1998;76:771–7. [PubMed] 38. Bode BP, Souba WW. Glutamine transport and human hepatocellular transformation. JPEN. 1999;23:S33–7. 39. Ren Y, et al. Interleukin-8 serum levels in patients with hepatocellular carcinoma: correlations
with clinicopathological features and prognosis. Clin Cancer Res. 2003;9:5996–6001. [PubMed] 40. Poon RT, Ho JW, Tong CS, Lau C, Ng IO, Fan ST. Prognostic significance of serum vascular endothelial growth factor and
endostatin in patients with hepatocellular carcinoma. Br J Surg. 2004;91:1354–60. [PubMed] 41. Yamauchi N, et al. The glypican 3 oncofetal protein is a promising diagnostic marker for hepatocellular
carcinoma. Mod Pathol. 2005;18:1591–8. [PubMed] 42. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. (2002) Molecular Biology of the Cell. Garland Publishing, New York. 43. Kufe DW et al. (2003) Cancer Medicine. BC Decker, Hamilton, ON, Canada. 44. Gascoyne RD, et al. Prognostic significance of Bcl-2 protein expression and Bcl-2 gene rearrangement
in diffuse aggressive non-Hodgkin’s lymphoma. Blood. 1997;90:244–51. [PubMed] 45. Pedersen LM, Jurgensen GW, Johnsen HE. Serum levels of inflammatory cytokines at diagnosis correlate to the bcl-6 and
CD10 defined germinal center (GC) phenotype and bcl-2 expression
in patients with diffuse large B-cell lymphoma. Br J Haematol. 2005;128:813–9. [PubMed] 46. Stros M, Ozaki T, Bacikova A, Kageyama H, Nakagawara A. HMGB1 and HMGB2 cell-specifically down-regulate the p53- and p73-dependent
sequence-specific transactivation from the human Bax gene promoter. J Biol Chem. 2002;277:7157–64. [PubMed] 47. Hock BD, Haring LF, Steinkasserer A, Taylor KG, Patton WN, McKenzie JL. The soluble form of CD83 is present at elevated levels in a number of hematological
malignancies. Leuk Res. 2004;28:237–41. [PubMed] 48. Gibbons FD, Roth FP. Judging the quality of gene expression-based clustering methods using gene
annotation. Genome Res. 2002;12:1574–81. [PubMed] 49. Rhodes DR, Chinnaiyan AM. Integrative analysis of the cancer transcriptome. Nat Genet. 2005;37(Suppl):S31–7. [PubMed] 50. Ben-Hur A, Guyon I. Detecting stable clusters using principal component analysis. Methods Mol Biol. 2003;224:159–82. [PubMed] 51. Bolshakova N, Azuaje F, Cunningham P. An integrated tool for microarray data clustering and cluster validity
assessment. Bioinformatics. 2005;21:451–5. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Nat Genet. 2004 Apr; 36(4):405-10.
[Nat Genet. 2004]BMC Pulm Med. 2005 Jan 5; 5():1.
[BMC Pulm Med. 2005]Genet Epidemiol. 2001 Apr; 20(3):293-306.
[Genet Epidemiol. 2001]Am J Hum Genet. 2004 May; 74(5):965-78.
[Am J Hum Genet. 2004]Nucleic Acids Res. 2004; 32(9):2685-94.
[Nucleic Acids Res. 2004]Nature. 1999 Dec 2; 402(6761 Suppl):C47-52.
[Nature. 1999]Nat Genet. 2005 Jun; 37 Suppl():S38-45.
[Nat Genet. 2005]Nat Genet. 2003 Jul; 34(3):267-73.
[Nat Genet. 2003]Nat Genet. 2003 Jun; 34(2):226-30.
[Nat Genet. 2003]Nat Genet. 2004 Oct; 36(10):1090-8.
[Nat Genet. 2004]Bioinformatics. 2003 Jul 1; 19(10):1259-66.
[Bioinformatics. 2003]Bioinformatics. 2003 Jun 12; 19(9):1090-9.
[Bioinformatics. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D258-61.
[Nucleic Acids Res. 2004]Mol Biol Cell. 2002 Jun; 13(6):1929-39.
[Mol Biol Cell. 2002]Nat Genet. 2000 Mar; 24(3):227-35.
[Nat Genet. 2000]Mol Pathol. 2002 Oct; 55(5):294-9.
[Mol Pathol. 2002]Blood. 2005 Mar 1; 105(5):1851-61.
[Blood. 2005]N Engl J Med. 2002 Jun 20; 346(25):1937-47.
[N Engl J Med. 2002]Genome Biol. 2003; 4(1):R4.
[Genome Biol. 2003]Proc Natl Acad Sci U S A. 2002 Oct 1; 99(20):12783-8.
[Proc Natl Acad Sci U S A. 2002]Genomics. 2003 Feb; 81(2):98-104.
[Genomics. 2003]Genome Biol. 2003; 4(10):R70.
[Genome Biol. 2003]Pac Symp Biocomput. 2002; ():474-85.
[Pac Symp Biocomput. 2002]Nucleic Acids Res. 2004; 32(21):6414-24.
[Nucleic Acids Res. 2004]Gene. 2005 Jun 6; 352():75-81.
[Gene. 2005]Biochem Cell Biol. 1998; 76(5):771-7.
[Biochem Cell Biol. 1998]Clin Cancer Res. 2003 Dec 1; 9(16 Pt 1):5996-6001.
[Clin Cancer Res. 2003]Br J Surg. 2004 Oct; 91(10):1354-60.
[Br J Surg. 2004]Mod Pathol. 2005 Dec; 18(12):1591-8.
[Mod Pathol. 2005]Blood. 2005 Mar 1; 105(5):1851-61.
[Blood. 2005]N Engl J Med. 2002 Jun 20; 346(25):1937-47.
[N Engl J Med. 2002]Blood. 1997 Jul 1; 90(1):244-51.
[Blood. 1997]Br J Haematol. 2005 Mar; 128(6):813-9.
[Br J Haematol. 2005]J Biol Chem. 2002 Mar 1; 277(9):7157-64.
[J Biol Chem. 2002]Leuk Res. 2004 Mar; 28(3):237-41.
[Leuk Res. 2004]Bioinformatics. 2003 Jul 1; 19(10):1259-66.
[Bioinformatics. 2003]Bioinformatics. 2003 Jun 12; 19(9):1090-9.
[Bioinformatics. 2003]BMC Bioinformatics. 2005 Mar 17; 6():58.
[BMC Bioinformatics. 2005]Genome Res. 2002 Oct; 12(10):1574-81.
[Genome Res. 2002]Nat Genet. 2005 Jun; 37 Suppl():S31-7.
[Nat Genet. 2005]Methods Mol Biol. 2003; 224():159-82.
[Methods Mol Biol. 2003]Bioinformatics. 2005 Feb 15; 21(4):451-5.
[Bioinformatics. 2005]