![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||
Copyright © 2008 Yi et al; licensee BioMed Central Ltd. Response projected clustering for direct association with physiological and clinical response data 1Department of Statistics, Seoul National University, Silim-dong, Kwanak-gu, Seoul, 151-747, Korea 2Division of Biostatistics and Epidemiology, University of Virginia, Charlottesville, VA 22908, USA Corresponding author.Sung-Gon Yi: skon/at/bibs.snu.ac.kr; Taesung Park: tspark/at/snu.ac.kr; Jae K Lee: jaeklee/at/virginia.edu Received July 9, 2007; Accepted January 31, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background Microarray gene expression data are often analyzed together with corresponding physiological response and clinical metadata of biological subjects, e.g. patients' residual tumor sizes after chemotherapy or glucose levels at various stages of diabetic patients. Current clustering analysis cannot directly incorporate such quantitative metadata into the clustering heatmap of gene expression. It will be quite useful if these clinical response data can be effectively summarized in the high-dimensional clustering display so that important groups of genes can be intuitively discovered with different degrees of relevance to target disease phenotypes. Results We introduced a novel clustering analysis approach, response projected clustering (RPC), which uses a high-dimensional geometrical projection of response data to the gene expression space. The projected response vector, which becomes the origin in the projected space, is then clustered together with the projected gene vectors based on their different degrees of association with the response vector. A bootstrap-counting based RPC analysis is also performed to evaluate statistical tightness of identified gene clusters. Our RPC analysis was applied to the in vitro growth-inhibition and microarray profiling data on the NCI-60 cancer cell lines and the microarray gene expression study of macrophage differentiation in atherogenesis. These RPC applications enabled us to identify many known and novel gene factors and their potential pathway associations which are highly relevant to the drug's chemosensitivity activities and atherogenesis. Conclusion We have shown that RPC can effectively discover gene networks with different degrees of association with clinical metadata. Performed on each gene's response projected vector based on its degree of association with the response data, RPC effectively summarizes individual genes' association with metadata as well as their own expression patterns. Thus, RPC greatly enhances the utility of clustering analysis on investigating high-dimensional microarray gene expression data with quantitative metadata. Background Microarray expression profiling has been widely applied to biological studies because of its ability to simultaneously examine tens of thousands of gene expression patterns. Microarray experiments have also proven to be quite useful for investigating associations between genes and physiological and clinical response measurements of many human diseases [1-3]. In particular, unsupervised learning techniques such as hierarchical clustering analysis have become one of the most commonly-used techniques for analyzing microarray data since these techniques can effectively summarize high-dimensional gene expression data in a two-dimensional color-coded cluster heatmap based on many genes' expression associations [4]. Several other clustering techniques such as k-means clustering, self-organization maps, and gene shaving have been used for microarray data analysis [5-7]. The main objective of these clustering analyses, however, has been to summarize the expression pattern associations among genes, but not the direct association with interesting physiological response data on study subjects. Several supervised learning and statistical modeling approaches have also been used to analyze the gene expression data along with other response variables such as treatment group variables [8], clinical response data such as survival times [9], and Bayesian regression modeling [10]. However, these approaches are often based on the dichotomization of quantitative response data, resulting in significant loss of information. Furthermore, these methods are mainly used for the discovery of gene factors and prediction models between different response groups and cannot provide high-dimensional association information between genes and response variables. In this study, we propose a novel clustering analysis approach, the so-called response projected clustering (RPC), which accounts for both the relationships among gene expression patterns themselves and their association with response data. This RPC approach is motivated by a relatively simple geometrical observation that a relevant response vector can be projected to each gene vector in their high-dimensional space to reflect each gene's association with the response data prior to the clustering analysis. For RPC analysis, all gene or response vectors are first standardized (so that the mean and variance are 0 and 1). The response vector is then projected into each gene so that its resulting projection resides in each gene's subspace proportional to the association strength with the response variable, not changing each gene vector's direction (so expression pattern) in the high-dimensional gene expression space (grey arrows in Fig. Fig.1a).1a
Avoiding unstable clustering patterns due to small changes of input data orders and heuristic clustering algorithms, we further refined RPC by using bootstrap-based counting measures in order to obtain robust clustering patterns with statistical significance. Our RPC method is applied to the drug activity data of in vitro growth inhibition by docetaxel and microarray data on the NCI-60 cancer cell lines [11] and the microarray study for atherogenic macrophage differentiation to foam cells [12]. Results Docetaxel chemosensitivity and microarray gene expression data on the NCI-60 cancer cell lines Docetaxel is one of the most widely-used anti-neoplastic chemotherapeutic compounds to treat various tumors such as breast, non-small cell lung, gastrointestinal (stomach), and prostate cancers [13]. Major target genes of docetaxel are known to be BCL2 and TUBB1. However, because this compound was originally derived from a natural extract (bark of the Pacific yew tree), its complete molecular chemosensitivity mechanisms and pathways are not completely understood [2]. In our current application, we use in vitro drug activity data of docetaxel on the NCI-60 cancer cell line panel, so-called GI50 (50% growth inhibition dose concentration in two-day assays) [11], together with publicly-available NCI-60 genome-wide expression profiling data of Affymetrix HG-U133A [14]. The NCI-60 cell line panel consists of nine cancer subtypes: lung, colon, breast, ovarian, leukemia, renal, melanoma, prostate, and central nervous system cancers. All microarrays are normalized by IQR-normalization which is a method that Q1 and Q3 of all microarrays have the same value [15]. Fig. Fig.22
We first identified genes that were strongly correlated with the GI50 values of docetaxel on NCI-60. Fig. Fig.33
RPC Analysis on NCI-60 Data After we standardized gene expressions and GI50 as described in the Methods section, we derived the response projected shrinkage factors between GI50 and gene expression data with 1 -
To examine the relationships among these selected genes themselves and with the drug sensitivity GI50 values, we performed the standard hierarchical clustering analysis with correlation distance (Fig. (Fig.5a),5a
For the above selected genes, we also obtained their Gene Ontology (GO) information and further analyzed them using GOstat for evaluating statistical significance of overrepresented functional and molecular mechanisms [17]. The majority were found to belong to the molecular component of intracellular membrane-bound organelle (GO:0043231): CAPG, CCNB1, CDKN1C, FEN1, GATA4, SNRPN, DGKZ, SMC2L1, NUP160, POLE3, RAB5B, PTPN1, and PRPF4. We found that one of the known target genes of docetaxel, TUBB1, belongs to the same intracellular organelle category even though this target itself was not significant on the NCI-60 data. Many of these genes have also been found to be quite relevant to carcinogenic mechanisms. For example, L1CAM, GATA4, CCNB1, CDKN1C, and FEN1 have been reported for their association with breast cancer: L1CAM was shown to inhibit the growth of breast carcinoma cells [18]. GATA4 was reported to regulate aromatase PII promoter activity in breast cancer cells [3]. cAMP-responsive gonad-type PII promoter ultimately leads to increased intratumoral estrogen production and tumor growth. CCNB1 was reported to be upregulated in MCF-IR20 breast cancer cells by microarray experiment and to significantly reduce the clonogenic survival of MCF-IR20 cells [19]. CDKN1C showed a loss of heterozygosity for 11p15.5. 11p15.5 is an important tumor-suppressor gene region showing loss of heterozygosity in Wilms tumor, rhabdomyosarcoma, adrenocortical carcinoma, and lung, ovarian, and breast cancer [20]. FEN1 was shown to be repressed in E2 in ER-positive breast cancer cells [21]. ABCB1 is especially found to be highly relevant to the docetaxel response. It was reported that resistance arose by the overexpression of drug efflux pumps including MDR1 (P-glycoprotein/ABCB1) when docetaxel is medicated [22]. PPM1A, SNRPN, RAB5B, and CAPG were also reported to be related to cancer [23-25]. We performed the bootstrapping-based RPC analysis as described in the Methods section in order to obtain more statistically consistent subclusters in Fig. Fig.5d.5d RPC analysis for PPARγ during macrophage differentiation in atherogenesis A microarray gene expression study was performed to identify novel atherogenic mechanisms involved in macrophage (MΦ) differentiation to foam cells at the University of Virginia [12]. In this experiment, human monocyte-derived macrophages (MDM) were incubated with different types of low density lipoproteins (LDL) conditions such as naïve LDL, oxidized LDL (OxLDL), and minimally modified LDL (mmLDL), which provide quite different microenvironments in atherogenesis. In this microarray experiment, peroxisome proliferator-activated receptor type γ (PPARγ), which plays important roles in atherogenesis and is a molecular target for pharmaceutical products such as Avandia® for treating cardiovascular complication among type 2 diabetic patients (GlaxoSmith Kline, Inc.), was found to be highly upregulated by OxLDL and naïve LDL during the macrophage differentiation to foam cells. This selective regulation again demonstrates that PPARγ is highly relevant to atherogenesis, necessitating more targeted investigation on this gene under its respective cellular environments. However, PPARγ, as a transcription factor has been found to interact with many different genes, and its complete pathway mechanisms in atherogenesis still need to be carefully investigated associated with this gene's expression patterns on different microenvironments. Thus, we applied our RPC approach to the macrophage differentiation microarray data as if the gene expression values of PPARγ were response data in order to find the gene networks closely associated with this gene factor (Fig. (Fig.6).6
PPARγ has also been reported to regulate many fatty acid factors during the form-cell formation including a group of fatty acid regulation genes such as CD36, ABCA1, apoE, and LPL [26]. In our RPC analysis, we could effectively identify their tight associations with PPARγ and discover novel gene factors such as CCL1 and IL1B which are also known to play a role in atherogenesis but have not been reported for any direct association with PPARγ. For example, the two transcripts of CD36 (correlation with PPARγ r = 0.65, 0.58) are tightly clustered with CCL1 (r = -0.84) which implies their close functional association in the opposite direction, or potentially inhibition. Note that these clustering results could not be observed by a standard clustering analysis. Overall, applying RPC to PPARγ expression patterns on the MΦ-differentiation microarray data, genes relevant to atherogenic PPARγ functions, e.g., LDL (blue) and OxLDL (red) groups move closer to the PPARγ gene whereas the groups of genes less relevant to PPARγ, e.g., mmLDL (turquoise) groups move away from it. Discussion We introduced a novel clustering analysis approach here – response projected clustering (RPC) that can simultaneously summarize associations both with important physiological response data and with gene expression patterns themselves. The RPC method effectively performs such an integrated analysis by directly projecting response data into the high-dimensional gene expression vectors. We believe that since clustering analysis plays a significant role in exploring coexpression patterns of a large number of genes in microarray profiling data, the RPC approach will be quite useful by examining such high-dimensional data simultaneously with those genes' association with the response data. Using a bootstrapping-based clustering measure, we also performed RPC analysis based on statistical significance of tightness of subclusters. RPC can be utilized in many different clustering analyses to investigate high-throughput biological profiling data together with relevant physiological response data if molecular signatures exist in the profiling data highly associated with the physiological response. It can also be used for a particular gene in microarray data to investigate the gene's associated groups of other genes. We, however, suspect that the degrees of molecular association with some response data such as patient long-term survival and outcome data may be weak and noisy, and careful understanding on such an association may improve the utility of the RPC technique. In our current study, we first applied RPC to the docetaxel drug activity and the microarray expression profiling data on the NCI-60 cancer cell lines. In this application, the selected genes (many of which were known for their carcinogenic mechanisms) were found to be quite tightly associated with DNA replication and cell cycle pathways. The above findings may suggest that this compound interferes with the DNA replication process in order to inhibit tumor cell growth; it will be interesting if the roles and functions of these genes are further investigated for their involvement in this drug activity to administer this chemotherapeutic compound more effectively in treating patients. We showed a different utilization of the RPC approach in our second application by using the expression values of a targeted gene factor, PPARγ, as response data in order to find other genes' expression patterns and networks closely associated with this gene. In this analysis, we were able to confirm many known genes as well as novel gene factors relevant to this target gene's functions and pathways in atherogenesis. In order to apply RPC to a subset of biologically relevant genes, we preselected genes that were differentially regulated between the experimental conditions of atherogenesis with FDR < 0.05; bigger FDR cutoff values resulted in much larger numbers of genes of which clustering results were less clear and difficult to interpret. This application demonstrates well that one can utilize RPC analysis in searching gene association networks on various contexts of genome-wide expression studies associated with a particular gene factor. In this study, we used 1 -
when xg1 = {xg11,...,xg1n} and xg2 = {xg21,...,xg2n} are the g1 and g2 gene vectors, respectively. The r1 is the correlation between the g1 gene vector and response vector and the r2 is between the g2 gene vector and response vector. We also note that several different clustering algorithms have been explored in our preliminary studies such as single, complete, average linkages (data not shown). While they show slightly different tree structures, the tightly clustered genes were found to be consistent. Thus, the clustering results presented here use the average linkage algorithm. Other forms of modification are certainly possible which may deserve a full comparison study both by simulation and practical application in a future study. More generally, RPC can be applied with different measures of association beyond correlation evaluation if the association between the biological profiling data and response data can be identified with a different measure, e.g. SNP data with linkage association scores. These different functions and algorithms need to be further investigated in the future. Also note that we introduced our RPC algorithm using hierarchical clustering but our RPC projection can be applied to other clustering algorithms such as k-means, SOM, and others. Finally, we note that RPC application will be more difficult if the degrees of molecular association are weak and noisy with some response data such as patient long-term survival and outcome data. In these cases, careful understanding on such association may improve the utility of the RPC technique. Conclusion We introduced a novel clustering analysis approach here – response projected clustering (RPC) – that can simultaneously summarize associations both with important physiological and clinical response data and with gene expression patterns themselves. RPC can be considered as an enhanced integration of the unsupervised learning with supervised learning techniques, effectively performing such an integrated analysis by directly projecting response data into the high-dimensional gene expression vectors. Using its simple projection transformation, the RPC approach allows one to effectively examine high-dimensional gene expression data simultaneously with relevant response data or with a specific gene target which would be extremely useful in many biomedical gene expression studies. Methods RPC shrinkage distance and analysis We assume all microarray data are IQR normalized (among different chips) prior to our analysis. Suppose there are n subjects and p genes on microarray profiling together with n subjects' response data y = {y1,...,yn}. Let xi = {xi1,...,xin} be an n-dimensional vector of the ith gene's expression, i = 1,...,p. We first standardize each of these response and expression vectors (so that the mean and variance are 0 and 1) to have the same scale (on a unit sphere). Denote the new standardized variables as: i = 1,...,p, j = 1,...,n. Note that the same notations are used for these standardized vectors as the original vectors because there is no loss of information after this standardization if pairwise distances are evaluated based on their co-expression (or association) patterns by, e.g., Pearson correlation for clustering analysis. For the projection of response data into gene variables, we then calculate the inner product between the standardized response vector and each standardized gene expression vector: The resulting inner product is the cosine value of the internal angle (in the n-dimensional space) between the response vector and each gene vector. Note that this value is thus the projected magnitude of the response vector to each gene vector; it is also the correlation between the two vectors. For example, if the two vectors have an inner angle close to 0° (or 180°) or a strong correlation, this value will be close to 1 or -1. Without changing its direction, each gene vector is then resized with RPC shrinkage factor si:
(The dark fractions of arrows in Fig. Fig.1a).1a If one wants to group both negatively and positively-correlated genes together as long as they are highly correlated to each other, the shrinkage factor can be obtained with a general monotone function f on the absolute magnitude of ri as:
For example, xi* = (1 - The clustering distance between two gene expression vectors, say xg1 and xg2, is thus calculated based on their Euclidean distance (Fig. (Fig.1b1b
where r1 and r2 are the inner products (or correlations) of xg1 and xg2 with the response vector y. Note that the original Euclidian distance was d(xg1, xg2) = ||xg1 - xg2|| prior to the RPC transformation. More discussions on this distance will be found later. Therefore, in this RPC analysis, the gene vectors highly associated with the response vector will have very short distances from the origin and consequently short clustering distances would be obtained between them. All other genes will be gradually clustered away from the response vector as their degrees of association with the latter weakens. The utility of this RPC approach can be illustrated in a simple example as below. For example, drug response data and seven other gene expression vectors are synthetically generated with correlation coefficients 0.87, 0.67, 0.87, 0.67, 0.31, 0, and -0.81 between drug response "Drug" vector and each of the seven gene vectors g1-g7 (also refer to Fig. Fig.11
Significance and consistency of RPC sub-clusters In order to evaluate the statistical significance of each RPC gene's association with the response data, we can generate random data directly using the original microarray data by repeatedly permutating sample identities in the response data. From such a permutation-based null distribution, we can evaluate the statistical significance of each observed di for the ith gene, i = 1,...,p, compared to dim, m = 1,...,M from M permutated samples: The false discovery rates (FDR) from these (empirical) p-values are then derived for the multiple test adjustment [27]. We use these FDR values for selecting genes for our final clustering analysis, e.g. FDR < 0.2. Due to the nature of its heuristic allocation algorithms, clustering analysis can often provide different groups of clustered genes with slightly different input data or even with different orders of genes. Statistical confidence evaluation on clustered gene groups has thus been suggested using resampling techniques such as bootstrap [28,29]. We also use a bootstrapping technique to assess the stability of our RPC clustering results among RPC selected, say, s genes. We obtain B bootstrapped samples of size n {z1b, ..., znb}, b = 1,...,B from the original n subjects (column vectors) {z1*, ..., zn*} with replacement where zj* = {x1i*, ..., xpi*} is the s-dimensional vector of the j-th subject. The consistency of sub-clusters of the s genes can be examined from these bootstrapped samples. For example, the probability that two genes belong to a common subcluster can be assessed by counting the frequencies of their co-clustering occurrences at a particular node, e.g. 75-percentile node of each cluster dendrogram (Fig. (Fig.88
At the q-percentile node of cluster dendrogram, an s by s counting table Cq can be constructed, each cell with the fraction of bootstrapped dendrograms in which each pair of genes were clustered together (Table 2). The higher the fraction, the more likely its corresponding pair of genes would cluster together. Therefore, using 1-Cq as a pseudo pairwise distance matrix, we can perform the cluster analysis for s genes. In general, we found that a 50-percentile node well reflects the consistent co-clustering patterns of genes. This bootstrap-counting clustering algorithm can thus be summarized in the following three steps:
(1) Generate bootstrapped samples of size n {z1b, ..., znb}, b = 1,...B. (2) Apply hierarchical clustering to each of the bootstrap samples. At the q-percentile node, construct a counting table Cq across B bootstrapped dendrograms. (3) Perform hierarchical clustering using a pseudo distance matrix 1-Cq. Thus, this bootstrap-based clustering can effectively summarize the statistical confidence on the tightness of gene clusters. Note that the height of a clustering dendrogram node then represents how strongly the members of the cluster are clustered; the closer to the bottom of the dendrogram tree, the tighter the elements in a cluster. Authors' contributions SGY performed the statistical analysis and drafted the manuscript; TP participated in data interpretation and helped to draft the manuscript; and JKL conceived the statistical method for analysis and finalized the manuscript. All authors read and approved the final manuscript. Acknowledgements This research was supported by the US NIH grant 5R01HL081690 of JKL and by the National Research Laboratory Program of Korea Science and Engineering Foundation (M10500000126) of TP. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||
Endocrinology. 2005 Nov; 146(11):4905-16.
[Endocrinology. 2005]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]Proc Natl Acad Sci U S A. 1999 Mar 16; 96(6):2907-12.
[Proc Natl Acad Sci U S A. 1999]Proc Natl Acad Sci U S A. 2000 Aug 29; 97(18):10101-6.
[Proc Natl Acad Sci U S A. 2000]Bioinformatics. 2003 Apr 12; 19(6):694-703.
[Bioinformatics. 2003]N Engl J Med. 2002 Dec 19; 347(25):1999-2009.
[N Engl J Med. 2002]Proc Natl Acad Sci U S A. 2001 Sep 25; 98(20):11462-7.
[Proc Natl Acad Sci U S A. 2001]Physiol Genomics. 2007 Apr 24; 29(2):149-60.
[Physiol Genomics. 2007]Bioinformatics. 2003 Oct 12; 19(15):1945-51.
[Bioinformatics. 2003]Bioinformatics. 2004 Jun 12; 20(9):1464-5.
[Bioinformatics. 2004]Cancer Cell. 2003 Jul; 4(1):41-53.
[Cancer Cell. 2003]Radiat Res. 2001 Apr; 155(4):543-53.
[Radiat Res. 2001]Endocrinology. 2005 Nov; 146(11):4905-16.
[Endocrinology. 2005]Genomics. 1997 Nov 15; 46(1):9-17.
[Genomics. 1997]J Mol Endocrinol. 2005 Apr; 34(2):535-51.
[J Mol Endocrinol. 2005]Physiol Genomics. 2007 Apr 24; 29(2):149-60.
[Physiol Genomics. 2007]Physiol Genomics. 2007 Apr 24; 29(2):149-60.
[Physiol Genomics. 2007]Arterioscler Thromb Vasc Biol. 2004 Feb; 24(2):230-9.
[Arterioscler Thromb Vasc Biol. 2004]Proc Natl Acad Sci U S A. 2001 Jul 31; 98(16):8961-5.
[Proc Natl Acad Sci U S A. 2001]Biometrics. 2005 Mar; 61(1):10-6.
[Biometrics. 2005]