![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2008 The Author(s) Unraveling transcriptional regulatory programs by integrative analysis of microarray and transcription factor binding data Bioinformatics Unit, Branch of Research Resources, National Institute on Aging, NIH, Baltimore, MD 21224, USA *To whom correspondence should be addressed. Associate Editor: Joaquin Dopazo Received March 12, 2008; Revised June 3, 2008; Accepted June 25, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Motivation: Unraveling the transcriptional regulatory program mediated by transcription factors (TFs) is a fundamental objective of computational biology, yet still remains a challenge. Method: Here, we present a new methodology that integrates microarray and TF binding data for unraveling transcriptional regulatory networks. The algorithm is based on a two-stage constrained matrix decomposition model. The model takes into account the non-linear structure in gene expression data, particularly in the TF-target gene interactions and the combinatorial nature of gene regulation by TFs. The gene expression profile is modeled as a linear weighted combination of the activity profiles of a set of TFs. The TF activity profiles are deduced from the expression levels of TF target genes, instead directly from TFs themselves. The TF-target gene relationships are derived from ChIP-chip and other TF binding data. The proposed algorithm can not only identify transcriptional modules, but also reveal regulatory programs of which TFs control which target genes in which specific ways (either activating or inhibiting). Results: In comparison with other methods, our algorithm identifies biologically more meaningful transcriptional modules relating to specific TFs. We applied the new algorithm on yeast cell cycle and stress response data. While known transcriptional regulations were confirmed, novel TF-gene interactions were predicted and provide new insights into the regulatory mechanisms of the cell. Contact: zhanmi/at/mail.nih.gov Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION Genes are coordinately expressed under tight regulation by transcriptional factors to carry out complex and condition-specific biological functions in living cells. It is critical to develop computational approaches for deciphering transcriptional regulatory programs, in order to elucidate molecular mechanism of development or disease or identify biomarkers (Brunet et al., 2004; Hughes et al., 2000; Li and Zhan, 2006; Segal et al., 2004; Zhan, 2007). Microarray gene expression data have been extensively used for identifying transcriptional regulatory modules. Various computational methodologies have been introduced for those studies, including projection (e.g. principal component analysis, singular value decomposition, independent component analysis) (Alter et al., 2000; Lee and Batzoglou, 2003; Liebermeister, 2002), model-based approaches (e.g. network component analysis, probabilistic sparse matrix factorization) (Dueck et al., 2005; Liao et al., 2003) and conventional clustering (e.g. hierarchical clustering, self-organizing maps, K-means) (Eisen et al., 1998; Tamayo et al., 1999; Tavazoie et al., 1999). The projection methods decompose the original data into components that are constrained to be mutually either uncorrelated or statistically independent, and cluster genes into mutually non-exclusive modules based on their loading in the components. Since these methods do not cluster genes according to the pair-wise similarity, they can identify sets of coexpressed genes that are potentially co-regulated. Model-based approaches model microarray data as a linear mixture of latent variables that may correspond to specific biological sources. These methods usually incorporate prior knowledge on gene regulatory mechanisms as constraints for precisely estimating model's parameters. For example, the probabilistic sparse matrix factorization approach uses the ‘sparse’ constraint in the matrix decomposition to provide a combinatorial account of the gene expression in terms of a small set of factors. One challenge of such model-based approaches is the lack of sufficient data to estimate the parameters. Recent simulation studies suggest that transcriptional networks inferred from gene expression data alone can be considerably obscured by spurious interactions when the number of observations is small or the quality of the data is poor (Husmeier, 2003). Several approaches, including GRAM (Bar-Joseph et al., 2003), COGRIM (Chen et al., 2007) and ReMoDiscovery (Lemmens et al., 2006), have been developed to infer transcriptional regulatory networks by integrating gene expression data with transcription factor (TF) binding information. These approaches allow identification of more functionally coherent regulatory modules, in comparison with the analyses utilizing microarray data alone (Bernard and Hartemink, 2005; Joung et al., 2006; Kim et al., 2006; Yu and Li, 2005; Zhou et al., 2005). We recently developed a two-stage matrix decomposition method that combine the characteristics of projection and model-based approaches for the discovery of transcriptional modules (Li et al., 2007). In the present study, we extend the two-stage decomposition method to incorporating TF binding data for unraveling TF-mediated regulatory programs. The new approach provides information of not only transcriptional modules, but also on which of the TFs control which target genes in which specific ways (either activating or inhibiting) in a regulatory program. Considering highly non-linear interactions between TFs and their target genes, we first adopt a non-linear independent component analysis (NICA) method to reduce the non-linear distortion in the data and decompose the data into independent latent components. Next, we develop a constrained probabilistic sparse matrix factorization (cPSMF) approach that models the expression of each gene across the independent latent components as a linear weighted combination of activity profiles of a small number of TFs. The model takes into account of the combinatorial and sparse nature of gene regulation by TFs. By incorporating TF-target gene relationships derived from ChIP-chip data into the probabilistic sparse matrix factorization, the cPSMF approach infers the network structure in a more accurate and robust manner. Finally, we fine-tune the transcriptional network by selecting target genes whose promoter regions contain a sequence that matches with the binding site of the corresponding TF. In comparison with other methods, our algorithm shows better performance in identifying functionally coherent transcriptional modules relating to specific TFs. We demonstrate the usefulness of the new method in a case study on yeast cell cycle and stress response data. While known transcriptional regulatory interactions were confirmed, novel TF-gene links were also predicted, providing new insights into the regulatory network of the cell. 2 METHODS 2.1 The two-stage constrained matrix decomposition model In this proposed model, TF-mediated transcriptional regulatory programs are inferred based on integrated results from microarray, ChIP-chip and TF binding motif data. Suppose there is a microarray gene expression data matrix X ![]() N×M with N genes and M samples and a configuration matrix C {0,1}N×Lobtained from ChIP–chip data with L TFs, where element Cil=1 represents gene i is regulated by TF l. We first take the NICA step to de-nonlinearize microarray data into independent latent components. The model can be written as (Jutten and Karhunen, 2004; Lappalainen and Honkela, 2000; Li et al., 2007):
![]() N×M′ denotes the independent latent source matrix and A![]() M′×M is the mixing matrix. M′ is the number of latent sources. N is a white Gaussian noise matrix. F(·) is a non-linear mixing function, which is modeled using a multilayer perceptron (MLP) network with one non-linear hidden layer (Haykin, 1999). S can be obtained from Equation (1) using the variational Bayesian learning (Lappalainen and Honkela, 2000) and the FastICA algorithm (Hyvarinen and Oja, 2000).Next, we take the cPSMF stage to model the expression profiles of genes across the independent latent components as linear weighted combinations of L TF activity profiles
![]() N×L is the weighting matrix, and Z![]() L×M′ is the matrix that contains activity profiles of L TFs across the independent latent components. For each TF, we obtain an activity profile from the centroid of the expression profiles of the target genes across the independent latent components. The target genes of each TF are chosen from the configuration matrix C, constructed from the ChIP-chip data. The Y matrix is inferred from Equation (2) by variational Bayesian learning (Dueck et al., 2005; Jordan et al., 1999) with constraints that
The proposed algorithm is summarized as follows: Input
2.2 Biological assessment and visualization of inferred regulatory networks To assess biological relevance of inferred regulatory network, we examined whether the identified TF-regulated transcriptional modules accumulate in certain Gene Ontology (GO) categories by conducting two different analyses: over-representation analysis and gene set enrichment analysis. The over-representation analysis was used to detect if a GO term is enriched in a transcriptional network. In specific, the hypergeometric probability p that a GO term is significantly enriched in a network is calculated as:
The gene set enrichment analysis is based on the non-parametric test. When considering an arbitrary GO category, it evaluates if the genes in the identified transcriptional modules that belong to the category are uniformly distributed or accumulated in the list sorted by some specific criteria (Backes et al., 2007). Since both the over-representation analysis and the gene set enrichment analysis were applied to many GO categories, we further conducted the false discovery rate (FDR) correction, which provides strong control to have less false negatives at the cost of a few more false positives. We took advantage of Cytoscape's versatile visualization environment (Shannon et al., 2003) to produce graphic representation of the resulting regulatory networks. 2.3 Experimental datasets Two publicly available yeast microarray datasets were used for algorithm evaluation and case studies. The first was a cell cycle dataset, which was determined under the normal growth condition (Spellman et al., 1998). The second was a stress response dataset, determined under different experimental conditions such as temperature shocks, amino acid starvation, nitrogen source deletion and progression into stationary phase (Gasch et al., 2000). The normalization of the datasets was conducted by zero transformation (Gollub et al., 2006), and the missing data were filled using the KNNimpute approach (Troyanskaya et al., 2001). The yeast ChIP-chip dataset used for the studies was obtained from a previous study (Harbison et al., 2004). The dataset contains 203 TFs, with all profiled in a rich medium and 84 profiled under multiple stress conditions. We used the TF-gene pairs that have P-values <0.05 to construct the configuration matrix C in our analysis. 3 RESULTS AND DISCUSSION 3.1 Feature of the algorithm The motivation of this work was to provide a mathematical framework for identifying condition-specific transcriptional regulatory networks by integrated analysis of microarray, ChIP-chip and TF binding motif data. An important feature of our method is the utilization of two-stage data decomposition (NICA+cPSMF). The NICA transformation captures the non-linear structure in the data and represents the data with independent latent components. Inspired by the fact that gene expression is regulated by a small set of TFs that act combinatorially, the cPSMF models the expression profile of each gene represented by the independent latent components as a linear combination of activity profiles of a small number of TFs. A configuration matrix (C matrix) is incorporated into the modeling as constraint for precisely estimating the influence of TFs. The configuration matrix, derived from ChIP-chip data, has a sparse property. That is, there are only a few non-zero elements in the matrix, as target genes can only be regulated by a small number of TF's. Through learning the parameters of the cPSMF model, the algorithm can infer activating and inhibitory regulatory relationships between TFs and their gene targets. The strength and direction of transcriptional regulation that a TF applies on its target genes are reflected by the weight of the TF presented in the Y matrix. Most methods for inferring TF-regulated transcriptional modules are based on the assumption that there exists a correlation on the mRNA expression level between TFs and their target genes (Kim et al., 2006; Zhu et al., 2002). This assumption is however not always true, since the activation or inhibition of a target gene by a TF can be influenced by not only the mRNA expression of the TF and their targets, but also by post-transcriptional modification of the genes, as well as the concentration, post-translational modification and cellular localization of their protein products. Because of these, our algorithm uses the expression patterns of TF target genes, instead of TFs themselves, to deduce the activity profiles of TFs. When applying our algorithm, we set the number of independent latent components equal to the number of experimental conditions for simplicity. For more accurate non-linear mapping, we set the number of hidden neurons in the MLP network as twice as the number of independent latent components. We also set K, the maximum number of effective TFs bound on target genes, equal to two in our algorithm. The choice of the parameters α (weight cutoff) and β (PWM matching score cutoff) is important for the structure of the inferred network. Since in general, ground-truth data are hardly available for condition-specific situation, we take a conserved approach in setting up the parameters in our case studies. We set weight cutoff α=0.05 and PWM matching score cutoff β=0.94. Similar conserved parameters are also adopted in other similar studies (Kim et al., 2006). 3.2 Comparison with other methods To evaluate our algorithm, we compared its performance with those by other similar methods, including GRAM (Bar-Joseph et al., 2003), COGRIM (Chen et al., 2007) and ReMoDiscovery (Lemmens et al., 2006). The latter three methods can predict transcriptional modules that are coregulated by TFs through integrated analysis of microarray, ChIP-chip and TF motif data. GRAM is based on an iterative search, in which genes with common TF binding sites on the promoters are first identified using ChIP-chip data and the clustered gene sets are further refined by shared expression profiles. COGRIM is derived from a Bayesian hierarchical model, while ReMoDiscovery is a non-iterative approach. While our method and COGRIM can infer activating or inhibiting relationships between TFs and their target genes, GRAM and ReMoDiscovery can not predict such relationships. We identified TF-mediated transcriptional modules using our method as well as the three other methods, based on the same set of data of microarray (from both cell cycle and stress response), ChIP-chip and TF binding motifs derived from the yeast (see Section 2.3). We chose 19 TFs that are involved in the cell cycle and stress response, and identified their target genes. We then examined the functional relevance of the target gene clusters based on the GO using over-representation analysis and gene set enrichment analysis. In the gene set enrichment analysis, the input set was sorted by the variance of the expression profile of the target genes. Table 1 shows the statistical enrichment of functional GO terms in the target gene clusters based on these two analysis approaches. The enrichment level was calculated by transforming the enrichment P-values after FDR correction to the negative log values and averaged over all functional modules for corrected P<0.05. If no functional modules are found for corrected P<0.05, the smallest value of corrected P is taken for calculating the enrichment level. As illustrated, our algorithm out-performs the other methods on the functional enrichment in the target gene clusters. The averaged enrichment level over all the 19 clusters in over-representation analysis was the highest by our algorithm (5.81), followed by ReMoDiscovery (5.40), COGRIM (5.22) and GRAM (4.96). The averaged enrichment level over all the 19 clusters in gene set enrichment analysis was also the highest by our algorithm (3.65), followed by COGRIM (3.42), GRAM (2.90) and ReMoDiscovery (2.64). This implies that our algorithm can identify more functionally coherent gene clusters relating to specific TFs. In terms of the number of target genes identified, the average gene number per cluster was the highest by our algorithm (91), followed by COGRIM (85), ReMoDiscovery (74) and GRAM (32). Interestingly, our algorithm identified more target genes that are annotated with known functions, indicating that our approach provides more functional information about transcriptional regulatory program. The better performance of our algorithm in comparison with others indicates that the mathematical framework we choose for modeling the transcriptional regulatory program is appropriable.
Nevertheless, challenges still remain for identifying TF-regulated transcriptional programs. There are limited ChIP-chip data available for such analysis. The identification of TF binding sites on the promoter sequences is often associated with high false positive or negative errors. Moreover, ChIP-chip data and gene expression profile data are often generated under different experimental conditions. It is not clear how this difference effects the identification of condition specific regulatory programs by integrated analysis of these data. 3.3 Case study: regulatory networks of the yeast 3.3.1 Cell cycle We applied our algorithm to infer the transcriptional regulatory program of the cell cycle in the yeast under the rich medium growth condition. The analysis was based on the microarray data and TF binding information determined by ChIP-chip and promoter sequence analysis (see Section 2.3). From the total 203 TFs, we selected 32 TFs for the analysis according to their ranked activity profiles, which were sorted by the variance of the activity profiles of TFs. These TFs should be top cell cycle regulated TFs. Cheng and Li recently proposed a two-step method to identify the cell cycle regulated TFs by integrating microarray data with ChIP-chip data (Cheng and Li, 2008). We compared these 32 TFs with the putative cell cycle TFs identified by Cheng and Li's method, as well as the TFs identified by (Tsai et al., 2005). Among them, 15 are known cell cycle related factors from previous experiments and 8 are also suggested as putative cell cycle TFs by these studies (Cheng and Li, 2008; Tsai et al., 2005). Figure 1
3.3.2 Stress response The inference of this regulatory network using our algorithm was based on the stress response microarray data under heat shock from 25○C to 37○C, along with 203 TFs and their genomic binding sites determined by ChIP and promoter sequence analysis (see Section 2.3). Firstly, we chose top 32 active TFs according to their ranked activity profiles. These 32 TFs include all stress response related factors that are experimentally confirmed. Figure 1 4 SUMMARY In this study, we present a novel methodology for unraveling transcriptional regulatory networks by integrated analysis of microarray, ChIP-chip data and TF motif information. The method is based on a two-stage constrained matrix decomposition model. The new method offers several advantages over previously published algorithms: (1) it takes into account the non-linear structure existed in the data, particularly in the TF-target gene interactions; (2) the model considers the combinatorial nature of gene regulation by TFs; (3) it predicts not only TF-target interactions, but furthermore the activating or inhibitory relationships; (4) the model does not assume the correlation between TFs and their target genes on the mRNA expression. We demonstrated the usefulness of the new method on the discovery of condition-specific regulatory networks in the yeast. While known transcriptional regulations were confirmed, novel TF-target interactions were predicted and provide new insights into the regulatory mechanisms of the cell. [Supplementary Data]
ACKNOWLEDGEMENTS This study is supported by the Intramural Research Program, National Institute on Aging, NIH. Conflict of Interest: none declared. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Proc Natl Acad Sci U S A. 2004 Mar 23; 101(12):4164-9.
[Proc Natl Acad Sci U S A. 2004]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Bioinformatics. 2006 Jan 1; 22(1):96-102.
[Bioinformatics. 2006]Nat Genet. 2004 Oct; 36(10):1090-8.
[Nat Genet. 2004]Genomic Med. 2007; 1(1-2):19-28.
[Genomic Med. 2007]Bioinformatics. 2007 Feb 15; 23(4):473-9.
[Bioinformatics. 2007]Int J Neural Syst. 2004 Oct; 14(5):267-92.
[Int J Neural Syst. 2004]Bioinformatics. 2007 Feb 15; 23(4):473-9.
[Bioinformatics. 2007]Neural Netw. 2000 May-Jun; 13(4-5):411-30.
[Neural Netw. 2000]Bioinformatics. 2005 Jun; 21 Suppl 1():i144-51.
[Bioinformatics. 2005]Nucleic Acids Res. 2007 Jul; 35(Web Server issue):W186-92.
[Nucleic Acids Res. 2007]Genome Res. 2003 Nov; 13(11):2498-504.
[Genome Res. 2003]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Mol Biol Cell. 2000 Dec; 11(12):4241-57.
[Mol Biol Cell. 2000]Methods Mol Biol. 2006; 338():191-208.
[Methods Mol Biol. 2006]Bioinformatics. 2001 Jun; 17(6):520-5.
[Bioinformatics. 2001]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Genome Res. 2003 Nov; 13(11):2423-34.
[Genome Res. 2003]Science. 2002 Oct 25; 298(5594):824-7.
[Science. 2002]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]BMC Bioinformatics. 2006 Mar 21; 7():165.
[BMC Bioinformatics. 2006]J Mol Biol. 2002 Apr 19; 318(1):71-81.
[J Mol Biol. 2002]BMC Bioinformatics. 2006 Mar 21; 7():165.
[BMC Bioinformatics. 2006]Nat Biotechnol. 2003 Nov; 21(11):1337-42.
[Nat Biotechnol. 2003]Genome Biol. 2007; 8(1):R4.
[Genome Biol. 2007]Genome Biol. 2006; 7(5):R37.
[Genome Biol. 2006]BMC Genomics. 2008 Mar 3; 9():116.
[BMC Genomics. 2008]Genome Res. 2003 Nov; 13(11):2423-34.
[Genome Res. 2003]Science. 2002 Oct 25; 298(5594):824-7.
[Science. 2002]Bioinformatics. 2005 Nov 1; 21(21):4033-8.
[Bioinformatics. 2005]Genome Res. 2003 Nov; 13(11):2423-34.
[Genome Res. 2003]Science. 2002 Oct 25; 298(5594):824-7.
[Science. 2002]Mol Biol Cell. 2000 Dec; 11(12):4241-57.
[Mol Biol Cell. 2000]Mol Cell Biol. 1997 Nov; 17(11):6223-35.
[Mol Cell Biol. 1997]Bioinformatics. 2005 Nov 1; 21(21):4033-8.
[Bioinformatics. 2005]