![]() | ![]() |
Formats:
|
|||||||||||||||||||||||||||||
Copyright © Springer Science+Business Media B.V. 2007 Deciphering modular and dynamic behaviors of transcriptional networks Bioinformatics Unit, Research Resources Branch, National Institute on Aging, NIH, 333 Cassell Drive, Baltimore, MD 21224 USA Ming Zhan, Phone: +1-410-5588373, Fax: +1-410-5588674, Email: zhanmi/at/mail.nih.gov. Corresponding author.Received February 9, 2007; Accepted April 13, 2007. This article has been cited by other articles in PMC.Abstract The coordinated and dynamic modulation or interaction of genes or proteins acts as an important mechanism used by a cell in functional regulation. Recent studies have shown that many transcriptional networks exhibit a scale-free topology and hierarchical modular architecture. It has also been shown that transcriptional networks or pathways are dynamic and behave only in certain ways and controlled manners in response to disease development, changing cellular conditions, and different environmental factors. Moreover, evolutionarily conserved and divergent transcriptional modules underline fundamental and species-specific molecular mechanisms controlling disease development or cellular phenotypes. Various computational algorithms have been developed to explore transcriptional networks and modules from gene expression data. In silico studies have also been made to mimic the dynamic behavior of regulatory networks, analyzing how disease or cellular phenotypes arise from the connectivity or networks of genes and their products. Here, we review the recent development in computational biology research on deciphering modular and dynamic behaviors of transcriptional networks, highlighting important findings. We also demonstrate how these computational algorithms can be applied in systems biology studies as on disease, stem cells, and drug discovery. Keywords: Systems biology, Coexpression, Transcriptional module, Pathway dynamics, Transcriptional intervention, ModulePro, PathwayPro Introduction The coordinated and dynamic modulation or interaction of genes or proteins acts as an important mechanism used by a cell in functional regulation (Bar-Joseph et al. 2003; Hartwell et al. 1999; Ideker et al. 2001; Segal et al. 2004). It has been shown that many transcriptional networks exhibit a scale-free topology and hierarchical modular architecture (Barabasi and Bonabeau 2003; Ihmels et al. 2002; Jeong et al. 2000; Ravasz et al. 2002; Resendis-Antonio et al. 2005; Stuart et al. 2003; Tanay et al. 2004; van Noort et al. 2004). That implies that the networks are dominated by a few highly connected nodes (i.e., genes or proteins) which link the rest of less connected nodes to the system. It also implies that genes often closely interact with each other forming transcriptional modules, some of which further interact with each other forming larger modules, and this process may continue on several different scales. Such a hierarchical modular structure is exemplified by the yeast transcriptional network, as shown in Fig. 1
In this article, we review the recent development in computational biology research on deciphering modular and dynamic behaviors of transcriptional networks from microarray data, highlighting important findings. We also demonstrate how these computational algorithms can be applied in systems biology studies as on disease, stem cells, and drug discovery. Identification of transcriptional modules Computational identification of transcriptional modules from microarray data has been conventionally conducted using clustering-based methods, such as hierarchical clustering, self-organizing maps, and k-means. Recently, different algorithms have been proposed to uncover biologically more meaningful transcriptional modules which may be featured with regulatory programs or hierarchical and contextual modularity. Segal et al. proposed a class of probabilistic graphical models for inferring regulatory modules from gene expression data (Segal et al. 2003). In this framework, a regulatory module is a set of genes that are regulated in concert by a shared regulatory program. The regulatory program specifies the behavior of the genes in the module as a function of the express levels of regulators. The method allows identifying specific regulators for each module, their effects, and the experimental conditions under which the regulation occurs. Clearly, this approach relies on the assumption that the expression levels of regulated genes depend on the expression levels of regulators. The method was demonstrated for its ability to generate detailed testable hypotheses relating to both regulatory modules and their control programs. The experimental results supported their computationally generated results and suggested regulatory roles for previously uncharacterized proteins. Similarly, Bar-Joseph et al. described an algorithm that uses gene expression data and transcription factor binding data to discover transcriptional modules (Bar-Joseph et al. 2003). The algorithm performs an exhaustive search over all possible combinations of transcription factors implied by the transcription factor binding data. Once a set of genes bound by a common set of transcription factors is found, the algorithm proceeds to find a smaller subset of genes that are coexpressed. The algorithm then seeks to add additional genes to the module that are similarly expressed and assumingly bound by the same set of transcription factors. The algorithm was applied to an analysis on yeast expression data from over 500 experiments and 106 yeast transcription factors profiled in rich medium conditions, and shown to be efficient in accurately clustering genes and regulators. Zhou et al. introduced an approach, termed the second-order expression analysis, for the identification of transcriptional modules (Zhou et al. 2005). They defined the first-order expression analysis as the extraction of expression patterns from one microarray data set. They then proposed the second-order expression analysis as a study of the correlated occurrence of expression patterns across multiple data sets measured under different conditions. By analyzing yeast microarray data, they demonstrated that the second-order analysis could identify modules of genes with the same function yet without clear coexpression patterns. The approach could further reveal network relationships among different transcriptional modules. Barkai’s group presented a method to assign genes into context-dependent and potentially overlapping regulatory units (Ihmels et al. 2004). They defined the transcriptional module as a self-consistent regulatory unit consisting of a set of co-regulated genes as well as the experimental conditions that induce their co-regulation, and proposed an efficient iterative signature algorithm to identify such modules. The proposed method is capable to reveal hierarchical organization of transcriptional modules and capturing overlapping modules in the presence of combinatorial regulation. The transcription modules identified by this method shows a high biological coherence, as measured by the conservation of putative cis-regulatory motifs between four related yeast species, in comparison to those by conventional methods. A variety of matrix decomposition methods have been introduced for uncovering transcriptional modules from microarray data, including singular value decomposition (Alter et al. 2000; Holter et al. 2001), independent components analysis (Frigyesi et al. 2006; Lee and Batzoglou 2003; Liebermeister 2002), non-negative matrix factorizations (Brunet et al. 2004; Gao and Church 2005; Kim and Tidor 2003; Lee and Seung 1999; Wang et al. 2006), network component analysis (Liao et al. 2003), and probabilistic sparse matrix factorization (Dueck et al. 2005). Recently, we presented a new matrix decomposition method, ModulePro, for transcriptional module discovery (Li et al. 2007b). The rationales behind our algorithm are: a) there may be nonlinear structure in transcriptional profiles, particularly between transcription factors and their target genes; and b) while many genes are involved in gene regulation, only a small set of genes (e.g., transcription factors or network hub genes) have predominant impact on the expression patterns of most genes. The new method is based on two-stage matrix decomposition on microarray data, as illustrated in Fig. 2
Cross-species analysis is important for identifying evolutionarily conserved and divergent transcriptional modules (Bergmann et al. 2004; Ihmels et al. 2005; Stuart et al. 2003; Zhou and Gibson 2004). We implemented an R-based program for the comparative analysis of transcriptional modules from two microarray data sets of different species (Zhan et al. unpublished). First, gene clustering is performed using a method such as self-organizing maps, k-means, or clustering analysis on the microarray data. The clustering results from two different species are then compared, and from the overlaps or non-overlaps of clustering results between two species, conserved or divergent transcriptional modules are identified. Using the program, we examined transcriptional profiles of embryonic stem cells (ESCs) and their earliest differentiated cells, embryoid bodies (EBs), from human and mouse (Sun et al. unpublished). Figure 3
Analysis of gene coexpression Transcriptional modules are made up by coexpressed or co-regulated genes. With recent interests in genetic networks and modules, the study of gene coexpression has emerged as a novel holistic approach for microarray data analysis (Butte and Kohane 2000; Carter et al. 2004; Graeber and Eisenberg 2001; Lee et al. 2004; Stuart et al. 2003; van Noort et al. 2004). The coexpression of genes has been conventionally measured using the Pearson’s correlation coefficient (Graeber and Eisenberg 2001; Lee et al. 2004; Stuart et al. 2003). The linear model-based correlation coefficient provides a good first approximation of coexpression, but is also associated with certain pitfalls; it can not provide evidence of directional relationship in which one gene is upstream of another, and underestimates the degree of coexpression if the relationship between genes is nonlinear (Herrgard et al. 2003; Imoto et al. 2002). Mutual information is also used to measure gene coexpression (Basso et al. 2005; Butte and Kohane 2000; Margolin et al. 2006; Zhou et al. 2003), but not suitable for modeling directional relationships, either. The coefficient of determination (CoD), on the other hand, can measure how much the combination of given genes (predictors) predicts the behavior of the target gene by comparison to the absence of the predictors, capable of uncovering nonlinear relationship of coexpression and suggesting the directionality (Dougherty et al. 2000; Hashimoto et al. 2004; Kim et al. 2002; Shmulevich et al. 2002a). Recently, we proposed a new algorithm, CoexPro, which is based on B-spline approximation followed by CoD estimation (Li et al. 2007a). The computation by the new algorithm requires no quantization of microarray data, thus avoiding significant loss or misrepresentation of biological information, which would otherwise occur in the conventional application of CoD (Dougherty et al. 2000; Hashimoto et al. 2004). In comparison to correlation coefficient and CoD, the new algorithm reveals gene coexpression with higher biological relevance. Along with uncovering both linear and nonlinear relationships of coexpression and suggesting the directionality, the new algorithm provides a more biologically meaningful model for gene coexpression, particularly useful in determining connectivity and inferring topology in transcriptional network studies. We used CoexPro to analyze coexpression of ligands and their corresponding receptors in lung cancer, prostate cancer, leukemia, and their normal tissue counterparts (Li et al. 2007a). As seen in Table 1, the analysis revealed many ligand-receptor pairs that showed different patterns of coexpression in cancer and normal tissues. Between the ligand BMP7 and its receptor ACVR2B, for example, CoD-B (the coexpression estimated by CoexPro) was 0.76 (P-value <0.028) in lung cancer and 0.00 (P-value <0.58) in normal samples, while R2 (correlation coefficient) was 0.042 in cancer and 0.0012 in normal samples. This pattern suggests a nonlinear coexpression in lung cancer but no coexpression in normal samples, and possibility of negative feedback regulation in BMP7 and ACVR2B expression. Between the ligand CCL23 and its receptor CCR1, on the other hand, CoD-B was 0.85 in the normal tissue while 0.00 in lung cancer, and R2 was 0.91 in the normal tissue and 0.054 in lung cancer. This pattern suggests a high linear coexpression in the normal lung tissue but no coexpression in cancerous lung samples. Similarly, CCL23 and CCR1 were also highly coexpressed in normal prostate samples (CoD-B = 0.85) but not coexpressed in cancerous prostate samples (CoD-B = 0.0). However, CCL23 and CCR1 were not coexpressed in both leukemia samples (CoD-B = 0.0) and their normal tissue counterparts (CoD-B = 0.0). Thus, CCL23 and CCR1 show differential coexpression not only between cancerous and normal tissues, but also among different cancers. The coexpression analysis using CoexPro sheds new light to the understanding of cancer development.
Coexpression networks or relevance networks can be constructed by computing gene–gene association using indices such as correlation coefficient, mutual information from all genes in a microarray dataset (Basso et al. 2005; Butte and Kohane 2000; Carter et al. 2004; Davidson 2001; Stuart et al. 2003). Basso et al. described a statistical algorithm, ARACNE, for inferring pair-wise interactions among genes and constructing coexpression networks (Basso et al. 2005). ARACNE identifies statistically significant gene–gene interactions by mutual information and builds networks with the relationships showing a high probability of representing either direct regulatory interactions or interactions mediated by post-transcriptional modifiers. Using ARACNE, a regulatory network of human B-cells was recovered from the expression profile data, showing a typical scale-free and hierarchical architecture. Zhang and Horvath presented a general framework for constructing and analyzing gene coexpression networks (Zhang and Horvath 2005). They proposed to use soft thresholding techniques to convert the gene coexpression similarity measure into the network connection strength and construct a weighted network. The soft thresholding is based the scale-free topology criterion that yields networks with high biological significance. They also distinguished intra-modular connectivity from whole network connectivity and showed that the intra-modular connectivity was more strongly correlated with functional significance than the whole network connectivity. Using the method, coexpression networks of human and chimpanzee brains were constructed, from which transcriptional modules were identified that correlated to the neuroanatomical structure of the brain (Oldham et al. 2006). Genes with the highest intra-modular connectivity were shown to be conserved between human and chimpanzee brains, underscoring the shared molecular bases of primate brain organization. Important differences in cerebral cortex between human and chimpanzee coexpression networks highlight the fact of rapid expansion of this brain region on the human lineage. The results provide insights into the molecular bases of primate brain organization and demonstrate the general utility of gene coexpression network analysis. Exploring dynamics of transcriptional network It is important to explore the dynamics of transcriptional coexpression or networks in response to disease development or changing cellular phenotypes. Various algorithms have been employed to explore the dynamics, including the conditional Markov chain model (Kim et al. 2002; Li and Zhan 2006), probabilistic Boolean network (Shmulevich et al. 2002b), liquid association model (Li et al. 2004), and a genomic scale approach for network dynamics analysis (Luscombe et al. 2004). Li et al. proposed a liquid association model for systematical analysis of coexpression dynamics (Li et al. 2004). The model detects the association of the transcriptional increase or decrease of the gene Z with the increase or decrease in the transcriptional correlation between the genes X and Y. The model was used to reveal how the enzymes associated with the urea cycle were expressed to ensure a proper mass flow of the involved metabolites in yeast, showing that the correlation between ARG2 and CAR2 changed from positive to negative as the expression level of CPA2 increased (Li et al. 2004). Luscombe et al. developed an approach for a genomic scale analysis of network dynamics (Luscombe et al. 2004). The approach combines well-known global topological measures, local motifs and newly derived statistics, uncovering significant changes in the network architecture that are unexpected from random simulation. An analysis on yeast gene expression data using this approach resulted in some interesting findings: a few transcription factors served as permanent hubs of the transcriptional network, whereas the most factors acted transiently only during certain conditions (Luscombe et al. 2004), and environmental responses facilitated fast signal propagation, whereas the cell cycle and sporulation directed temporal progression through multiple stages. Recently, we developed an algorithm, PathwayPro, to mimic the dynamic behavior of transcriptional networks through a series of interventions made in silico on each gene or gene combination (Li and Zhan 2006). The inputs to the algorithm are experiment-specific regulatory network information and gene expression data. The outputs are the estimated probabilities of the behavior transition of a network in instances such as disease development, aging process, or cell differentiation. The algorithm can provide answers to two questions: 1) whether or how much a gene or external perturbation contributes to the behavior transition of a network across different conditions; 2) in what specific ways is this contribution manifested. The PathwayPro analysis is particularly valuable in its ability to in silico simulate the network behavior which may not be easy to recreate in vitro, and generate hypotheses for further in vitro investigation. The potential clinical impact of such analysis is tremendous as it can not only open up a window on the dynamic behavior of a pathway or disease progression, but also translate into accurate diagnosis, drug discovery, and effective preventive and therapeutic intervention of disease. We used PathwayPro to examine the dynamic behavior of the BCR-ABL pathway in response to the leukemia development, and to identify possible disease and drug targets of leukemia (Li and Zhan 2006). In this case study, in silico transcriptional intervention was conducted on each gene (referred to as single-gene intervention), each combination of two genes (double-gene intervention), and each combination of three genes (triple-gene intervention) on this pathway. In each intervention, the observed expression of a gene was altered to the opposite direction or remained unchanged. The probability of the network behavior transition between the normal condition and leukemia state under each of the transcriptional interventions was calculated. The probability of the network transition from normal to leukemia states suggests disease susceptibility of the genes involved. The higher the probability is, the more likely the gene or gene combination under a certain intervention is responsible for the disease development. On the other hand, the probability of the network transition from leukemia to normal states suggests the potential usefulness of a drug or therapeutic intervention. Table 2 lists parts of the analysis results. As shown, more genes and gene combinations had higher probabilities in the normal-to-leukemia network transition than the leukemia-to-normal transition. This result suggests that the chance is higher for human to develop leukemia than to recover from the disease. It was also showed that transcriptional interventions involving the genes BCR and ABL yielded high probabilities for the normal-to-leukemia transition and for the leukemia-to-normal transition, no matter in single-, double- and triple-gene interventions (Table 2). The result suggests that BCR and ABL are the most contributive genes to the network behavior transition between the normal condition and the leukemia state, and therefore the most susceptible for the development of leukemia as well as the recovery of the disease to a normal condition. The two genes can thus serve as good drug targets for the treatment of leukemia. This result, reached independently by the computational analysis, is in agreement with the conclusion by previous laboratory-based studies (Zou and Calame 1999). It has been shown that chronic myeloid leukemia (CML) is associated in most cases with the fusion of the genes ABL and BCR, and the activation of BCR-ABL represses apoptosis and allows transformed cells to divide, resulting in the development of CML. The drug Gleevec is a selective BCR-ABL inhibitor, effective in the treatment of CML (Druker et al. 2001). In addition, the PathwayPro analysis revealed that BAD and MYC played critical roles in the leukemia development while AKT appeared important in the leukemia recovery to a normal condition, shedding new light on the understanding of the leukemia disease.
Closing remarks Systems biology is aimed at elucidating how genes interact to each other to perform specific biological processes or functions, and how disease or cellular phenotypes arise from the connectivity or networks of genes and their products. The utilization of high-throughput data generated by microarray or other technologies provides scientists with a first step towards systems-level analyses of transcriptional networks, in particular their modular and dynamic behaviors. However, the current data quality and coverage of high-throughput datasets impose various limitations on the network studies. Recent studies suggest that regulatory networks learned from gene expression data alone can be considerably obscured by spurious interactions when the number of observations is small (Husmeier 2003). Integrating findings from multiple data sources (e.g., DNA sequences, gene and protein expression profiles, protein–protein interactions, protein structural information, and protein-DNA binding data) can overcome this drawback. Several research groups demonstrate that the recovery of transcriptional networks from multiple types of data is more accurate than that from each data type alone (Bar-Joseph et al. 2003; Bernard and Hartemink 2005; Li et al. 2006). By continuing multidisciplinary efforts on further technological innovations in both data generation and computational methodology, we are expecting for more effective exploration of transcriptional networks and systems biology studies on disease, cell development, and other biological phenomena. Acknowledgements This study was supported by the Intramural Research Program, National Institute on Aging, NIH. Abbreviations
Footnotes Concise Summary This article reviews the recent development in computational biology research on deciphering modular and dynamic behaviors of transcriptional networks, discussing important findings and demonstrating the applications in systems biology studies. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
||||||||||||||||||||||||||||
Nat Biotechnol. 2003 Nov; 21(11):1337-42.
[Nat Biotechnol. 2003]Nature. 1999 Dec 2; 402(6761 Suppl):C47-52.
[Nature. 1999]Annu Rev Genomics Hum Genet. 2001; 2():343-72.
[Annu Rev Genomics Hum Genet. 2001]Nat Genet. 2004 Oct; 36(10):1090-8.
[Nat Genet. 2004]Sci Am. 2003 May; 288(5):60-9.
[Sci Am. 2003]Proc Natl Acad Sci U S A. 2004 Mar 2; 101(9):2981-6.
[Proc Natl Acad Sci U S A. 2004]Nat Genet. 2003 Jun; 34(2):166-76.
[Nat Genet. 2003]Nat Biotechnol. 2003 Nov; 21(11):1337-42.
[Nat Biotechnol. 2003]Nat Biotechnol. 2005 Feb; 23(2):238-43.
[Nat Biotechnol. 2005]Bioinformatics. 2004 Sep 1; 20(13):1993-2003.
[Bioinformatics. 2004]Proc Natl Acad Sci U S A. 2000 Aug 29; 97(18):10101-6.
[Proc Natl Acad Sci U S A. 2000]Proc Natl Acad Sci U S A. 2001 Feb 13; 98(4):1693-8.
[Proc Natl Acad Sci U S A. 2001]BMC Bioinformatics. 2006 Jun 8; 7():290.
[BMC Bioinformatics. 2006]Genome Biol. 2003; 4(11):R76.
[Genome Biol. 2003]Bioinformatics. 2002 Jan; 18(1):51-60.
[Bioinformatics. 2002]Bioinformatics. 2007 Feb 15; 23(4):473-9.
[Bioinformatics. 2007]PLoS Biol. 2004 Jan; 2(1):E9.
[PLoS Biol. 2004]PLoS Genet. 2005 Sep; 1(3):e39.
[PLoS Genet. 2005]Science. 2003 Oct 10; 302(5643):249-55.
[Science. 2003]Genome Biol. 2004; 5(7):232.
[Genome Biol. 2004]Pac Symp Biocomput. 2000; ():418-29.
[Pac Symp Biocomput. 2000]Bioinformatics. 2004 Sep 22; 20(14):2242-50.
[Bioinformatics. 2004]Nat Genet. 2001 Nov; 29(3):295-300.
[Nat Genet. 2001]Genome Res. 2004 Jun; 14(6):1085-94.
[Genome Res. 2004]Science. 2003 Oct 10; 302(5643):249-55.
[Science. 2003]Nat Genet. 2005 Apr; 37(4):382-90.
[Nat Genet. 2005]Pac Symp Biocomput. 2000; ():418-29.
[Pac Symp Biocomput. 2000]Bioinformatics. 2004 Sep 22; 20(14):2242-50.
[Bioinformatics. 2004]Science. 2003 Oct 10; 302(5643):249-55.
[Science. 2003]Proc Natl Acad Sci U S A. 2006 Nov 21; 103(47):17973-8.
[Proc Natl Acad Sci U S A. 2006]Bioinformatics. 2006 Jan 1; 22(1):96-102.
[Bioinformatics. 2006]Bioinformatics. 2002 Oct; 18(10):1319-31.
[Bioinformatics. 2002]Proc Natl Acad Sci U S A. 2004 Nov 2; 101(44):15561-6.
[Proc Natl Acad Sci U S A. 2004]Nature. 2004 Sep 16; 431(7006):308-12.
[Nature. 2004]Proc Natl Acad Sci U S A. 2004 Nov 2; 101(44):15561-6.
[Proc Natl Acad Sci U S A. 2004]Nature. 2004 Sep 16; 431(7006):308-12.
[Nature. 2004]Bioinformatics. 2006 Jan 1; 22(1):96-102.
[Bioinformatics. 2006]J Biol Chem. 1999 Jun 25; 274(26):18141-4.
[J Biol Chem. 1999]N Engl J Med. 2001 Apr 5; 344(14):1038-42.
[N Engl J Med. 2001]Bioinformatics. 2003 Nov 22; 19(17):2271-82.
[Bioinformatics. 2003]Nat Biotechnol. 2003 Nov; 21(11):1337-42.
[Nat Biotechnol. 2003]Pac Symp Biocomput. 2005; ():459-70.
[Pac Symp Biocomput. 2005]Bioinformatics. 2006 Aug 15; 22(16):2037-43.
[Bioinformatics. 2006]