![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2006, EMBO and Nature Publishing Group Deciphering principles of transcription regulation in eukaryotic genomes 1 Department of Genetics, Harvard Medical School, Boston, MA, USA 2 Biosciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, USA a Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, NBR 238, Boston, MA 02115, USA. Tel.: +1 617 335 8439; Fax: +1 617 432 6513; E-mail: dnguyen/at/genetics.med.harvard.edu Received April 8, 2005; Accepted February 8, 2006. This article has been cited by other articles in PMC.Abstract Transcription regulation has been responsible for organismal complexity and diversity in the course of biological evolution and adaptation, and it is determined largely by the context-dependent behavior of cis-regulatory elements (CREs). Therefore, understanding principles underlying CRE behavior in regulating transcription constitutes a fundamental objective of quantitative biology, yet these remain poorly understood. Here we present a deterministic mathematical strategy, the motif expression decomposition (MED) method, for deriving principles of transcription regulation at the single-gene resolution level. MED operates on all genes in a genome without requiring any a priori knowledge of gene cluster membership, or manual tuning of parameters. Applying MED to Saccharomyces cerevisiae transcriptional networks, we identified four functions describing four different ways that CREs can quantitatively affect gene expression levels. These functions, three of which have extrema in different positions in the gene promoter (short-, mid-, and long-range) whereas the other depends on the motif orientation, are validated by expression data. We illustrate how nature could use these principles as an additional dimension to amplify the combinatorial power of a small set of CREs in regulating transcription. Keywords: computational method, matrix factorization, MED, principles of transcription regulation, transcriptional regulatory networks, yeast Introduction Transcription is the first step in the universal pipeline of the biological information flow from genome to proteome. Accordingly, the regulation of transcription is critical for the development, complexity, and homeostasis of all living organisms (Davidson, 2001; Levine and Tjian, 2003). Although transcription can be regulated at different levels (e.g., chromatin structure level), one fundamental level, first discovered by Jacob and Monod (Jacob and Monod, 1961), is that the production of transcripts of a given gene is governed by a complex combinatorial interplay of cis-regulatory elements (CREs) (henceforth referred to as motifs) present in the gene's promoter region, and associated transcription factors (henceforth referred to as regulators) present in the cellular environment. Because regulators are gene products, their productions in principle are also controlled by motifs. Therefore, transcription of a gene is fundamentally regulated by the motif set present in such gene's promoter, acting as the gene's condition-independent signal receivers, and the set of functions describing the dependency of motif strength—the quantitative level of motif's influence on gene expression–on promoter context constitutes the set of principles of transcription regulation. Major efforts have been made in identifying motifs in different species using a variety of approaches (McGuire and Church, 2000; McGuire et al, 2000; Guhathakurta et al, 2002a, 2002b; Siggia, 2005; Tompa et al, 2005; Xie et al, 2005). Of those organisms, the yeast Saccharomyces cerevisiae has gained the most attention owing to the availability of multiple yeast genomes and high-quality mRNA. In fact, many methods developed for finding motifs and determining condition-dependent motif (or associated transcription factor) activity have used yeast as the model organism (Roth et al, 1998; Tavazoie et al, 1999; Bussemaker et al, 2000, 2001; Hughes et al, 2000; Wang et al, 2002; Conlon et al, 2003; Liao et al, 2003; Segal et al, 2003; Gao et al, 2004; Pritsker et al, 2004; Tompa et al, 2005) (also see Siggia, 2005 for a more complete list of references). However, less attention has been paid to the effects of motifs on gene expression as a function of their promoter context, and such effects remain poorly understood. Works by Pilpel et al (2001) and Sudarsanam et al (2002) studied the effect of motif cooccurrence on gene expression by measuring the degree of coexpression within the set of genes containing motif combinations of interest. Although their work could infer the combinatorial effects of motif–motif interactions on gene expression, it did not address how such effect is influenced by other factors that determine the properties of the promoter context such as geometric constraints. A recent study by Beer and Tavazoie (2004) began to take geometric features into account by way of a Bayesian network model of yeast expression profiles in order to learn the effect of motif position and orientation on gene expression. Although this later approach works quite well, it does not consider the individual expression patterns of each single gene, but instead analyzes the expression profiles of gene clusters, a process that can potentially cause loss of information and may not be suitable for modeling genes in the genome that do not belong to any well-defined cluster. Because the common assumption underlying these works is that coexpression implies coregulation, these approaches are limited by the need to detect motif influence from statistically aggregated expression data rather than from individual genes, and this typically restricts their application to subsets of genes with large gene expression signals, or those in predefined clusters, or with specific promoter properties. Furthermore, although metrics for measuring the degree of gene coexpression using expression coherence (Pilpel et al, 2001; Sudarsanam et al, 2002) or average of pairwise correlation (Beer and Tavazoie, 2004) employed in these works can infer the effects of motifs on gene expression well, such metrics do not provide a direct quantitative measure of motif influence on gene expression. In this article, we present a deterministic mathematical strategy, the motif expression decomposition (MED) formalism, whose framework provides just such a quantitative measure—motif strength. MED operates on all genes in the genome of a particular organism under consideration, and assigns a strength to each motif in the promoter of each individual gene, without depending on averaging or clustering of gene expression profiles. Motif strength as a function of promoter context can then be derived using the concept of gene ensemble and gene ensemble instance illustrated in Figure 1
Results and discussion The MED computational framework for deriving principles of transcription regulation From the physical standpoint, the effect of a given motif on gene expression—motif strength—must depend on its context such as its exact sequence, geometry (i.e. location or orientation), and cooccurrence with other motifs, simply because these parameters underlie the physical nature of the complex combinatorial interactions between motifs and regulators at the atomistic level for regulating transcription. Similar to the concept of the potential of mean force in statistical mechanics (McCammon and Harvey, 1987), each of these attributes of the motif context can be considered as a reaction coordinate along which the observed motif strength—a multivariable function—can be projected on. To this end, we propose the concept of gene ensemble and gene ensemble instance (Figure 1 Transcriptional regulatory principles derived from S. cerevisiae transcriptional networks We applied MED to yeast S. cerevisiae transcriptional networks with a combined gene expression data set covering 255 conditions involving different environmental stresses (Gasch et al, 2000) and multiple stages of the cell cycle (Spellman et al, 1998). We used crossvalidation (see Materials and methods) as an unbiased way to measure MED's ability to fit the biological data contained in the data set. We obtained an average correlation coefficient of 0.52 (Figure 2
To demonstrate the proof-of-concept that MED is capable of deriving principles of transcription regulation, in this study we chose to focus primarily on motif position and orientation with respect to the start codon, two geometric constraints known to play a role in gene coexpression (Beer and Tavazoie, 2004). We found that motifs do not always have the same level of influence on the gene expression simply owing to their presence in the gene promoter, nor exert the largest influence on the gene expression when they are near the start codon in yeast, but rather follow a function of a complex shape. Here we illustrate existence of four functions of motif strength (Figure 3
The PAC and RRPE motifs (Tavazoie et al, 1999; Hughes et al, 2000), which are found in promoters of genes encoding ribosomal proteins, are examples of the short-range motifs (Figure 3A 0.01), and significantly higher than at positions further upstream (P
Wilcoxon<10−16 for PAC, <1.83 × 10−5 for RRPE) (see Materials and methods section for definitions of P
Shuffling and P
Wilcoxon). To validate this form of regulatory principle, we computed the corresponding function describing the dependency of the degree of gene coexpression, as measured by the average pairwise expression correlation (Beer and Tavazoie, 2004) (e.g., average of expression correlation coefficients of all gene pairs in a given gene set), on motif position for the same set of gene ensemble instances (Figure 3BThe MCB motif (Koch et al, 1993), which plays a role in DNA synthesis and replication during the S1 phase of the cell cycle, is an example of the mid-range motif (Figure 3C 0.01). Furthermore, the MCB motif also exhibits a small degree of orientational effect around the position of its maximum strength; hence, it may also weakly belong to the orientation-dependent motif type (P
Wilcoxon<0.1). These predicted forms of regulatory principles are validated by expression data (Figure 3DFinally, the RAP1 motif (Lascaris et al, 1999), which controls the production of ribosomal proteins, is an example of the long-range and orientation-dependent motif (Figure 3E 0.01). In addition, it has a clear preferential orientation for regulating gene expression almost over the entire promoter length (P
Wilcoxon<4.84 × 10−8). As with the PAC, RRPE, and MCB motifs, these predicted forms of regulatory principles by RAP1 are also validated by expression data (Figure 3FBiological relevance To cope with both a myriad of environmental conditions and the internal complexity of cellular functions, eukaryotes are known to employ combinatorial strategies to generate a variety of expression patterns from a relatively small set of regulatory motifs (Kellis et al, 2003; Levine and Tjian, 2003). The combinatorial potential has been understood primarily in terms of motif cooccurrence and synergy. However, the transcriptional regulatory principles described here suggest several avenues of research into how nature may also exploit motif geometry as another dimension of combinatorial power for regulating transcription. For instance, given the observation that PAC motif strength varies along the length of the promoter, we foresee an experiment that explores the effect of PAC motif location on a reporter gene in relation to the hypothesis that reporter expression level should vary as indicated in Figure 3A To further illustrate the potential use of motif geometry by nature for regulating transcription, we use MED to dissect the appearance of synergism between the PAC and RRPE motifs (Pilpel et al, 2001; Sudarsanam et al, 2002; Beer and Tavazoie, 2004). This notion of synergism was based on the higher coherence of gene expression in the gene ensemble containing both, compared to the PAC-only and RRPE-only ensembles (Supplementary information 2). However, MED analysis, as shown in Figure 4A
Conclusion We have demonstrated a novel mathematical strategy for deciphering principles of transcription regulation using S. cerevisiae as a model system. We identify four regulatory principles that motifs obey in order to regulate transcription. These principles reveal the complexity of how a motif can exert its influence on gene expression beyond its mere presence, absence, or closeness to the start codon. In addition, we have also illustrated an example showing how nature could exploit geometry as another means for regulating transcription, hence increasing the combinatorial power of a relatively small set of motifs. With the emergence of new research paradigms in modern biology, where the process of biological research begins with a system-level theoretical prediction followed by experimental validation (Gilbert, 1991), we believe that MED can play an important role in fostering the development of biological theory necessary for explaining how regulatory motifs can control transcription. Furthermore, with the technology recently available to allow the high-throughput synthesis of oligomers (Tian et al, 2004), we foresee a new research direction aiming at engineering new and improved biological systems with desired properties. To this end, we believe that MED can be a valuable tool for such bioengineering process by providing necessary knowledge and parameters regarding motif's behaviors. Materials and methods The MED method MED is composed of two main steps. In the first step, each individual gene is analyzed for the strength of each motif in its promoter without taking into account any information about motif's context. The way in which such motif strength is derived in MED is based on the Jacob and Monod's model of transcription (Jacob and Monod, 1961), whereby the log ratio expression level of a gene is a function of a motif set present in its promoter and regulators' activities in the cellular environment (see equation (1)). The outcome of this step consists of two matrices: a matrix of motif strength, where each element represents the condition-independent strength of each motif in each gene promoter; and a matrix of regulator activity, where each element represents the global proxy activity of each regulator under a particular environmental condition. In the second step (the regulatory rule deduction step), regulatory principles are derived from the matrix of motif strength using the gene ensemble concept as illustrated in Figure 1 Step 1: Derivation of motif strength For a given gene g, let Ω
g
be a set of motifs occurring in its promoter; then its log ratio expression level E
gc
in a specific environmental condition c can be approximated using the following :
where M
gj
represents strength of motif jth on the expression level of gene g and A
jc
represents a global proxy for the regulator activity associated with motif jth under condition c. Unlike previous works (Bussemaker et al, 2001; Gao et al, 2004), where the matrix element M
gj
is a known constant and equal to the number of instances that motif jth occurs in the promoter of gene g or ChIP log ratio for transcription factor jth binding to the promoter of gene g, MED optimizes both M
gj
and A
jc
to best fit the expression data. Therefore, for all genes and conditions, equation (1) becomes
where E is an m genes by n conditions expression matrix, M is an m genes by k motifs matrix of condition-independent motif strengths, and A is a k regulators by n conditions matrix of condition-dependent global proxy activity of regulators for k motifs. Note that if a particular motif jth does not exist in the promoter of gene ith, then the matrix element M
ij
of the above matrix M is zero and remains so. The problem posed in equation (2) becomes a matrix decomposition problem. This portion of the MED algorithm consists of the procedure for decomposing the data matrix E into a product of matrices M and A uniquely using the motif–gene relationship as constraints (see proof in Supplementary information 4). The procedure we employed here is based on the factor analysis (Anderson, 1984; Gifi, 1990; Paatero et al, 2002; Liao et al, 2003) with Tikhonov regularization (Tikhonov and Arsenin, 1977) imposed on the matrix M to ensure uniqueness. To compute matrices M and A:
In the above algorithm, steps (b) to (d) are sufficient to ensure a unique solution M and A from the expression matrix E (see proof in Supplementary information 4). In equations (4) and (5), the second term is critical for producing a unique matrix M regardless of the linear dependency or near linear dependency of the rows of matrix A (see proof in Supplementary information 4). It can also be used to constrain the strength of motif jth in the promoter of gene g to a predefined value M
gi
* if such value is known a priori. Although the parameter λ can be chosen using more sophisticated methods (Shock, 1984; Engl and Neubauer, 1985; Guacaneme, 1988; Wahba, 1990), in this work it is chosen in such a way that it does not noticeably affect the test error computed from crossvalidation (Supplementary information 7). We used equation (4) to compute the motif strength and λ was set to a scalar value of 10−4, although it can be a vector quantity in general for weighting motifs in different gene promoters differently. The convergence criterion used in this work is the total variance of the residual matrix defined in Supplementary information 7. Note that, as each motif has its own binding strength to regulators and hence having its own scale of influence on gene expression, only relative motif strengths of the same motif across different instances of gene ensemble are meaningful for comparison purposes. Finally, equation (1) can be extended to include the nonlinear term accounting for the motif–motif interactions (Supplementary information 8) and the MED formalism shown above can still be applied transparently. Step 2: Deduction of regulatory principles We construct the gene ensemble containing a specific motif set of interest, partition this ensemble into instances based on the specific promoter properties of interest, and calculate the average motif strength and standard error across these instances (Figure 1 Data We used a combined gene expression data set obtained from environmental stresses (Gasch et al, 2000) and cell cycle (Spellman et al, 1998) with a total of 255 conditions. Ideally, we want to use motifs that are derived directly from the ChIP-chip data without depending on the clustering in the gene expression space (Harbison et al, 2004); however at the time of this work, such data were not available. Therefore, we used 62 DNA regulatory motifs, represented as position-specific weight matrices, that were generated using literature (37 motifs) and the multiple sequence alignment program AlignACE (Roth et al, 1998) (25 motifs) as described previously (Roth et al, 1998; Hughes et al, 2000; Pilpel et al, 2001). We used ScanACE (Roth et al, 1998; Hughes et al, 2000; Pilpel et al, 2001) to find motif occurrences in promoter regions up to 1000 bp upstream. The expression data matrix E was centralized to remove column and row means. Crossvalidation To analyze the performance of MED, we used crossvalidation, in which we partitioned the expression data matrix into 100 blocks, each of which consists of 20% of random genes and 5% of random conditions (of these genes). For each run, we left out one of these blocks and trained the model on the remaining data. This allowed us to use gene expression data on all 255 conditions (but only across 80% of the genes) in order to compute matrix A in step (b) of the MED algorithm, and likewise, to use information on all the genes (but only across 95% of the conditions) to compute the motif matrix M in step (d) of the MED algorithm. Upon convergence, we then used the resulting matrices M and A to predict gene expression of the block of 20% genes and 5% condition the model has not been trained on. This process was repeated for each of the 100 blocks, each time predicting expression on the block of data that was left out, in order to obtain a complete expression matrix, each element was predicted by this crossvalidation scheme. The result presented in Figure 2 Statistical tests To further ensure that the type of each motif presented in this work is statistically significant in addition to the degree of gene coexpression, we performed two additional statistical tests: one is the Wilcoxon rank sum test (Wilcoxon, 1945; Lehmann, 1975) and the other is from the 100 random shuffling of complete gene expression profiles. In the Wilcoxon test, we determined if the motif strength at the position of extremum is statistically different from the motif strengths elsewhere. For the MCB and RAP1 motifs, we also determined if the strength of a motif oriented along one direction is statistically different compared to that of reversed direction. The level of statistical significance in the Wilcoxon rank sum test is measured by the Wilcoxon P-value (P
Wilcoxon). In the random shuffling test, we permuted all elements of the expression matrix E 100 times, generating 100 expression matrices E
i
, i=1, …, 100, for computing the strengths of each motif presented in this work. In this test, we determined if the strength of a motif at a particular promoter position obtained from the actual expression data is statistically more significant than the corresponding one derived from the random shuffling of expression data. The level of statistical significance in this test is measured by the P-value (P
shuffling): the fraction of motif strengths obtained from the random shuffling of expression data larger than the corresponding one obtained from the actual expression data. Note that, as there are 100 random shuffling runs, the smallest P-value, P
shuffling, attainable in this test is 0.01 if no assumption is made about the distribution of motif strengths derived from the randomly shuffling of expression data. However, as shown in Supplementary information 10 and Supplementary Figure SF6a–c, the P-values for these observed motif strengths can be much smaller than 0.01 owing to the Chebyshev's inequality (P
Chebyshev) (Abramowitz and Stegun, 1972), as the computed motif strengths of the PAC, RRPE, MCB, and RAP1 motifs at the promoter location of their extremum derived from the actual data are far away from the mean of the distribution of the corresponding ones derived from the random shuffling of expression data (by at least 28 s.d.). We also performed two-way ANOVA on PAC and RRPE motif strengths derived from the PAC/RRPE-containing gene ensemble using the MatLab's anovan command (MatLab) using the model=‘full' and default ss-type parameters. Two factors were specified, PAC distance from start codon, whose P-value is denoted as P
ANOVA-PAC, and RRPE distance from start codon, whose P-value is denoted as P
ANOVA-RRPE, where each consisted of three levels corresponding to the distance bins in Figure 4 Competing interest statement The authors declare that they have no competing financial interests. Supplementary Material Click here to view.(885K, pdf) Supplementary Material Click here to view.(2.9M, zip) Acknowledgments We thank Nikos Reppas, Zhou Zhu, Xiaoxia Lin, Dana Pe'er, Saeed Tavazoie, Eric Siggia, and Joel Bader for critical reading of the manuscript. We thank John Aach for critical reading of the manuscript and useful suggestions on statistical tests. We are indebted to George M Church for his guidance and support of this work. Dat H Nguyen acknowledges support from the Alfred P Sloan and US Department of Energy Postdoctoral Fellowship in Computational Molecular Biology and Bioinformatics, and travel fellowships provided by the National Science Foundation Institute for Pure and Applied Mathematics at UCLA. George M Church was supported by US Department of Energy GTL Grant No. DE-FG02 02ER63461. PD was supported by PhRMA/Harvard CEIGI grant, and is currently supported by an LDRD grant at Lawrence Livermore National Laboratory. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Nature. 2003 Jul 10; 424(6945):147-51.
[Nature. 2003]J Mol Biol. 1961 Jun; 3():318-56.
[J Mol Biol. 1961]Genome Res. 2000 Jun; 10(6):744-57.
[Genome Res. 2000]Genome Res. 2002 May; 12(5):701-12.
[Genome Res. 2002]Curr Opin Genet Dev. 2005 Apr; 15(2):214-21.
[Curr Opin Genet Dev. 2005]Nat Biotechnol. 2005 Jan; 23(1):137-44.
[Nat Biotechnol. 2005]Nature. 2005 Mar 17; 434(7031):338-45.
[Nature. 2005]Mol Biol Cell. 2000 Dec; 11(12):4241-57.
[Mol Biol Cell. 2000]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Cell. 2004 Apr 16; 117(2):185-98.
[Cell. 2004]Nat Genet. 2001 Feb; 27(2):167-71.
[Nat Genet. 2001]Cell. 2004 Apr 16; 117(2):185-98.
[Cell. 2004]Nat Genet. 1999 Jul; 22(3):281-5.
[Nat Genet. 1999]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Cell. 2004 Apr 16; 117(2):185-98.
[Cell. 2004]Science. 1993 Sep 17; 261(5128):1551-7.
[Science. 1993]Bioinformatics. 1999 Apr; 15(4):267-77.
[Bioinformatics. 1999]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Nature. 2003 Jul 10; 424(6945):147-51.
[Nature. 2003]Science. 1999 Aug 6; 285(5429):901-6.
[Science. 1999]Nature. 2002 Jul 25; 418(6896):387-91.
[Nature. 2002]Nat Genet. 2001 Oct; 29(2):153-9.
[Nat Genet. 2001]Genome Res. 2002 Nov; 12(11):1723-31.
[Genome Res. 2002]Cell. 2004 Apr 16; 117(2):185-98.
[Cell. 2004]Proc Natl Acad Sci U S A. 2004 Nov 16; 101(46):16234-9.
[Proc Natl Acad Sci U S A. 2004]Proc Natl Acad Sci U S A. 2005 May 17; 102(20):7203-8.
[Proc Natl Acad Sci U S A. 2005]Nature. 1991 Jan 10; 349(6305):99.
[Nature. 1991]Nature. 2004 Dec 23; 432(7020):1050-4.
[Nature. 2004]J Mol Biol. 1961 Jun; 3():318-56.
[J Mol Biol. 1961]Nat Genet. 2001 Feb; 27(2):167-71.
[Nat Genet. 2001]Proc Natl Acad Sci U S A. 2003 Dec 23; 100(26):15522-7.
[Proc Natl Acad Sci U S A. 2003]Genome Biol. 2003; 5(1):R2.
[Genome Biol. 2003]Mol Biol Cell. 2000 Dec; 11(12):4241-57.
[Mol Biol Cell. 2000]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Nat Biotechnol. 1998 Oct; 16(10):939-45.
[Nat Biotechnol. 1998]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Cell. 2004 Apr 16; 117(2):185-98.
[Cell. 2004]Cell. 2004 Apr 16; 117(2):185-98.
[Cell. 2004]Cell. 2004 Apr 16; 117(2):185-98.
[Cell. 2004]