![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2006, Cold Spring Harbor Laboratory Press Establishing glucose- and ABA-regulated transcription networks in Arabidopsis by microarray analysis and promoter classification using a Relevance Vector Machine 1 Department of Cell and Developmental Biology, John Innes Centre, Norwich NR4 7UH, United Kingdom 2 Computational Biology Department, John Innes Centre, Norwich NR4 7UH, United Kingdom 3 The School of Computing Sciences, University of East Anglia, Norwich NR4 7TJ, United Kingdom 4Corresponding author. E-mail michael.bevan/at/bbsrc.ac.uk; fax 01603 450025. Received June 6, 2005; Accepted November 14, 2005. This article has been cited by other articles in PMC.Abstract Establishing transcriptional regulatory networks by analysis of gene expression data and promoter sequences shows great promise. We developed a novel promoter classification method using a Relevance Vector Machine (RVM) and Bayesian statistical principles to identify discriminatory features in the promoter sequences of genes that can correctly classify transcriptional responses. The method was applied to microarray data obtained from Arabidopsis seedlings treated with glucose or abscisic acid (ABA). Of those genes showing >2.5-fold changes in expression level, ~70% were correctly predicted as being up- or down-regulated (under 10-fold cross-validation), based on the presence or absence of a small set of discriminative promoter motifs. Many of these motifs have known regulatory functions in sugar- and ABA-mediated gene expression. One promoter motif that was not known to be involved in glucose-responsive gene expression was identified as the strongest classifier of glucose-up-regulated gene expression. We show it confers glucose-responsive gene expression in conjunction with another promoter motif, thus validating the classification method. We were able to establish a detailed model of glucose and ABA transcriptional regulatory networks and their interactions, which will help us to understand the mechanisms linking metabolism with growth in Arabidopsis. This study shows that machine learning strategies coupled to Bayesian statistical methods hold significant promise for identifying functionally significant promoter sequences. The identification and understanding of transcriptional regulatory networks and their interactions are a major challenge in biology, as transcriptional mechanisms contribute to the regulation of nearly all cellular processes. The time, location, and levels of gene transcripts are known to be specified by combinations of protein interactions with noncoding sequences surrounding genes, and significant progress is being made in defining protein interactions with regulatory motifs on a whole-genome scale. For example, experiments that localize transcription factor binding sites using chromatin immunoprecipitation to the yeast genome sequence have established pathways of gene regulation involving >100 of the 141 known yeast transcription factors (Lee et al. 2002). However, the multitude of transcription factors and the larger genomes of multicellular organisms make direct experimental approaches such as this daunting with current technology. Computational methods that define relationships between gene expression levels and putative regulatory sequences in upstream regions of genes are increasingly used to establish genome-scale transcriptional regulatory networks (Smith et al. 2005). By correlating the frequency of occurrence of known promoter motifs in coregulated genes, it has been possible to relate promoter motifs with known functions to transcriptional pathways in yeast (Bussemaker et al. 2001). The clustering of genes that are coregulated during the yeast cell cycle according to their functions and alignment of promoter sequences of clustered genes identified promoter motifs with known regulatory functions and novel motifs with predicted functions (Tavazoie et al. 1999). This strategy was extended into a systematic approach analyzing a wide range of gene expression patterns in yeast and Caenorhabditis elegans with frequentist statistical methods for identifying promoter DNA elements and combinations of elements that optimally predict gene expression patterns. From this, the expression of a significant proportion of genes was accurately predicted according to promoter sequences (Beer and Tavazoie 2004). Regulatory modules have been defined in yeast based on coregulated gene expression patterns, and promoters in a significant number of these modules contained a promoter motif that was a known binding site for a coregulated transcription factor (Segal et al. 2003). Subsequent testing of these predictions defined the functions of several regulatory proteins and established the power of these approaches. We are interested in elucidating the transcriptional regulatory mechanisms integrating carbohydrate availability and hormone action in the plant Arabidopsis thaliana (Arabidopsis). Widespread changes in cell function in response to carbohydrate status, such as reduced protein synthesis and the mobilization of alternative substrates for energy supply in response to carbohydrate starvation, have been predicted based on microarray analysis (Price et al. 2004; Thimm et al. 2004). These experiments also show that the expression of a wide range of genes is regulated by carbohydrates in Arabidopsis and ~25% of the genes represented on the 8K Affymetrix chip also responded to both light and sugar treatments (Thum et al. 2004). Many of these genes encode enzymes of primary, secondary, and lipid metabolism, and a codependent interaction between light- and sugar-responsive gene expression was identified. These transcriptional responses were also interconnected with ABA- and ethylene-mediated gene expression and growth responses. Interactions between glucose- and ABA-response pathways have been established by the isolation of the ABA biosynthetic mutant aba2 and the ABA response mutant abi4 in screens for reduced responses of seedlings to high levels of glucose or sucrose (Arenas-Huertero et al. 2000; Huijser et al. 2000; Laby et al. 2000; Rook et al. 2001; Cheng et al. 2002). Learning techniques are used in an increasingly wide variety of biological applications such as microarray analysis (Lavine et al. 2004), protein homology detection (Jaakkola et al. 1999), function prediction based on annotated sequence (Vinayagam et al. 2004), and functional predictions based on transcriptional coexpression (Zhang et al. 2004). Supervised learning methods construct a decision rule from a training set of known positive and negative examples and algorithms such as Support Vector Machines (SVM) (Boser et al. 1992) learn to discriminate between training examples from each category. SVMs have demonstrated both excellent performance in dealing with sparse and noisy data typically generated by biological experimentation and an ability to deal with high-dimensional data in a computationally efficient way (Scholkopf et al. 2004). Recently SVM applications have also been used to discriminate between promoter and nonpromoter regions of human DNA (Gangal and Sharma 2005), and to resolve promoter sequences and the positions of transcription initiation sites in plant DNA (Shahmuradov et al. 2005). Here we describe the use of a Relevance Vector Machine (RVM) (Tipping 2000) to classify gene expression according to the composition of promoter sequences. The RVM was used with a Bayesian Automatic Relevance Determination (ARD) (MacKay 1994; Neal 1994) prior to select a small subset of promoter motifs for its discriminatory rule to optimally distinguish between regulated genes. Unlike correlation-based approaches, which consider the significance of individual features, the RVM considers the significance of a feature in the context of the features already selected, which may be useful in considering the effects of combinations of features on gene expression. This approach has been successfully used to find a small number of genes whose expression is diagnostic for certain cancer types (Li et al. 2002). The discriminatory features selected by the RVM classifier included promoter motifs that had known functions in both glucose- and ABA-activated gene expression and revealed that light-responsive promoter motifs were powerful features for classifying promoters controlling glucose down-regulated gene expression. One motif with no established function in glucose-responsive transcriptional responses that was the strongest classifier of glucose up-regulated gene expression was shown experimentally to confer glucose-activated gene expression in stable transgenic lines. The successful application of machine learning algorithms for promoter sequence analysis using Bayesian statistical principles established models of transcriptional pathways regulating glucose- and ABA-mediated gene expression and demonstrated that these methods hold promise for establishing transcriptional regulatory networks in Arabidopsis and other organisms. Results Transcript profiling reveals that glucose regulates genes with diverse functions Affymetrix ATH1 Gene Chips were used to identify glucose- and ABA-regulated genes. Seedlings were grown in liquid culture for 7 d on low sugar concentrations (0.5% glucose) and constant light to abrogate diurnal responses. Treatments were designed to reveal transitions in gene expression from a sugar-restricted condition to a sugar-replete state. After 7 d of growth, the medium was replaced with glucose-free medium for 24 h, and then glucose or mannitol was added to 3% (w/v). Mannitol, a nontoxic nonmetabolized sugar, was used as an osmotic control in ABA experiments to define the interactions between ABA and 3% glucose. Seedlings that had developed the first pair of true leaves (stage 1.02) (Boyes et al. 2001) were sampled at 0, 2, 4, or 6 h after addition of glucose, mannitol, glucose + ABA, or mannitol + ABA. The time course was selected to detect proximal events, to minimize transcriptional changes due to accelerated growth and development in response to sugars, and to establish the dynamics of glucose- and ABA-mediated gene expression. Scatterplots (Supplemental Fig. 1) show that >99% of the significantly expressed genes (Present) exhibit <2.5-fold variation in signal intensity between two independent chip hybridizations. Up- or down-regulated genes were defined independently for each time point as those with a statistically significant change in treatment/control pairs (Wilcoxon signed-rank test, P < 0.005) (Hubbell et al. 2002; Liu et al. 2002). Genes with expression ratios of glucose/mannitol and glucose/0 h of >2.5-fold or <2.5-fold at one time-course point or more were defined as glucose-inducible genes and glucose-repressible genes, respectively. The genes with expression ratios of ABA + mannitol/mannitol, ABA + mannitol/0 h, ABA + glucose/glucose and ABA + glucose/0 h of >2.5-fold or <2.5-fold at one time-course point or more were defined as ABA-inducible and ABA-repressible genes, respectively. The 0-h time point was common to all treatments, and time points for the glucose and mannitol treatments were replicated three times and hybridized independently to ATH1 arrays to measure changes over time. This scheme provided three experimental replicates of glucose treatment at each time point and nine experimental replicates for defining glucose-regulated genes. The ABA treatments provided a minimum of two experimental replicates for defining ABA-regulated genes. Accordingly, 983 genes were expressed by >2.5-fold in response to 3% glucose, 769 genes were expressed at 2.5-fold lower levels in response to 3% glucose (Supplemental Table 1), and 692 and 173 genes were identified as ABA inducible and ABA repressible with >2.5-fold change, respectively (Supplemental Table 2). To confirm the microarray expression profile analyses, semiquantitative RT-PCR analysis was performed on the RNA samples used for array analysis. Fifty genes exhibiting expression changes in response to glucose were selected and tested two times. Results from 15 of the selected genes are shown in Supplemental Figure 2. Gene expression patterns revealed by RT-PCR exhibited similar dynamics to those seen in array analysis, establishing the reliability of the microarray data. We categorized glucose- and ABA-regulated genes according to their putative functions based on Arabidopsis Gene Ontology (GO) annotations in GeneSpring lists, the classification of the Munich Information Centre for Protein Sequencing (MIPS) database, pathway analysis defined by AraCyc (Mueller et al. 2003) and KEGG (Kanehisa 2002), and the literature. Table 1 shows the significance of finding glucose- and ABA-responsive genes in different functional categories calculated using the hypergeometric P-value (Tavazoie et al. 1999). ABA down-regulated genes were not considered because of the small number of genes in this category. The functional clusters enriched for glucose up-regulated genes include metabolic pathways and cellular processes associated with enhanced growth, such as amino acid and nucleotide synthesis, sulfur assimilation, and secondary metabolism. Genes involved in protein synthesis were significantly enriched in the glucose up-regulated set, as were protein targeting genes and abiotic stress proteins including chaperonins and heat-shock proteins, demonstrating that glucose-mediated transcriptional regulation mediates a coordinated increase in protein synthesis and processing. Glucose down-regulated genes were enriched in functional categories involved in metabolic responses such as amino acid degradation, gluconeogenesis, and glutaredoxins. The regulation of genes involved in trehalose metabolism was highly significant, consistent with the proposed role of trehalose 6 P levels in regulating carbon assimilation (Schluepmann et al. 2003). Many genes regulating light responses, such as transcription factors, light receptors and signaling proteins were also down-regulated in response to glucose, although in general this diverse functional group was not significantly down-regulated as a whole. The most significant categories of genes regulated by ABA in our conditions included abscisic acid metabolism, secondary metabolism, and carbohydrate degradation pathways. Our quantitative analysis is consistent with recent qualitative microarray analysis showing that glucose treatment regulates a broad range of gene functions (Price et al. 2004; Thimm et al. 2004).
Dynamics of glucose-responsive gene expression Analysis of gene expression profiles during the first 6 h after addition of glucose or mannitol showed rapid and transient changes in the expression of many genes. A total of 469 genes were maximally expressed at the 2-h time point, and 719 and 628 genes were maximally expressed at the 4-h and 6-h time points, respectively (Fig. 1A
Quality Threshold (QT) clustering was used to divide glucose up-regulated genes into 10 clusters of 20 or more genes that shared similar expression dynamics (Fig. 1C Clusters 4 and 9 (Fig. 1C Glucose treatment led to a rapid and progressive increase in the expression of genes involved in protein synthesis, including 32 ribosomal proteins, a putative ribosome recycling factor, and translation initiation and elongation factors, which were predominantly found in clusters 1, 2, and 3 (Fig. 1C Glucose- and ABA-responsive gene expression Previous genetic analyses have shown that sugar- and ABA-mediated growth responses are closely interconnected in plants (Zhou et al. 1998; Rook et al. 2001). Array analysis revealed >14% of the ABA-inducible genes were also induced by glucose, indicating a substantial overlap between glucose- and ABA-regulated gene expression (Supplemental Table 4). Several transcriptional regulators of ABA responses were regulated by glucose. The homeodomain leucine zipper (HD-Zip) proteins in Arabidopsis are involved in ABA regulation (Himmelbach et al. 2002), and expression of ATHB6 is up-regulated by glucose (Supplemental Table 1), suggesting that sugars may participate in ABA signaling by regulating the expression of ABA-response regulators. Ninety-five genes were up-regulated by both glucose and ABA. These genes are involved in stress, defense, and senescence responses, secondary metabolism and cell wall biosynthesis, amino acid metabolism, carbohydrate metabolism, fatty acid and lipid metabolism and transport, transcript regulation, and signal transduction (Fig. 2A
Thirty-seven genes were identified as glucose- and ABA-corepressed genes, including protein kinases, transcription factors, transporters, and enzymes. Two genes (AT4G36670 and AT1G08930) encoding putative sugar transporters are down-regulated by both glucose and ABA. Two genes encoding 1-aminocyclopropane-1-carboxylate oxidase (AT1G77330) involved in ethylene biosynthesis and putative ethylene-responsive element binding factor (AT5G61590) are repressed by both glucose and ABA, revealing that aspects of ethylene biosynthesis and responses are modulated by both glucose and ABA. Genes regulated by glucose and ABA in opposed ways were also analyzed. Genes involved in ammonium assimilation, such as a putative ammonium transporter (AT1G64780), were glucose inducible and ABA repressible, and lysine-ketoglutarate reductase (AT4G33150) exhibited a decrease of expression level in glucose treatment and an increase of expression level in ABA treatment (Supplemental Table 4), suggesting that nitrogen metabolism may provide different compounds for stress and growth responses. Finally, the phosphate transporter gene ATPT2 (AT2G38940) is up-regulated by glucose and down-regulated by ABA (Supplemental Table 4), suggesting that the sugar-replete state may promote uptake and utilization of the phosphate required for carbon metabolism and ABA may repress this process. Several examples of the synergistic effects of sugar and ABA on gene expression have been reported. For example, expression of the rice myo-inositol-1-phosphate synthase gene RINO1 was induced by both sucrose and ABA treatments, and the combination of both sucrose and ABA resulted in much higher expression levels (Yoshida et al. 2002). We defined synergistic interactions as those genes expressed at greater than twofold higher levels in response to glucose + ABA treatment compared to the sum of expression levels observed for glucose and ABA + mannitol treatments at two or more points in the time course. A set of 12 genes was in this class (Fig. 2B Regulatory gene expression Glucose treatment led to rapid transient increases in the expression of diverse transcription factors including members of the MYB, bZIP, AP2, homeodomain, NAM-like, and heat-shock transcription factor protein families. Expression of MYB75/PAP1/AN2 (Borevitz et al. 2000; Stracke et al. 2001) and the flower pigmentation gene ATAN11 were rapidly induced by glucose (Supplemental Table 1). AN2 and AN11 have been well characterized and encode a MYB-domain transcriptional activator and a WD-repeat protein, respectively (de Vetten et al. 1997; Quattrocchio et al. 1999). In petunia flowers, AN2 and AN11 control flower pigmentation by stimulating the transcription of anthocyanin biosynthetic genes. Overexpression of PAP1 also leads to elevated expression of anthocyanin biosynthetic genes (Borevitz et al. 2000), suggesting that glucose may promote expression of phenylpropanoid biosynthetic genes by elevating expression of these MYB transcription factors and ATAN11. The MYB transcript factor gene ATR1, which activates tryptophan gene expression in Arabidopsis (Bender and Fink 1998), was also up-regulated by glucose, suggesting that glucose may increase expression of tryptophan biosynthetic genes by activating expression of ATR1. Expression of several MADS-box and WRKY-like family members was down-regulated by glucose. The expression of a WRKY class transcription factor (AT5g07100) encoding a protein related to sweet potato SPF1 (Kim et al. 1997) was reduced in response to glucose. SPF1 binds SP8a and SP8b promoter sequences of sporamin and beta-amylase genes expressed in storage roots of sweet potato, and reduced expression of SPF1 mRNA levels induced sporamin and beta-amylase expression (Ishiguro and Nakamura 1994). Our analysis suggests that AT5g07100 may modulate sugar-regulated gene expression in Arabidopsis by a similar mechanism. Identification and analysis of promoter motifs Promoter sequences comprising ~1000 bp upstream of the predicted ATG initiation codon of all Arabidopsis genes predicted in the TIGR version 5 annotation (Haas et al. 2005) were assembled. Responsive genes were defined as those showing >2.5-fold changes at the 2-h, 4-h, and 6-h time points in response to glucose or ABA compared to control treatments. The set of 983 glucose up-regulated promoters was compared with 769 glucose down-regulated promoters and a set of 692 ABA up-regulated promoters was compared to a set of 647 promoters showing no responses to ABA. Matrices of (983 + 769) promoters regulated by glucose, and 381 experimentally defined plant transcriptional regulatory sequences established in the PLACE database (Higo et al. 1999) were assembled for feature extraction. Matrices of (692 + 647) ABA-regulated and nonregulated promoters and PLACE elements were also assembled, and features were extracted from both strands of the promoters. Similar matrices were also made with a set of all 1024 (45) possible 5-mers in an unbiased search for promoter motifs. 5-mers were chosen because 4-mers occurred too frequently to provide discriminatory power, while 6-mers may be too selective. These features served as input into a feature space by the RVM to construct classifiers of gene expression based on either PLACE elements or k-mer sequences. These classifiers were tested in a 10-fold cross-validation procedure that partitioned the data into 10 disjoint subsets of approximately equal size. A model was then trained using nine segments as the training data and tested on the unused segment. This procedure was repeated 10 times, each time using a different combination of nine segments to form the training data, such that all 10 segments were used as test data for a different model. The average test set performance was reasonably stable after 10 trials; therefore, a 10-fold cross-validation provided a good estimate of model performance. Classification accuracy was displayed in the Receiver-Operator Characteristic (ROC) curves shown in Figure 3, A and B
The hypergeometric probability distribution function was used to assess the enrichment of these motifs in the promoters of genes in various functional categories. Supplemental Table 6 shows that many of the motifs were significantly enriched in the promoters of genes found in functional classes involved in glucose and ABA responses. These relationships were also consistent with the known functions of these promoter motifs in regulating different cellular functions. The TELO motif, the top-ranked classifier of glucose-induced genes, was originally identified in promoters of genes encoding components of the translational machinery (Tremousaygue et al. 1999). Consistent with this, our analysis shows it is significantly enriched in the promoters of protein and nucleotide synthesis genes (Supplemental Table 6). The BS1EGCCR and MYB26PS motifs have been implicated in the regulation of phenylpropanoid biosynthesis genes (Uimari and Strommer 1997; Lacombe et al. 2000), and these were enriched in glucose-regulated carbohydrate metabolism and sulfate-uptake genes. The DRECRTCOREAT motif mediates stress responses (Dubouzet et al. 2003), and this motif was enriched in the promoters of abiotic stress-related genes. The PLACE elements that were top-ranking classifiers of glucose down-regulated gene expression, such as the I-box, the EVENINGAT, MYBST1, and the G-box-related motif, all have established functions in regulating light- and sugar-related gene expression. For example, the G-box-related element LRENPCABE was previously shown to repress gene expression by sugars (Hwang et al. 1998; Lu et al. 1998). The MYBST1 motif, TATCC, is very similar to the known sugar-repression motifs (TATCCA) and OSRAMY3D (TATCCAY) (Hwang et al. 1998; Lu et al. 1998, 2002), suggesting that TATCC is a core of motifs conferring sugar repression. Supplemental Table 6 shows these motifs are significantly enriched in the promoters of genes involved in catabolic responses, abiotic stress, and trehalose and jasmonate metabolism. PLACE elements that were strong classifiers of ABA up-regulated promoters (Table 4) were also significantly enriched in classes of genes known to be regulated by ABA, such as stress responses, ABA biosynthesis, carbohydrate breakdown, and phenylpropanoid synthesis (Supplemental Table 6). Many of these PLACE elements have been shown to confer ABA- and stress-responsive gene expression, such as ABARELATERD1, AB AREATRD22, MYB1AT, and DRE2COREZMRAB17 (Busk and Pages 1998). Recently these ABRE motifs and the DRE element were also identified as overrepresented sequences in ABA-up-regulated genes (Leonhardt et al. 2004). Ten k-mer motifs were top-ranking classifiers of ABA up-regulated promoters (Table 5). ACGTG, the most significant motif, forms the core of ABRE LATERD1, ABREATRD22, and ACGTATBREMOTFA2OSEM; CGTGT is the core of ABREMOTIFAOSOSEM; CGTGG is the core of ABREATRD22; and CGTAC is the core of ABRE3HVA22.
The TELO motif was the best classifier of glucose up-regulated expression. It is required, together with other elements such as the TEF, trap40, and IIa/IIb elements, for high-level expression in actively dividing cells in root meristems (Tremousaygue et al. 1999, 2003; Manevski et al. 2000). Figure 4A
Discussion Dynamic transcriptional responses to glucose Glucose and ABA treatments lead to rapid dynamic changes in gene expression in Arabidopsis seedlings. Quantitative analysis of gene function and clustering of gene expression dynamics identified patterns of coregulation of classes of genes that revealed large-scale changes in cell function in response to glucose and ABA. Among the most rapid transient transcriptional responses to glucose involved the up-regulation of genes encoding heat-shock and DNAJ-like chaperonin proteins. Genes encoding components of protein synthesis were also rapidly induced, but their expression persisted, suggesting a temporal control the cellular machinery for protein synthesis that involves rapid initial synthesis of chaperonins for stabilizing newly synthesized proteins and longer-term expression of components involved in protein synthesis. Transcription factors and protein kinase genes were among the most rapidly modulated by glucose. Rapidly up-regulated genes in these classes included those encoding transcription factors regulating biosynthetic pathways such as MYB75/PAP1, ATR1, MYB28, and JAF13. This is consistent with these transcription factors mediating subsequent more persistent expression of many genes encoding enzymes, transporters, and other proteins involved in the reprogramming of biosynthetic and catabolic pathways. This is supported by the identification of cognate transcription-factor-binding sites as strong classifiers of glucose up-regulated expression of these classes of genes (see below). Among the rapidly induced and persistently expressed genes were those functioning in the cell cycle, cell division, DNA replication and recombination, and in growth. These rapid responses, which occur before any significant growth or development, suggest that glucose-mediated transcriptional responses directly orchestrate cell division and growth. One of the most striking responses to glucose was the rapid and persistent down-regulation of transcription factors regulating light responses and regulators of the circadian clock. Longer-term cellular responses to high sugar include suppression of photogene expression (Jang et al. 1997), and our analysis suggests a mechanism involving the rapid down-regulation of transcription factors conferring light-responsive expression of photogenes. This proposed mechanism is supported by the identification of cognate promoter elements that are strong classifiers of glucose down-regulated expression (see below). How these major changes in gene expression are regulated remains to be elucidated. A large number of genes were coregulated by glucose and ABA, including key regulators of ABA action such as ATHB6 (Himmelbach et al. 2002) and a diverse set of genes involved in signal transduction and transcription, stress responses, and metabolism. Furthermore, several genes involved in ethylene-mediated gene expression were also coregulated by ABA and glucose, identifying regulatory points for three-way interactions between these growth regulators (Yanagisawa et al. 2003; Price et al. 2004). Regulatory mechanisms Our application of machine learning methods for promoter classification linked known transcription factors and their cognate binding sites into a model of glucose- and ABA-mediated gene expression and revealed new glucose-mediated transcriptional control mechanisms. The TELO promoter motif was identified by the RVM as the strongest classifier of glucose up-regulated gene expression. It was found in >200 of the 983 glucose-up-regulated genes and was significantly enriched in the promoters of genes encoding components of protein and nucleotide synthesis pathways (Supplemental Table 6). The TELO motif and the associated TEF motif conferred increased gene expression in response to glucose, thus establishing a new role for this element and validating the feature extraction and classification strategy. The TELO motif, together with the adjacent TEF sequence in the eEF1A promoter, was previously shown to direct high-level expression in rapidly cycling primordia (Tremousaygue et al. 1999). Recently, the TELO motif was shown to be overrepresented in the promoters of genes up-regulated during axillary bud outgrowth in Arabidopsis, such as ribosomal protein and cell cycle genes (Tatematsu et al. 2005). Together these data demonstrate a key role for the TELO motif in regulating the expression of genes in response to growth stimuli such as glucose and decapitation. The MYB26S and BS1EGCCR motifs, which are enriched in genes involved in carbohydrate metabolism and sulfur uptake (Supplemental Table 6), were previously shown to regulate genes in the phenylpropanoid pathway (Uimari and Strommer 1997; Lacombe et al. 2000). The E2FBNTRNR motif is enriched in protein synthesis genes, consistent with experimental evidence (Chaboute et al. 2000), and the AMMORESIIUDCRNIA1 motif involved in the transcriptional control of the nitrate reductase gene (Loppes and Radoux 2001) was enriched in nucleotide metabolism genes. This model proposes that glucose may either regulate the transcription of genes encoding transcription factors that then activate these classes of genes, or glucose promotes the activity of transcription factors by post-transcriptional mechanisms. The cycloheximide dependence of glucose up-regulated expression (Price et al. 2004) is consistent with the former mechanism. Several examples of possible regulatory chains (Yu et al. 2003) involved in glucose-down-regulated gene expression were evident from the promoter features described in Tables 2 and 3. Four motifs involved in conferring light regulation (Puente et al. 1996), the I-box core motif, the GATA motif, light regulatory motifs related to the evening element, and a G-box-related element were all top-weighted classifiers of glucose-down-regulated gene expression (Table 2). GBF1 binds the G-box and confers light regulation, and the down-regulation of GBF1 in response to glucose suggests that glucose-down-regulates light-responsive gene expression by reducing expression of GBF1 (Supplemental Table 1). Glucose down-regulates the expression of GATA4 expression (Supplemental Table 1), which encodes a GATA transcription factor. This binds the sequences GGATA and GATAA (Puente et al. 1996), the top-weighted k-mer motifs for classifying glucose-down-regulated expression and establishes another putative regulatory chain. Glucose also down-regulates the expression of AT1G19000 (Supplemental Table 1) encoding a 1 repeat MYB protein related to MYBST1. This transcription factor binds to the GGATA motif and I-box-related sequences (Lu et al. 2002), which are also top-weighted classifiers of glucose down-regulated expression. This suggests another transcriptional regulatory chain contributing to glucose-mediated transcriptional repression of light-regulated genes. Expression of genes encoding the trihelix proteins GT1 and GT2, which confer light activation (Lam 1995), was also reduced by glucose treatment (Supplemental Table 1), but their cognate GT promoter elements were not selected as classifiers by the RVM. This analysis provides potential mechanisms linking glucose- and light-mediated gene expression suggested by earlier analyses (Thum et al. 2004). The promoter of the Amy3D α-amylase gene contains a TATCCA- and a G-box-related motif required for repression by sugars or induction by sugar starvation (Hwang et al. 1998; Lu et al. 1998, 2002; Toyofuku et al. 1998). Three rice MYB proteins (OsMybS1, OsMybS2, and OsMybS3) bind to the TATCCA element and mediate these sugar responses. The expression of two Arabidopsis genes (AT1G19000 and AT5G47390) encoding MYB proteins with high overall similarity to OsMybS2 and OsMybS3 is glucose repressible (Supplemental Table 1), and the TATCCA-related motif (TATCC) is a strong classifier of glucose down-regulated gene expression. This suggests a third regulatory chain in which these Arabidopsis MYB proteins mediate glucose down-regulated transcription through the TATCC element. Several cis-acting promoter elements confer ABA-responsive gene expression. These include the ABA-responsive element (ABRE) (Marcotte Jr. et al. 1989), coupling elements (Shen et al. 1996), and recognition sites for MYB and MYC classes of transcription factors (Iwasaki et al. 1995; Abe et al. 1997). Our RVM analyses of ABA-responsive promoters identified ABRE-like motifs, recognition sequences for the ATMYB2 transcription factor, a G-box-related motif and DRE-related motifs as top-weighted classifiers of ABA-induced genes. These motifs were enriched in the promoters of genes encoding proteins involved in stress responses, secondary metabolism, and hormone metabolism (Table 4; Supplemental Table 6). Our RVM classification is consistent with recently reported analysis of motif frequencies in ABA-regulated genes, which identified ABRE and DRE motifs as overrepresented (Leonhardt et al. 2004). The expression of genes encoding ABF3, DREB1A, DREB1B, DREB1C, and DREB2A transcription factors, which mediate ABA-responsive gene expression through ABRE- and DRE-related motifs, respectively, was induced by ABA, suggesting a regulator chain model in which these transcription factors mediate ABA responsiveness through the motifs identified as strong classifiers of ABA-regulated expression. Similarly, expression of ATMYB2 is up-regulated by ABA (Supplemental Table 2). It has been shown to function as a transcriptional activator in ABA-inducible gene expression under drought stress in plants (Abe et al. 2003) and its recognition motif (WAACCA) was a strong classifier of ABA-up-regulated promoters (Table 4). The DRE-related motif (ACCGAC) conferred glucose-, ABA-, drought-, high salt-, and cold-responsive gene expression (Busk et al. 1997; Kizis and Pages 2002; Dubouzet et al. 2003). Its cognate transcription factor DREB1A/CBF3 was also transcriptionally up-regulated by both glucose and ABA, suggesting a regulator chain model for glucose and ABA regulation of stress-responsive and other target genes. Promoter analysis A variety of approaches have been taken to establish regulatory networks based on whole-genome analysis of gene expression levels. Many of these use frequentist probabilistic methods to identify overrepresented sequence motifs associated with expression profiles (Beer and Tavazoie 2004), which can then be used to infer relationships between motifs and gene expression patterns. Our analysis of promoter sequences uses an RVM classifier to give an estimate of the probability that a gene is up- or down-regulated based on promoter sequence features. The advantage of the RVM (Tipping 2001) with a Bayesian Automatic Relevance Determination (MacKay 1994; Neal 1994) prior is that it selects a small subset of promoter motifs for its discriminatory rule that optimally distinguish between regulated genes. The RVM also has the useful property that no parameters are set, such as the threshold of significance of a feature, since the entire model is generated automatically from the data. It also considers the significance of a feature in the context of the features already selected. This makes the application especially suitable for biological problems with many variables of unknown significance that may influence each other. The RVM correctly predicted the up- or down-regulation of ~70% of the 1752 promoters in the glucose regulon and 692 promoters in the ABA-up regulon. This success is similar to that achieved in a recent study (Beer and Tavazoie 2004), which correctly predicted the expression patterns of 73% of 2587 yeast genes in 255 conditions using probabilistic methods. Our analysis also shows that there are other features affecting gene expression that are not captured by PLACE elements or 5-mer sequences within 1 kb of the initiation codon of Arabidopsis genes. These “missing” features probably include combinatorial effects and protein–protein interactions. The promoter sequences selected by the RVM strategy were validated by demonstrating that the TELO motif, which was the top-weighted classifier of glucose-up-regulated gene expression, conferred glucose-mediated expression in conjunction with the TEF motif. Furthermore, other promoter motifs selected as top-weighted classifiers had established functions in glucose- and ABA-mediated gene regulation. The transcriptional coregulation of transcription factors and promoters containing cognate promoter elements selected by the RVM provides further validation of the classification strategy and permitted regulatory networks to be established. The sparse feature selection of our RVM provides a computationally efficient way of dealing with the wide range of variables commonly encountered in biology and is suitable for biologists to apply, as the classification rule is built automatically without any statistical assumptions. Bayesian statistical methods such as we have used also provide more realistic probability models based on these large data sets (Eddy 2004). Our work reveals that these approaches have significant promise in classifying promoter functions according to their sequence and establishing transcriptional regulatory networks. Methods Plant material, growth condition, and time course Arabidopsis thaliana seedlings (ecotype Columbia-0) were grown in liquid culture for 7 d on MS medium containing 0.5% glucose in constant light. After 7 d of growth, the medium was replaced with glucose-free medium for 24 h, and then seedlings were treated with 3% glucose, 3% mannitol, 3% glucose + 10 μM ABA or 3% mannitol + 10 μM ABA, and sampled at 0, 2, 4, or 6 h after treatment. Three independent sets of cultures grown in 3% glucose and 3% mannitol were sampled for RNA isolation. RNA preparation, cRNA synthesis, and microarray hybridization Total RNA was extracted from the treated Arabidopsis seedlings using an RNeasy Plant Mini Kit (Qiagen) according to the kit manual. Affymetrix Gene Chip array expression profiling was carried out at the John Innes Genome Lab (http://www.jicgenomelab.co.uk) according to Affymetrix Expression Analysis Technical Manual II (Affymetrix Manual II; http://www.affymetrix.com/support/technical/manuals.affx). Further information on processing microarray data and clustering is provided in the Supplemental material. Machine learning methods The Relevance Vector Machine (RVM) (Tipping 2001) was selected as the most appropriate technique for learning to distinguish between up- and down-regulated genes according to the sequence composition of their promoter regions. A MATLAB implementation of the RVM is available from http://www.relevancevector.com. Assume that our data set, D, is comprised of coregulated genes Calculating enrichment in functional categories To ascribe functions to genes represented on the ATH1 chip, Gene Ontology (GO) annotations were integrated within GeneSpring 6.1 (Silicon Genetics, Redwood City, CA) as “GeneLists.” This was achieved by converting the Gene Ontology graph structures as exported from DAG-Edit (GO flat-file format, http://www.geneontology.org/) into a file-system-based data structure, where vertices are represented by directories. A list of Arabidopsis genes annotated to each GO term was prepared from the TIGR version 5 XML files (ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/PSEUDOCHROMOSOMES/), and each list was stored in GeneSpring XML format within the appropriate directory. We classified sugar-regulated genes according to their putative functions based on Arabidopsis Gene Ontology (GO) annotations in GeneSpring lists, the classification of the Munich Information Centre for Protein Sequencing (MIPS) database, pathway analysis defined by AraCyc (Mueller et al. 2003) and KEGG (Kanehisa 2002), and the literature. We calculated the P-value of the enrichment of regulated genes and promoter elements in functional categories using the hypergeometric cumulative distribution function (Tavazoie et al. 1999). Values were expressed as –log10 of P, where at least x genes in category of size k were regulated. k was determined from gene annotations as described above. The total number of genes on the array (M) was 21,000, and the total numbers of regulated genes were glucose up-regulated genes (N = 983), glucose down-regulated (N = 769), and ABA up-regulated (N = 692). The Bonferroni Correction was used to establish the significance of multiple comparisons of functional categories. Functional categories containing fewer than five genes were not considered for statistical reasons, and larger and heterogeneous functional groups were also not included in the analysis. Construction of synthetic promoter motifs, Arabidopsis transformation, and β-glucuronidase (GUS) assays Promoter motifs were synthesized, annealed into double-stranded DNA oligomers, cloned into a minimal promoter-reporter cassette, and transformed into Arabidopsis as described in the Supplemental material. Transformants were selected and assayed as described in the Supplemental material. Acknowledgments We thank Georg Harberer and Klaus Mayer (MIPS, GSF, Munich) for an initial version of the promoter database, James Hadfield of the John Innes Genome Laboratory for advice on RNA isolation and Affymetrix array processing, and members of the Bevan group for advice. This work was supported by BBSRC Exploiting Genomics Grants EGM16126 and EGM16128 to M.W.B. and G.C., respectively, and EC grant QLRT-1999-00351 (PlaNET) to M.W.B. Notes [Supplemental material is available online at www.genome.org. The microarray data from this study have been submitted to ArrayExpress under accession no. E-MEXP-475.] Article published online ahead of print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.4237406. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Science. 2002 Oct 25; 298(5594):799-804.
[Science. 2002]Proc Natl Acad Sci U S A. 2005 Feb 1; 102(5):1560-5.
[Proc Natl Acad Sci U S A. 2005]Nat Genet. 2001 Feb; 27(2):167-71.
[Nat Genet. 2001]Nat Genet. 1999 Jul; 22(3):281-5.
[Nat Genet. 1999]Cell. 2004 Apr 16; 117(2):185-98.
[Cell. 2004]Bioinformatics. 2003; 19 Suppl 1():i273-82.
[Bioinformatics. 2003]Plant Cell. 2004 Aug; 16(8):2128-50.
[Plant Cell. 2004]Plant J. 2004 Mar; 37(6):914-39.
[Plant J. 2004]Genome Biol. 2004; 5(2):R10.
[Genome Biol. 2004]Genes Dev. 2000 Aug 15; 14(16):2085-96.
[Genes Dev. 2000]Plant J. 2000 Sep; 23(5):577-85.
[Plant J. 2000]Comb Chem High Throughput Screen. 2004 Mar; 7(2):115-31.
[Comb Chem High Throughput Screen. 2004]BMC Bioinformatics. 2004 Aug 26; 5():116.
[BMC Bioinformatics. 2004]J Biol. 2004; 3(5):21.
[J Biol. 2004]Nucleic Acids Res. 2005; 33(4):1332-6.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2005; 33(3):1069-76.
[Nucleic Acids Res. 2005]Bioinformatics. 2002 Oct; 18(10):1332-9.
[Bioinformatics. 2002]Plant Cell. 2001 Jul; 13(7):1499-510.
[Plant Cell. 2001]Bioinformatics. 2002 Dec; 18(12):1585-92.
[Bioinformatics. 2002]Bioinformatics. 2002 Dec; 18(12):1593-9.
[Bioinformatics. 2002]Plant Physiol. 2003 Jun; 132(2):453-60.
[Plant Physiol. 2003]Novartis Found Symp. 2002; 247():91-101; discussion 101-3, 119-28, 244-52.
[Novartis Found Symp. 2002]Nat Genet. 1999 Jul; 22(3):281-5.
[Nat Genet. 1999]Proc Natl Acad Sci U S A. 2003 May 27; 100(11):6849-54.
[Proc Natl Acad Sci U S A. 2003]Plant Cell. 2004 Aug; 16(8):2128-50.
[Plant Cell. 2004]Mol Cell Biol. 1995 Feb; 15(2):1014-20.
[Mol Cell Biol. 1995]EMBO J. 1996 Jul 15; 15(14):3732-43.
[EMBO J. 1996]Plant Cell. 1998 May; 10(5):673-83.
[Plant Cell. 1998]Proc Natl Acad Sci U S A. 1998 Mar 17; 95(6):3318-22.
[Proc Natl Acad Sci U S A. 1998]Science. 1995 Aug 18; 269(5226):968-70.
[Science. 1995]Development. 2000 May; 127(9):1815-22.
[Development. 2000]Planta. 2002 Jan; 214(3):373-82.
[Planta. 2002]Science. 2003 Apr 11; 300(5617):332-6.
[Science. 2003]Proc Natl Acad Sci U S A. 1998 Aug 18; 95(17):10294-9.
[Proc Natl Acad Sci U S A. 1998]Plant J. 2001 May; 26(4):421-33.
[Plant J. 2001]EMBO J. 2002 Jun 17; 21(12):3029-38.
[EMBO J. 2002]Plant Physiol. 2000 Dec; 124(4):1854-65.
[Plant Physiol. 2000]J Exp Bot. 2003 Aug; 54(389):1865-77.
[J Exp Bot. 2003]Plant J. 2002 Feb; 29(4):427-37.
[Plant J. 2002]Plant J. 2001 May; 26(4):421-33.
[Plant J. 2001]Plant Cell. 2000 Dec; 12(12):2383-2394.
[Plant Cell. 2000]Curr Opin Plant Biol. 2001 Oct; 4(5):447-56.
[Curr Opin Plant Biol. 2001]Genes Dev. 1997 Jun 1; 11(11):1422-34.
[Genes Dev. 1997]Plant Cell. 1999 Aug; 11(8):1433-44.
[Plant Cell. 1999]Proc Natl Acad Sci U S A. 1998 May 12; 95(10):5655-60.
[Proc Natl Acad Sci U S A. 1998]BMC Biol. 2005 Mar 22; 3():7.
[BMC Biol. 2005]Nucleic Acids Res. 1999 Jan 1; 27(1):297-300.
[Nucleic Acids Res. 1999]Plant Mol Biol. 1998 Feb; 36(3):331-41.
[Plant Mol Biol. 1998]J Biol Chem. 1998 Apr 24; 273(17):10120-31.
[J Biol Chem. 1998]Plant Cell. 2002 Aug; 14(8):1963-80.
[Plant Cell. 2002]Plant J. 1999 Dec; 20(5):553-61.
[Plant J. 1999]Plant J. 1997 Dec; 12(6):1273-84.
[Plant J. 1997]Plant J. 2000 Sep; 23(5):663-76.
[Plant J. 2000]Plant J. 2003 Feb; 33(4):751-63.
[Plant J. 2003]Plant Mol Biol. 1998 Feb; 36(3):331-41.
[Plant Mol Biol. 1998]Plant Mol Biol. 1998 Jun; 37(3):425-35.
[Plant Mol Biol. 1998]Plant Cell. 2004 Mar; 16(3):596-615.
[Plant Cell. 2004]Plant J. 1999 Dec; 20(5):553-61.
[Plant J. 1999]Plant J. 2003 Mar; 33(6):957-66.
[Plant J. 2003]FEBS Lett. 2000 Oct 13; 483(1):43-6.
[FEBS Lett. 2000]Plant Cell. 1997 Jan; 9(1):5-19.
[Plant Cell. 1997]EMBO J. 2002 Jun 17; 21(12):3029-38.
[EMBO J. 2002]Nature. 2003 Oct 2; 425(6957):521-5.
[Nature. 2003]Plant Cell. 2004 Aug; 16(8):2128-50.
[Plant Cell. 2004]Plant J. 1999 Dec; 20(5):553-61.
[Plant J. 1999]Plant Physiol. 2005 Jun; 138(2):757-66.
[Plant Physiol. 2005]Plant J. 1997 Dec; 12(6):1273-84.
[Plant J. 1997]Plant J. 2000 Sep; 23(5):663-76.
[Plant J. 2000]Plant Cell. 2000 Oct; 12(10):1987-2000.
[Plant Cell. 2000]Plant Mol Biol. 2001 Jan; 45(2):215-27.
[Plant Mol Biol. 2001]Plant Cell. 2004 Aug; 16(8):2128-50.
[Plant Cell. 2004]Trends Genet. 2003 Aug; 19(8):422-7.
[Trends Genet. 2003]EMBO J. 1996 Jul 15; 15(14):3732-43.
[EMBO J. 1996]Plant Cell. 2002 Aug; 14(8):1963-80.
[Plant Cell. 2002]Mol Cell Biol. 1995 Feb; 15(2):1014-20.
[Mol Cell Biol. 1995]Genome Biol. 2004; 5(2):R10.
[Genome Biol. 2004]Plant Mol Biol. 1998 Feb; 36(3):331-41.
[Plant Mol Biol. 1998]J Biol Chem. 1998 Apr 24; 273(17):10120-31.
[J Biol Chem. 1998]Plant Cell. 2002 Aug; 14(8):1963-80.
[Plant Cell. 2002]FEBS Lett. 1998 May 29; 428(3):275-80.
[FEBS Lett. 1998]Plant Cell. 1989 Oct; 1(10):969-76.
[Plant Cell. 1989]Plant Cell. 1996 Jul; 8(7):1107-19.
[Plant Cell. 1996]Mol Gen Genet. 1995 May 20; 247(4):391-8.
[Mol Gen Genet. 1995]Plant Cell. 1997 Oct; 9(10):1859-68.
[Plant Cell. 1997]Plant Cell. 2004 Mar; 16(3):596-615.
[Plant Cell. 2004]Cell. 2004 Apr 16; 117(2):185-98.
[Cell. 2004]Nat Biotechnol. 2004 Sep; 22(9):1177-8.
[Nat Biotechnol. 2004]BMC Biol. 2005 Mar 22; 3():7.
[BMC Biol. 2005]Plant Physiol. 2003 Jun; 132(2):453-60.
[Plant Physiol. 2003]Novartis Found Symp. 2002; 247():91-101; discussion 101-3, 119-28, 244-52.
[Novartis Found Symp. 2002]Nat Genet. 1999 Jul; 22(3):281-5.
[Nat Genet. 1999]