![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2007, Cold Spring Harbor Laboratory Press Identification of muscle-specific regulatory modules in Caenorhabditis elegans Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA 1Corresponding author. E-mail stormo/at/genetics.wustl.edu; fax (314) 362-7855. Received September 25, 2006; Accepted December 12, 2006. This article has been cited by other articles in PMC.Abstract Transcriptional regulation is the major regulatory mechanism that controls the spatial and temporal expression of genes during development. This is carried out by transcription factors (TFs), which recognize and bind to their cognate binding sites. Recent studies suggest a modular organization of TF-binding sites, in which clusters of transcription-factor binding sites cooperate in the regulation of downstream gene expression. In this study, we report our computational identification and experimental verification of muscle-specific cis-regulatory modules in Caenorhabditis elegans. We first identified a set of motifs that are correlated with muscle-specific gene expression. We then predicted muscle-specific regulatory modules based on clusters of those motifs with characteristics similar to a collection of well-studied modules in other species. The method correctly identifies 88% of the experimentally characterized modules with a positive predictive value of at least 65%. The prediction accuracy of muscle-specific expression on an independent test set is highly significant (P < 0.0001). We performed in vivo experimental tests of 12 predicted modules, and 10 of those drive muscle-specific gene expression. These results suggest that our method is highly accurate in identifying functional sequences important for muscle-specific gene expression and is a valuable tool for guiding experimental designs. In metazoans, the gene-regulatory information that directs development is encoded in their genomic DNA sequence. The temporal and spatial expression pattern of genes is controlled by short cis-regulatory elements that act as binding sites for transcription factors. Through interactions with the basal transcription apparatus and other regulatory proteins, transcription factors determine either activation or repression of the target gene at a particular developmental time or within a particular cell or tissue. Therefore, identification of cis-regulatory elements and their binding proteins constitute an important part of deciphering the role of noncoding sequences. However, the individual binding of a transcription factor to a regulatory element is rarely sufficient to confer context-specific expression. Mounting evidence suggests that complex, cooperative protein–protein interactions between transcription factors are required to determine gene expression patterns (Arnone and Davidson 1997; Kamachi et al. 2000; Li et al. 2000; Remenyi et al. 2004). Therefore, identification of all of the component regulatory elements and understanding how they interact with each other are crucial to fully understanding the transcriptional regulatory network. Given the fast increasing number of genome sequences, our ability to decipher the encoded information lags far behind. For example, Caenorhabditis elegans is the first metazoan organism whose genome was sequenced. However, our understanding of the sequences that control tissue-specific gene expression is still limited. This limited understanding comes mainly from experimental investigation of the regulatory sequences of individual genes, which began almost 20 yr ago (Spieth et al. 1988). In C. elegans, cis-regulation of tissue-specific gene expression is known only for a few genes in some tissues, such as hypodermal cell, excretory cell, vulva, muscle, and neurons (Okkema et al. 1993; Gilleard et al. 1999; Gower et al. 2001; Hwang and Lee 2003; Landmann et al. 2004; Teng et al. 2004; Wang and Chamberlin 2004; Zhao et al. 2005). Progress is limited because of the complexity of the analysis. It involves dissection of all of the sequences around the gene of interest, which could be >10 kb long, to search for functional sequences. To facilitate the study of tissue-specific gene regulation in C. elegans, we use C. elegans muscle-specific gene expression as an example to explore the feasibility of identifying tissue-specific regulatory sequences through a computational approach. In C. elegans, muscle development has been an extensive area of research for a long time. Transcription factors of the basic helix-loop-helix class (hlh-1, Ce-Twist), the NK-2 class (ceh-22), and the T-box family (tbx-2, mls-1) have been shown to be critical for muscle specification and development (Okkema et al. 1993; Chen et al. 1994; Okkema and Fire 1994; Harfe and Fire 1998; Kostas and Fire 2002; Smith and Mango 2007). The promoter regions of several muscle-specific genes (myo-1, myo-2, myo-3, unc-54, hlh-1, and ace-1) have been studied in detail to identify important DNA regulatory sequences using sequence deletions or mutations (Okkema et al. 1993; Chen et al. 1994; Culetto et al. 1999). However, no general rules about the transcriptional regulatory mechanisms that control gene expression in muscle tissue have been identified. Studies from various organisms have revealed a common theme that transcription factor binding sites tend to be interconnected and function together to confer a particular context-specific expression on the target gene. Those clusters of transcription factor binding sites form a regulatory module that can be located in the upstream, downstream, or intronic sequences and can be moved from their native context and still recapitulate a portion of the native expression pattern independent of their position and orientation to the basal promoter (Arnone and Davidson 1997). Modules have been shown to be very useful in studying temporal and spatial gene expression regulation. Modular structure of regulatory elements is widely present in higher eukaryotes (Kirchhamer et al. 1996; Arnone and Davidson 1997) and has been noted in C. elegans (Jantsch-Plunger and Fire 1994). Due to the time-consuming and labor-intensive nature of experimental approaches, many computational tools have been developed recently to facilitate the identification of regulatory modules. However, the predictive value of most of the methods is either unknown or less than satisfactory. Here we describe a de novo computational method for accurate identification of regulatory sequences that confer muscle-specific gene expression, as well as experimental tests of the predicted modules. Comparisons of the predicted modules with experimentally characterized modules show high sensitivity and positive predictive value (PPV, defined as True Positives/All Predictions). A totals of 88% (22/25) of experimentally characterized modules are predicted, and 65% (30/46) of our predicted modules are located within experimentally defined regions. The rest of the predicted modules have not been tested for function, so the PPV could be much higher; it is already much higher than currently available algorithms. We developed a scoring system to predict the muscle specificity for any segment of DNA sequence. When applied to the whole genome, this method can help discriminate muscle genes from non-muscle genes. Because no information about known modules was used for the predictions, we expected the new predictions to have the same sensitivity and PPV. To examine this, we experimentally tested the functionality of 12 predicted modules. Of these 12 modules, three are located within known muscle gene promoters and nine are located in the promoters of genes with unknown expression patterns and unknown functions. Ten of the 12 tested modules drive gene expression in muscle tissue, demonstrating that our method is a valuable tool for guiding experimental design. Although we focus on muscle-specific gene expression in this work, we expect the method to be generally applicable to many other context-specific module identification tasks, because our method requires no prior knowledge other than a set of likely coexpressed orthologous genes. C. elegans muscle-specific module prediction tool can be accessed at http://ural.wustl.edu/software.html. Results Identification of regulatory motifs Promoters are commonly defined as the DNA regions located upstream of the transcription start sites that contain the necessary binding elements for proper transcriptional regulation. In C. elegans, 60% of predicted intergenic regions will be fully included within a 2-kb upstream segment (Dupuy et al. 2004). The level of similarity between C. elegans and its relative Caenorhabditis briggsae decreases dramatically 1500 bp upstream of the predicted ATG for most genes with a long intergenic region (Dupuy et al. 2004). Even though some regulatory elements can be located in introns and/or 3′UTRs of genes (Okkema et al. 1993; Jantsch-Plunger and Fire 1994), including those regions in our study, could make computational identification of DNA motifs more difficult, because noise increases with increasing sequence length (Buhler and Tompa 2002; Wang and Stormo 2003). Therefore, we chose to focus on the upstream −2000 to −1 regions. We have used the translation start site (the 0 position) to select the candidate promoter regions because transcriptional start sites have not been determined for most C. elegans genes. We used the program PhyloCon (Wang and Stormo 2003) for motif identification because comparisons suggest that it outperforms several previous motif-finding programs (Wang and Stormo 2003; MacIsaac et al. 2006). PhyloCon uses position weight matrix-based models (Stormo 2000) to represent ungapped DNA sequence motifs, and conserved motifs identified by this program represent potential regulatory elements. We collected a total of 122 C. elegans genes that are preferentially expressed in muscle tissue (Supplemental Table 1; details given in Methods section), 78 of which have defined C. briggsae orthologs (Supplemental Table 2). PhyloCon was run on the 2-kb upstream sequences of the 78 pairs of muscle genes to predict regulatory motifs, and a total of 18 unique motifs were identified (Table 1).
Muscle specificity of identified motifs To identify motifs that are enriched in muscle gene promoters we calculated the Over Representation Index (ORI) (Bajic et al. 2004) for each motif (see Methods) using the rest of the genome as a background gene set. ORI takes into account not only the number of patterns found in sequences, but also the proportion of sequences in which the pattern is found. It reflects how much more probable it is to find a particular motif in the muscle-specific promoter set than in the background set. We define motifs that have an ORI >1.2 as muscle-specific motifs, and they are used later in module score calculations. From our catalog of 18 motifs, eight are designated as muscle-specific. The top four motifs, ranked by ORI (Table 1), are similar to previously identified muscle-specific regulatory motifs (GuhaThakurta et al. 2002, 2004; Ao et al. 2004). Motif 1 (CTCTCTCTCTC) has almost the same consensus sequence as the binding site of transcription-factor TFII-I (currently known as GTF2I) in vertebrates, which binds to 5′-CTCACTCTCT-3′ (Clark et al. 1998). TFII-I family proteins play an important role in regulating muscle gene expression in humans (Polly et al. 2003). However, no C. elegans homolog was identified by BLAST. Motif 3 (CGCCRCCGCCKCC) is similar to the binding site of Drosophila melanogaster transcription factor Adf-1 (CCGCYGCYG YNGCCGV) in the TRANSFAC database (Matys et al. 2003). Homology search identified three genes in C. elegans that have significant similarity to and belong in the same conserved orthologous groups (COG) as Adf-1. All of them have a MADF domain that directs sequence-specific DNA binding. Motif 6 (WCTTTGM) matches several similar matrices that belong to TCF/LEF family transcription factors that are a subfamily of HMG domain proteins that bind to WWCAAWG consensus sequences. It occurs at a similar level in the muscle gene promoters as in the background gene promoters. Therefore, our motif identification step recovered both known muscle-specific motifs as well as binding sites for common transcription factors. Identification of muscle-specific regulatory modules in C. elegans promoter sequences Currently, we do not have a good understanding on how motifs are organized to form modules. Modules may vary in the type of motifs, in the total number and the order of binding sites for each type of motif they contain. However, modules usually contain clusters of motifs, and this property has been used in various algorithms to identify regulatory modules (Wagner 1999; Berman et al. 2002; Markstein et al. 2002). In this study, we developed and tested a simple algorithm that is based on motif clustering and takes into account the general properties of well-studied regulatory modules in higher organisms. First, from many cases of well-studied regulatory modules in various organisms, regulatory modules usually consist of two to eight different regulatory motifs (Arnone and Davidson 1997). Therefore, we require that a regulatory module have at least two different motifs. Secondly, Wasserman and Fickett (1998) collected 18 well-characterized regulatory modules from human muscle genes. Most of the modules have at least two muscle-specific motif sites, which can be the sites of the same motif or of different motifs. Based on this information, we require that a regulatory module have at least two muscle-specific motif sites in order to be a muscle-specific regulatory module. Third, we require the distance between any two adjacent sites within a cluster to be ≤40 bp. Although this choice is somewhat arbitrary, the results are fairly insensitive to several reasonable choices of spacing between motifs (see Discussion). In summary, our definition of a muscle-specific regulatory module is a fragment of sequence that consists of clusters of motifs with intersite spaces ≤40 bp, and in which there are at least two different motifs and at least two muscle-specific binding sites (for details of the algorithm, see Methods). Because some genes have alternative promoters, there are 138 different muscle gene promoters for the 122 muscle-specific genes. We applied this method on the 138 muscle gene promoters and identified 373 modules, an average of 2.7 modules per gene. The size of the modules ranges from 28 to 516 bp with a mean of 144 bp. Kirchhamer et al. (1996) collected 68 experimentally defined modules from Drosophila and mouse. Their size ranges from 40 bp to 8 kb, but they noted that the listed size was the length of DNA fragments used in gene transfer experiments and the actual size of the modules could be much smaller. The number of motifs in our predicted modules ranges from two to 12 with a mean of six. Well-studied modules have two to eight motifs with a mean of five (Arnone and Davidson 1997). Thus, our predicted modules share some general features with those well-studied modules. Verification of regulatory modules To evaluate the accuracy of the predicted modules we identified a total of 27 experimentally characterized modules in 16 gene promoters (Table 2). Of those 27 modules, one is located >2 kb upstream of the translation start site, outside the range of our predictions. Two of the modules overlap by >70% of their length (−370 to −686, −458 to −764 in gene T18D3.4) and it has not been tested whether the minimal overlapping region is sufficient for functionality, so they are treated as one module (−370 to −764) when calculating sensitivity and PPV. Therefore, there are a total of 25 experimentally characterized modules located in the regions we studied.
A comparison of our predicted modules to those experimentally characterized modules shows that they match closely. For example, T18D3.4 encodes Myo-2, a pharyngeal-specific myosin heavy chain. The −17 to −239 region is defined as the minimal promoter that can drive reporter gene expression in pharyngeal muscles, while two overlapping 0.3-kb fragments (−370 to −686 and −458 to −764) are sufficient for pharyngeal muscle-specific enhancer activity (Okkema et al. 1993). We predicted three modules in the T18D3.4 2-kb upstream sequences that are located at −60 to −263, −430 to −515, and −562 to −733 upstream of the ATG start codon. Therefore, all three predicted modules are located within the experimentally defined regions (Fig. 1
We performed simulations to estimate the statistical significance of obtaining the same sensitivity and PPV, given the promoter sequences and the known regulatory modules. We simulate the distribution of predicted modules in the promoters by randomly picking a start position for each module. The length and number of modules in each gene is kept the same as the predicted modules in that gene. The simulation is repeated 100,000 times and the sensitivity and PPV are calculated for each one. The average sensitivity is 48.8% with standard deviation of 7.8. The average PPV is 35.5% with standard deviation of 5.5. Therefore, the P-values of getting 88% sensitivity and 65% PPV are both much less than 0.001. Detection of muscle genes on a genome scale Another test of the accuracy of our module definitions is to use them to predict additional muscle genes. We developed a scoring system to measure the muscle specificity for each module using only the muscle-specific motif sites (see Methods). We expect that the higher the score, the more likely it is to be a muscle-specific module. By ranking all promoters by their scores we should be able to enrich for muscle genes. One difficulty of this assessment is that the expression pattern for most C. elegans genes is unknown. WormBase contains information about the tissue-expression pattern of 2576 genes. There are undoubtedly some omissions in these annotations, where some genes are expressed in tissues beside those listed, but it is likely to be largely correct and is the best data available for this assessment. For these 2576 genes, 1562 are either ubiquitously expressed or expressed in tissues other than muscle. We use these 1562 genes as the negative set. The set of well-characterized muscle genes that were not included in the training set, because we could not identify orthologs in C. briggsae, were used as a test set. We present the results using a Receiver Operator Characteristic (ROC) curve (Fig. 2
Will prior information help? Our module predictions did not rely on any knowledge about experimentally defined modules, such as which genes contained them, where they were located, or which motifs they contained. We next examined whether the use of prior information about experimentally defined modules can identify a reduced set of motifs that is indispensable for module identification and can improve predictive performance. First, we tested the performance of module prediction using only muscle-specific motifs. We first noticed that the sensitivity is greatly reduced compared with the prediction made with the full set of motifs. Varying the distance parameter from 20 to 100 bp, the sensitivity ranges from 52% to 72%, while using the full set of motifs has a sensitivity range from 80% to 96%. Secondly, the PPV (from 61.8% to 74.3%) is comparable to the prediction made with the full set of motifs (from 60.5% to 77.4%). Using this motif set to perform genomic predictions does not improve the performance, as determined by the ROC curve of the 44 test set muscle-specific genes (Supplemental Fig. 3). This suggests that some of the non-muscle-specific motifs are important components of muscle-specific modules. We next performed experiments to find a subset of motifs to regain the prediction sensitivity with the same or higher level of PPV. By adding back combinations of one, two, or three non-muscle-specific motifs and using various distance parameters ranging from 20 to 100 bp, we find that there are six cases in which we can obtain both higher sensitivity and higher PPV (Supplemental Table 3). In all cases, motif 6 (WCTTTGM) is included in the motif set. We used three motif sets that give the highest sensitivity and PPV to perform genomic prediction, and plotted the ROC curve of the 44 test set muscle genes. The results suggest that the predictive performances are all comparable to, or worse than, the original set of motifs (Supplemental Fig. 3). Therefore, training on known modules can improve the performance on the training set, but this must be due to overfitting, because it does not improve the genomic predictions in any significant way. These results demonstrate that (1) our method for module identification does not need prior information in order to make high quality predictions; (2) our method is robust; (3) the initial step of motif prediction and redundant motif elimination effectively identifies motifs that are important for regulating muscle-specific gene expression. Experimental verification of predicted modules All of the statistical analyses suggest that our method generated high-quality predictions. To test the predictive value of the method on unknown modules and the usefulness in guiding experimental designs, we performed four different types of experiments. First, we tested our predictive powers by locating the regulatory regions of three genes that are known to be muscle-specific genes, but whose promoters have not been subjected to comprehensive functional analyses. Our results confirmed that our predictions are correct in all three cases. C02D4.2 (ser-2) has at least three alternative promoters that drive C02D4.2 expression in a set of neurons, as well as pharyngeal cells and head muscles (Tsalik et al. 2003). We predicted three modules in C02D4.2a 2-kb upstream region (−91 to −382, −1557 to −1716, and −1769 to −1882). We verified the function of the first predicted module by determining that the first 512 bp upstream of the ATG is sufficient to drive gfp expression only in the head muscle cells (data not shown). Similarly, DNA sequences encompassing the first predicted modules of C33G3.1a (dyc-1) and F08B6.2 (gpc-2) both drive reporter gene expression in the corresponding muscle cells (Table 3; data not shown).
Second, we tested whether our predictions help to identify muscle-expressing genes in the genome. We randomly picked eight genes of unknown function and unknown expression pattern from the top-ranking predicted muscle genes (ranked from 1 to 198 in the genomic ranking, Table 3). For each gene we assayed whether the minimal upstream sequences encompassing the first predicted modules could drive gene expression in the muscle tissue. Table 3 shows the list of genes tested, as well as the genomic rank of the genes, the location of the predicted modules, and the observed expression patterns. C01B7.3 and C01B7.1 share the 2.6-kb intergenic sequences. C01B7.3 is a predicted gene with no RNAi phenotype and no hit in a BLASTP search in the genome of C. briggsae, Caenorhabditis remanei, Anopheles gambiae, D. melanogaster, Rattus norvegicus, Homo sapiens, C. elegans, and Saccharomyces cerevisiae (WormBase http://www.wormbase.org/.). In our experiment, the 553-bp C01B7.3 promoter did not give any expression pattern. Therefore, C01B7.3 is likely to be a falsely annotated gene. For the remaining seven genes, six are muscle genes, while the minimal promoter region of C10G11.7 drives reporter gene expression exclusively in the neurons (Fig. 3A–L
Third, we tested the functionality of modules located further upstream by deletion analysis. The first two predicted modules in K10G6.3 are clustered at −378 to −847. A DNA fragment containing this region drives gfp expression mainly in neurons and occasionally in the pharyngeal muscles (Fig. 3N Fourth, we tested the enhancer activity of a predicted module. W06H8.6 is a gene with unknown function and unknown expression pattern that has an upstream sequence >7 kb. In the W06H8.6 2-kb promoter sequence, six modules were predicted. The first one is located at −256 to −591 and the first 675 bp upstream of ATG drives reporter gene expression in body wall muscle (Fig. 3J In summary, we tested the functionality of 12 predicted modules. Ten of them drive gene expression in muscle tissues and one of them is involved in gene expression in neuronal cells. The remaining one showed no expression and may not even correspond to a true gene. This gives a positive predictive value of 83%, and 92%, if we count neuronal regulatory modules as positive. Generally, it takes many similar experiments to dissect the long promoter sequences to identify the functional sequences of a single gene. For the genes we tested, several of them have very long upstream sequences. For example, the upstream sequence of F45D3.2, W06H8.6, and F27D4.2 are 9, 7, and 11 kb, respectively. These results demonstrate that our method is able to both predict unknown genes that are expressed in muscle cells and to reduce the important functional domains, which contain the essential modules, to much smaller regions. Discussion The accurate identification of regulatory modules within a genomic sequence would be very useful for the study of gene regulation. However, identifying modules experimentally is a time-consuming and labor-intensive process. We developed a computational approach to predict muscle-specific cis-regulatory modules in C. elegans and performed experimental evaluations of their accuracy. Analysis of the in vivo activity of 12 predicted modules, of which 10 showed the predicted activity, demonstrates the utility of our approach. We chose muscle genes for this study because muscle has been a fertile ground for molecular genetics studies with C. elegans for three decades. Most of the work focused on the organization, structure, and function of muscle fibers and muscle cells (Moerman and Fire 1997; Moerman and Williams 2006). Recent work identified two genes that are involved in muscle cell fate specification (Kostas and Fire 2002; Smith and Mango 2007). However, the molecular mechanisms that control muscle cell fate specification and differentiation remain unclear. Here we demonstrate a computational approach that can identify motifs and their combinations into regulatory modules, which is very useful in identifying muscle-expressing genes. We tested eight genes of unknown expression pattern and unknown function, which we predicted to have modules for muscle expression. Six of those modules did, indeed, cause expression in muscle cells, while one drove expression in neurons and another showed no expression pattern. In total, we tested 12 predicted modules, with 10 showing activity in regulating muscle gene expression, which gives a PPV of 83%. Many of those were in segments directly upstream of the gene, consistent with C. elegans regulatory regions being compact. But in two cases, K10G6.3 and F27D4.2, we showed that the immediate upstream region was not sufficient for muscle expression, but that inclusion of a predicted module further upstream was. In another case, W06H8.6, we showed that two predicted modules, one immediately upstream and another more distant one, were each sufficient to drive muscle expression, but with different expression patterns. Although this study focused on modules for muscle expression, we did not use any muscle-specific characteristics, and we expect that our method would work equally well for other tissue-specific expression patterns. The approach is quite simple and requires very little prior information, including no initial information about motifs. The input is merely a set of C. elegans genes known to share a particular expression pattern and their orthologs in another Caenorhabditis genome, so that the program PhyloCon could identify significant motifs. We then used the promoters of non-muscle genes to identify which motifs were muscle specific and which were general. The set of motifs were then combined into predicted modules based on characteristics of a few well-characterized modules found in human, mouse, rat, fly, and sea urchin (Arnone and Davidson 1997; Wasserman and Fickett 1998), namely, that there should be at least two different motifs within the module and at least two occurrences of muscle-specific motifs. The one parameter we explored was the spacing between motifs within a module, but we found the results to be quite consistent over ranges from 20 to 75 bp (Supplemental Table 4); longer spacing often predicted entire upstream regions to be a single module, which is not very useful. We do not specify a particular window size for a module, and they can vary considerably in length. We also do not specify a minimum score, although the score, which is based only on the content of the muscle-specific motifs, is useful for ranking the predicted modules, and the results show that the highest-scoring promoters are the most enriched in muscle-specific genes (Fig. 2 While these results demonstrate the utility of our approach, we are still far from having a precise and completely accurate predictor of muscle expression patterns. Two of the 12 predicted modules we tested were not correct. From the ROC curve (Fig. 2 Methods Identification of C. elegans muscle genes and orthologs in C. briggsae In this study, we define muscle-specific genes as those that are only expressed in the muscle tissue or expressed in at most two other tissues. We identified a total of 122 C. elegans muscle-specific genes from searching the WormBase (Chen et al. 2005) expression pattern database (http://www.wormbase.org/) and from previous work (GuhaThakurta et al. 2002). C. briggsae orthologs for 78 of the 122 genes were obtained from WormBase. The C. elegans and C. briggsae chromosomal sequence and the gene structures were downloaded from the WormBase ftp-site (ftp://ftp.wormbase.org/pub/wormbase/genomes/, WS123). These were then used to obtain −2000 to −1 upstream regions of muscle-specific genes, as well as an upstream region of all C. elegans genes (22,247). Identification of putative regulatory motifs and elimination of redundant motifs PhyloCon (Wang and Stormo 2003) program was run on the upstream sequences (−2000 to −1) of the 78 pairs of C. elegans and C. briggsae orthologous muscle genes. We took the best matrix from each run of PhyloCon, masked all of the incidences of the identified motif in the input file, and repeated until no additional significant motifs were identified. The experiments were performed using various parameters (Wang and Stormo 2003), and motifs identified in all experiments were pooled together. To determine whether any two-position weight matrices were similar, we tested whether two motifs overlap significantly in promoter sequences, as determined by a χ2 test on simulated data. If two motifs overlap significantly, they were considered redundant motifs, and the one with lower information content was removed. Calculation of over-representation index Given a weight matrix, the Patser program calculates the probability of observing a sequence with a particular score or greater (Staden 1989; Hertz and Stormo 1999) and determines the default cutoff score based on that P-value. Therefore, a “site” corresponding to a particular motif (weight matrix) is a subsequence that is identified by the Patser program using the cutoff appropriate for each motif. We adopted the concept of over-representation of a particular pattern in one group of sequences with regard to another group of sequences from Bajic et al. (2004). They define it as:
Searching for cis-regulatory modules To search for clusters of motifs, we first identify all of the sites for all of the motifs using Patser. Then, we scan the sequence from 5′ to the 3′ end starting from the first site in the sequence. If the next site is less than the cutoff distance away, it is considered to be in the same cluster as the first site. Then, the third site is considered and the distance between it and the second site is calculated. This processes continues until a site is encountered that is too far away from the previous site (exceeds the distance cutoff). This cluster of motifs is a putative regulatory module. Then, we check whether this cluster fits the criteria of muscle-specific module (having at least two types of motifs and two muscle-specific sites). If it fits, it is kept as a muscle-specific regulatory module. Calculation of module score and promoter score For a given DNA sequence, the combined probability–proportionality value of multiple motifs is calculated as described (GuhaThakurta et al. 2004). It measures the likelihood that each TF binds at least one of its binding sites in the given sequence. We apply this calculation on each predicted module rather than the whole sequence to calculate the combined probability–proportionality value for each module:
is the probability–proportionality value for motif m in a given module module calculated as described (GuhaThakurta et al. 2004). This treatment is likely oversimplified given the known cooperative binding of transcription factors to promoter elements. However, this does not affect module prediction, it only affects the ranking of genes when we try to discriminate muscle genes from non-muscle genes, and this simplified approach has produced meaningful results. The score for a regulatory module is calculated as log of the combined probability.
Genome-wide searches We retrieved 2 kb of upstream sequences from all of the genes in the C. elegans genome (22,247). A muscle-specificity score is calculated for each gene promoter as described above. The promoters were then ranked by the score. If a gene has multiple promoters, we take the highest score and ranking of that gene. Construction of plasmids and GFP expression analysis To test the predicted modules close to translational start codons, gene-specific primers were used to amplify the corresponding sequences from fosmid DNAs (Geneservice Ltd). PCR products were cloned into a promoterless GFP vector pLS43 (GuhaThakurta et al. 2004) with nuclear localization signals. Transgenic C. elegans were made as described (Mello et al. 1991) using the collagen gene rol-6 as a coinjection marker. Rolling GFP-expressing progeny were isolated and studied for in vivo GFP expression. To test the enhancer activity of more distant predicted modules, PCR products were cloned into pPD107.94 (Δpes-10 minimal promoter, a gift from Andrew Fire, Stanford University School of Medicine) (Fire et al. 1990). The construct is used to make transgenic animals for GFP expression study. Acknowledgments We thank Ting Wang for assistance with the PhyloCon program and helpful discussions. We also thank Michael L. Nonet, Andrew Fire, and Susan E. Mango for providing reagents used in this work, and Dr. Frank E. Harrell Jr. for helping with statistical analysis of the predictions. This work was supported by NIH grants HG00249, and G.Z. was supported by NIH institutional training grant 5 T32 HG000045-08 and National Institute of General Medical Sciences NRSA service award 1 F32 GM73444-01. Footnotes [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5989907 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Development. 1997 May; 124(10):1851-64.
[Development. 1997]Trends Genet. 2000 Apr; 16(4):182-7.
[Trends Genet. 2000]Oncogene. 2000 Dec 18; 19(55):6514-23.
[Oncogene. 2000]Nat Struct Mol Biol. 2004 Sep; 11(9):812-5.
[Nat Struct Mol Biol. 2004]Dev Biol. 1988 Nov; 130(1):285-93.
[Dev Biol. 1988]Genetics. 1993 Oct; 135(2):385-404.
[Genetics. 1993]Dev Biol. 1999 Apr 15; 208(2):265-80.
[Dev Biol. 1999]J Mol Biol. 2001 Feb 16; 306(2):145-57.
[J Mol Biol. 2001]J Mol Biol. 2003 Oct 17; 333(2):237-47.
[J Mol Biol. 2003]Development. 1997 May; 124(10):1851-64.
[Development. 1997]Proc Natl Acad Sci U S A. 1996 Sep 3; 93(18):9322-8.
[Proc Natl Acad Sci U S A. 1996]J Biol Chem. 1994 Oct 28; 269(43):27021-8.
[J Biol Chem. 1994]Genome Res. 2004 Oct; 14(10B):2169-75.
[Genome Res. 2004]Genetics. 1993 Oct; 135(2):385-404.
[Genetics. 1993]J Biol Chem. 1994 Oct 28; 269(43):27021-8.
[J Biol Chem. 1994]J Comput Biol. 2002; 9(2):225-42.
[J Comput Biol. 2002]Bioinformatics. 2003 Dec 12; 19(18):2369-80.
[Bioinformatics. 2003]Bioinformatics. 2003 Dec 12; 19(18):2369-80.
[Bioinformatics. 2003]BMC Bioinformatics. 2006 Mar 7; 7():113.
[BMC Bioinformatics. 2006]Bioinformatics. 2000 Jan; 16(1):16-23.
[Bioinformatics. 2000]In Silico Biol. 2004; 4(2):109-25.
[In Silico Biol. 2004]Genome Res. 2004 Dec; 14(12):2457-68.
[Genome Res. 2004]Science. 2004 Sep 17; 305(5691):1743-6.
[Science. 2004]Nucleic Acids Res. 1998 Jun 1; 26(11):2813-20.
[Nucleic Acids Res. 1998]J Biol Chem. 2003 Sep 19; 278(38):36603-10.
[J Biol Chem. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Bioinformatics. 1999 Oct; 15(10):776-84.
[Bioinformatics. 1999]Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):757-62.
[Proc Natl Acad Sci U S A. 2002]Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):763-8.
[Proc Natl Acad Sci U S A. 2002]Development. 1997 May; 124(10):1851-64.
[Development. 1997]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Proc Natl Acad Sci U S A. 1996 Sep 3; 93(18):9322-8.
[Proc Natl Acad Sci U S A. 1996]Development. 1997 May; 124(10):1851-64.
[Development. 1997]Genetics. 1993 Oct; 135(2):385-404.
[Genetics. 1993]Comput Chem. 1996 Mar; 20(1):25-33.
[Comput Chem. 1996]Dev Biol. 2003 Nov 1; 263(1):81-102.
[Dev Biol. 2003]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Gene. 1990 Sep 14; 93(2):189-98.
[Gene. 1990]Genes Dev. 2002 Jan 15; 16(2):257-69.
[Genes Dev. 2002]Dev Biol. 2007 Feb 1; 302(1):25-39.
[Dev Biol. 2007]Development. 1997 May; 124(10):1851-64.
[Development. 1997]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Genetics. 1993 Oct; 135(2):385-404.
[Genetics. 1993]Dev Biol. 2004 Feb 15; 266(2):388-98.
[Dev Biol. 2004]J Biol Chem. 1994 Oct 28; 269(43):27021-8.
[J Biol Chem. 1994]Nature. 2002 Aug 29; 418(6901):975-9.
[Nature. 2002]Bioinformatics. 2003 Dec 12; 19(18):2369-80.
[Bioinformatics. 2003]Proc Natl Acad Sci U S A. 2005 Nov 29; 102(48):17400-5.
[Proc Natl Acad Sci U S A. 2005]Genome Biol. 2005; 6(13):R110.
[Genome Biol. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D383-9.
[Nucleic Acids Res. 2005]Bioinformatics. 2003 Dec 12; 19(18):2369-80.
[Bioinformatics. 2003]Comput Appl Biosci. 1989 Apr; 5(2):89-96.
[Comput Appl Biosci. 1989]Bioinformatics. 1999 Jul-Aug; 15(7-8):563-77.
[Bioinformatics. 1999]In Silico Biol. 2004; 4(2):109-25.
[In Silico Biol. 2004]Genome Res. 2004 Dec; 14(12):2457-68.
[Genome Res. 2004]Genome Res. 2004 Dec; 14(12):2457-68.
[Genome Res. 2004]EMBO J. 1991 Dec; 10(12):3959-70.
[EMBO J. 1991]Gene. 1990 Sep 14; 93(2):189-98.
[Gene. 1990]