![]() | ![]() |
Formats:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright : © 2006 Philippakis et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Expression-Guided In Silico Evaluation of Candidate Cis Regulatory Codes for Drosophila Muscle Founder Cells 1 Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts, United States of America 2 Harvard/MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, Massachusetts, United States of America 3 Harvard University Graduate Biophysics Program, Harvard Medical School, Boston, Massachusetts, United States of America 4 Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts, United States of America Michael B Eisen, Editor University of California Berkeley, United States of America #Contributed equally. * To whom correspondence should be addressed. E-mail: michelson/at/receptor.med.harvard.edu (AMM), Email: mlbulyk/at/receptor.med.harvard.edu (MLB) Received June 23, 2005; Accepted April 5, 2006. This article has been cited by other articles in PMC.Abstract While combinatorial models of transcriptional regulation can be inferred for metazoan systems from a priori biological knowledge, validation requires extensive and time-consuming experimental work. Thus, there is a need for computational methods that can evaluate hypothesized cis regulatory codes before the difficult task of experimental verification is undertaken. We have developed a novel computational framework (termed “CodeFinder”) that integrates transcription factor binding site and gene expression information to evaluate whether a hypothesized transcriptional regulatory model (TRM; i.e., a set of co-regulating transcription factors) is likely to target a given set of co-expressed genes. Our basic approach is to simultaneously predict cis regulatory modules (CRMs) associated with a given gene set and quantify the enrichment for combinatorial subsets of transcription factor binding site motifs comprising the hypothesized TRM within these predicted CRMs. As a model system, we have examined a TRM experimentally demonstrated to drive the expression of two genes in a sub-population of cells in the developing Drosophila mesoderm, the somatic muscle founder cells. This TRM was previously hypothesized to be a general mode of regulation for genes expressed in this cell population. In contrast, the present analyses suggest that a modified form of this cis regulatory code applies to only a subset of founder cell genes, those whose gene expression responds to specific genetic perturbations in a similar manner to the gene on which the original model was based. We have confirmed this hypothesis by experimentally discovering six (out of 12 tested) new CRMs driving expression in the embryonic mesoderm, four of which drive expression in founder cells. Synopsis Although genome sequences and much gene expression data are readily available, the determination of sets of transcription factors regulating particular gene expression patterns remains a problem of fundamental importance. Tissue-specific gene expression in developing animals is regulated through the combinatorial interactions of transcription factors with DNA regulatory elements termed cis regulatory modules (CRMs). Although genetic and biochemical experiments allow the identification of transcription factors and CRMs, those experiments are laborious and time-consuming. Philippakis et al. introduce a new approach (termed “CodeFinder”) for quantifying the enrichment for particular combinations of transcription factor binding site motifs within predicted CRMs associated with a given gene set of interest, identified from gene expression data. The authors' analyses allowed them to discover a specific combination of transcription factor binding site motifs that constitute a core cis regulatory code for expression of a particular subset of genes in muscle founder cells, an embryonic cell population in the developing fruit fly (Drosophila melanogaster) mesoderm, and also led them to the discovery and subsequent experimental validation of novel, tissue-specific CRMs. Importantly, the CodeFinder approach is generally applicable, and thus could be used to support, refute, or refine a known or hypothesized cis regulatory code for any biological system or genome of interest. Introduction A central challenge to determining the structure of genetic regulatory networks is the development of systematic methods for assessing whether a set of transcription factors (TFs) co-regulates a given set of co-expressed genes. Although classical genetics approaches allow the identification of key regulating TFs and the determination of their approximate ordering within the genetic hierarchy, demonstrating that a collection of TFs forms a combinatorial code acting to directly drive gene expression has required laborious experimental identification and perturbation of numerous individual cis regulatory modules (CRMs; [1]). To speed this process, several groups have recently demonstrated that computational approaches can rapidly identify CRMs with considerable accuracy [2–17], especially when performing computational searches with a collection of TFs known a priori to co-regulate. This is perhaps best exemplified by the dramatic progress made by several groups in discovering CRMs for genes expressed during segmentation of the Drosophila melanogaster embryo [2,3,6,14], a system where years of genetic screens have identified the regulating TFs [18]. In most biological systems, however, such a set of co-regulating TFs is either merely hypothesized or entirely unknown. Therefore, in order for these in silico approaches to effectively identify the cis component of regulation in novel biological systems (i.e., discover CRMs), additional computational methods are needed that can identify the trans component of regulation (i.e., the set of co-regulating TFs). To address this question in metazoan systems, we have developed an initial statistical framework for evaluating hypothesized transcriptional regulatory models (TRMs; i.e., sets of TFs that together co-regulate a target gene set through their combinatorial interactions at CRMs). As a model system, we have examined the regulation of a class of Drosophila myoblast genes for which a regulatory model has been previously hypothesized [19,20] and for which extensive transcriptional profiling datasets have been generated [21]. Muscle founder cells (FCs) are a sub-population of mononucleate myoblasts that are specified by the Wingless (Wg), Decapentaplegic (Dpp), and Ras signal transduction cascades acting in combination within the somatic mesoderm [22,23] (these pathways and some of their key regulators are schematized in Figure 1
Guided by this genetic analysis of eve expression, a series of gene expression profiles has been determined for purified embryonic myoblasts by Estrada et al. [21]. In addition to profiling wild-type cells, these investigators performed expression array analyses of myoblasts in which the Wg, Dpp, Ras, and Notch pathways were variably perturbed by 12 informative gain-of-function (gof) and loss-of-function (lof) genetic manipulations (we note that the Notch pathway also functions in the mesoderm to distinguish the cell fates of FCs from those of fusion competent myoblasts [26]). Each of these 12 genetic perturbations was predicted, based on the example of eve, to increase or decrease expression of those genes with localized expression in FCs (henceforth referred to as “FC genes”). These 12 expression arrays were then combined into a single weighted ranking (henceforth referred to as the “composite FC ordering”), which was used to predict additional FC genes. Estrada et al. [21] performed over 200 in situ hybridizations on predicted FC genes from the top of this composite FC ordering, and their experiments (as well as a review of the literature) yielded a list of 159 validated FC genes. In the present work, we utilize the expression data of Estrada et al. [21] to evaluate the roles of dTCF/Mad/Pnt/Twi/Tin as generalized regulators of FC gene expression. A previous computational scan for windows of sequence containing these five TFs successfully identified an additional enhancer for the gene heartbroken (hbr) that drove expression in dorsal FCs and contained matches to these five transcription factor binding site (TFBS) motifs, demonstrating that the example of eve was not unique [20]. However, the generality of the model could not be established by those two examples alone, and we therefore developed a method of quantifying enrichment for these five TFBS motifs in localized windows of non-coding sequences flanking or intronic to FC genes. Importantly, this approach, which we term “CodeFinder,” quantifies the relevance of not only each TF individually, but also of all combinations of the given set of TFs. From this analysis, we hypothesized that the eve TRM is unlikely to apply to all FC genes. Rather, we found that three TFs—Pnt, Twi, and Tin—are likely to regulate a specific subset of FC genes that share characteristic changes in their gene expression profiles in response to the genetic perturbations used by Estrada et al. [21]. Thus, by combining TFBS and gene expression data, our analysis allows a refinement of the initial model such that a subset of the original TFs appears to regulate a subset of FC genes. As a test of this hypothesis, we have empirically validated four candidate FC enhancers that conform to our modified TRM (as well as two additional enhancers driving expression in other domains of the embryonic mesoderm). Results FC Genes Are Enriched for Clusters of dTCF/Mad/Pnt/Twi/Tin Motifs in Their Flanking and Intronic Non-Coding Sequences We first compiled from the literature, experimentally verified binding sites for each of the five TFs dTCF/Mad/Pnt/Twist/Tin (see Protocol S1). Additionally, we obtained a collection of 159 genes validated by in situ hybridization to be FC genes [21]; see Protocol S1). A common approach for determining whether a set of genes is targeted by a collection of TFs is to look for instances of the corresponding TFBS motifs immediately upstream of transcriptional start [27–29]. In preliminary analyses, we determined that the proximal 1–2 kb of flanking sequences upstream of these 159 FC genes were not significantly enriched for the dTCF/Mad/Pnt/Twist/Tin motifs relative to the corresponding regions of randomly selected genes taken as a background set (p > 0.05 after Bonferroni correction for multiple hypothesis testing, see Materials and Methods). Because we did not uncover any clues to the mechanisms underlying transcriptional regulation of FC genes from an analysis of their proximal promoter regions, we sought to develop a framework that could evaluate the over-representation of the dTCF/Mad/Pnt/Twist/Tin motifs in the extended flanking and intronic sequences of these genes. In approaching this problem, we were influenced by the strategy of Mootha et al. [30], who looked at the aggregate behavior of entire gene sets, rather than individual genes, in analyzing gene expression microarray data. In their method, genes were ranked by expression change, and independently defined gene sets were then inspected to see if their positions within this ranking were non-randomly distributed. From that analysis, they were able to observe trends in the aggregate behavior of the gene set that were not significant when looking on a gene-by-gene basis. Our approach borrows from this method, but utilizes a sequence-based (rather than expression-based) method of ranking genes. Here, genes are ordered according to their enrichment for various combinations of TFBS motifs in localized windows of sequence, and a given foreground gene set is then inspected to see if its distribution of ranks within this list is non-random. The goal of this analysis is to uncover effects that are otherwise small in size, but that can be statistically quantified. For each of the 159 validated FC genes, we searched the entire non-coding upstream, downstream and intronic regions with a CRM identification tool named ModuleFinder that was previously developed by our group [31]. This program is one of a number of approaches that scores windows of genomic sequence according to the degree of TFBS clustering and/or evolutionary conservation [3,7,9,13,16,31–33] (our approach most resembles that of Lifanov et al. [13], but extends it by incorporating a measure of evolutionary conservation in addition to binding site clustering). Next, we assigned to each gene the ModuleFinder score of the most significant window adjacent to it (i.e., a ModuleFinder “hit”; see Materials and Methods). Under the hypothesis that dTCF/Mad/Pnt/Twi/Tin are widely acting regulators of FC genes, we anticipated that many of these 159 genes would be enriched for significant ModuleFinder scores in their surrounding non-coding sequence, as compared with a suitable background set (Figure 2
In an initial application, we used the remaining D. melanogaster genes as a background set. As this was being implemented, however, another group observed that regulatory genes such as TFs and kinases typically had more flanking sequence [34]. Our FC gene list was substantially enriched for TFs and signaling proteins [21]; consistent with this, we observed that the median amount of non-coding, non-repetitive sequence flanking these 159 FC genes was approximately 15.2 kb, whereas the remaining D. melanogaster genes had a median of only approximately 3.4 kb of non-coding, non-repetitive flanking sequence (p < 1 × 10−20 by Wilcoxon-Mann-Whitney [WMW] statistic, Figure 2 In order to evaluate the enrichment of the TFBS motifs under consideration in the foreground gene set, we sought a metric that could quantify the degree to which the foreground genes ranked higher than the background genes on the basis of their ModuleFinder scores. The WMW statistic specifically tests this null hypothesis [35]; however, the p-value attached to it reflects both the degree of foreground enrichment and the sample sizes of the foreground and background. Therefore, we used an additional measure of enrichment that is less dependent on the number of foreground and background genes, so that effects between gene sets of different sizes could be compared. For this we utilized the area between the detection rate curves shown in Figure 2 Using criteria in which any combination of the five TFs of interest can contribute to the ModuleFinder score for each gene, we observed enrichment for high-scoring ModuleFinder hits adjacent to FC genes as compared to the length-matched background sequences (p < 0.02 by WMW statistic, area = 0.051; see Figure 2 Differential Response of FC Genes to Gof of Pnt, a Ras-Dependent TF In constructing the composite ordering of newly identified FC genes, Estrada et al. [21] weighted each of the 12 mutant expression profiles according to the degree to which a training set of 33 previously known and validated FC genes responded in the expected fashion (i.e, were up- or down-regulated) within that genetic background. A somewhat surprising result of their analysis was that, although arrays performed on gof genetic backgrounds corresponding to upstream regulators of the Ras pathway (i.e., constitutively activated forms of EGFR, FGFR, Ras, and armadillo+Ras) caused the training set of FC genes to be up-regulated, gof of Pnt—a TF acting downstream of these factors—had little aggregate effect on the same FC gene training set [21] (see Figure 3
Since the model enhancer eve responded strongly to Pnt gof and was known to contain functionally validated TFBSs for Pnt [19], we re-inspected the Pnt expression profile to see how all FC genes responded to Pnt gof. We first ranked all genes according to their up- and down-regulation in the Pnt gof background, and we then looked at the positions of the 159 FC genes within this ranking (see Figure 3 Utilizing the “leading edge” analysis of Subramanian et al. [36], we took as a foreground gene set those genes ranking higher than the point at which the foreground and background detection curves maximally diverged (Figure 3 When we inspected the PLE (Figure 4
CodeFinder Provides a Systematic Examination of TFBS Combinatorics in Reference to a Set of Co-Expressed Genes Given foreground and background gene sets F and B, and a set of transcription factor binding site motifs M, it is desired to provide confirming or refuting evidence for the over-representation of M in F relative to B. Three concerns must be addressed in order to effectively evaluate combinatorial interactions between the TFs considered. First, any given motif set M is unlikely to be necessary or sufficient for regulation of F (i.e., not all genes in F will actually be targeted by M, and there may be motifs other than those of M that contribute to the regulation of F). Thus, the metric must be able to quantify even only slight degrees of foreground enrichment. Second, because the score responsible for ranking the foreground and background genes is a linear sum of scores for the input motifs, if one of the input motifs is not enriched in F, then omitting it from the search should result in a greater degree of left-shifting for F, as it is acting only to increase score variability (noise) in F and B. Hence, it is necessary to inspect subsets of TFBS motifs comprising the TRM. Finally, there is the possibility that a combination of TFBS motifs shows increased enrichment relative to its subsets not because the combination is truly co-regulating, but because the genes in F are being targeted by overlapping subsets of that combination. Hence, a mechanism is needed to distinguish between these possibilities. To address these concerns, we systematically inspected combinations of the five TFs dTCF/Mad/Pnt/Twi/Tin. First, we looked at each of the five TFs individually and observed that only Pnt was significantly enriched in the foreground set (areas reported in Figure 4 Throughout this analysis, the foreground and background sequence sets remained invariant, but their overall ordering changed with each combination of motifs. Thus, it is important to note that combinations involving different numbers of motifs can be compared on equal footing, since a smaller number of relevant motifs can cause the foreground sequences to rank more highly than a larger set involving irrelevant motifs. From this analysis, we conclude that: 1) the TFBS motif Pnt is likely to directly target a substantial fraction of genes in the PLE, and 2) that although the other four factors may each be working with Pnt to participate in the regulation of some genes in this set, it is unlikely that there is a single combination responsible for targeting all of these genes. Finally, one of the genes in the PLE is Yan, an Ets-domain transcriptional repressor [37] (recall that Pnt is also an Ets domain protein, but is known to act as a transcriptional activator [38]). We inspected the possibility that Yan might be responsible for the down-regulation of the Pnt trailing edge (PTE) gene set, possibly in combination with some of the other TFs under consideration. Here, when we looked at the OR combination involving all five TFBS motifs, we saw no statistically significant enrichment (Figure 4 An Expression Cluster of FC Genes Enriched for Pnt AND Twi AND Tin Having utilized the Pnt gof expression profile to identify a gene set likely to be directly targeted by Pnt, we wanted to see if we could utilize the entire collection of expression profiles from Estrada et al. to identify one or more additional gene sets enriched for the TFs comprising the hypothesized TRM (Figure 5
The FC gene set indicated as Cluster 1 (C1) contained eve; moreover, genes in C1 responded to the genetic perturbations of Estrada et al. [21] in a manner most similar to what had been observed in the original analysis of eve expression [19] (Figure 5 A Second Application of CodeFinder to Genes Expressed in Subsets of Cells in the Developing Fly Wing In order to test the generality of our approach, we used our CodeFinder framework to examine a second developmental system of comparable complexity and for which similar data were available (Figure 6
From the supplementary data of Reeves and Posakony, we obtained a list of 44 genes validated by in situ hybridization to be expressed in PNCs [17]. As a computational negative control, we checked that this gene set was not enriched for targets of dTCF OR Mad OR Pnt OR Twi OR Tin (Figure 6 When we looked for enrichment of Ac/Sc and Su(H) motifs in the sequences surrounding PNC genes, we observed strong enrichment for these motifs, especially Ac/Sc (Figure 6 Validation of Novel FC CRMs Enriched for the Pnt, Twi, and Tin Motifs The preceding analyses suggested a TRM comprising Pnt AND Twi AND Tin targets many of the 37 FC genes found in C1. To test this hypothesis, we evaluated the in vivo functions of 12 candidate CRMs selected from the class of all ModuleFinder hits containing matches to the Pnt, Twi, and Tin TFBS motifs and located in the flanking or intronic sequences of the 159 FC genes (these 12 candidate CRMs are graphically depicted in Figure 7
The enhancer for Ndg (Figure 8 Of the eight remaining tested FC CRM candidates, three (the regions labeled as CG31151 in Figure 7 Discussion We have described an easily implemented, controlled approach (termed “CodeFinder”) for evaluating the degree to which a hypothesized transcriptional regulatory code acts to drive the expression of an independently derived gene set. CodeFinder integrates TFBS and expression profiling information by examining the statistical overrepresentation of particular TFBSs (or combinations thereof) in the non-coding sequences of co-expressed genes. This approach provides the beginnings of a general framework that can be applied to higher metazoan genomes, as it considers not only the proximal promoter regions of genes, but also their extended upstream, downstream and intronic regions, while controlling for the difficulty that genes may not have homogeneous amounts of flanking sequence [34]. In addition, our approach does not require that the collection of TFs under consideration be genetically necessary or sufficient, because we allow for the possibilities that some of the hypothesized TFs are not directly contributory to the cis-regulatory code, or that there are unknown, additional TFs that are contributory. This is accomplished by developing a quantitative metric for the degree of enrichment for TFBS motifs among a foreground gene set. Importantly, this metric allows both different combinations of TFs and different gene sets to be compared on equal footing, so that an initially hypothesized TRM or foreground gene set can be refined. We have applied CodeFinder to evaluate the roles of five TFs, dTCF/Mad/Pnt/Twi/Tin, as regulators of gene expression in muscle FCs. When inspecting an aggregate collection of 159 FC genes, we observed significant yet faint enrichment for these TFs. However, restricting to a subset of FC genes defined by up-regulation in a Pnt gof background amplified this enrichment; moreover, we observed that the bulk of this enrichment was due to the single TF Pnt (as might be expected in this genetic perturbation). Similarly, by restricting to a subset of genes defined by having common expression profiles across the genetic perturbations of Estrada et al. [21], we again observed that this foreground enrichment could be amplified. Here, however, we observed that most of the signal was due to the TFBS motif combination Pnt AND Twi AND Tin. Interestingly, C1 and the PLE overlap at 18 genes (Figure 9
The enhancer discovered for an FC gene not in C1 (lbl), as well as the two enhancers discovered driving expression in other mesodermal cells (cardiac and adepithelial), highlight that there are likely to be additional gene sets targeted by some combination of dTCF/Mad/Pnt/Twi/Tin, most likely in addition to other, currently unknown, motifs. For the example of lbl, although it was not observed to be responsive to Pnt gof as determined by microarray t-statistics and fold-changes, we argue that the Ets-domain motifs in its FC enhancer may yet be meaningful as lof of Yan (an Ets domain transcriptional repressor) increases the number of lbl-positive FCs, suggesting that Yan normally represses the lbl FC enhancer in the absence of Ras/MAPK signaling [46]. For FC genes such as lbl that are not in C1 or the PLE, however, we do not observe enrichment for these TFs beyond the genomic background rate. This suggests that the appropriate foreground gene sets and their corresponding TRMs have yet to be defined. What modes of regulation are responsible for targeting these non-C1/PLE FC genes? Certainly, the Ras signal transduction cascade is likely to play a central role, as activation of the Ras/MAPK signaling cascade stimulates generalized FC gene expression [21]. However, only a small fraction of FC genes are up-regulated in a Pnt gof background. We can envision three possibilities for Ras-dependent, but not Pnt-dependent, gene regulation. First, Ras may activate an Ets-domain TF other than Pnt. As the Drosophila genome encodes eight total Ets domain proteins, it is possible that one of these Ets TFs is acting downstream of Ras signaling to regulate FC gene expression. However, we observed negligible enrichment for the Pnt (Ets) motif when examining the class of all FC genes. Thus, it seems unlikely that another Ets-domain TF is regulating non-C1 or non-PLE genes. Second, it is possible that the genes of C1 are directly targeted by Pnt, and that the remaining FC genes are genetically downstream of C1. This model places Pnt at the top of the FC gene regulatory hierarchy. However, Pnt lof mutants are lacking only a subset of embryonic muscles, a phenotype inconsistent with this key role of Pnt in all FCs [38,47]. Third, Ras signaling is known to affect multiple transcriptional pathways and either directly or indirectly may activate a non-Ets domain TF to regulate FC gene expression. In fact, there are 20 TFs in the list of 159 FC genes. One or more of these TFs could be target(s) of Ras/MAPK signaling and regulate FC gene expression in the non-Pnt-responsive FCs. We are now working to expand upon the initial analysis presented here both computationally and experimentally in order to refine our determination of cis regulatory codes specific for FC gene expression. First, the genome sequences of many additional Drosophila species will be available in the near future [48], and our computational tool for evaluating binding site clustering and evolutionary conservation (ModuleFinder) will need to be extended to incorporate these additional genomes. This can be done most carefully through the use of tree-based methods as suggested by Moses et al. [49], and we are utilizing aspects of their approach in the development of a newer, improved computational tool for quantifying binding site clustering and conservation (Warner et al., manuscript in preparation). ModuleFinder was important for the current work only as a method of quantifying binding site enrichment, and an approach that better utilizes evolutionary conservation can be expected to sharpen the results presented here. Second, in order to resolve the cis-regulatory codes driving sub-domains of expression within FCs, two additional data sets are required—higher resolution expression maps for single FC genes, and the DNA binding specificities of additional TFs known to be expressed in FCs. Of the 20 known TFs among the list of 159 confirmed FC genes, most have unknown DNA binding specificities which could be determined using protein binding microarrays [50]. This represents an ideal opportunity to expand upon the analyses presented here and even perhaps find cis-regulatory codes for non-C1/PLE FC genes. Also, it may allow us to determine more subtle effects regarding which TFs target which subsets of FC genes. For example, the enrichment observed in C1 for the Tin motif might actually not be for Tin itself, but rather for one of the other NK homeodomain family members, slouch, ladybird early, and lbl [51,52], that are known FC genes. Thus, the work presented here provides a first step toward determining the mechanisms underlying the regulation of gene expression in FCs. Since the formation of the somatic mesoderm is a complex developmental process requiring input from many signal transduction cascades and tissue-specific TFs, it is an ideal model system for developing an integrated experimental and computational framework that can be applied more generally to identifying cis-regulatory codes in animal genomes. Materials and Methods Promoter analysis. We obtained 1-kb regions flanking the transcriptional start sites of each of the 159 FC genes from the University of California Santa Cruz (UCSC) Genome Browser dm2 assembly (http://genome.ucsc.edu), extending from 800 bp upstream of transcriptional start to 200 bp downstream of transcriptional start (in a second application, we repeated this analysis with 2-kb regions extending from 1800 bp upstream of transcriptional start to 200 bp downstream of transcriptional start). In cases where more than one transcriptional start site was listed, we used the one closest to translational start. We also extracted a corresponding set of 1,590 non-overlapping regions from promoters of non-FC genes. Both foreground and background proximal promoter regions were repeat masked using the repeat masking provided by UCSC genome browser (http://genome.ucsc.edu). Enrichment in the promoters was measured for each motif using the group specificity score of Hughes et al. [27]. For each of the five motifs dTCF, Mad, Pnt, Twi, Tin, we inspected four different versions of the motif: the collection of matches to known binding sites (see Protocol S1), as well as all words matching within 0.5, 1.0, and 1.5 standard deviations of the motif position weight matrix average [27]. No version of any of these motifs was statistically significant using a confidence level of p < 0.05, after applying a Bonferroni correction for multiple hypothesis testing. Genome pre-processing for all ModuleFinder scans. As in the promoter analysis, the D. melanogaster genome was obtained from the UCSC Genome Browser dm2 assembly (http://genome.ucsc.edu). All repetitive regions were masked using the repeat masking provided by UCSC; all exons (as determined by the UCSC refGene annotation) were also masked. For all ModuleFinder scans, we utilized the alignments to D. pseudoobscura (dp2) and D. virilis (droVir1), as provided by the UCSC Genome Browser Multiz alignment of 8 genomes (dm2, droYak1, droAna1, dp2, droMoj1, droVir1, apiMel1, anoGam1). D. melanogaster translational Start and Stop sites were obtained from the UCSC refGene flat files. Because these files contain redundant references to the same gene, all overlapping reading frames were clustered together; the translational Start and Stop of each such clustered gene was defined to be the most distal (i.e., inclusive) of all translational Starts/Stops in that gene cluster. After clustering genes, we defined the “intergenic regions” to be those sequences contained between adjacent gene clusters, and “intronic regions” to be those sequences contained between the translational Start and Stop of gene clusters. We utilize the terms “gene cluster,” “intergenic region,” and “intronic region” throughout this section. For all ModuleFinder scans, we utilized windows ranging between 700 and 300 bp (increment size of 50 bp), and the dp2 and droVir1 alignments. We used a “wiggle room” of 5 bp for considering binding sites as conserved. For the input dTCF and Pnt binding site motifs, we utilized the set of words matching within one standard deviation of the average position weight matrix score of the known binding sites [27]; for the Mad, Twi and Tin motifs we utilized only the collection of known binding sites (this was done because the known binding sites for Twi and Tin do not have much variability, and so it was not necessary to extrapolate them. Mad is an extremely degenerate motif, and we found that using a cutoff similar to that of dTCF and Ets caused too large a fraction of sequence space to be considered a motif). Analysis of flanking gene length and generation of a length-matched background sequence set. For each FC gene we matched it to its corresponding gene cluster. Here, although the original list of FC genes from Estrada et al. contained 160 genes, our list contains only 159 since two genes (CG6682 and CG13789) mapped to the same gene cluster. For each gene cluster we then computed the amount of non-coding, non-repetitive sequence in the two intergenic and intronic regions associated with it; thus, each intergenic region is assigned to the two gene clusters that flank it. Detection rates for the 159 FC genes, as well as all other genes are shown in Figure 2 Because we observed that the 159 FC genes in general had more non-coding, non-repetitive sequence associated with them than other D. melanogaster genes, we generated a length-matched set of background sequences. Here, we wanted to make sure that the distribution of sequence lengths between the foreground and background sets were nearly identical. Since in all following analyses we utilize statistics based on rank-orderings of the foreground and the background gene sets, we sought to make the foreground gene set as uniformly distributed as possible with respect to the background gene set when ranking genes by the length of their associated non-coding, non-repetitive sequences. For this, we first partitioned the D. melanogaster genome into intergenic and intronic regions. We then ordered the regions in each set by length. For the intergenic set of regions, we defined the “foreground regions” to be those regions upstream or downstream of one of the 159 FC genes, and we defined the “non-foreground regions” to be the collection of all other regions (i.e., intergenic regions not upstream or downstream of an FC gene). For each foreground region, we took the seven non-foreground regions occurring directly above and below it in the length-based ranking as background regions (we found that seven was the largest number that could be used and still produce a well-matched background set). In the event that two or more foreground regions did not have 14 background regions ranked between them, we continued to extend above and below them so that the center of this local collection of background regions was the same as the center of their associated foreground regions. Hence, for each foreground region, we were able to associate to it exactly 14 length-matched background regions. We then repeated this matching for the intronic regions to obtain a collection of 14 length-matched background intronic regions for each foreground intronic region. Finally, we concatenated the two intergenic and intronic regions of each foreground gene, as well as the two background intergenic and intronic regions associated with it (note that the background regions that are concatenated need not be adjacent to each other in the D. melanogaster genome, but for simplicity we shall still refer to them as “background genes”). In Figure 2 Detection rate curves and relation to WMW. In Figures 2 [1, |F|+|B|], let δF(i) be the indicator function that takes the value “1” if the i'th-ranked gene is in F and “0” otherwise; similarly, let δB(i) be the corresponding indicator function for B. The foreground and background detection rate curves are then defined by
After scaling the X-axis to have a width of 1, the area between the detection rate curves is given by
Recall that the WMW test statistic is computed according to the formula [35]
Thus, the area between the detection rate curves is simply a graphical representation of the WMW test statistic, scaled so that only effect sizes (but not sample sizes) are evaluated. Also, note that this measurement is closely related to the area under a receiver-operater curve [53]. Finally, we note that in displaying the detection rate curves, if there are multiple genes with the same score (this frequently occurs, for example, in considering AND combinations where many genes get a score of “0”), then these genes are randomly ordered when plotted. When computing the WMW p-value for this enrichment (or in the corresponding area calculation), however, the correction factor for ties is utilized [35]. Clustering of gene expression microarray data. All clustering was performed using Cluster [40], and visualized with Java Treeview [54]. We first performed self-organizing map clustering using (Xdim = 1, Ydim = 13, iterations = 100,000) for genes, and (Xdim = 1, Ydim = 4, iterations = 20,000) for arrays. We median-centered the columns of this output and hierarchically clustered genes and arrays using a similarity metric of “correlation (uncentered),” and “average linkage.” Proneural cluster analysis. We obtained all genes from the Supplementary Data of Reeves and Posakony [17]. We note that their original list of non-SOP contains 22 (rather than 18) genes, eight of which are pairs of genes occurring adjacent to one another in the genome. These pairs are (CG3396:CG3096), (CG8328:CG8333), (CG8361:CG8365) and (CG3796:CG3827). In order to avoid the problem that a ModuleFinder hit occurring between two of these genes could be counted twice, we dropped one element of each pair (CG3096, CG8333, CG8365, CG3827). For the Ac/Sc motif, we used RCAGSTGN, as stated by Reeves and Posakony [17] (note that the final degenerate position was added at the end in order to prevent the motif from being able to overlap itself). Su(H) binding sites are listed in Protocol S1; for all scans with it, we utilized the set of words matching within 1 standard deviation of the motif average [27]. We note that there are additional known Su(H) binding sites from the enhancer of split complex, but these were not utilized to avoid circularity, as many genes from this complex are PNC genes. Validation of predicted enhancers. CRM regions were PCR-amplified in duplicate, sequence verified, subcloned into the pETW-nuclacZ reporter vector and then injected into y w embryos as previously described [19]. At least four independent insertion lines were assessed for each reporter construct. Immunohistochemistry, in situ hybridization, and fluorescent in situ hybridization followed standard protocols [55]. Figure S1: Detection Rate Curves Using as a Foreground Gene Set 152 Genes That Are FC but Not PNC Genes Foreground and background regions were searched using the motif combination Su(H) OR Ac/Sc. (1.0 MB DOC) Click here for additional data file.(1013K, doc) Protocol S1: Supplementary Methods Binding sites and sources for dTCF, Mad, Ets, Twist and Tin, SuH (235 KB DOC) Click here for additional data file.(236K, doc) Table S1: Areas for All Boolean Combinations of the Motifs dTCF/Mad/Pnt/Twi/Tin Using as Foreground Gene Sets C1, PLE, and PTE (25 KB XLS) Click here for additional data file.(26K, xls) Table S2: Information on All Tested Candidate CRMs (51 KB XLS) Click here for additional data file.(52K, xls) Acknowledgments The authors thank S. Choe and M. Halfon for helpful discussions, and M. Berger, M. Halfon, E. Kontopoulos, and R. Maas, for critical reading of the manuscript. We thank Y. Li for assistance with graphically representing ModuleFinder hits. Abbreviations
Footnotes Author contributions. AAP, BWB, SSG, AMM, and MLB conceived and designed the experiments. AAP, BWB, and SSG performed the experiments. AAP, BWB, and SSG analyzed the data. FSH and BE contributed reagents/materials/analysis tools. AAP, BWB, SSG, AMM, and MLB wrote the paper. Competing interests. The authors have declared that no competing interests exist. Funding. This work was funded in part by a PhRMA Foundation Informatics Research Starter Grant (MLB), a William F. Milton Fund Award (MLB), NIH R01 HG02966 (MLB), NIH/NIGMS R01 GM076603 (AMM), and the Howard Hughes Medical Institute (AMM). AAP was supported in part by a National Defense Science and Engineering Graduate Fellowship from the Department of Defense, and an Athinoula Martinos Fellowship from HST. BWB was supported in part by an institutional training grant number (T32-7572) awarded to the Department of Cardiology, Children's Hospital, Boston. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):757-62.
[Proc Natl Acad Sci U S A. 2002]Dev Cell. 2005 Mar; 8(3):413-25.
[Dev Cell. 2005]Genome Biol. 2004; 5(9):R61.
[Genome Biol. 2004]Proc Natl Acad Sci U S A. 2005 Apr 5; 102(14):4960-5.
[Proc Natl Acad Sci U S A. 2005]Cell. 2000 Sep 29; 103(1):63-74.
[Cell. 2000]Genome Res. 2002 Jul; 12(7):1019-28.
[Genome Res. 2002]Curr Opin Genet Dev. 2001 Aug; 11(4):431-9.
[Curr Opin Genet Dev. 2001]Genes Dev. 1998 Dec 15; 12(24):3910-22.
[Genes Dev. 1998]Dev Biol. 2001 Oct 1; 238(1):13-26.
[Dev Biol. 2001]Dev Biol. 2002 Apr 15; 244(2):226-42.
[Dev Biol. 2002]Genome Res. 2002 Jul; 12(7):1019-28.
[Genome Res. 2002]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Nat Genet. 2001 Oct; 29(2):153-9.
[Nat Genet. 2001]Nat Genet. 2003 Jul; 34(3):267-73.
[Nat Genet. 2003]Genome Biol. 2004; 5(9):R61.
[Genome Biol. 2004]Nat Genet. 2000 Oct; 26(2):225-8.
[Nat Genet. 2000]Genome Res. 2003 Apr; 13(4):579-88.
[Genome Res. 2003]Proc Natl Acad Sci U S A. 2002 Jul 23; 99(15):9888-93.
[Proc Natl Acad Sci U S A. 2002]Genome Biol. 2004; 5(4):R25.
[Genome Biol. 2004]Cell. 2000 Sep 29; 103(1):63-74.
[Cell. 2000]Proc Natl Acad Sci U S A. 2005 Oct 25; 102(43):15545-50.
[Proc Natl Acad Sci U S A. 2005]Cell. 1995 Jun 16; 81(6):857-66.
[Cell. 1995]Nature. 1994 Aug 4; 370(6488):386-9.
[Nature. 1994]Proc Natl Acad Sci U S A. 1999 Mar 16; 96(6):2907-12.
[Proc Natl Acad Sci U S A. 1999]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]Cell. 2000 Sep 29; 103(1):63-74.
[Cell. 2000]Dev Cell. 2005 Mar; 8(3):413-25.
[Dev Cell. 2005]Dev Cell. 2005 Mar; 8(3):413-25.
[Dev Cell. 2005]Genes Dev. 1995 Oct 1; 9(19):2373-83.
[Genes Dev. 1995]Genes Dev. 1998 Feb 1; 12(3):304-15.
[Genes Dev. 1998]Dev Cell. 2005 Mar; 8(3):413-25.
[Dev Cell. 2005]Development. 2005 Aug; 132(15):3333-44.
[Development. 2005]Nature. 2003 Oct 2; 425(6957):512-6.
[Nature. 2003]Nature. 2003 Oct 2; 425(6957):507-12.
[Nature. 2003]Genome Biol. 2004; 5(4):R25.
[Genome Biol. 2004]Cell. 2000 Sep 29; 103(1):63-74.
[Cell. 2000]Mech Dev. 2003 Sep; 120(9):991-1007.
[Mech Dev. 2003]Nature. 1994 Aug 4; 370(6488):386-9.
[Nature. 1994]Development. 1993 Jan; 117(1):163-76.
[Development. 1993]Genome Biol. 2004; 5(12):R98.
[Genome Biol. 2004]Nat Genet. 2004 Dec; 36(12):1331-9.
[Nat Genet. 2004]Development. 1999 Oct; 126(20):4525-35.
[Development. 1999]Development. 1998 Sep; 125(18):3699-708.
[Development. 1998]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]Bioinformatics. 2004 Nov 22; 20(17):3246-8.
[Bioinformatics. 2004]Dev Cell. 2005 Mar; 8(3):413-25.
[Dev Cell. 2005]Dev Cell. 2005 Mar; 8(3):413-25.
[Dev Cell. 2005]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Cell. 2000 Sep 29; 103(1):63-74.
[Cell. 2000]Science. 2004 Aug 6; 305(5685):846.
[Science. 2004]