![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright Aerts et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Fine-Tuning Enhancer Models to Predict Transcriptional Targets across Multiple Genomes 1Laboratory of Neurogenetics, Department of Molecular and Developmental Genetics, Vlaams Instituut voor Biotechnologie (VIB), Leuven, Belgium 2Department of Human Genetics, K.U. Leuven School of Medicine, Leuven, Belgium 3Service de Conformation des Macromolécules Biologiques et de Bioinformatique, Département de Biologie Moléculaire, Université Libre de Bruxelles, Bruxelles, Belgium Guillaume Bourque, Academic Editor Genome Institute of Singapore, Singapore * To whom correspondence should be addressed. E-mail: stein.aerts/at/med.kuleuven.be (SA); Email: bassem.hassan/at/med.kuleuven.be (BH) Conceived and designed the experiments: SA. Performed the experiments: Jv SA OS. Analyzed the data: SA. Contributed reagents/materials/analysis tools: Jv OS. Wrote the paper: BH SA. Received July 19, 2007; Accepted September 2, 2007. This article has been cited by other articles in PMC.Abstract Networks of regulatory relations between transcription factors (TF) and their target genes (TG)- implemented through TF binding sites (TFBS)- are key features of biology. An idealized approach to solving such networks consists of starting from a consensus TFBS or a position weight matrix (PWM) to generate a high accuracy list of candidate TGs for biological validation. Developing and evaluating such approaches remains a formidable challenge in regulatory bioinformatics. We perform a benchmark study on 34 Drosophila TFs to assess existing TFBS and cis-regulatory module (CRM) detection methods, with a strong focus on the use of multiple genomes. Particularly, for CRM-modelling we investigate the addition of orthologous sites to a known PWM to construct phyloPWMs and we assess the added value of phylogenentic footprinting to predict contextual motifs around known TFBSs. For CRM-prediction, we compare motif conservation with network-level conservation approaches across multiple genomes. Choosing the optimal training and scoring strategies strongly enhances the performance of TG prediction for more than half of the tested TFs. Finally, we analyse a 35th TF, namely Eyeless, and find a significant overlap between predicted TGs and candidate TGs identified by microarray expression studies. In summary we identify several ways to optimize TF-specific TG predictions, some of which can be applied to all TFs, and others that can be applied only to particular TFs. The ability to model known TF-TG relations, together with the use of multiple genomes, results in a significant step forward in solving the architecture of gene regulatory networks. Introduction The characterization and understanding of gene regulatory interaction networks that rigorously control the execution of genetic programs that make functional cells, tissues, and organisms is a key challenge for post-genome biology. Such regulatory interactions are formed by transcription factors (TFs) and their target genes (TGs) and are implemented via TF DNA-binding sites (TFBS) located in cis-regulatory modules (CRM) of TGs. A CRM is a promoter or enhancer sequence that contains TFBSs for one or more TFs and that controls a specific aspect of the expression pattern of the TG [1]. A consequence of genetic pleiotropy- one gene, multiple functions- is that genes often have several distinct expression patterns regulated by several distinct CRMs per gene. For example, the expression of the atonal TF in Drosophila melanogaster (Dmel) in different tissues is regulated by discrete CRMs, some of which are also autoregulatory [2]–[4]. It is therefore not surprising that comparative genomics and computational CRM predictions [5] suggest large numbers of CRMs per genome, implying very large numbers of regulatory interactions. The vast majority of these interactions remain to be discovered. This complexity means that it will be practically impossible to understand the logic and organization of gene regulatory networks without the application of genome-wide, TF-specific computational TG discovery methods. Although genetic interaction, expression profiling and chromatin binding approaches can provide lists of candidate TGs, they each suffer from disadvantages such as high cost, technical limitations, inability to detect direct TGs, and prohibitive numbers of conditions to test [6]. Therefore, experimental approaches would benefit greatly from being complemented by in silico TG discovery methods. Ideally, computational approaches to mine entire genomes for TFBSs and CRMs would generate highly accurate lists of candidate TGs that can be taken directly to in vivo biological validation. Bioinformatics approaches have been developed to predict TFBSs for TFs that have a known consensus TFBS or position weight matrix (PWM). Unfortunately, this approach results in a 1000-fold excess of false predictions when applied on a genomic scale [7]. In a handful of cases, however, genome-scale scanning for TF-target interactions has been successful, particularly if the binding sites are evolutionarily conserved and if they occur in clusters. These can be either homotypic clusters with multiple TFBSs for one TF as is the case for Drosophila Dorsal [8], Bicoid [9] and Suppressor of Hairless [10], or heterotypic TFBS clusters for several very well characterized TFs [11]–[13]. The methods applied in these studies are often based on Hidden Markov Models (HMM) [14]–[16]. They take one motif or a set of motifs - in the form of PWMs - as input and identify confined regions in large DNA sequences that harbor significantly more motif instances than expected by chance or than given by a background model. Alternatively, predicted TFBSs can be filtered by their occurrence in conserved non-coding DNA stretches [17]. There are a few reasons why such methods cannot be generalized. First, PWMs for most TFs are built from few instances and the minimal number of instances for a useful PWM is not known. Second, not all TFs regulate their TGs via homotypic TFBS clusters. Third, for most CRMs that contain heterotypic clusters the cooperating TFs, let alone their PWMs, are unknown. Fourth, several independent studies have found that sequence conservation per se is not sufficient to identify enhancers because many enhancers are functionally conserved without sharing high levels of sequence identity across a long DNA stretch [11], [13], [18]. Based on this latter point, methods have been developed to search for conserved motif clusters across two genomes [13], [19]. However, these methods have not been assessed for their ability to score more than two genomes, nor for their performance on a wide range of TFs. Taken together, these limitations mean that it is currently unclear under what conditions, using which parameters and for which TFs genome-scale TG prediction is feasible. To investigate these issues we perform a benchmark study on genome-scale TG prediction for individual TFs. The benchmark consists of in silico validations on known TGs with identified TFBSs for 34 Drosophila TFs from the FlyReg database [20]. The availability of the full genome sequences of twelve Drosophila species at a range of evolutionary distances from D. melanogaster provides the opportunity to study the evolution of genes [21] and the discovery and annotation of functional elements like protein-coding genes, miRNA genes, and regulatory motifs [22]. In our benchmark study, we take advantage of the multiple genomes in several ways. First, we compare the use of Dmel PWMs built from the known binding sites versus phyloPWMs built from orthologous binding sites from 10 other Drosophila species. Next, we exploit all Drosophila genomes to improve the prediction of homotypic TFBS clusters, either by applying motif conservation or network-level conservation filters. For TFs that show a low performance in this approach, we investigate the use of heterotypic enhancer models consisting of de novo discovered motifs by phylogenetic footprinting, also using all Drosophila genomes. Finally, we model the known TGs of a 35th TF not included in the initial assessment, namely the eye determination TF Eyeless (Ey). We find a significant overlap between predicted TGs and a list of candidate targets obtained from a recently published and biologically validated microarray experiment. An important conclusion of this study is that no general rule exists that applies to all, or even most, TFs. However, most TFs benefit greatly from the use of multiple genomes in enhancer scoring. Also, we find that by performing cross-validations, the optimal strategy and parameters can be determined for each TF. By training these parameters and through the extensive use of multiple genomes, combined with Gene Ontology filters or microarray data, we estimate that genome-wide discovery of TGs is feasible for about 50% of the TFs tested. Results Detecting homotypic enhancers using known PWMs while eliminating validation bias The first, and most straightforward, strategy for motif-based TG prediction, is to use an existing consensus site or PWM for the TF under study. The genome-wide discovery of TGs through homotypic cluster prediction [23] was already shown to be feasible for a number of TFs, particularly Dorsal [8], Bicoid [9] and Suppressor of Hairless [10]. Here, we test this strategy for all 34 TFs in our dataset (see Methods and Table S1). We have chosen the Hidden Markov Model implementation of Cluster-Buster [15], although other methods that take a PWM as input are available [16], [24], [25]. Through leave-one-out cross-validation (LOOCV), we test whether a 1000 bp ‘positive’ sequence flanking one or more known binding sites can be discriminated from ‘negative sequences’ by the motif cluster score. As negative sequences we use 500 randomly selected proximal promoter sequences. In each run, the 501 sequences are ordered by descending score and the rank of the positive region is recorded (Figure 1
Next we examined the performance of unbiased homotypic cluster detection in detail. We find that 37 out of 166 test regions rank in the top 2% (10/501 scored sequences) representing 19 of 34 TFs. To investigate the performances of each individual TF in more detail, we plot ROC curves per TF and calculate AUC scores for each TF separately (Figure 3C = 0.99) and Bicoid (AUC = 0.89) that were previously known to use homotypic clusters. In addition, the homotypic cluster approach to TG prediction works very well for Adh transcription factor 1 (0.95), Tinman (0.93), Trithorax-like (0.92), and Zeste (0.90). We could not find any obvious correlation between the performance and the size of the data set (Figure 3C
Adding orthologous sites can improve the quality of sparse PWMs Given the availability of multiple Drosophila genomes, an interesting question is whether better PWMs can be built using sequences that are orthologous to the known TFBSs. The issue of phylogenetic conservation of TFBSs is paradoxical because on the one hand, TFBSs are known to have a high evolutionary turn-over rate [32], while on the other hand many have been discovered by virtue of their evolutionary conservation. We asked if the addition of conserved sites- defined as aligned sites sharing more than 40%, 70%, 80%, or 90% identity between a given Drosophila species and Dmel- to the PWM improves the overall performance (see Methods). A priori we expected the 70% identity cut-off to perform best because we calculated from all PWMs in the TRANSFAC library that the sites that make up a PWM have 72% identity on average. This cut-off allows one substitution in a 6 bp motif, two in a 8 bp motif, and three in a 10 bp motif. We expected that a less stringent cut-off (e.g., 40%) would allow too many negative sequences in the phyloPWM, and that a more stringent cut-off (e.g., 90%) would not introduce enough variability. To our surprise, we do not observe an increased average performance across all 34 TFs with any of these phyloPWMs (Figure 4A
Different approaches for enhancer scoring using multiple genomes: network-level conservation versus motif conservation Using multiple genomes in the training step to construct PWMs only sometimes improve detection of test TG enhancers. However, using sequence conservation between Dmel and D. pseudoobscura (Dpse) during enhancer scoring has been suggested as a useful filter [13], [19]. The program STUBBMS [19] also implements a Hidden Markov Model like Cluster-Buster used above, but can score two genomes simultaneously given one or more PWMs as input. When we apply the program STUBBMS [19] to the combined scoring of Dmel and Dpse regions we find a better performance than using Cluster-Buster on Dmel alone (Figure 3A We find that using order statistics to integrate information from all 11 species during enhancer scoring for homotypic clusters results in the best performance, with a significant improvement over Dmel alone, or any pairwise combination with Dmel (data shown for Dmel-Dpse). To ensure that the order statistics themselves do not bias the results, we combined the rankings obtained from scrambled PWMs. This did not result in an increased performance compared to scrambled PWMs from Dmel alone (Figure 3A In a second, more classical approach, we filtered the HMM-based motif cluster predictions based on sequence conservation of the predicted motifs themselves. Using the phastCons scores [37] from the UCSC browser we masked all nucleotides with a conservation score below 0.90 before applying the HMM-scoring (thresholds of 0.5 and 0.7 gave similar results; data not shown). The resulting overall performance is better than the STUBB-MS performance on two genomes only (purple curve in Figure 3B Importantly, the ROC curves depict averages across 34 transcription factors. When looking at the performances of the individual TFs, it became clear that some factors benefit more from network-level conservation (e.g., tin, kr, twi, zen) and others are better suited for motif conservation applications (e.g., abd-A, Ubx, HLHm5, cad, gt). Again, we could not find any obvious correlation between the methods and the size of the starting data sets, neither in terms of target genes nor in terms of binding sites per TF. The TF-wise performances for each method, both for the Dmel PWM and the phyloPWM are presented in Figure 3C Learning contextual motifs to construct heterotypic enhancer models Using multiple genomes resulted in improved performance in homotypic cluster prediction for many TFs. However, it is known that many TFs operate in cooperation with other TFs that bind to different binding sites in the neighborhood, which is often a relatively small sequence window (e.g., 1 kb). To take the existence of such heterotypic motif clusters into account in our dataset of 166 enhancers, we added a pattern discovery step to identify a shared context between all regions targeted by the same TF, always excluding the left-out test region (Figure 1 First, we apply two traditional motif discovery methods, namely MotifSampler [38] and oligo-analysis [39] to each Dmel training set. The number of motifs found by MotifSampler is a parameter of the method, and was set to 5. Oligo-analysis has the advantage that only significantly over-represented k-mers (where k is 6,7 or 8) are reported, yielding between zero and 18 motifs per regulon, with an average of 4.3 and a standard deviation of 4.17. Oligo-analysis gives slightly better results than MotifSampler in the high-specificity range in which we are interested. The reason why the performance for oligo-analysis decreases in the low-specificity range is our conservative approach that instructs Cluster-Buster to rank the left-out region last when oligo-analysis does not find any common motifs in the training set. Note that oligo-analysis yields over-represented DNA words, not PWMs, while Cluster-Buster requires PWMs as input for the scoring step. To solve this problem we transform each over-represented word into a pseudo-PWM (see Methods and Supplementary Note 1). In agreement with published results [40], we observe that single species motif discovery results in unsatisfactory enhancer models that are not able to generalize (Figure 5A
We examined individual TF performances (Figure 5B Genome-wide target gene prediction The ultimate goal of computational target gene prediction is to obtain a high quality set of candidate targets by scanning one or more genomes. The cross-validations described above show promising results for a number of transcription factors. However, in practice several obstacles arise. The first obstacle in genome scanning is the definition of the search space and the association of a predicted regulatory region with a target gene. For the Drosophila genomes we chose to attribute a CRM to a gene when it either lies in an intron of that gene or within 5 kb upstream of that gene. Each sequence region of a gene (i.e., each intron and the upstream sequence) receives the maximal CRM score found within that region. For the network-level conservation approach, each region in the Dmel genome is associated with its orthologous region in another genome using UCSC's genome alignments. The top 100 scoring genes for each of the 34 TFs (using the optimal enhancer model for each TF), together with the locations of the predicted motif clusters, are presented on our website (http://med.kuleuven.be/cme-mg/lng/cisTarget). The second, more problematic obstacle is the size of the test set that now consists of 93330 regions (all regions for all genes). Sensitivities of 50% will no longer be feasible because the top 10% regions represent more than 1000 genes, which is too large for biological validation and even for further filtering by functional annotations. On the other hand, we know from the LOOCV tests that even the top 1% (136 genes) could contain a few bona fide target genes, because about 10% of the targets (18 out of 166 for the phyloPWM with motif conservation) were found in the top 1% (i.e., top 5 out of 501 regions) in the LOOCV. We attempted to enrich for true targets within the top 100 genes by comparing over-represented Gene Ontology (GO) terms with the GO annotation of the TF itself. When the optimal model – based on the LOOCV results - is chosen for each TF, 15 TFs show a top 100 TG prediction with a significant enrichment of at least one biological process of the TF (see Table 1; the complete GO results for all TFs are available from our website). To test whether the selection of the optimal model is important for GO enrichment, we compared the optimal model to one reference model, namely the homotypic model using the Dmel PWM and network-level conservation. Indeed, by using the reference model instead of the optimal model for each TF, 8 factors loose the GO enrichment that was relevant for the TF, another three TFs have a p-value that is still significant but less than the optimal method, one TF has the same p-value and only 1 TF has a better p-value (data not shown).
Filtering based on functional annotation is however limited to the detection of genes with GO IDs. For genes without GO IDs computational predictions can be complemented with data from other sources, such as phenotypic data, protein-protein interaction or expression data. As an example we analyzed a regulon of the eye determination transcription factor Ey. Ey was not included in our earlier dataset because the FlyReg database only contains one target gene of Ey, namely sine oculis (so). However, recent studies have identified four more targets of Ey: eya, shf, Optix, and atonal [42]. The first three were found by expression studies comparing wild type and ectopic Ey over-expression [43]. All five genes have identified Ey binding sites that have been validated through in vitro binding studies and in vivo reporter assays. Ostrin et al. conducted a microarray experiment and identified 188 potential TGs of Ey whose expression is independent of the function of the retinal differentiation TF atonal. The total of five target regions can now be subjected to cross-validation. The assessment suggests that the homotypic approach with multiple genomes performs best for Ey, although the differences between methods are small in the high specificity range in which we are interested (data not shown). Although Ostrin et al. used a phyloPWM in their study [43], the cross-validation performances we observed for our phyloPWM were similar to the Dmel PWM. The homotypic model using the Dmel PWM and network-level conservation was applied to the full genome, ranking all 5 kb upstream regions and all introns. The top 100 scoring genes have “photoreceptor cell differentiation” over-represented with a p-value of 6.45e-06. The 9 photoreceptor differentiation genes in the top 100 are bun, eya, fz, hth, lilli, mbl, Optix, pnt, and sdk. It is important to note that neither GO:0048748, nor any other eye development GO term, is over-represented in the top100 candidate genes when the scoring is performed on Dmel alone. There are 9 genes in the top 100 genes that are also present in the set of 188 genes of Ostrin et al. To determine the significance of this result, the same procedure was repeated for 100 randomizations of the Ey PWM. The top 100 scoring genes from randomized PWMs contain 3.70±2.84 genes (95% confidence interval) in common with the Ostrin set of 188 genes. The maximal difference between the true ranking and the randomizations is obtained at a threshold of 171 genes (Figure S3). At this threshold, the randomized PWMs yield an overlap of 6.05±3.65 while the true PWM yields an overlap of 14 genes (p-value of 2.3288e-05). These 14 genes are mspo, SK, so, toy, ey, CG17816, Optix, CG30492, CG32521, osp, Fas2, CG5888, Tie, and eya. Discussion Gene regulatory networks of TFs and their TGs play key roles in development, homeostasis and behavior. The relationship between a TF and its TGs is achieved through “DNA words”, usually 4–12 bp long, acting as binding sites for the TF in question. A collection of such motifs regulating a specific aspect of the expression of a gene defines a CRM. Therefore, to understand network organization, we need to understand the organizational logic of CRMs. Since most networks mediating a specific biological function consist of multiple TFs and their- sometimes overlapping- TGs, the capacity to detect TGs genome-wide and in silico with high accuracy would bring major advantages. In this respect, regulatory bioinformatics faces a few challenges. First, the organizational logic(s) of CRMs is not clear. Do most TFs regulate their TGs via multiple or single TFBSs? How conserved do such TFBSs need to be at the sequence level to be detectable? Can the contextual information around a given TFBS in a CRM be harnessed to improve TG detection? Does the availability of multiple related genomes foster or complicate CRM detection? And finally, do different TFs follow different logics? Developing and assessing methods that can address these issues is clearly a major goal of regulatory bioinformatics. To begin to address these issues we took advantage of the availability of 12 Drosophila genomes and the Drosophila database of TF-TG relations (FlyReg) to perform a benchmark study on genome-wide in silico TG prediction. Drosophila is ideally suited for such an approach not only because multiple genomes are now available, but also because a few regulatory networks, such as the segmentation network, have been well-established in vivo resulting in several deeply understood TF-TG relations. One challenging problem of target gene prediction is the mapping of an in silico predicted regulatory locus to the gene it potentially regulates. Indeed, several known enhancers are located one or several genes upstream of the actual target gene, for example the cluster of dorsal binding sites that regulates zen is located directly upstream of CG1162. To circumvent this problem in the benchmark analysis and to make direct comparisons at the sequence level possible, we added isolated regulatory regions to sets of negative sequences. This setup has the additional advantage that it requires less computational resources so that more parameters can be varied. We chose the same region size for each enhancer and each transcription factor to exclude confounding effects of the size when comparing across factors (Cluster-Buster can generate higher scores for larger regions). For the cross-validations we arbitrarily set this size to 1 kb. Such a choice is justified because the cross-validations are intended to compare relative performances rather than absolute performances. When a cross-validation procedure is applied on a single factor under study, the region size could be considered as an additional parameter. For the genome-wide scoring, the size of the motif clusters is determined by Cluster-Buster. Cross-validation tests and subsequent genome-wide TG predictions result in both higher average performances across a data set of more than 30 TFs (of a total of approximately 700 TFs in the Dmel genome [44]) as well as determination of the optimal parameters for each of the individual TFs. This is of particular value for molecular geneticists who are likely to be interested in one or a few TFs within a network and for whom average performance across a large dataset is not particularly useful. The difference between the two is highlighted by our finding that two different performance parameters result in highly similar average performance, but radically different single TF performance profiles. As a result of these advantages the number of factors for which TG detection becomes highly performant (AUC values above 0.90) is increased from 5/34 to 19/34. We attempted to address some of the questions facing regulatory bioinformatics. The major conclusions from our work are as follows. First, there is unlikely to be a single unifying CRM logic, at least at current levels of genome annotation resolution. We find that whereas some TFs perform optimally with single TFBS parameters (data not shown), others use clusters of homotypic TFBSs and still others use heterotypic TFBS clusters. Thus, a methodology that can determine the optimal approach per TF a priori is necessary for successful single TF based TG predictions. Second, the availability of multiple genomes is, in general, extremely useful for genome-wide TG prediction. One exception to this rule is that, contrary to conventional wisdom and several previous reports, additional genomes do not always result in better PWM building. The reason for this could be that PWMs that are built from sufficiently distinct binding sites (e.g., more than 20) already possess enough variation and do not benefit from the addition of more sites. A small amount of PWMs with very few, but highly conserved sites (e.g., HLHm5) do benefit from a phylogenetic extension. The availability of multiple genomes becomes especially useful in two other ways. First, it improves the training of a CRM model from a set of known target regions, to discover sites that are both conserved and shared across this set. We have assessed whether such de novo discovered motifs could contribute to a better enhancer model for the TF. Second, it improves the scoring of a CRM model by taking advantage of functional enhancer conservation or TFBS conservation. We found no obvious explanation (e.g., in terms of correlations with TF families, with the conservation of known sites, or with the size of the TG set) why some TFs perform better with network-level conservation and others with motif conservation. When more cis-regulatory data becomes available as validation sets, for example through open community based annotation [45], a deeper investigation of this issue may become feasible. Although several of the tested variables, most importantly the integration of multiple genomes, can result in significantly enhanced TG prediction accuracy, more work is needed to improve on this performance because only a portion of the true target enhancers could be detected. Again, it can be expected that performances will increase further, when more knowledge about regulatory regions emerges. For example, King et al. found recently that in vertebrates some regulatory regions correlate well with phastCons conservation scores (used for our motif conservation), while other regulatory regions correlate better with alignment-based scores that are corrected for background neutral substitution rates[46]. However, even with more advanced interspecies comparisons, on a genomic scale the true positive TF-TG interactions are spread out across many other high-scoring interactions. At present it is difficult to determine for a certain TF whether the other high scoring genes are also bona fide TGs, false positive predictions, or- most likely- a mix of both. Several ways can be envisioned to further improve the performance. For example, enhancer predictions can be combined with other data types to filter the ranked enhancers, such as GO terms, as we have shown for the TFs in Table 1, and/or large scale expression data, as we tested for the eye determination TF Ey. The benchmark dataset can be used in the future to evaluate novel methods for de novo motif discovery, enhancer prediction, or target gene prioritization. In summary, we have tested several strategies and parameters for the computational prediction of TF-TG relations through TFBS and CRM detection. The selection of the best strategy for each individual TF, combined with the extensive use of multiple genomes during both the training and scoring of enhancer models results in a significant step forward in the bioinformatics to solving the architecture of gene regulatory networks. Materials and Methods Data Our dataset consists of 166 TF-target relations for 34 transcription factors, generated by selecting all known TFs from FlyReg [20] that have minimally 3 distinct target genes (Table S1). Each TF-TG is represented by a test sequence, defined by selecting 1000 bp flanking sequence around only one of the TF-specific footprints around the TG. The ‘experimental’ PWMs are constructed by taking the best hit within each footprint after scoring with the corresponding matrices that were construced by Daniel Pollard using the MEME algorithm (http://rana.lbl.gov/dan/matrices.html). The scoring was done using Patser [47]. PWM scrambling is done by permuting the columns of a matrix, thereby conserving the A/T en C/G composition. 500 negative sequences are selected ad random from all 1 kb proximal sequences from the UCSC Table Browser [48]. Other negative sets that we tested are all the REDfly [31] enhancers with a maximal size of 1 kb (308 in total), extended in the genome to 1 kb, and 250 sequences of 1 kb surrounding a test sequence (125 on each side). Sets of orthologous sequences for positive and negative Dmel sequences are assembled using the liftOver utility of the UCSC Genome Browser [49]. Multiple output regions, due to homology to discontinuous regions, are all retained for training and scoring. Aligned sequences to Dmel TFBSs, used to build phyloPWMs, are also obtained through the UCSC liftOver utility. PhastCons conservation scores were downloaded from the UCSC download pages. Species and UCSC assemblies used throughout the analyses are D. melanogaster (dm2), D. simulans (DroSim1), D. sechellia (DroSec1), D. yakuba (DroYak1), D. erecta (DroEre2), D. ananassae (DroAna2), D. pseudoobscura (dp3), D. persimilis (DroPer1), D. virilis (DroVir2), D. mojavensis (DroMoj2), and D. grimshawi (DroGri1). Cross-validation Methods for motif discovery, for CRM prediction, and for the use of multiple genomes are compared through leave-one-out cross-validation (LOOCV) (see Figure 1 Sequence scoring The program Cluster-Buster [15] is used to score a sequence with a set of PWMs, using 1000 bp (the length of the test sequences) as range (-r option) for counting local nucleotide abundances. Order statistics are applied to integrate Cluster-Buster-based rankings on multiple genomes as described in [30] and in Figure 3B Pattern discovery For each LOOCV run, the Gibbs sampling program MotifSampler [38] was run 50 times for motifwidths 5, 6, 8, 10, 12, and 14, prior 0.2 and a 0th order backgroundmodel from the whole 1 kb dataset. Resulting motifs were clustered and ranked according to information content as done in [40]. The program oligo-analysis [39] detects motifs in sequences by counting the occurrences of all oligonucleotides, and calculating their significance according to a background model, estimated by counting the occurrences of all oligonucleotides in all species-specific upstream sequences. Prior to their analysis, training sequences were purged with the program mkvtree [51] to discard internally repeated fragments. The genome-wide repetitive fragments were also masked for the pattern discovery step using UCSC Genome Browser RepeatMasker annotation (http://www.repeatmasker.org). Oligonucleotide occurrences of all sizes between 6 and 8 were counted on both strands, only considering the renewing occurrences (self-overlapping occurrences of a same word were discarded). The threshold of significance was set to 1, corresponding to an E-value of 1 false positive oligonucleotide per 10 training sets. Each run of oligo-analysis returns a set of over-represented oligonucleotides, which were used to construct a pseudo-PWM for each over-represented oligo with a value of 10 for the letters forming the word, and 0 for the other letters (see Note S1). To use oligo-analysis on multiple species, we detect over-represented words in each species separately, and use those to score the test regions of the respective species. The PhyloGibbs program allows detecting motifs that are both conserved and shared across co-regulated sequences [41]. PhyloGibbs was run with parameters -D1 -m8 -z5 (5 motifs of width 8), -L“((DroGri1:0.7,(DroVir2:0.75,DroMoj2:0.75):0.7):0.5,((DroPer1:0.95,Dp3:0.95):0.7,(DroAna2:0.8,(DroEre1:0.82,(DroYak1:0.85,(Dm2:0.9,(DroSec1:0.95,DroSim1:0.95):0.9):0.85):0.82):0.8):0.7):0.5)” and whole chr2L as background sequence. Resulting matrices were compared with FlyReg matrices using the Kullback-Leiber distance [17]. Figure S1 Leave-one-out cross-validation performance for different negative sets. The rank of the positive “test” region (1 kb) within a set of negative sequences (all 1 kb) is plotted cumulatively. As negative sequences were used 500 randomly selected proximal promoter sequences, upstream of the annotated transcription start site (black curve) or 308 REDfly enhancers of maximally 1 kb length (blue curve), then all genomically extended to 1 kb, or 250 flanking sequences around the positive region (green curve), or 500 randomly generated sequences of 1 kb using a 5th order Markov model trained on all Dmel upstream sequences. (0.15 MB PDF) Click here for additional data file.(145K, pdf) Figure S2 Cmparison of PhyloGibbs PWMs and real PWMs. All 166×5 motifs resulting from PhyloGibbs were compared to all 34 real PWMs, using the progam MotifComparison that implements the Kullback-Leiber distance between matrices [1]. Left column are real PWMs, middle column are matching PhyloGibbs motifs, and right column is the distance between both. Only eight real PWMs could be matched below distance threshold 1.0. (0.70 MB PDF) Click here for additional data file.(688K, pdf) Figure S3 Overlap between Ey candidate targets obtained from microarray data and from genome-wide binding site prediction. At different cut-offs N (x-axis), the N top scoring Ey candidate targets, based on motif detection, are compared with the 188 Ey candidate targets obtained from gene expression studies [1]. The same is done for each of 100 randomized rankings (obtained by using a randomized Ey PWM). (A) For each cut-off value, the number of genes in common between the two sets is plotted on the y-axis. The values for the true Ey PWM are in blue, while the mean values of the randomized PWMs are in black. A 95% confidence interval is plotted in red dashed lines. (B) A p-value for each cut-off value is plotted on the y-axis. The p-values are calculated using a normal distribution based on the mean and standard deviation from the randomized rankings. The optimal p-value is obtained by using a cut-off value of 171. (0.08 MB PDF) Click here for additional data file.(78K, pdf) Table S1 Dataset used in the study. 166 TF-target relations extracted from the FlyReg database [1]. For all factors with at least three different target genes, one footprint was chosen. 1000 bp flanking this footprint is used as training or test sequence in the cross-validation. (0.11 MB PDF) Click here for additional data file.(106K, pdf) Note S1 Linking oligo-analysis output with Cluster-Buster input. (0.07 MB PDF) Click here for additional data file.(66K, pdf) Acknowledgments SA acknowledges Denis Thieffry, Bernard Jacq, and Carl Herrman from IBDML for helpful discussions and Carl Herrman for mapping FlyReg matrices to footprints. We thank Yves Moreau for help with order statistics; Rachel Harte, Hiram Clawson and colleagues at UCSC for genome sequence data, tools, and helpful suggestions on their usage; Daniel Pollard for FlyReg weight matrices; Wouter Bossuyt for suggestions on the manuscript. Part of this work was performed during a research visit at the Developmental Biology Institute of Marseille Luminy (IBDML), Marseille, France. Footnotes Competing Interests: The authors have declared that no competing interests exist. Funding: This research was funded by an FWO (Fund for Scientific Research, Flanders) postdoctoral fellowship (to SA) as well as VIB, a Concerted Research Action (GOA) grant and an Impuls grant from University of Leuven (to BAH). BioSapiens Network of excellence funded under the sixth Framework program of the European Communities (LSHG-CT-2003-503265 to JvH). Linux clusters were used from KUL and SMBCC, the latter funded by the Belgian Fonds de la Recherche Fondamentale Collective and the Fonds d'Encouragement à la Recherche. References 1. Davidson EH. Genomic Regulatory Systems. San Diego, USA: Academic Press; 2001. p. 261. 2. zur Lage PI, Powell LM, Prentice DR, McLaughlin P, Jarman AP. EGF receptor signaling triggers recruitment of Drosophila sense organ precursors by stimulating proneural gene autoregulation. Dev Cell. 2004;7:687–696. [PubMed] 3. Sun Y, Jan LY, Jan YN. Transcriptional regulation of atonal during development of the Drosophila peripheral nervous system. Development. 1998;125:3731–3740. [PubMed] 4. Hassan BA, Bermingham NA, He Y, Sun Y, Jan YN, et al. Atonal regulates neurite arborization but does not act as a proneural gene in the Drosophila brain. Neuron. 2000;25:549–561. [PubMed] 5. Blanchette M, Bataille AR, Chen X, Poitras C, Laganiere J, et al. Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Res. 2006;16:656–668. [PubMed] 6. Taverner NV, Smith JC, Wardle FC. Identifying transcriptional targets. Genome Biol. 2004;5:210. [PubMed] 7. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004;5:276–287. [PubMed] 8. Markstein M, Markstein P, Markstein V, Levine SM. Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc Natl Acad Sci U S A. 2002;99:763–768. [PubMed] 9. Ochoa-Espinosa A, Yucel G, Kaplan L, Pare A, Pura N, et al. The role of binding site cluster strength in Bicoid-dependent patterning in Drosophila. Proc Natl Acad Sci U S A. 2005;102:4960–4965. [PubMed] 10. Rebeiz M, Reeves LN, Posakony WJ. SCORE: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. Site clustering. Proc Natl Acad Sci U S A. 2002;99:9888–9893. [PubMed] 11. Schroeder DM, Pearce M, Fak J, Fan H, Unnerstall U, et al. Transcriptional control in the segmentation gene network of Drosophila. PLoS Biol. 2004;2:E271. [PubMed] 12. Halfon MS, Grad Y, Church GM, Michelson AM. Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. Genome Res. 2002;12:1019–1028. [PubMed] 13. Berman PB, Pfeiffer DB, Laverty RT, Salzberg LS, Rubin MG, et al. Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila. Genome Biol. 2004;5:R61. [PubMed] 14. Eddy SR. What is a hidden Markov model? Nat Biotechnol. 2004;22:1315–1316. [PubMed] 15. Frith CM, Li CM, Weng Z. Cluster-Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Res. 2003;31:3666–3668. [PubMed] 16. Rajewsky N, Vergassola M, Gaul U, Siggia ED. Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics. 2002;3:30. [PubMed] 17. Aerts S, Thijs G, Coessens B, Staes M, Moreau Y, et al. Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res. 2003;31:1753–1764. [PubMed] 18. Emberly E, Rajewsky N, Siggia DE. Conservation of regulatory elements between two species of Drosophila. BMC Bioinformatics. 2003;4:57. [PubMed] 19. Sinha S, Schroeder MD, Unnerstall U, Gaul U, Siggia ED. Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila. BMC Bioinformatics. 2004;5:129. [PubMed] 20. Bergman MC, Carlson WJ, Celniker ES. Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila. Bioinformatics. 2005;21:1747–1749. [PubMed] 21. Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007 doi:10.1038/nature06341. 22. Stark A, Lin MF, Kheradpour P, Pederson J, Parts L, et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature. 2007 doi:10.1038/nature06340. 23. Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA. Homotypic regulatory clusters in Drosophila. Genome Res. 2003;13:579–588. [PubMed] 24. Johansson O, Alkema W, Wasserman WW, Lagergren J. Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics. 2003;19(Suppl 1):i169–76. [PubMed] 25. Bailey TL, Noble WS. Searching for statistically significant regulatory modules. Bioinformatics. 2003;19(Suppl 2):II16–II25. [PubMed] 26. Segal E, Sharan R. A discriminative model for identifying spatial cis-regulatory modules. J Comput Biol. 2005;12:822–834. [PubMed] 27. Philippakis AA, Busser BW, Gisselbrecht SS, He FS, Estrada B, et al. Expression-guided in silico evaluation of candidate cis regulatory codes for Drosophila muscle founder cells. PLoS Comput Biol. 2006;2:e53. [PubMed] 28. Chang LW, Nagarajan R, Magee JA, Milbrandt J, Stormo GD. A systematic model to predict transcriptional regulatory mechanisms based on overrepresentation of transcription factor binding profiles. Genome Res. 2006;16:405–413. [PubMed] 29. Chan BY, Kibler D. Using hexamers to predict cis-regulatory motifs in Drosophila. BMC Bioinformatics. 2005;6:262. [PubMed] 30. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, et al. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24:537–544. [PubMed] 31. Gallo SM, Li L, Hu Z, Halfon MS. REDfly: a Regulatory Element Database for Drosophila. Bioinformatics. 2006;22:381–383. [PubMed] 32. Moses AM, Pollard DA, Nix DA, Iyer VN, Li XY, et al. Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS Comput Biol. 2006;2:e130. [PubMed] 33. Stuart JM, Segal E, Koller D, Kim SK. A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003;302:249–255. [PubMed] 34. Pritsker M, Liu YC, Beer MA, Tavazoie S. Whole-genome discovery of transcription factor binding sites by network-level conservation. Genome Res. 2004;14:99–108. [PubMed] 35. Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, et al. The UCSC Genome Browser Database. Nucleic Acids Res. 2003;31:51–54. [PubMed] 36. Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003;13:721–731. [PubMed] 37. Siepel A, Haussler D. Phylogenetic hidden Markov models. In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: Springer; 2005. 38. Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, et al. A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol. 2002;9:447–464. [PubMed] 39. van Helden J, Andre B, Collado-Vides J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol. 1998;281:827–842. [PubMed] 40. Tompa M, Li N, Bailey TL, Church GM, De Moor B, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–144. [PubMed] 41. Siddharthan R, Siggia DE, Nimwegen Ev. PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol. 2005;1:e67. [PubMed] 42. Zhang T, Ranade S, Cai CQ, Clouser C, Pignoni F. Direct control of neurogenesis by selector factors in the fly eye: regulation of atonal by Ey and So. Development. 2006;133:4881–4889. [PubMed] 43. Ostrin JE, Li Y, Hoffman K, Liu J, Wang K, et al. Genome-wide identification of direct targets of the Drosophila retinal determination protein Eyeless. Genome Res. 2006;16:466–476. [PubMed] 44. Adryan B, Teichmann SA. FlyTF: a systematic review of site-specific transcription factors in the fruit fly Drosophila melanogaster. Bioinformatics. 2006;22:1532–1533. [PubMed] 45. Montgomery SB, Griffith OL, Sleumer MC, Bergman CM, Bilenky M, et al. ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics. 2006;22:637–640. [PubMed] 46. King DC, Taylor J, Zhang Y, Cheng Y, Lawson HA, et al. Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data. Genome Res. 2007;17:775–786. [PubMed] 47. Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999;15:563–577. [PubMed] 48. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004;32:D493–6. [PubMed] 49. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. [PubMed] 50. Boyle IE, Weng S, Gollub J, Jin H, Botstein D, et al. GO::TermFinder–open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms. Bioinformatics. 2004;20:3710–3715. [PubMed] 51. Kurtz S, Choudhuri JV, Ohlebusch E, Schleiermacher C, Stoye J, et al. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 2001;29:4633–4642. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Dev Cell. 2004 Nov; 7(5):687-96.
[Dev Cell. 2004]Neuron. 2000 Mar; 25(3):549-61.
[Neuron. 2000]Genome Res. 2006 May; 16(5):656-68.
[Genome Res. 2006]Genome Biol. 2004; 5(3):210.
[Genome Biol. 2004]Nat Rev Genet. 2004 Apr; 5(4):276-87.
[Nat Rev Genet. 2004]Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):763-8.
[Proc Natl Acad Sci U S A. 2002]Proc Natl Acad Sci U S A. 2005 Apr 5; 102(14):4960-5.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2002 Jul 23; 99(15):9888-93.
[Proc Natl Acad Sci U S A. 2002]PLoS Biol. 2004 Sep; 2(9):E271.
[PLoS Biol. 2004]PLoS Biol. 2004 Sep; 2(9):E271.
[PLoS Biol. 2004]Genome Biol. 2004; 5(9):R61.
[Genome Biol. 2004]BMC Bioinformatics. 2003 Nov 20; 4():57.
[BMC Bioinformatics. 2003]BMC Bioinformatics. 2004 Sep 9; 5():129.
[BMC Bioinformatics. 2004]Bioinformatics. 2005 Apr 15; 21(8):1747-9.
[Bioinformatics. 2005]Genome Res. 2003 Apr; 13(4):579-88.
[Genome Res. 2003]Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):763-8.
[Proc Natl Acad Sci U S A. 2002]Proc Natl Acad Sci U S A. 2005 Apr 5; 102(14):4960-5.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2002 Jul 23; 99(15):9888-93.
[Proc Natl Acad Sci U S A. 2002]Nucleic Acids Res. 2003 Jul 1; 31(13):3666-8.
[Nucleic Acids Res. 2003]PLoS Comput Biol. 2006 Oct; 2(10):e130.
[PLoS Comput Biol. 2006]Genome Biol. 2004; 5(9):R61.
[Genome Biol. 2004]BMC Bioinformatics. 2004 Sep 9; 5():129.
[BMC Bioinformatics. 2004]PLoS Biol. 2004 Sep; 2(9):E271.
[PLoS Biol. 2004]Nat Biotechnol. 2006 May; 24(5):537-44.
[Nat Biotechnol. 2006]Science. 2003 Oct 10; 302(5643):249-55.
[Science. 2003]J Comput Biol. 2002; 9(2):447-64.
[J Comput Biol. 2002]J Mol Biol. 1998 Sep 4; 281(5):827-42.
[J Mol Biol. 1998]Nat Biotechnol. 2005 Jan; 23(1):137-44.
[Nat Biotechnol. 2005]PLoS Comput Biol. 2005 Dec; 1(7):e67.
[PLoS Comput Biol. 2005]Development. 2006 Dec; 133(24):4881-9.
[Development. 2006]Genome Res. 2006 Apr; 16(4):466-76.
[Genome Res. 2006]Bioinformatics. 2006 Jun 15; 22(12):1532-3.
[Bioinformatics. 2006]Bioinformatics. 2006 Mar 1; 22(5):637-40.
[Bioinformatics. 2006]Genome Res. 2007 Jun; 17(6):775-86.
[Genome Res. 2007]Bioinformatics. 2005 Apr 15; 21(8):1747-9.
[Bioinformatics. 2005]Bioinformatics. 1999 Jul-Aug; 15(7-8):563-77.
[Bioinformatics. 1999]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D493-6.
[Nucleic Acids Res. 2004]Bioinformatics. 2006 Feb 1; 22(3):381-3.
[Bioinformatics. 2006]Genome Res. 2002 Jun; 12(6):996-1006.
[Genome Res. 2002]Nucleic Acids Res. 2003 Jul 1; 31(13):3666-8.
[Nucleic Acids Res. 2003]Nat Biotechnol. 2006 May; 24(5):537-44.
[Nat Biotechnol. 2006]Bioinformatics. 2004 Dec 12; 20(18):3710-5.
[Bioinformatics. 2004]J Comput Biol. 2002; 9(2):447-64.
[J Comput Biol. 2002]Nat Biotechnol. 2005 Jan; 23(1):137-44.
[Nat Biotechnol. 2005]J Mol Biol. 1998 Sep 4; 281(5):827-42.
[J Mol Biol. 1998]Nucleic Acids Res. 2001 Nov 15; 29(22):4633-42.
[Nucleic Acids Res. 2001]PLoS Comput Biol. 2005 Dec; 1(7):e67.
[PLoS Comput Biol. 2005]Nucleic Acids Res. 2003 Mar 15; 31(6):1753-64.
[Nucleic Acids Res. 2003]