Logo of narLink to Publisher's site
Nucleic Acids Res. 2011 Jul 1; 39(Web Server issue): W79–W85.
Published online 2011 May 18. doi:  10.1093/nar/gkr291
PMCID: PMC3125742

Genome surveyor 2.0: cis-regulatory analysis in Drosophila


Genome Surveyor 2.0 is a web-based tool for discovery and analysis of cis-regulatory elements in Drosophila, built on top of the GBrowse genome browser for convenient visualization. Genome Surveyor was developed as a tool for predicting transcription factor (TF) binding targets and cis-regulatory modules (CRMs/enhancers), based on motifs representing experimentally determined DNA binding specificities. Since its first publication, we have added substantial new functionality (e.g. phylogenetic averaging of motif scores from multiple species, and a novel CRM discovery technique), increased the number of supported motifs about 4-fold (from ∼100 to ∼400), added provisions for evolutionary comparison across many more Drosophila species (from 2 to 12), and improved the user-interface. The server is free and open to all users, and there is no login requirement. Address: http://veda.cs.uiuc.edu/gs.


Cis-regulatory analysis is a key step in understanding and decoding transcriptional regulatory networks. The researcher is interested in determining which transcription factors (TFs) regulate a gene (or genes) of interest, the locations of binding sites for those TFs, and, if the analysis has an evolutionary component, how those binding sites and regulatory influences evolve across species. For Drosophila researchers, these tasks have been greatly facilitated by the availability of 12 Drosophila genomes (1,2) and vast amounts of other genetic and genomic data (3–5). In addition, a variety of computational tools can nicely complement high throughput experimental approaches to the above tasks, and aid the biologist to efficiently design and conduct hypothesis-driven experiments. For instance, available computational methods can summarize known binding specificities of TFs as ‘motifs’ and search the genome (or genomic regions near a specific gene) for matches to these motifs, thus identifying putative TF binding sites. Other more sophisticated methods can produce estimates of TF binding strength in a DNA segment, by integrating all putative binding sites, both weak and strong, present in that segment. Application of these methods to multiple Drosophila genomes, coupled with whole-genome alignments, can help describe the evolution of TF binding events. Cross species comparison can also improve the accuracy of predicting TF binding targets (6–8).

Computational methods have also been used to search for clusters of binding sites of multiple TFs, with the goal of identifying cis-regulatory modules (CRMs, also called enhancers). CRMs are ∼500–1000-bp long regulatory elements that harbor multiple binding sites that together mediate a specific expression pattern of a neighboring gene (9). The identification of CRMs can provide a meaningful context in which the role of individual TF binding sites can be interpreted; they may also help reduce false positives in predicting individual binding sites. More recently, statistical methods have been demonstrated to recover functional CRMs without the prior knowledge of relevant TFs and/or their motifs. Such motif-blind approaches adopt the alternative paradigm of ‘supervised CRM discovery’, where a set of known CRMs with similar functionality (expression patterns) are used as ‘training data’ to locate other similar CRMs in the genome (10,11).

Genome Surveyor 2.0 presents an easy-to-use, web-based graphical interface to many of the cis-regulatory analysis tools mentioned above. It allows the user to perform TF target prediction and CRM discovery using any motif(s) from the FlyFactorSurvey database (12), the most comprehensive resource for Drosophila motifs today. It displays genome browser ‘tracks’ that profile matches to individual motifs or user-selected combinations of motifs, based on sequence information from a single genome or a combination of genomes. It also provides tracks for ‘supervised CRM prediction’ (10), driven by a user-selected subset of known CRMs from the REDfly database (13). Additional tracks are available to visualize related information such as chromatin immunoprecipitation (ChIP)-based profiles of TF occupancy, and previously characterized CRMs from the literature. In addition to providing locus-centric visualization of cis-regulatory elements, Genome Surveyor 2.0 provides an interface to search for motif/ChIP-based binding site clusters genome-wide.


Genome surveyor 2.0 provides users with the following components to perform cis-regulatory analysis in Drosophila melanogaster (Figure 1A).

  1. Single/multi-species motif profiles. A motif profile displays the estimated binding site presence for a user-selected TF motif as a function of genomic coordinates. We obtain the single species profiles by running the program Stubb (14) and multi-species profiles by averaging the profiles of orthologous regions from selected species.
  2. Supervised CRM discovery profiles. This component allows the user to specify a set of known CRMs and search for novel CRMs that have a similar k-mer composition to the specified set. Supervised CRM discovery methods do not require pre-selection of motifs, and provide a viable alternative to predicting functional CRMs, as explained in (10).
  3. Profiles of other cis-regulatory information. ChIP-based-binding profiles (from BDTNP) and experimentally validated CRMs (from REDfly) can be displayed along with other profiles. In addition to Stubb-based motif profiles, the user may visualize binding site predictions by a more traditional method (individual matches above a threshold).
  4. Search for Motif/ChIP clusters of binding sites. This component provides the user with an ability to search the entire genome (or list of loci) for the most significant clusters of motif matches and/or ChIP sites.

Figure 1.
Genome Surveyor 2.0 input/output. (A) General Scheme of Genome Surveyor 2.0. Users may investigate the region of interest (defined through GBrowse) for potential cis-regulatory activity using a variety of tracks. To find the targets of specific motifs, ...

The first three components are implemented as plugins for GBrowse (15), and their outputs are ‘tracks’ that may be added to the current view of GBrowse. Note that all of these tracks/profiles can be displayed simultaneously, as illustrated in Figure 1B.

Single/multi-species motif profiles

We have pre-computed the motif profiles of a large collection of experimentally validated TFs for D. melanogaster (12) using the Hidden Markov Model-based program Stubb (14). (Stubb examines each 500-bp window and computes a score for the presence of one or more strong or weak binding sites in that window, without imposing arbitrary thresholds on what constitute a motif match.) We have also generated motif profiles for 11 other Drosophila species and mapped them to the D.mel coordinates. All profiles are normalized using their genome-wide mean and standard deviation. Users may select from the following options related to motif profiles:

  • Individual species, individual motif: This option displays the profiles of the selected motif(s) in the selected species. Given this option, users might easily check, for example, whether a specific potential binding event is conserved between D.mel and D.pse by turning on the tracks of the corresponding motif for both species. Also, they may easily assess the similarity between the targets of two or more TFs. All tracks are directly linked to (and just a click away from) the FlyFactorSurvey database (12) that provides detailed information about the binding site's specificity and the method used to characterize it.
  • Individual species, multi-motif: This option averages the profiles of selected motifs for each selected species. This provides a convenient way to look for clusters of binding sites of several TFs, as a means to discover novel CRMs. For example, a user searching for enhancers regulating dorsal/ventral (D/V) patterning may choose to select the motifs involved in this process (e.g. those for the TFs Dl, Twi, Sna) and examine their average profile. The user may repeat this process for other species as well, to examine if the predicted CRM in D. melanogaster is independently supported by predictions at orthologous locations in those species.
  • Multi-species, individual motif: This option combines the profiles of a selected motif from different species, using simple averaging or a phylogenetic tree-based averaging (7). The peaks in this profile represent the TF targets that are conserved across species.
  • Multi-species, multi-motif: This option averages all the profiles from selected motifs and species to create a single track. The peaks in this profile represent the strong clusters of binding sites that are conserved across species, and may thus correspond to functional CRMs.
  • User-defined motif: This option allows users to input their own Position Weight Matrices (PWMs), rather than selecting from a pre-defined list of motifs. Although there has been an intense effort to characterize the binding specificities of all TFs in D. melanogaster (16), there remain many TFs with unknown binding specificity. The user-defined motif option allows motifs that are not part of the publically available database to be used.

Supervised CRM discovery profiles

The REDfly database catalogs over 800 experimentally characterized CRMs in D. melanogaster, along with their spatial/temporal expression patterns (13). This extensive resource can be used as ‘training data’ to computationally predict novel CRMs genome-wide, through ‘supervised CRM discovery’ methods. These methods score a genomic segment for sequence similarity to any given set of known CRMs. The similarity score is based on frequencies of short words in the sequences, and can detect the presence of shared binding sites without relying on prior knowledge of motifs. As such, this is a pragmatic approach to CRM discovery when the likely transcriptional regulators of a gene are not known in advance, or their binding specificities have not been characterized. Genome Surveyor 2.0 allows the user to profile any genomic region with two different scores [HexMCD and IMM (M. Kazemian, Q. Zhu, M. S. Halfon, S. Sinha, manuscript under preparation)] (10). The training set of CRMs may be selected as one of over 30 different subsets of REDfly CRMs, defined by the tissue/stage of development that they help regulate (11). The user may also upload a Fasta file of CRM sequences.

Binding sites, ChIP and REDfly profiles

Users may select from the following three tracks for additional information to aid their analysis:

  • Binding sites above a threshold. This functionality is taken from (http://gmod.org/wiki/MotifFinder.pm). It displays individual binding sites predicted based on how well they match the selected/provided motifs.
  • ChIP profiles. This track displays ChIP-based measurements of TF occupancy (17). At this time, these profiles are available for a limited number of TFs.
  • REDfly CRMs. This track shows experimentally verified D. melanogaster CRMs from REDfly (13). It helps user to check the availability of any known enhancer in their region of interest. Each CRM is linked back to the REDfly database for detailed information (e.g. CRM expression pattern, the evidence for the element, the source, binding sites).

Search interface for Motif/ChIP clusters of binding sites

CRMs are known to harbor binding sites for several TFs, which act together to achieve specific regulatory functions. As such, computational tools for genome-wide CRM discovery typically search for clusters of binding sites with suitably chosen collections of TF motifs. Genome Surveyor 2.0 provides an interface for users to search for the most significant clusters of binding sites in the D. melanogaster genome for any user-specified combination of TFs (Figure 2A).

Figure 2.
Search tool for motif/ChIP-based binding site clusters. (A) An interface to search for motif/ChIP-based binding site clusters. There are three main panels: Search options, Motif collection and ChIP data collection. Search options: here, the user may select ...

The search interface may be accessed from the main page of Genome Surveyor 2.0. Users first select the type of binding site profiles that will be used for search (Single/multi species motif profiles or ChIP profiles). Next, they may choose to scan the entire genome, or provide a list of genomic loci where the search will be performed. Advanced options (e.g. the number of top hits or the minimum number of different TFs in a predicted cluster) are available, but default settings are provided and help pages provide guidance for changing them. Finally, the user selects the motif or ChIP profiles of interest and begins the search. The output of the search tool is a table of predicted regulatory sequences (500-bp segments with clusters of binding sites) in the D. melanogaster genome, with links to appropriate GBrowse views (Figure 2B). The results are sorted based on the average value of the selected profiles in the segments. Single as well as multi-species scores are reported for each segment. Moreover, a score representing each motif's presence in the segment is shown separately, to help the user determine which motifs contribute significantly to the cluster. The output also includes information about the nearest neighboring genes and their distances from the binding site cluster.

Methods validation

Stubb is a popular CRM discovery tool that has been tested by multiple groups in different species (18–20). We have shown previously that regions with high Stubb scores are highly enriched for experimentally observed TF binding (ChIP), and that the enrichment improves significantly upon incorporating multi-species information (7). Stubb score profiles can be utilized to investigate the binding site composition of any genomic region. Figure 3 shows an example of motif regulatory analysis for two known CRMs. The strategy of combining the Stubb profiles of multiple TFs and identify the segments with highest average scores (Figure 3) has been demonstrated to recover known CRMs (16). Genome-wide predictions of the ‘supervised CRM prediction’ methods included in Genome Surveyor 2.0 have been assessed statistically and validated experimentally (10).

Figure 3.
Example of motif composition analysis in known CRMs. Shown is the 21-kb region surrounding the hairy gene. Eight different tracks are shown: the ‘Stubb individual species’ profiles for seven TFs (CAD, KR, BCD, GT, HB, KNI, TLL) involved ...


Funding for open access charge: This work was supported in part by grants by the National Institute of Health (grant R01HG004744-01 to M.H.B., grant R01GM085233-01 to S.S.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflict of interest statement. None declared.


1. Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, Kaufman TC, Kellis M, Gelbart W, Iyer VN, et al. Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007;450:203–218. [PubMed]
2. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. [PubMed]
3. Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, et al. FlyBase: enhancing Drosophila gene ontology annotations. Nucleic Acids Res. 2009;37:D555–D559. [PMC free article] [PubMed]
4. Celniker SE, Dillon LA, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Kellis M, Lai EC, Lieb JD, MacAlpine DM, et al. Unlocking the secrets of the genome. Nature. 2009;459:927–930. [PMC free article] [PubMed]
5. The modENCODE Consortium. Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330:1787–1797. [PMC free article] [PubMed]
6. Berman BP, Pfeiffer BD, Laverty TR, Salzberg SL, Rubin GM, Eisen MB, Celniker SE. Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol. 2004;5:R61. [PMC free article] [PubMed]
7. Kazemian M, Blatti C, Richards A, McCutchan M, Wakabayashi-Ito N, Hammonds AS, Celniker SE, Kumar S, Wolfe SA, Brodsky MH, et al. Quantitative analysis of the Drosophila segmentation regulatory network using pattern generating potentials. PLoS Biol. 2010;8:e1000456. [PMC free article] [PubMed]
8. Sinha S, Schroeder MD, Unnerstall U, Gaul U, Siggia ED. Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila. BMC Bioinformatics. 2004;5:129. [PMC free article] [PubMed]
9. Davidson EH. The Regulatory Genome: Gene Regulatory Networks in Development and Evolution. 1st edn. Burlington, MA: Academic Press; 2006.
10. Kantorovitz MR, Kazemian M, Kinston S, Miranda-Saavedra D, Zhu Q, Robinson GE, Gottgens B, Halfon MS, Sinha S. Motif-blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse. Dev. Cell. 2009;17:568–579. [PMC free article] [PubMed]
11. Ivan A, Halfon MS, Sinha S. Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs. Genome Biol. 2008;9:R22. [PMC free article] [PubMed]
12. Zhu LJ, Christensen RG, Kazemian M, Hull CJ, Enuameh MS, Basciotta MD, Brasefield JA, Zhu C, Asriyan Y, Lapointe DS, et al. FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system. Nucleic Acids Res. 2011;39:D111–D117. [PMC free article] [PubMed]
13. Gallo SM, Gerrard DT, Miner D, Simich M, Des Soye B, Bergman CM, Halfon MS. REDfly v3.0: toward a comprehensive database of transcriptional regulatory elements in Drosophila. Nucleic Acids Res. 2010;39:D118–D123. [PMC free article] [PubMed]
14. Sinha S, van Nimwegen E, Siggia ED. A probabilistic method to detect regulatory modules. Bioinformatics. 2003;19(Suppl. 1):i292–i301. [PubMed]
15. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, et al. The generic genome browser: a building block for a model organism system database. Genome Res. 2002;12:1599–1610. [PMC free article] [PubMed]
16. Noyes MB, Meng X, Wakabayashi A, Sinha S, Brodsky MH, Wolfe SA. A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system. Nucleic Acids Res. 2008;36:2547–2560. [PMC free article] [PubMed]
17. MacArthur S, Li XY, Li J, Brown JB, Chu HC, Zeng L, Grondona BP, Hechmer A, Simirenko L, Keranen SV, et al. Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions. Genome Biol. 2009;10:R80. [PMC free article] [PubMed]
18. Won KJ, Agarwal S, Shen L, Shoemaker R, Ren B, Wang W. An integrated approach to identifying cis-regulatory modules in the human genome. PLoS ONE. 2009;4:e5501. [PMC free article] [PubMed]
19. Su J, Teichmann SA, Down TA. Assessing computational methods of cis-regulatory module prediction. PLoS Comput. Biol. 2010;6:e1001020. [PMC free article] [PubMed]
20. Siddharthan R. PhyloGibbs-MP: module prediction and discriminative motif-finding by Gibbs sampling. PLoS Comput. Biol. 2008;4:e1000156. [PMC free article] [PubMed]
21. Ochoa-Espinosa A, Yucel G, Kaplan L, Pare A, Pura N, Oberstein A, Papatsenko D, Small S. The role of binding site cluster strength in Bicoid-dependent patterning in Drosophila. Proc. Natl Acad. Sci. USA. 2005;102:4960–4965. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • Gene
    Gene records that cite the current articles. Citations in Gene are added manually by NCBI or imported from outside public resources.
  • GEO Profiles
    GEO Profiles
    Gene Expression Omnibus (GEO) Profiles of molecular abundance data. The current articles are references on the Gene record associated with the GEO profile.
  • HomoloGene
    HomoloGene clusters of homologous genes and sequences that cite the current articles. These are references on the Gene and sequence records in the HomoloGene entry.
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...