• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jan 2010; 38(Database issue): D443–D447.
Published online Nov 1, 2009. doi:  10.1093/nar/gkp910
PMCID: PMC2808907

FlyTF: improved annotation and enhanced functionality of the Drosophila transcription factor database

Abstract

FlyTF (http://www.flytf.org) is a database of computationally predicted and/or experimentally verified site-specific transcription factors (TFs) in the fruit fly Drosophila melanogaster. The manual classification of TFs in the initial version of FlyTF that concentrated primarily on the DNA-binding characteristics of the proteins has now been extended to a more fine-grained annotation of both DNA binding and regulatory properties in the new release. Furthermore, experimental evidence from the literature was classified into a defined vocabulary, and in collaboration with FlyBase, translated into Gene Ontology (GO) annotation. While our GO annotations will also be available through FlyBase as they will be incorporated into the genes’ official GO annotation in the future, the entire evidence used for classification including computational predictions and quotes from the literature can be accessed through FlyTF. The FlyTF website now builds upon the InterMine framework, which provides experimental and computational biologists with powerful search and filter functionality, list management tools and access to genomic information associated with the TFs.

INTRODUCTION

Site-specific transcription factors (TFs) are proteins that bind to specific DNA sequences or DNA conformations, and that confer regulatory information to the basal transcription machinery. While they play a key role in gene regulation in general, TFs are of special interest to developmental biologists as their presence at cis-regulatory elements in the genome determines important developmental decisions in processes such as axis formation and morphogenesis (1). It may therefore seem surprising that almost a decade after the availability of the Drosophila melanogaster genome (2) there is still no definitive answer as for the number of site-specific TFs, let alone a comprehensive list of TFs from an authoritative community resource like FlyBase (3).

FlyTF (http://www.flytf.org) has stepped in to fill this gap by integrating both computationally predicted as well as experimentally verified TFs. The first version of FlyTF (4) provided information about the curation of 1052 candidate TFs [selected for the presence of a canonical DNA-binding domain using the pipeline of the DBD transcription factor database (5) or a set of suitable Gene Ontology terms (6)], and yielded a repertoire of 753 site-specific fly TFs, about two-thirds of which were called with a high degree of confidence. The website has had ~4000 visitors since publication, with the majority of users bulk-downloading our annotations.

IMPROVED ANNOTATIONS

The initial release of FlyTF was based on D. melanogaster release 3.1 gene annotations, and manual curation was based on GO annotations and literature published by December 2005. The candidate proteins were primarily assessed for their capability to bind to DNA (yes/maybe/no) and confer a regulatory function. While a strict set of rules was applied for the DNA-binding property, all regulatory proteins ranging from canonical site-specific TFs to insulators and those involved in chromatin-mediated maintenance of transcription were treated alike. This was identified as a major limitation in computational studies that focussed on classical TFs. Furthermore, gene annotations in D. melanogaster are currently in their 5.19 release, meaning that many novel or modified gene models were not present in the initial dataset.

We have addressed these shortcomings in the current release of FlyTF. First, we generated a novel candidate gene list by incorporating the initial FlyTF gene set, DBD searches on the FlyBase release 5.8 gene annotations (all translations), and GO searches with a set of TF-related GO terms. This yielded a non-redundant set of 1162 candidate TFs. Two human curators (one general curator at FlyTF, one GO curator at FlyBase) assessed this list, taking all experimental evidence published by December 2008 into account.

Each candidate TF was characterised both for its DNA-binding as well as regulatory characteristics. A verdict for DNA-binding can now be

  • ‘yes’ (clear evidence for sequence-specific binding),
  • ‘yes’ (homolog) (property experimentally shown for a homolog),
  • ‘yes’ (DNA binding, no sequence-specificity determined),
  • ‘yes (heterodimer) (if the factor alone is not capable of binding DNA),
  • ‘maybe’ (none or no convincing evidence found) and
  • ‘no’ (experimental evidence against DNA-binding).

As in the previous version, where available, quotations from the literature were extracted along with an associated PubMed ID. To allow users a more fine-grained selection of evidence, experiments regarding the DNA-binding characteristics of the proteins were categorised into eight different groups of varying quality, each of which can now be queried or filtered for at FlyTF (Table 1). While a DNA-binding protein in the original version automatically became a bona fide TF if the DBD pipeline identified a domain frequently found in TFs, we now provide a more detailed categorisation of the regulatory property of the candidate protein. A verdict for this can be

  • ‘yes’ (a true site-specific TF),
  • ‘yes’ (heterodimer) (as before, but only as a heterodimer),
  • ‘maybe’ (if a canonical DNA-binding domain was found, but no experimental evidence) or
  • ‘no’ (not a site-specific TF).

Table 1.
Experimental procedures accepted to confirm DNA-binding property of candidate proteins in FlyTF, and GO terms assigned on their basis (as IDA)

The ‘maybe’ and ‘no’ categories are frequently associated with free text, describing further characteristics where the information was easily accessible. Useful information in this context could be, for example, ‘chromatin-remodelling’, ‘TBP-associated factor (TAF)’, ‘inhibitor’ or ‘insulator’. This verdict is supported by quotations from the literature as well as a discrete categorisation of the experimental evidence, which can be used for user-defined queries (Table 2).

Table 2.
Experimental procedures accepted to confirm transcriptional regulatory property of candidate proteins, and evidence codes in support of GO terms dealing with regulatory function

Ultimately, in collaboration with FlyBase, any supporting experimental data was translated into GO annotation, combining the expertise of the FlyTF and FlyBase curators (the rules for the translation of experimental evidence into GO terms can be found in Tables 1 and and2).2). At the same time, each candidate TF received a final score based on its DNA-binding domain, and the experimental evidence found for both DNA-binding and transcriptional regulatory function (Table 3).

Table 3.
FlyTF score based on computational predictions (DBD) and novel GO annotation (based on experimental data)

ENHANCED FUNCTIONALITY AND ACCESSIBILITY

The initial FlyTF website was a collection of static HTML pages and a few dynamically generated lists. A search tool to find individual genes or all TFs with a certain DNA-binding domain was the only means of user interaction. However, most visitors chose to download our annotations in bulk. We suspect this is because traditional Drosophila geneticists often prefer to retrieve information about ‘their favourite gene’ directly from FlyBase, the authoritative community resource. Also, researchers utilising genomics or computational methods are likely to query large batches of identifiers, and their analysis is often based on further list operations, neither of which were catered for by FlyTF.

We assessed a variety of options to allow non-specialist users easy access to our annotations and at the same time provide computational biologists with some basic analysis tools. The FlyTF database is now based on the InterMine framework (http://www.intermine.org), the backbone of biological data warehouses such as FlyMine (7) or modMine (8). This now enables different usage scenarios, which we will illustrate below.

The simplest scenario is the search for a single gene of interest. The query form accepts any identifier for a given TF (gene name, symbol, unique ID or even rarely used synonyms) and displays general gene information as well as our transcription factor annotations (Figure 1A).

Figure 1.
Screenshots from the FlyTF web site. (A) Transcription factor summary information for gene hunchback. The left panel provides basic gene information and serves as a starting point for the retrieval of DNA or protein sequences. The right panel focuses ...

A novel feature is of special interest for users with a genome-wide perspective: it is possible to upload extensive gene lists, from which the genes encoding TFs will be recognised and marked, and can be saved for further analysis on the website. This enables, for example, the one-step identification and characterisation of TFs contained in candidate gene lists from genomics experiments. Analysis tools available at the FlyTF website comprise ‘widgets’ to report GO term enrichment or over-representation of certain structural domains (Figure 1B). It is noteworthy that some of these statistics are calculated in a transcription factor background, which may be helpful in the determination of differences between sets of TFs (rather then comparing TFs against the entire genome). Users can also choose to register at the FlyTF website, and store and compare their TF lists at a later stage.

A third usage scenario addresses the needs of the computational biologists. Lists of TFs fulfilling specific criteria can easily be created using the FlyTF QueryBuilder (Figure 1C), and customisable output formats allow the swift integration of FlyTF in many bioinformatics workflows. For example, it is possible to search for all TFs that (i) contain zinc finger domains, (ii) for which a position weight matrix is known and (iii) whose transcriptional regulatory function was shown in a reporter assay in the fly. In this case, only one gene (hunchback) fulfils these criteria. The gene’s genomic coordinates can be exported in GFF3 format and the translations are available in FASTA format. It should be mentioned that through the customisable output generator, it is possible to export the entire FlyTF dataset as one tab-delimited file.

FUTURE DIRECTIONS

The comparative sequencing and genome annotation of closely related Drosophila species (9) has provided the community with the gene repertoires of a dozen flies. Experimental data for individual genes of these non-D. melanogaster flies is still sparse, yet researchers interested in their TFs can use FlyTF as a starting point to identify homologous proteins using the built-in orthology mapping.

The next-generation of InterMine-based databases will enable researchers to share gene lists and analysis tools across species and data mines, and we are looking forward to assist TF researchers in other model organisms with our dataset.

COLLABORATION BETWEEN TWO COMMUNITY RESOURCES

FlyTF and FlyBase both deal with the functional annotation of fly genes, and have pooled resources for this work. While FlyTF focuses on manual curation and only on TF genes, FlyBase is the community resource for all things Drosophila. Although the information content of each database is distinct, both use GO terms for functional annotation and a key aim of this project was to improve GO annotation consistency between these databases, based on both computational predictions and experimental evidence using the combined expertise of the TF specialists at FlyTF and the FlyBase GO curator. We believe our collaboration can be a model for many ‘niche’ databases that are maintained on a sporadic basis, which can benefit from both the experience and the resources of an established community portal.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

FlyBase grant (National Human Genome Research Institute at the US NIH P41 HG000739 to FlyBase); EPSRC MPhil studentship (to D.P.J.); Medical Research Council (to S.T. and D.W.); Royal Society University Research Fellowship (to B.A.). Funding for open access charge: The Royal Society.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

The authors thank Nick Brown and Steven Marygold at FlyBase for enabling UP to participate in this collaboration and for comments on the manuscript. They also wish to thank Goran Nenadic and Casey Bergman for the provision of computationally marked-up TF literature (10,11), and Richard Smith, Julie Sullivan and Gos Micklem for helpful comments and technical assistance in the customization of the InterMine system.

REFERENCES

1. Levine M, Davidson E. Gene regulatory networks for development. Proc. Natl Acad. Sci. USA. 2005;102:4936–4942. [PMC free article] [PubMed]
2. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. [PubMed]
3. Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, et al. FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Res. 2009;37:D555–D559. [PMC free article] [PubMed]
4. Adryan B, Teichmann SA. FlyTF: a systematic review of site-specific transcription factors in the fruit fly Drosophila melanogaster. Bioinformatics. 2006;22:1532–1533. [PubMed]
5. Wilson D, Charoensawan V, Kummerfeld SK, Teichmann SA. DBD–taxonomically broad transcription factor predictions: new content and functionality. Nucleic Acids Res. 2008;36:D88–D92. [PMC free article] [PubMed]
6. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. [PMC free article] [PubMed]
7. Lyne R, Smith R, Rutherford K, Wakeling M, Varley A, Guillier F, Janssens H, Ji W, Mclaren P, North P, et al. FlyMine: an integrated database for Drosophila and Anopheles genomics. Genome Biol. 2007;8:R129. [PMC free article] [PubMed]
8. Celniker S, Dillon L, Gerstein M, Gunsalus K, Henikoff S, Karpen G, Kellis M, Lai E, Lieb J, MacAlpine D, et al. Unlocking the secrets of the genome. Nature. 2009;459:927–930. [PMC free article] [PubMed]
9. Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007;450:203–218. [PubMed]
10. Yang H, Nenadic G, Keane JA. Identification of transcription factor contexts in literature using machine learning approaches. BMC Bioinformatics. 2008;9(Suppl. 3):S11. [PMC free article] [PubMed]
11. Yang H, Keane J, Bergman CM, Nenadic G. Assigning roles to protein mentions: the case of transcription factors. J. Biomed. Inform. 2009;42:887–894. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Compound
    Compound
    PubChem Compound links
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...