• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinformLink to Publisher's site
Bioinformation. 2010; 5(2): 49–51.
Published online Jul 6, 2010.
PMCID: PMC3039987

xFITOM: a generic GUI tool to search for transcription factor binding sites

Abstract

Locating transcription factor binding sites in genomic sequences is a key step in deciphering transcription networks. Currently available software for site search is mostly server-based, limiting the range and flexibility of this type of analysis. xFITOM is a fully customizable program for locating binding sites in genomic sequences written in C++. Through an easy-to-use interface, xFITOM that allows users an unprecedented degree of flexibility in site search. Among other features,it enables users to define motifs by mixing real sites and IUPAC consensus sequences,to search the annotated sequences of unfinished genomes and to choose among 11 different search algorithms.

Availability

xFITOM is available for download at: http://research.umbc.edu/˜erill

Keywords: Transcription factor, binding site, site search, motif prediction, genome analysis

Background

The discovery and analysis of transcriptional regulatory networks is a key step in elucidating the complex regulatory apparatus of living beings [1]. Transcriptional regulation is mediated mainly by transcription factors (TF) that bind DNA and can either hinder (repressors) or promote (activators) the formation of an open complex by the RNA-polymerase holoenzyme [2]. The semi-specific recognition of binding sites by their cognate transcription factors allows implementing computational tools for the discovery and detection of transcription binding motifs and sites [1]. Motif discovery methods focus on the identification of overrepresented patterns in groups of sequences [3]. Conversely, site search algorithms take in a motif description and use pattern matching techniques to search for sites on DNA sequences [4]. Scanning genomic sequences leads to a noisy but informative reconstruction of transcriptional regulatory networks, which can be later validated by in vitro and in vivo methods [5].

Site search techniques can be roughly divided into two categories. Methods based on based on regular expression syntaxes (regex) use IUPAC codes for degenerate bases and allow for complex motif descriptions including gaps, multiple repeats and variable spacers [3]. Their flexibility in motif description comes at the cost of reduced base frequency information and a combinatorial increase in complexity for elaborate patterns [3]. Position- Specific Scoring Matrices (PSSM) approaches implement rigid or semirigid feature locators based on the observed frequency of occurrence of each base at each motif position, as derived from a collection of aligned sites [6]. This matrix representation is commonly referred to as the binding motif or profile. Most PSSM approaches are based on the application of information theory to molecular biology [4, 7], which leads to the widespread representation of binding motifs as sequence logos [8].

Information theory can be applied to the information process that takes place when a transcription factor binds a site. Protein binding leads to a reduction in uncertainty that is formally defined as the difference in entropy at each motif position [4] ( supplementary material for equation). Following this approach, several methods have been devised to assess the fit of a candidate site to a particular binding motif. Nonweighted methods use only the frequency of observed bases at each position in their computation, while weighted methods apply also with information on positional conservation. Weighted methods show improved correlation with experimental binding affinities, but are more prone to false positives in genomic searches [4]. Alternative methods using relative entropy, instead of Shannon entropy, have been devised to take into account biases in genomic base composition [4].

Most software solutions developed to implement genome-wide searches of TF-binding sites using PSSM, like Virtual Footprint, are server-based applications linked to server-based databases [9] and, in some cases, integrated in web-based motif discovery suites [3]. A common problem with most web-based applications in bioinformatics is the limitation of sequence sizes in order to contain server load and/or traffic. This is normally addressed by relying on a server database of sequences, but this often limits severely the number and repertoire of target sequences available for analysis. Some solutions rely also on a limited set of precompiled matrices, while others restrict entry of motif data to either site collections or IUPAC codes. In addition, each solution relies on a single specific method for site search. This prevents users from choosing the method that they believe most appropriate and precludes the integration of results from different methods. xFITOM is a standalone, easy to use and extremely flexible GUI-based application for site search that integrates several PSSM-based methods in a single tool.

Methodology

xFITOM requires a collection of aligned sites and a set of target files to be searched. Using the site collection, xFITOM will compute the information content of the motif. It will also pre-process the target sequence and compute its a priori entropy. The program will then start a sequential search of the genome using a sliding window and one of the 11 available scoring methods. xFITOM uses a threshold, an arbitrary cut-off value or a value relative to the distribution of scores in the collection, to define putative binding sites. These are then sorted by genomic position or score. If gene annotation data is available, sites are classified as “operator”, “intragenic” or “intergenic” based on their location with respect to nearby genes. xFITOM can also apply local complexity factors to detect locally enriched motifs [10]. In this mode, xFITOM computes several mobile score averages and can rescore sites based on three different ratios between local and global score averages.

Implementation

xFITOM has been developed integrally in C++ as a standalone application for Microsoft Windows operating systems using the Microsoft Foundation Classes. The program operates on two main input files: a collection of aligned sites and a collection of target sequences. Aligned sites can be entered in FASTA and raw text format, with degenerate IUPAC characters allowing for an arbitrary degree of flexibility in the definition of the binding motif. Target sequences can be entered in FASTA and GenBank formats, with gene annotation data extracted automatically from the latter. Multiple sequences can be processed directly in FASTA format, while GenBank files can be merged into a compound GenBank file for analysis. This allows analysis of annotated unfinished genomes, giving immediate access to newly released genomic data. All xFITOM parameters can be specified by the user through the GUI (Figure 1). Basic parameters, like the search method, detection threshold and site-to-gene distances are defined in the left panel, while local complexity options are located on the right panel. The program generates comma-separated value output files to allow easy post-processing of results with spreadsheet software. Due to its standalone nature and configurability, xFITOM allows users to carry out useful non-standard analyses, such comparisons among different methods (Figure 1), affinity ranking of collection sites or annotated searches of unfinished genome assemblies (Figure 1).

Figure 1
[Left panel] GUI interface of xFITOM showing the input section and the advanced and local complexity options sections. [Top-right panel] Condensed results from an annotated search of the Pasteurella dagmatis ATCC_43325 unfinished genome sequence using ...

Supplementary material

Data 1:

Acknowledgments

This research was supported partly by the Summer Faculty Fellowship program at UMBC.

Footnotes

Citation:Erill et al, Bioinformation 5(2): 49-51 (2010)

References

1. Babu MM. Biochem Soc Trans. 2008;36(4):758. [PubMed]
2. Ptashne M. Trends Biochem Sci. 2005;30(6):275. [PubMed]
3. Mrazek J. Brief Bioinform. 2009;10(5):525. [PubMed]
4. Erill I, O'Neill MC. BMC Bioinformatics. 2009;10(1):57. [PMC free article] [PubMed]
5. Erill I, et al. Nucleic Acids Res. 2004;32(22):6617. [PMC free article] [PubMed]
6. Hannenhalli S. Bioinformatics. 2008;24(11):1325. [PubMed]
7. Schneider TD, et al. J Mol Biol. 1986;188(3):415. [PubMed]
8. Schneider TD, Schneider RM. Nucleic Acids Res. 1990;18(20):6097. [PMC free article] [PubMed]
9. Munch R, et al. Bioinformatics. 2005;21(22):4187. [PubMed]
10. Halford SE, Marko JF. Nucleic Acids Res. 2004;32(10):3040. [PMC free article] [PubMed]
11. Erill I, et al. Bioinformatics. 2003;19(17):2225. [PubMed]

Articles from Bioinformation are provided here courtesy of Biomedical Informatics Publishing Group

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...