• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinfoLink to Publisher's site
Bioinformatics. Feb 1, 2010; 26(3): 405–407.
Published online Dec 10, 2009. doi:  10.1093/bioinformatics/btp681
PMCID: PMC2815662

Tmod: toolbox of motif discovery

Abstract

Summary: Motif discovery is an important topic in computational transcriptional regulation studies. In the past decade, many researchers have contributed to the field and many de novo motif-finding tools have been developed, each may have a different strength. However, most of these tools do not have a user-friendly interface and their results are not easily comparable. We present a software called Toolbox of Motif Discovery (Tmod) for Windows operating systems. The current version of Tmod integrates 12 widely used motif discovery programs: MDscan, BioProspector, AlignACE, Gibbs Motif Sampler, MEME, CONSENSUS, MotifRegressor, GLAM, MotifSampler, SeSiMCMC, Weeder and YMF. Tmod provides a unified interface to ease the use of these programs and help users to understand the tuning parameters. It allows plug-in motif-finding programs to run either separately or in a batch mode with predetermined parameters, and provides a summary comprising of outputs from multiple programs. Tmod is developed in C++ with the support of Microsoft Foundation Classes and Cygwin. Tmod can also be easily expanded to include future algorithms.

Availability: Tmod is available for download at http://www.fas.harvard.edu/~junliu/Tmod/

Contact: nc.ude.tdun@56iewhx; ude.dravrah.tats@uilj

1 INTRODUCTION

Conserved sequence patterns shared by multiple protein or nucleic acid sequences often provide important clues about the function of these linear biopolymers (Lawrence et al., 1993; Stormo and Hartzell, 1989). For DNA sequences, conserved sequence patterns, which are often located upstream of co-expressed genes, are likely to be transcription factor binding sites (Conlon et al., 2003). For proteins, short sequence patterns conserved across evolutionarily remotely related proteins may correspond to important functional domains (Liu et al., 1995). Identification of these overrepresented sequence patterns, often called motifs, can be an important first step toward our understanding of gene regulatory networks and protein functions. Because direct experimental determination of these motifs is neither cost-effective nor practical in many biological systems, computational motif discovery tools have become indispensable for many biological studies involving gene regulations.

Many algorithms and software have been proposed to tackle the motif-finding problem. AlignACE (Roth et al., 1998), BioProspector (Liu et al., 2001), MotifSampler (Thijs et al., 2001), SeSiMCMC (Favorov et al., 2005) and Gibbs motif sampler (Liu et al., 1995) use Gibbs sampling strategies for motif discovery based on Bayesian statistical models. GLAM (Frith et al., 2004) uses a simulated annealing approach to optimize the alignment of multiple sequences, and performs more efficiently for mode finding than the standard Gibbs sampler. MEME (Bailey and Elkan, 1994) applies the EM algorithm, instead of Gibbs sampling, to find the maximum likelihood motif estimation based on a model similar to that used by the Gibbs motif sampler. CONSENSUS (Hertz and Stormo, 1999) employs a greedy algorithm for optimizing the motif information content, which is asymptotically equivalent to finding the maximum a posteriori motif alignment.

Although the aforementioned model-based motif-finding methods have met great successes, none of these algorithms can guarantee to find the optimal solution (such as the posterior mode). Thus, several alternative algorithms based on word enumeration and other heuristics have been developed. YMF (Sinha and Tompa, 2003) uses a simpler motif model and a heuristic enumeration algorithm guaranteed to find the motifs with the highest Z-scores. Weeder (Pavesi et al., 2001) uses an extended exhaustive enumeration for both short sequence signals and longer patterns. MDscan (Liu et al., 2002) finds motif candidates by combining a word enumeration technique applied to high-confidence sequences with a Gibbs sampling refinement. MotifRegressor (Conlon et al., 2003) starts from motif candidates predicted by MDscan, and uses gene expression and/or ChIP-on-chip enrichment information to help select functionally important motifs.

When applied to a particular sequence dataset, the algorithms will most likely turn up different motif results due to different underlying mechanisms or tuning parameters of the algorithms, and each of these programs can perform better than others in some particular cases. Furthermore, motifs found by different algorithms may have different but unique characteristics. It is very helpful for further comparison, assessment and selection to present all of these different results in a single software platform. Jensen and Liu (2004) proposed BioOptimizer to combine and compare the motif-finding results using a score function based on a Bayesian model. BioOptimizer can handle situations in which binding site abundance and motif width are unknown. It also deals with two-block motifs with variable-length gaps. We thus include BioOptimier in Tmod to aid the user in analyzing the motif-finding results.

A majority of the motif-finding programs were made to run on Linux operating systems. Considering the fact that many researchers in related fields use the Windows operating systems, we developed Tmod, a Windows-based integrated software platform, to make these motif-finding programs easier to use for biologists.

2 THE Tmod SYSTEM

Our software tool Tmod employs a two-level software system architecture (Fig. 1). The first level is a human–machine interface, which contains the software main interface and several parameter input interfaces for the next level motif discovery programs. The second level comprises of the 12 motif-finding programs. These command-line based programs are supported by Cygwin (http://cygwin.com/), which is a Linux-like environment for Windows. MDScan, BioProspector, SeSiMCMC, MotifSampler, Weeder and Gibbs Motif Sampler provide Cygwin version programs. We downloaded the source code of CONSENSUS, AlignACE, MEME, GLAM and YMF, and compiled them in Cygwin environment. All these Cygwin version programs can run well on the Windows platform with the Cygwin dynamic link library cygwin1.dll. For MotifRegressor, we changed its main Perl script into an executable program, and rewrote the stepwise regression part in Matlab and compiled the M file to an executable file. After these modifications, MotifRegressor can run well on the Windows platform without S-PLUS.

Fig. 1.
Software architecture of the Tmod system. Names displayed at the second level, such as AlignACE, BioProspector, etc., correspond to different motif discovery programs. Each program receives input sequences and parameters through the human–machine ...

After the users input parameters for any of the 12 programs and click the ‘Run’ menu, Tmod will record these parameters in a batch file and then generate a new thread to run the saved batch file. In practice, users may want to run a certain program with different parameter combinations. It is tedious and time consuming to enter the parameters one by one. Tmod provides a batch interface so that the users can easily write the commands with several sets of parameters in a batch file. It will run the users’ batch file and report multiple results automatically.

As all the motif discovery programs have their own format of output files, it is very hard for users to interpret and compare the results from multiple programs. Tmod provides a unified summary file to list the consensus sequence of all the motifs predicted by different programs. This summary can be further used for motif comparison and optimization. BioOptimizer is also included in Tmod for further motif comparison and optimization.

All the motif discovery programs included in Tmod handle their own exceptions. We transplant these programs to Cygwin platform retaining their exception handling abilities. In our Tmod software, we do not provide any additional exception handling. All the parameters passed to the motif-finding programs are in the form of strings. In this way, Tmod will not generate new exceptions.

Our software Tmod adopts the open software architecture, so that it can be easily maintained and expanded. Any plug-in motif-finding program will not be interfered by other programs. Each motif-finding program uses its own independent thread. Besides the 12 motif-finding programs integrated in Tmod, there are still many other motif-finding programs. If they are written in C/C++ on Linux platform, they can be transplanted to the Windows platform and added to Tmod easily.

3 IMPLEMENTATION

Tmod was implemented by Microsoft Foundation Classes. The Visual C++ 6.0 development environment was used to build the Windows application, and the Cygwin Release 1.5.21-1 was used to compile the source code of the motif discovery programs. The motif discovery programs integrated in the current version of Tmod are as follows: MDscan, BioProspector, MEME 3.5.4, Gibbs Motif Sampler, AlignACE 4.0, CONSENSUS V6C, GLAM2, MotifSampler 3.1.1, Weeder 1.3, YMF 3.0, MotifRegressor (rewrite) and SeSiMCMC 4.35. Tmod also provides a platform to compare and assess the results of the motif discovery programs. The main interface and typical summary output of Tmod are shown in Fig. 2. It will help users get some knowledge on Tmod.

Fig. 2.
(A) The main interface of Tmod. (B) The summary file of consensus sequences parsed from the output files of each program. (C) The summary file of motif information parsed from BioOptimizer sum files.

As a demonstration, we applied Tmod to the cyclic AMP receptor protein binding sequence dataset (Stormo and Hartzell). We set the motif width in a range of 6–10 bp, and all other parameters of the programs as the default. The results are available at the Tmod homepage.

ACKNOWLEDGEMENTS

We thank the developers of each program for allowing us to integrate their current versions in Tmod. We also thank Roee Gutman for helpful suggestions and his test of Tmod.

Funding: NIH (GM-078990 to J.S.L.).

Conflict of Interest: none declared.

REFERENCES

  • Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1994;2:28–36. [PubMed]
  • Conlon EM, et al. Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. 2003;100:3339–3344. [PMC free article] [PubMed]
  • Favorov AV, et al. A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics. 2005;21:2240–2245. [PubMed]
  • Frith MC, et al. Finding functional sequence elements by multiple local alignment. Nucleic Acids Res. 2004;32:189–200. [PMC free article] [PubMed]
  • Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999;15:563–577. [PubMed]
  • Jensen ST, Liu JS. BioOptimizer: a Bayesian scoring function approach to motif discovery. Bioinformatics. 2004;20:1557–1564. [PubMed]
  • Lawrence CE, et al. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment. Science. 1993;262:208–214. [PubMed]
  • Liu JS, et al. Bayesian models for multiple local sequence alignment and gibbs sampling strategies. J. Am. Stat. Assoc. 1995;90:1156–1170.
  • Liu XS, et al. Bioprospector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Proc. Pac. Symp. Bioinfor. 2001;6:127–138. [PubMed]
  • Liu XS, et al. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 2002;20:835–839. [PubMed]
  • Pavesi G, et al. An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics. 2001;17:S207–S214. [PubMed]
  • Roth FP, et al. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol. 1998;16:939–945. [PubMed]
  • Sinha S, Tompa M. YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2003;31:3586–3588. [PMC free article] [PubMed]
  • Stormo GD, Hartzell GW. Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Assoc. Sci. 1989;86:1183–1187. [PMC free article] [PubMed]
  • Thijs G, et al. A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics. 2001;17:1113–1122. [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...