![]() | ![]() |
Formats: |
||||||||||||||||||
Copyright © 2009 The Author(s) MEME Suite: tools for motif discovery and searching 1Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia, 2Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Tokyo, Japan, 3Department of Genome Sciences, University of Washington, Seattle, Washington, 4National Biomedical Computation Resource, University of California, San Diego and 5Department of Computer Science and Engineering, University of Washington, Seattle, Washington, USA *To whom correspondence should be addressed. Tel: Phone: 61 7 3346 2614; Fax: 61 7 3346 2103; Email: t.bailey/at/imb.uq.edu.au Correspondence may also be addressed to William S. Noble. Tel: Phone: +1 206 543 8930; Fax: +1 206 685 7301; Email: william-noble/at/u.washington.edu Received February 10, 2009; Revised April 10, 2009; Accepted April 21, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract The MEME Suite web server provides a unified portal for online discovery and analysis of sequence motifs representing features such as DNA binding sites and protein interaction domains. The popular MEME motif discovery algorithm is now complemented by the GLAM2 algorithm which allows discovery of motifs containing gaps. Three sequence scanning algorithms—MAST, FIMO and GLAM2SCAN—allow scanning numerous DNA and protein sequence databases for motifs discovered by MEME and GLAM2. Transcription factor motifs (including those discovered using MEME) can be compared with motifs in many popular motif databases using the motif database scanning algorithm Tomtom. Transcription factor motifs can be further analyzed for putative function by association with Gene Ontology (GO) terms using the motif-GO term association tool GOMO. MEME output now contains sequence LOGOS for each discovered motif, as well as buttons to allow motifs to be conveniently submitted to the sequence and motif database scanning algorithms (MAST, FIMO and Tomtom), or to GOMO, for further analysis. GLAM2 output similarly contains buttons for further analysis using GLAM2SCAN and for rerunning GLAM2 with different parameters. All of the motif-based tools are now implemented as web services via Opal. Source code, binaries and a web server are freely available for noncommercial use at http://meme.nbcr.net. INTRODUCTION The MEME Suite is a software toolkit with a unified web server interface that enables users to perform four types of motif analysis: motif discovery, motif–motif database searching, motif-sequence database searching and assignment of function. It offers a significantly expanded set of programs for these tasks compared with the earlier web server (1). Figure 1
MOTIF DISCOVERY The MEME algorithm (2) has been widely used for the discovery of DNA and protein sequence motifs, and MEME continues to be the starting point for most analyses using the MEME Suite. Detailed protocols describing how to use MEME are available (8). Some biosequence motifs exhibit insertions and deletions, but MEME cannot discover such motifs, because it does not allow gaps. To overcome this limitation, we have incorporated a recent algorithm for gapped motif discovery—GLAM2 (3)—into the MEME suite. Discovering gapped motifs is intrinsically more difficult than discovering ungapped motifs, because there are vastly more possible gapped motifs than ungapped motifs. Therefore, when trying to discover gapped motifs, we recommend performing a simpler gapless motif analysis as well. GLAM2 uses a particular ‘model’ of gapped motifs, which is illustrated in Figure 2
GLAM2 reports a score for each motif that it discovers, with higher scores indicating stronger motifs. GLAM2 also reports a score for each site, with higher scores indicating better matches to the overall motif. Using of GLAM2 is similar to using MEME, with only a few differences. Unlike MEME, GLAM2 does not search for multiple distinct motifs. Instead, it performs replicates: it attempts to discover the strongest possible motif 10 times, and displays the results in order of score. If the top few results are similar, this may be regarded as successful replication. If not, GLAM2 can be rerun more thoroughly (but slowly) by increasing the ‘number of iterations’ parameter. The gappiness of GLAM2 motifs can be controlled by four pseudocount options. Their relative values control GLAM2's aversion to gaps: increasing the no-deletion pseudocount relative to the deletion pseudocount makes it more averse to deletions, and likewise for the no-insertion and insertion pseudocounts. The absolute pseudocount values control GLAM2's preference for putting gaps together in the same positions: decreasing the deletion and no-deletion pseudocounts makes it more prone to gather deletions into a few columns, and likewise for the (no-)insertion pseudocounts. Note that the pseudocounts affect the score calculation, so scores are not comparable between motifs discovered with different pseudocount settings. GLAM2 has options to set the maximum and minimum number of aligned columns, similar to MEME's maximum and minimum width options. It also has an option for the initial number of aligned columns: setting this can help it find an appropriate motif. GLAM2 has difficulty adjusting the motif width when there are many sequences, especially if they are short. It should be noted that both protein and DNA motifs are often shorter than the defaults (50) used by GLAM2 and MEME for the ‘maximum number of aligned columns’ and ‘maximum width’, respectively. It is often advisable for you to reduce those parameters to much smaller values (e.g. in the range 10–20) by entering a new value in the appropriate input box on the web form. Finally, GLAM2 lets you specify the minimum number of input sequences that must contribute a motif occurrence. This is a generalization of MEME's OOPS (one occurrence per sequence) and ZOOPS (zero or one occurrence per sequence) options. GLAM2 cannot consider more than one occurrence per sequence. When interpreting GLAM2 output, note that it will always report the best motif it can find, even if you give it random sequences. Thus, it may be wise to rerun GLAM2 on negative control (e.g. shuffled) sequences and compare the resulting scores with the original scores. The GLAM2 input form contains a checkbox (on the lower right-hand side) that will cause the characters in the input sequences to be shuffled before being input to GLAM2. USING AND ANALYZING MOTIFS Once you have discovered a collection of motifs, you may wish to perform additional analyses to better characterize those motifs. The MEME Suite provides three types of tools for carrying out such analyses. First, the MEME Suite can compare your DNA motifs to known compendia of motifs (such as JASPAR, Flyreg and DPINTERACT) to see if your motif is similar to a known regulatory motif. This type of analysis is done using Tomtom. Second, the MEME Suite can attempt to determine what types of regulatory functions your motif might be involved in. This assignment is done using the GOMO tool to determine if your motif matches upstream regions of many sequences with the similar Gene Ontology (GO) annotations. Third, the MEME Suite can search a sequence database for additional occurrences of your motif. Comparing DNA motifs with known regulatory motifs Often, your first question after finding a DNA motif will be, ‘Is this a novel motif?’ Thus, it may be useful to learn whether a motif found by MEME is similar to other motifs, particularly motifs with known biological functions. Tomtom (4) quantifies the similarity between two motifs, and can be used to search a database of known motifs for matches to motifs found by MEME. Tomtom not only provides a numeric score for the match between two motifs, but also provides an estimate of the statistical significance of the score. Currently, Tomtom only supports DNA motifs. The MEME output for each reported motif contains a button for submitting that motif directly to Tomtom. The Tomtom web application also allows the user to submit a motif by pasting in columns of base counts for each position of the motif. The user then selects the motif similarity measure to use and chooses which online motif database to search. The output of Tomtom includes LOGOS representing the alignment of two motifs, the p-value and q-value [a measure of false discovery rate (10)] of the match, and links back to the parent motif database for more detailed information about the target motif. Sample Tomtom output is shown in Figure 3
GO term analysis for DNA motifs A second question you may ask is, ‘What is the functional role of this motif?’ The tool GOMO (6) is used to search a species-specific GO annotation database for GO terms that are associated with genes that a given DNA motif regulates. GOMO uses the motif models in the format generated by MEME. GOMO ranks genes by the average binding affinity of the transcription factor to the gene's upstream region and assesses GO terms associated with these genes. Gene sequences and GO annotations are linked via the sequence identifier. The latter requires a curated dataset, a selection of which are currently available covering the best annotated species with respect to GO—Escherichia coli, Drosophila, chicken, mouse, Saccharomyces cerevisiae and Schizosaccharomyces pombe. GOMO reports for each motif the list of GO terms considered significant in descending order down to a threshold specified before. When interpreting GOMO output, note that the GO terms reported always relate to the gene the transcription factor regulates. Sequence database search With a set of interesting motifs in hand, an obvious next step is to look for other occurrences of these motifs. The tools FIMO and MAST are used to search sequence databases for matches to motifs discovered using MEME. The GLAM2SCAN tool is specifically designed for searching with gapped motifs of the type discovered by GLAM2. The MEME server provides web forms for performing analyses with each of these tools. As a convenience, the HTML output of MEME contains buttons for starting FIMO and MAST searches. The MEME web site provides online versions of a number of sequence databases, or users may upload their own sequence data in FASTA format. ‘FIMO’ stands for ‘find individual motif occurrences’. FIMO uses the output of MEME, which may contain multiple, ungapped motifs. FIMO scores the match to each motif at each position in the sequence database. As the name of the tool suggests, each match is treated independently. The p-value for the match is computed using a dynamic programming procedure (11), and motif-specific q-values with respect to the complete set of matches are computed using a bootstrap procedure (12). The output from FIMO is a list of the matches for which the q-value is less than a user-specified threshold. Sample output from FIMO is shown in Figure 4
GLAM2SCAN uses the output of GLAM2, which always consists of a single motif, possibly containing gaps. GLAM2SCAN scores the match to this motif at each position in the sequence database. Like FIMO, each match is treated independently, and the output is a list of the best scoring matches. The user can adjust the number of matches reported, up to a limit of 200. Sample output from GLAM2SCAN is shown in Figure 5
MAST (13) also uses the output of MEME. For each sequence, MAST determines the best match in the sequence to each motif. The scores for these best sequence motif matches are combined into a score for the overall match between the complete motif set and the sequence, resulting in an E-value for each sequence. The output from MAST is a list of the sequences for which the E-value is less than a user-specified threshold. In addition to the list of sequences, the output contains a block diagram showing the relative positions of the best motif matches in the high scoring sequences, and annotated alignments of the best motif matches. The three sections of MAST output are shown in Figure 6
The choice of motif search tool will depend on the goal of the analysis. MAST is ‘sequence oriented’, computing a single score for each sequence in the database. This makes MAST more suited for analyzing proteins or fixed-length sequences like upstream regions of genes. FIMO and GLAM2SCAN only provide individual motif matches, and can be used to scan genomic databases. Both FIMO and MAST require ungapped motifs, whereas searching with gapped motifs requires the use of GLAM2SCAN. WEB SERVER AND USER SUPPORT The MEME Suite web services are hosted by the National Biomedical Computation Resources (NBCR, http://nbcr.net). Since late 2007, we have adopted the Opal web service toolkit (7) to handle the computational and data management aspect of the MEME web server (Figure 7
Customized user interfaces have been developed for the MEME Suite for enhanced user experience. All clients access the Opal services through the Opal web service application programming interface. When Opal receives a request from a client, it creates a unique working directory, transfers all the input files and dispatches the job to a local batch job scheduler, which schedules the job on an available compute node in a cluster. The adoption of Opal hides the complexity of resource management from scientific programmers, and allows the MEME Suite to take advantage of the distributed grid and the emerging cloud computing environment. The scalable resources made available by Opal allow applications such as MEME to meet growing demand from users. The sequence databases (more than 120 GB to date) are updated on a weekly basis automatically. The server handles more than 200 user requests per day, and the Opal dashboard provides a real-time usage status update on individual applications (Figure 7 FUNDING The authors acknowledge NBCR award from NCRR, NIH P41 RR08605, for support of the MEME and MAST web site. T.L.B., C.E.G. and W.S.N. acknowledge NIH/NCRR award R01 RR021692 for support of continuing development of the MEME and related sequence analysis tools. Funding for open access charge: National Institutes of Health. Conflict of interest statement. None declared. REFERENCES 1. Bailey TL, Williams N, Misleh C, Li WW. Meme: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006;34:W369–W373. [PubMed] 2. Bailey TL, Elkan CP. Fitting a mixture model by expectation-maximization to discover motifs in biopolymers. In: Altman R, Brutlag D, Karp P, Lathrop R, Searls D, editors. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press; 1994. pp. 28–36. 3. Frith MC, Saunders NFW, Kobe B, Bailey TL. Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput. Biol. 2008;4 e1000071. 4. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8:R24. [PubMed] 5. Bailey TL, Gribskov M. Combining evidence using p-values: application to sequence homology searches. Bioinformatics. 1998;14:48–54. [PubMed] 6. Bodén M, Bailey TL. Associating transcription factor-binding site motifs with target go terms and target genes. Nucleic Acids Res. 2008;36:4108–4117. [PubMed] 7. Krishnan S, Stearn B, Bhatia K, Baldridge KK, Li WW, Arzberger PA. Opal: simple web services wrappers for scientific applications. IEEE International Conference on Web Services. 2006 Chicago, Ill. 8. Bailey TL. Discovering sequence motifs. Methods Mol. Biol. 2007;395:271–292. [PubMed] 9. MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics. 2006;7:113. [PubMed] 10. Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW. Significance analysis of time course microarray experiments. Proc. Natl Acad. Sci. USA. 2005;102:12837–12842. [PubMed] 11. Staden R. Searching for motifs in nucleic acid sequences. Methods Mol. Biol. 1994;25:93–102. [PubMed] 12. Storey JD. A direct approach to false discovery rates. J. R. Stat. Soc. 2002;64:479–498. 13. Bailey TL, Gribskov M. Score distributions for simultaneous matching to multiple motifs. J. Comput. Biol. 1997;4:45–59. [PubMed] 14. Sanner MF. A component-based software environment for visualizing large macromolecular assemblies. Structure. 2005;13:447–462. [PubMed] 15. Ludaescher B, Altintas I, Berkley C, Higgins D, Jaeger E, Jones M, Lee EA, Tao J, Zhao Y. Scientific workow management and the Kepler system. Concurrency Comput. Pract. Exp. 2005;18:1039–1065. 16. Hassan M, Brown RD, Varma-O'Brien S, Rogers D. Cheminformatics analysis and learning in a data pipelining environment. Mol. Div. 2006;10:283–299. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W369-73.
[Nucleic Acids Res. 2006]Genome Biol. 2007; 8(2):R24.
[Genome Biol. 2007]Bioinformatics. 1998; 14(1):48-54.
[Bioinformatics. 1998]Nucleic Acids Res. 2008 Jul; 36(12):4108-17.
[Nucleic Acids Res. 2008]Methods Mol Biol. 2007; 395():271-92.
[Methods Mol Biol. 2007]Genome Biol. 2007; 8(2):R24.
[Genome Biol. 2007]Proc Natl Acad Sci U S A. 2005 Sep 6; 102(36):12837-42.
[Proc Natl Acad Sci U S A. 2005]BMC Bioinformatics. 2006 Mar 7; 7():113.
[BMC Bioinformatics. 2006]BMC Bioinformatics. 2006 Mar 7; 7():113.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2008 Jul; 36(12):4108-17.
[Nucleic Acids Res. 2008]Methods Mol Biol. 1994; 25():93-102.
[Methods Mol Biol. 1994]J Comput Biol. 1997 Spring; 4(1):45-59.
[J Comput Biol. 1997]Structure. 2005 Mar; 13(3):447-62.
[Structure. 2005]