![]() | ![]() |
Formats:
|
||||||||||||
Copyright © 2007 Kato and Tsunoda; licensee BioMed Central Ltd. MotifCombinator: a web-based tool to search for combinations of cis-regulatory motifs 1Laboratory for Medical Informatics, SNP Research Center, RIKEN, Yokohama, Japan Corresponding author.Mamoru Kato: kato/at/src.riken.jp; Tatsuhiko Tsunoda: tsunoda/at/src.riken.jp Received November 1, 2006; Accepted March 22, 2007. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background A combination of multiple types of transcription factors and cis-regulatory elements is often required for gene expression in eukaryotes, and the combinatorial regulation confers specific gene expression to tissues or environments. To reveal the combinatorial regulation, computational methods are developed that efficiently infer combinations of cis-regulatory motifs that are important for gene expression as measured by DNA microarrays. One promising type of computational method is to utilize regression analysis between expression levels and scores of motifs in input sequences. This type takes full advantage of information on expression levels because it does not require that the expression level of each gene be dichotomized according to whether or not it reaches a certain threshold level. However, there is no web-based tool that employs regression methods to systematically search for motif combinations and that practically handles combinations of more than two or three motifs. Results We here introduced MotifCombinator, an online tool with a user-friendly interface, to systematically search for combinations composed of any number of motifs based on regression methods. The tool utilizes well-known regression methods (the multivariate linear regression, the multivariate adaptive regression spline or MARS, and the multivariate logistic regression method) for this purpose, and uses the genetic algorithm to search for combinations composed of any desired number of motifs. The visualization systems in this tool help users to intuitively grasp the process of the combination search, and the backup system allows users to easily stop and restart calculations that are expected to require large computational time. This tool also provides preparatory steps needed for systematic combination search – i.e., selecting single motifs to constitute combinations and cutting out redundant similar motifs based on clustering analysis. Conclusion MotifCombinator helps users to systematically search for motif combinations that play an important role in gene expression as measured by microarrays. Background Gene expression in eukaryotes is controlled by combinatorial regulation of transcription factors and cis-regulatory elements. Many types of transcription factors are bound to their respective regulatory DNA elements, and the interactions between the factors and elements control the gene expression. Molecular experiments can identify several binding sites for selected transcription factors, but they are too laborious and time-consuming to be applied to large-scale studies. Computational methods are thus required for processing genomic data to reveal the combinatorial regulation on a genomic scale. Recently, some computational methods have been developed to detect combinations of patterns (motifs) of cis-regulatory elements. They process data of upstream sequences selected from genomic sequences, as well as data of either expression levels measured by DNA microarrays or binding information given by ChIP-on-chip arrays [1-3]. One widely used type of computational method [4-6] employs genes that are co-expressed at a certain threshold level. Methods of this type first enumerate possible combinations of single motifs and then select significant combinations that specifically appear in upstream sequences of co-expressed genes [1,4,7]. Alternatively, such methods search directly for a pattern of several closely spaced motifs in the upstream sequences [5,6,8,9]. Another type of computational method is based on regression analysis between expression levels and motif scores (occurrence frequencies or weight matrix scores) in input sequences [10-12]. This type calculates a matching score of a motif (or motif combination) along each of the upstream sequences, for which the expression levels of the corresponding genes are measured by microarrays. It then takes regression between the motif scores in the upstream sequences and the expression levels of the corresponding genes to calculate the goodness-of-fit of the regression. This goodness-of-fit is obtained for each of the possible motifs (or motif combinations), and then motifs with the best fit are selected. These procedures are interpretable under the simple assumption that the occurrence frequencies or the weight matrix scores, which approximately correlate with the binding energy of the transcription factors to DNA elements [13], in upstream sequences influence the levels of gene expression [14], and that the scores of genuine motifs or motif combinations must explain much of the variation of expression levels. Well-known regression methods used for this purpose are the (multivariate) linear regression method [10,11] and the multivariate adaptive regression spline (MARS) method [12]. The former method assumes a linear function between the motif scores and expression levels, and the latter method has hockey-stick functions as basis functions, though it does not explicitly assume a particular function because it is a non-parametric method. This type of computational method can take full advantage of the information about expression levels, since it does not compulsively dichotomize expression levels by a threshold into a binary code to indicate whether or not a gene is expressed. An integrated web tool (RgS-Miner [15]) to search for motif combinations has already been developed based on the widely used type of method above (searching for motif combinations specifically appearing in upstream sequences of co-expressed genes); however, there is no integrated web tool that employs regression methods to systematically search for motif combinations. Moreover, RgS-Miner and the previous regression methods practically handle combinations composed of only two or three motifs; however, more than two or three motifs can be involved in combinatorial regulation in higher organisms [16,17]. Hence, this case needs to be addressed. We therefore developed MotifCombinator, a web-based tool that uses regression methods to systematically search for combinations of regulatory motifs. This tool is equipped with two kinds of regression methods, the multivariate linear regression and MARS methods, and it also includes logistic regression, which is a regression method for regulatory motifs that uses co-expressed genes [16]. It also employs the genetic algorithm to search for combinations composed of any desired number of motifs. For systematic combination search, MotifCombinator includes a series of procedures such as determining single motifs that will constitute motif combinations, filtering out redundant single motifs that resemble each other, and calculating the goodness-of-fit for possible combinations. As with the integrated tool RgS-Miner [15], these systematic procedures are realized in interactive multistage pipelines with a simple screen layout that can be easily used by experimental biologists. MotifCombinator will thus help users to find combinations of regulatory motifs that are important for combinatorial regulation. Implementation Multi-step pipelines For the systematic search for motif combinations, MotifCombinator uses a four-step pipeline structure. The four steps are 1) uploading a data set of upstream sequences and gene expressions; 2) determining single motifs that will be used to constitute possible motif-combinations; 3) cutting out redundant or irrelevant single motifs; and 4) searching possible combinations composed of the filtered motifs. We will introduce the framework here; supplemental details on how to use the tool are also provided online. Preparatory steps In the first step, users upload a data set consisting of both upstream sequences and gene expressions. Users can upload such a data set through files or by using a copy-and-paste function. Data of upstream sequences should be formatted in the FASTA format, and IDs of the upstream sequences should be indicated following a greater-than (">") symbol in the format. The gene expression data should consist of two columns, a column of IDs of gene expressions, and a column of expression levels or expression binaries. The expression levels are typically log ratios of expression levels for genes that are measured by microarrays, and will be subsequently processed by the linear regression or MARS. The expression binaries are 1 or 0, which indicate whether or not a gene is expressed, and will be subsequently processed by the logistic regression. After uploading both types of data, MotifCombinator matches up IDs in the sequence data with IDs in the expression data to manage together the both types of data. Users can also use pre-calculated non-homologous upstream sequences in the human and mouse genomes as sequence data. The second step is to determine single motifs that will be used later to constitute candidate motif-combinations. Motifs are represented by position frequency matrices. These single motifs can be determined through motif-finding tools, the JASPAR database [18], and users' stored motifs. Three motif-finding tools based on different strategies are available to find de novo single motifs from input sequences: MEME [19], which mainly utilizes the EM algorithm, AnnSpec [20], which mainly utilizes Gibbs sampling, and MDscan [21], which mainly utilizes the word enumeration strategy. Users can adjust several parameters for these tools on the web screen. The TRANSFAC (and its module TRANSCompel) [22] and JASPAR [18] databases are among well-established databases of transcription factor binding motifs, but currently, only freely-downloadable JASPAR is pre-installed and its motifs can be uploaded with a simple mouse-click. Users can also upload their own motifs. After the upload, users can confirm the matrices of uploaded motifs and can delete them if necessary. The third step (Figure (Figure1A)1A
Combination search At the fourth step (Figure (Figure1B),1B The genetic algorithm (Figure (Figure2)2
The first procedure in the genetic algorithm is initialization. Initial individuals (motif combinations) are randomly generated in the number that a user specifies. More specifically, motifs to constitute a motif combination are uniformly selected from all uploaded motifs at the probability of 0.5 (i.e., each of the uploaded motifs at the probability of 0.5 divided by the number of the uploaded motifs) and are selected from the null motif (NULL) at the probability of 0.5. The second procedure is selection. For each of the motif combinations coded as described above, the matching scores of the motif combination are calculated from input sequences as follows: where i and j denote the module and the sequence, m denotes any of the motifs belonging to the module i, Mm is the probability matrix of the motif m, M0 is the third-order Markov model estimated from all input sequences, and w denotes a string of any of the sliding windows in the sequence j. When a user specifies the option of the number of motifs that constitute modules as one (i.e., no summation over m), this scoring is exactly the same as that used in the multivariate linear regression method [11]. Then, for the motif combination, the scores and input expression levels are regressed, and the goodness-of-fit of the regression is calculated. A user can select one of three regression methods: the multivariate linear regression [10,11], the multivariate adaptive regression spline (MARS) [12], or the multivariate logistic regression method [16]. The goodness-of-fit is Akaike's Information Criterion (AIC) for the linear/logistic regression, or the generalized cross-validation score (GCV) for MARS. By the goodness-of-fit, each motif combination (individual) is evaluated and only motif combinations with the best fit are selected in the number that a user specifies. The third and fourth procedures are crossover and mutation. In the crossover, two different individuals are randomly chosen from the selected individuals with the best fit, and one-point crossover is performed between the codes of the two individuals to reproduce codes of two new individuals at the next generation (e.g., |M1|||M2|M3| × |M4|||M5|M6| → |M1|M5|M6| and |M4|M2|M3|). This operation is repeated until the number of new individuals at the next generation reaches that of the old individuals at the previous generation. At the fourth procedure, mutation is performed on codes of the new individuals. Here, mutation is random replacement of one motif with another at a site on the code of an individual (e.g., |M1|M2|M3| → |M4|M2|M3|). First, sites of motifs to be replaced are randomly selected across all individuals in the number of (the integer of) L*M, where L is the number of motifs in all individuals and M is the mutation rate specified by a user. Then old motifs at the selected sites are replaced with new motifs that are randomly selected as in the initialization step. By this procedure, one iteration loop is closed. The next iteration loop starts with the second procedure, selection, which in turn evaluates the new codes of motif combinations that are updated through crossover and mutation at the previous iteration. These iterations repeat for the number of times specified by the user. On the web screen, users can obtain results about motif combinations with the best fit. The motif combinations listed there are the best ones throughout all iterations. By clicking the FORMULA button, users can see an estimated regression formula, together with the constituent motifs, regression coefficients, evaluation score (AIC or GCV), and contribution rate (the proportion of the variance of input expression levels explained by those expression levels that are predicted by the regression with the scores of a motif combination). By clicking the UMOTIF button, users can see motifs that constitute a motif combination. By referring to the two figures on the screen, users can intuitively grasp how the combination search has proceeded and thereby can get hints to evaluate the results. One figure on the screen shows the history of the best goodness-of-fit values during the combination search, i.e., the best goodness-of-fit value at the selection step versus each of the iterations during the search. If users observe, for example, that the values are steadily falling, in other words, getting better according to the iterations, this suggests that users should not stop the calculation but continue. The other figure on the screen shows a brief landscape of the goodness-of-fit values during the search, i.e., the values of all individuals in all iterations versus a metric whose different values are intended to represent different motif combinations. The metric is the sum of the distances of constituent motifs from the reference motif, of which the length is 10 and the probability matrix is composed of 0.25 for each nucleotide base. The distance is measured by the average Kullback-Leibler divergence between a constituent motif and the reference motif, and is calculated by MatCompare [23] (with the same options as above). If users observe, for example, that a point of the goodness-of-fit value is alone positioned extremely low in the y-axis (meaning good) and isolated far from all other points that are scattered thoroughly across the x-axis, this may suggest that the value is worthwhile and the motif combination with this value may be close to the optimal solution. Finally, our tool has a backup system for re-calculation later, since the combination search takes a long time when the sizes of the parameters are large. Users can dump an archive file needed for re-calculation by simply clicking the BACKUP button. Then, they can shut down the computer. Users can easily restart the calculation by uploading the archive file on the web screen. In summary, the characteristic points at the combination search are as follows. • Users can search for combinations composed of any number of motifs, using the genetic algorithm. • Users can use the three types of regression methods (the multivariate linear regression, MARS, and the multivariate logistic regression) to find motif combinations that well explain the variations of expression levels. • Users can intuitively understand how the search has proceeded by the two visualization systems, which show a history and a brief landscape of the goodness-of-fit scores during the search. • Users can easily store all records by the backup system, and even after shutting down the computer, they can easily restart the search by just uploading the backup file on the web screen. Results and discussion Tests by simulated data Using simulated data sets, we tested whether the tool can correctly recover motif combinations composed of given motifs from many irrelevant motifs. The simulated data sets consisted of simulated upstream sequences and simulated expression levels. We generated the simulated upstream sequences in 2000 bps by the third-order Markov model estimated from the human genome. Into the upstream sequences, we randomly planted modules with motifs neighboring within 100 bps of each other, following the Poisson distribution (with a mean of one and a maximum number of five) for the number of modules in each upstream sequence, and following the probability matrices of motifs to plant strings of the motifs into each upstream sequence. We generated ten different sets of such simulated upstream sequences, into which we respectively planted ten different modules. We generated two types of simulated expression levels. One was generated by the linear combination of the scores of modules as follows: where i and g are the indices of a module and a gene, respectively, x is the score (in eq. 1) of a module, C and k are constants, and ε is a N (0,1) Gaussian noise. C was adjusted for the mean of expression levels across genes to be zero, and k was adjusted for the standard deviation of noises to be a certain percent of the standard deviation of expression levels across genes. We used this type of simulated expression level for the test of the linear regression method. The other simulated expression levels were generated by the two-order interaction terms between different modules in addition to the linear combination [12] as follows: We used this type of simulated expression level for the test of MARS. We added the Gaussian noises at 0%, 20%, 40%, 60%, and 80% of the standard deviation of expression levels across genes. For these data sets (10 sets of modules × 2 types of expression levels × 5 degrees of noises), we tested whether MotifCombinator can correctly recover motif combinations of planted modules from all JASPAR [18] motifs (111 motifs). For all JASPAR motifs, we first performed the step of motif cutting based on the clustering analysis at a cut height of 1. Then at the step of combination search, we set the options as follows: number of individuals = 50, top individuals in selection = 20, mutation = 10%, and iterations = 500. We set module × motif according to the modules used (e.g., when three modules with two motifs were used, we set 3 × 2) and set the regression type to be the linear regression and MARS for the two types of expression levels in eq. 2 and eq. 3, respectively. Using these settings, we executed the search and obtained the results. For each data set, we took up the top ten motif combinations (by the UMOTIF button) that were evaluated and recovered by the tool. We calculated how well the tool recovered a motif combination, based on the hypergeometric test: where K is the number of all motifs, k is the number of motifs (say, true motifs) that constitute a motif combination we planted, T is the number of all motifs in a motif combination recovered by the tool, and t is the number of true motifs included in the motif combination recovered by the tool. We calculated the P value for each of the top ten motif combinations in each data set, and kept two types of P values: the best P value among the top ten and the (log) averaged P value across the top ten in each data set. Then we averaged both P values across all data sets. Table 1 shows that, in the case of the averaged P values across the top ten ("Average"), the values were always under 0.01 (1% significance level) in all noise and regression types, though the values got worse according to the increase of noise. The best P values ("Best") were always under roughly 10-4 (< 2.0 × 10-4). The tool significantly recovered motif combinations that were exactly or approximately the same as those we planted.
Tests by muscle-specific transcripts We tested the tool using real data sets, which consisted of upstream sequences and expression levels of muscle-specific transcripts. We obtained skeletal muscle-specific transcripts and their expression levels from a data file (Additional data file 5) included in a study of an extensive microarray survey of gene expression in normal human tissues [24]. That study measured the expression levels in two tissues related to skeletal muscle: abdominal muscle and right calf muscle. For the upstream sequences, we used genome sequences in the human NCBI build 35 and obtained sequences 2000-bps upstream (without overlapping other transcripts) from the muscle-specific transcripts. A set of 51 muscle-specific transcripts was uploaded into the tool. We also prepared position weight matrices of motifs (MEF2, MYF, SP1, SRF, and TEF) known to be involved in muscle-specific expression [16] using the JASPAR database [18], and uploaded them. We tested whether the tool could correctly select motif combinations composed of these motifs from among all JASPAR motifs (111 motifs). We first performed the step of motif cutting based on the clustering analysis at a cut height of 1.5. Then, in the step of combination search, we set the options as follows: number of individuals = 50, top individuals in selection = 20, mutation = 10%, and iterations = 500. We set module × motif to 1 × 5 and regression type to MARS. Among a dozen of the top motif combinations that were evaluated and selected by the tool, three (MEF2, MYF, and SRF) were found out of the five muscle-related motifs in the abdominal-muscle data set and three (MEF2, SP1, and SRF) were found in the right calf-muscle data set. For each motif combination, we evaluated the results by the hypergeometric test described above. Table 2 shows that, in the abdominal-muscle data set, the best P value was under 0.01 and the combinations with the best P value included MEF2 and SRF, and MYF and SRF. In the right calf-muscle data set, the best P value was roughly 10-4 (1.7 × 10-4) and the combination with the best P value included MEF2, SP1, and SRF.
Conclusion Sequencing of genomic DNA deepens our understanding of DNA codes of protein products on genomic sequences in organisms; however, the codes that control gene expression or protein products are not well understood because of their complexity. Such complexity arises partly from the combinatorial regulation of cis-regulatory elements – many short DNA segments are combined to confer the ability to express genes specifically to tissues or environments. To understand the combinatorial regulation, it is necessary to develop a computational tool that efficiently searches possible combinations of patterns (motifs) of cis-regulatory elements to find combinations important for gene expression measured by, for example, microarrays. We developed MotifCombinator, a web-based tool that searches for combinations composed of any desired number of motifs using the genetic algorithm and that employs well-known regression methods to find motif combinations that well explain the variations of expression levels directly – without dichotomizing expression levels into the "expressed" or "non-expressed" categories. This tool also has two visualization systems for intuitive evaluation of the search and has a backup system for continuing a long calculation. Convenient preparatory steps (selecting single motifs as "seeds" for combinations, cutting out redundant motifs) are also implemented for systematic search of the combinations. Using simulated data sets and muscle-specific transcripts, we tested the tool and found that the tool indeed recovered appropriate combinations of motifs for these data. Recently, information on known motif combinations has been increasing, as in TRANSCompel [22], which is a database of two (or more) closely-located binding sites on DNA sequences. This information can in principle be integrated into a system that handles combinations of any number of position frequency matrices, such as our tool, and the integration would increase the accuracy and convenience of the system. In conclusion, MotifCombinator will help users to efficiently search for motif combinations that are important for gene expression measured in genome-wide experiments. Availability and requirements Project name: MotifCombinator Project home page: http://emu.src.riken.jp/combinator Operating system(s): Platform independent (web-based, tested on Mozilla Firefox 1.5 and Internet Explorer 6.0) Any restrictions to use by non-academics: None Abbreviations - MARS: the multivariate adaptive regression spline - AIC: Akaike's Information Criterion - GCV: the generalized cross-validation score - eq.: equation Authors' contributions MK planned the research, performed the analyses, and wrote the manuscript. TT reviewed the manuscript. All authors read and approved the final manuscript. Acknowledgements We thank Isao Kurosaka for his help with writing the codes, and Takashi Morizono, Hiroto Kawakami, and Emi Okutsu for their help in testing the tool. This work was supported by JSPS.KAKENHI (17710167). References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Genome Biol. 2004; 5(8):R56.
[Genome Biol. 2004]Proc Natl Acad Sci U S A. 2005 Feb 8; 102(6):1998-2003.
[Proc Natl Acad Sci U S A. 2005]Bioinformatics. 2004 Aug 12; 20(12):1974-6.
[Bioinformatics. 2004]Proc Natl Acad Sci U S A. 2004 Aug 17; 101(33):12114-9.
[Proc Natl Acad Sci U S A. 2004]Genome Res. 2002 Nov; 12(11):1723-31.
[Genome Res. 2002]Nucleic Acids Res. 2003 Jul 1; 31(13):3487-90.
[Nucleic Acids Res. 2003]Proc Natl Acad Sci U S A. 2004 Nov 16; 101(46):16234-9.
[Proc Natl Acad Sci U S A. 2004]Bioinformatics. 2000 Jan; 16(1):16-23.
[Bioinformatics. 2000]Nat Genet. 2001 Feb; 27(2):167-71.
[Nat Genet. 2001]Proc Natl Acad Sci U S A. 2003 Mar 18; 100(6):3339-44.
[Proc Natl Acad Sci U S A. 2003]Nucleic Acids Res. 2004; 32(6):1948-56.
[Nucleic Acids Res. 2004]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):757-62.
[Proc Natl Acad Sci U S A. 2002]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Nucleic Acids Res. 2004; 32(6):1948-56.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Proc Int Conf Intell Syst Mol Biol. 1994; 2():28-36.
[Proc Int Conf Intell Syst Mol Biol. 1994]Pac Symp Biocomput. 2000; ():467-78.
[Pac Symp Biocomput. 2000]Nat Biotechnol. 2002 Aug; 20(8):835-9.
[Nat Biotechnol. 2002]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D108-10.
[Nucleic Acids Res. 2006]Bioinformatics. 2005 Feb 1; 21(3):307-13.
[Bioinformatics. 2005]Proc Natl Acad Sci U S A. 2003 Mar 18; 100(6):3339-44.
[Proc Natl Acad Sci U S A. 2003]Nucleic Acids Res. 2003 Jul 1; 31(13):3487-90.
[Nucleic Acids Res. 2003]Proc Natl Acad Sci U S A. 2004 Nov 16; 101(46):16234-9.
[Proc Natl Acad Sci U S A. 2004]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Bioinformatics. 2005 Feb 1; 21(3):307-13.
[Bioinformatics. 2005]Proc Natl Acad Sci U S A. 2004 Nov 16; 101(46):16234-9.
[Proc Natl Acad Sci U S A. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Genome Biol. 2005; 6(3):R22.
[Genome Biol. 2005]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D108-10.
[Nucleic Acids Res. 2006]