![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2006 Zhang and Zaki; licensee BioMed Central Ltd. SMOTIF: efficient structured pattern and profile motif search 1Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York 12180, USA Corresponding author.Yongqiang Zhang: zhangy0/at/cs.rpi.edu; Mohammed J Zaki: zaki/at/cs.rpi.edu Received May 21, 2006; Accepted November 21, 2006. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background A structured motif allows variable length gaps between several components, where each component is a simple motif, which allows either no gaps or only fixed length gaps. The motif can either be represented as a pattern or a profile (also called positional weight matrix). We propose an efficient algorithm, called SMOTIF, to solve the structured motif search problem, i.e., given one or more sequences and a structured motif, SMOTIF searches the sequences for all occurrences of the motif. Potential applications include searching for long terminal repeat (LTR) retrotransposons and composite regulatory binding sites in DNA sequences. Results SMOTIF can search for both pattern and profile motifs, and it is efficient in terms of both time and space; it outperforms SMARTFINDER, a state-of-the-art algorithm for structured motif search. Experimental results show that SMOTIF is about 7 times faster and consumes 100 times less memory than SMARTFINDER. It can effectively search for LTR retrotransposons and is well suited to searching for motifs with long range gaps. It is also successful in finding potential composite transcription factor binding sites. Conclusion SMOTIF is a useful and efficient tool in searching for structured pattern and profile motifs. The algorithm is available as open-source at: http://www.cs.rpi.edu/~zaki/software/sMotif/. Background Searching biological sequence(s) for motifs is a fundamental task in bioinformatics. Motifs can be represented as either patterns over a specific alphabet, or profiles (also called positional weight matrix (PWM)), which give the probability of observing each symbol in each position. Motifs can be classified into two main types. If no variable gaps are allowed in the motif, it is called a simple motif. For example, in the genome of Saccharomyces cerevisiae, the binding sites of transcription factor, GAL4 [1], can be characterized by the simple motif shown in Table 1, which illustrates the pattern over the IUPAC alphabet (ΣIUPAC; see Table 2), as well as its profile (which gives the frequency of each DNA base at each position). The motif in Table 1 only consists of one component and thus is a simple motif. Since the symbols in the first 3 positions (CGS) and in the last 3 positions (SCG) are well conserved, we can also represent this motif as CGS[11,11]SCG, where [11,11] means that there is a fixed "gap" of length 11 between the two components. If variable gaps are allowed in a motif, it is called a structured motif. A structured motif can be regarded as an ordered collection of simple motifs with gap constraints between each pair of adjacent simple motifs. For example, the LTR retrotransposons from the Copia group, corresponding to genes encoding reverse transcriptase, in A. thaliana can be characterized by the structured motif M1 [2,5] M2 [6,7] M3, as shown in Table 3[2]. Here M1, M2 and M3 are three simple motifs; [2,5] and [6,7] are variable gap constraints ([minimum gap, maximum gap]) allowed between the adjacent simple motifs. Note that each simple motif Mi (with 1 ≤ i ≤ 3) can either be a pattern over ΣIUPAC or a profile over ΣDNA. Searching for structured motifs is more complicated than searching for simple motifs, and is an ongoing research area [3-7]. The sequence to be searched can be very long, e.g., chromosome 1 of Homo Sapiens contains 245 million (245M) base pairs. The structured motif can also be as long as several kilobases. All these factors need to be considered when designing an efficient structured motif search algorithm.
More formally, a structured motif [1, k - 1], the length of the gap between Mi and Mi+1 is given as gi = ei+1 - si - 1, and we require that gi [li, ui]. We use |Mi| to denote the number of symbols/positions in component Mi, also called the length of the component, and we use ΣDNA at position j, where j [1, |Given a collection of sequences,
Depending on the application, the structured motif search problem can have several variations: • Missing Components: The matching motifs can consist of some, instead of all, the simple motifs in • Approximate Matches: The matching motifs may consist of similar motifs (as measured by Hamming or Levenshtein distance [8]), instead of exact matches, to the simple motifs in • Overlapping Components: The variable gap constraints (li and ui) can take on a limited range of negative values, allowing search for overlapping simple motifs. We allow two adjacent components Mi and Mi+1 to overlap, but we require that Mi+1 does not precede Mi. This condition can be satisfied by the following constraints on the gap range [li, ui]: - |Mi| ≤ li ≤ ui, for i [1, k). For example the search for motif ACG[-2,2]CGA, can discover the overlapped occurrence ACGA, as well as the non-overlapped occurrence ACG- -CGA, at the two extremes of the gap range.• Profile Search: The components of the motif In this paper, we focus on the problem of searching for a given structured motif in one or more sequences. We propose SMOTIF, an efficient algorithm for structured motif searches. It uses an inverted index of symbol positions, and it finds all occurrences by positional joins over this index. For structured pattern search problem, we propose two main variants of our approach: i) a direct search for simple motifs and the structured motif via positional joins, and ii) a two-step approach, where we use a suffix tree to search for simple motifs and then use positional joins for the structured motif. For structured profile search problem, we first search each simple motif by aligning its profile with the sequences, and then search structured motifs with positional joins. SMOTIF allows missing components, overlapping motifs, and also approximate matches (when using the two-step approach). SMOTIF also allows flexible matches using IUPAC symbols. We apply SMOTIF for searching long DNA sequences for LTR retrotransposons, which constitute a substantial fraction of most Eukaryotic genomes and are believed to have a significant impact on genome structure and function [9,10]. We show that SMOTIF is effective in searching for composite regulatory patterns, and it can also suggest potentially new binding sites. We experimentally demonstrate the superiority of SMOTIF over SMARTFINDER, a state-of-the-art method for structured motif search, both in terms of time and space; SMOTIF can be up to 7 times faster and can consume 100 times less space. Related work Many existing pattern matching algorithms [8,11-18] can be used to solve the simple pattern search problem. Given the sequence length, n, and the pattern length, m, exact matching algorithm can run in O (n + m) [11]; approximate matching algorithm can run in O (rn) [16,17] or O (nm/w) [18], where r is the error threshold and w is the size of a computer word. The space complexity is O (n) for both exact matching and approximate matching. Several previous efforts have focused on the structured pattern search problem. Anrep [3,4] provides a unified biosequence pattern representation by using network expressions with spacers, where a network expression is a regular expression without Kleene closure. With network expressions, one can specify the scoring scheme and the threshold of approximate matching for each simple motif separately, the positional weights which express the relative importance of different parts of a motif, and whether a simple motif is optional. Anrep introduces a two-step approach: first it searches for simple motifs by a threshold-sensitive motif matching algorithm and then it finds the structured motif by an optimized backtracking matching algorithm. However, as compared by [6], Anrep is much slower than SMARTFINDER. In [5], the structured motif is called a Classes of Characters and Bounded Gaps (CBG) expression and is represented by a non-deterministic ε-automaton with bit parallelism. Two algorithms are proposed for CBG expression search: forward search and backward search. Bit parallelism speeds up the search, but is adequate only for a pattern whose maximum span is smaller than the length of the computer word. Also the implementation of CBG can only handle such pattern. This limits the application of CBG to searching for patterns with small number of symbols and gaps. SMARTFINDER [6,7], which is currently the most efficient method for structured motif search, is also a two-step approach. In the first step each simple motif is searched separately by building a suffix tree for the sequence. This step outputs the ordered occurrence lists of all simple motifs. The second step solves a constraint satisfaction problem by considering constraints individually in three sub-steps. First it considers the gap constraints and builds a constraint graph whose nodes are the simple motif occurrences and edges connect all possible pairs of nodes that locally satisfy the gap constraints. It then considers the constraint for the maximum number of missing components and shrinks the graph to contain only the nodes that can be in the structured motif occurrences. Finally it enumerates all the valid occurrences by a depth first search (DFS). Notable differences in SMOTIF and SMARTFINDER are as follows: we search patterns directly by positional joins over an inverted index, we consider variable gap constraints during the positional joins as opposed to building a constraint graph, and we handle missing components more efficiently by considering them over patterns instead of over each occurrence as in SMARTFINDER. Note also that like Anrep and SMARTFINDER, SMOTIF can also mine approximate patterns, when using the two-step approach, which we describe later. For profile search, MATCH [19], P-Match [20], and MatInspector [21,22] search DNA sequences against a position weight matrix library (such as TRANSFAC database [23]) and report the occurrences that satisfy given score thresholds. They compute the matrix score by multiplying the base frequency with the information content value at each position, in order to emphasize the fact that mismatches at less conserved positions are more easily tolerated than mismatches at highly conserved positions. Besides the matrix score, they define a core region, which is usually the first 4–5 most conserved consecutive positions of the matrix, and perform the core score threshold check. Then they align the matrix to each position of the sequence and calculate the core score and matrix score. However, these algorithm don't consider the prior probability of each base when calculating the matrix (or core) score, and the core region is required to be consecutive. They need to check all positions of each subsequence (at least all the core positions) in order to calculate the matrix (core) score. Moreover, these algorithms only work on simple profile with one single matrix component. For structured profile search, only Anrep [3,4] provides the capability to model structured profiles, with its general network expressions. However, Anrep doesn't give a solution on score calculation and fast search for structured profiles. Moreover, its implementation doesn't support structured profile search. To our knowledge, SMOTIF is the only implemented method that can handle structured profile search. Methods We first introduce our basic approach for structured pattern search, and successively optimize it for various practical scenarios. We then present our approach for structured profile search. Structured pattern search: basic approach Let us assume that we are searching for a structured motif ΣIUPAC, its pos-list is given as [1, |S|]}, where S [i] is the symbol at position i in S, and |S| denotes the length of S. When S is obvious, we drop it, and denote the pos-list as ΣDNA are given in Table 5.
Depending on whether we compute the pos-lists for IUPAC symbols or not, SMOTIF uses two approaches: (a) DNA pos-lists: Here we keep (in memory) the pos-lists only for the four DNA symbols. For the other IUPAC symbols, we obtain their pos-lists by taking a union over the pos-lists of their constituting DNA symbols, e.g., Positional joins We first extend the notion of pos-lists to cover structured motifs. The pos-list of ΣIUPAC be any two symbols, and let
• d <l: Advance y to the next element in • d > u: Advance x to the next element in • l ≤ d ≤ u: Save this occurrence in The pos-list for X [l, u] Y can be computed in time linear in the lengths of [0, 1]. So we store x = 8 in Given a longer motif SMOTIF handles both simple and structured motifs uniformly, by adding the gap range [0, 0] between adjacent symbols within each simple motif Mi. For our example in Table 4, the structured motif Figure Figure22
Full position recovery In our positional join approach, to save time and space we retain only the motif start positions, however, in some applications, we may need to know the full position of each occurrence, i.e., the set of matching positions for each symbol in the motif. We describe two approaches to recover the full positions: recomputed or indexed full-position recovery. Recomputed full position recovery For recomputing the full positions SMOTIF needs access to only the sequence S and the post-list [s + 1, s + L - 1] in sequence S, we consider each candidate prefix F [l, u]. If yes, S [i] is a valid occurrence of
As an example, let's assume we want to recover the full position for the motif
Indexed full position recovery Rather than recomputing the positions, we can "index" some information during the positional joins in order to facilitate full position recovery. For each suffix of [li, ui]. Note that [l1, u1] we set F [2] =
Consider the example shown in Figure Figure66 [0,0]. To recover the full position for our example motif, from start position 5, we follow index 1 to get position 6 in the next pos-list, to obtain
Sequence segmentation The SMOTIF approach as described above works well for searching a motif in a relatively short sequence. For a very long sequence S (e.g., searching for (LTR) retrotransposons in an entire chromosome) the pos-lists can get very long in the initial stages, consuming a lot of memory. SMOTIF handles a long sequence by splitting it into several segments and searches each segment separately for the structured motif. That is, the sequence S is split into p equal partitions (except for the last one). Handling each smaller segment Si (i [l, p]) instead of the original S can save a lot of space and also reduces the total search time. After segmentation, to avoid missing any occurrence, we require that each partition Si, with i [l, p - 1], include the first L - 1 symbols from partition Si+1. Finally, to avoid duplicate occurrences, we discard all occurrences with a start position in the overlap region, since it would be reported when we process segment Si+1. For example, let S be the sequence in Table 6, and let the structured motif be
So far we have assumed that we are searching for the structured motif in a single sequence. SMOTIF can easily handle a collection Missing components In some applications a partial match of the structured motif might still be of interest. SMOTIF allows up to q simple motif components to be missing during the search. Let [k - q, k]. Next, the gap ranges are adjusted in each sub-motif to account for skipping over the missing components. The new gap range, [li,j, ui,j], between components Mi and Mj (with 1 ≤ i <j ≤ k) in a sub-motif, is calculated as follows: For example, if we allow one (q = 1) missing component for our structured motif in Table 4, the set of sub-motifs that need to be searched for are: GC[0,1]TTA[1,4]CAT, GC[1,8]CAT, GC[0,1]TTA and TTA[1,4]CAT. Note that it is straightforward to incorporate other approaches to compute new ranges into SMOTIF since it would only change the gap constraints. For example, li,j = minn [i,j-1] {ln} and ui,j = maxn [i,j-1] {un} is another possible way to compute the adjusted gap ranges.Instead of searching each sub-motif separately, we do an optimized search. We reuse the partial pos-lists created when using a depth first search to enumerate and search the sub-motifs. The idea is to re-use the pos-lists created for common suffixes when enumerating their sub-motif extensions. Two-step approach for structured pattern search So far we have described the direct method used by SMOTIF to search for the structured motif by positional joins over the symbols. In fact, SMOTIF, can also follow a two-step approach like in Anrep [4] and SMARTFINDER [6]. In the first step, given Mk as tail. Since Exact matching Many algorithms [11-14] exist for exact pattern matching. Like in SMARTFINDER we use a lazy suffix tree [11] to extract the pos-lists for all simple motifs. The matching occurrences are sorted after extracting them from the suffix tree to obtain the pos-list in sorted order. For an intermediate pattern Mi [li, ui] [0, 1].
Approximate matching Several algorithms [8, 12, 15–18] exist for approximate pattern matching. For consistency, we used Sellers' dynamic programming algorithm [26], as implemented in SMARTFINDER, to extract the pos-lists for all simple motifs with approximate matches. This algorithm is not optimal and it can be replaced by more efficient ones [16-18]. Since we allow a specific Levenshtein distance [8] (i.e., insertions, deletions and substitutions) between the occurrences and the motif, the length of the occurrences can be different from the component length |Mi|. Thus we augment the pos-list to explicitly store the end position, in addition to the start position, for each occurrence. Figure 7(b) [0,1] and d = 8 - 6 - 1 = 1 [0, 1].Structured profile search Having outlined our approach for structured pattern search, here we tackle the problem of structured profile search. The profile (also called a position weight matrix) for a structured motif ΣDNA occurs at position j. Note that, as in the case of pattern motifs, for a structured profile motif Given a profile structured motif Weighted profile creation Given the initial "raw" frequency profile for the structured motif where ΣDNA at position j in the structured motif ΣDNA; Whereas the weights computed above give the likelihood of observing a given symbol in a given position they do not account for the degree to which some symbols are conserved at some positions. We can adjust the weights by considering the information content at each position. The information content for a profile is given as: where Given the initial profile motif Figure Figure88 ΣDNA in each position j (
Profile scoring Given the weighted profile [0,1]. To ensure that the profile score lies between 0 and 1, we have to normalize it. Let Note that whereas Equations 4(a) and 4(b) are strictly in the range [0, 1], for 4(c) When applying the score threshold for an occurrence S', we require that its normalized score is above the threshold λ. For example, for Equation 4(b), In other words, instead of normalizing the score for each match, we take λn as the new normalized threshold for scoring the potential matches. Likewise we can get the new thresholds for Equations 4(a) and 4(c). For example, for 4(a) the normalized threshold would be λn = λ · Partial scores For profile matching problem, we are only given the score threshold λ for the whole structured motif. We here develop a method to compute a partial score threshold for any sub-profile, which can lead to great pruning efficiency. For a sub-profile In another word, the minimum threshold for any sub-profile Core scores In many biologically relevant motifs, some positions are more conserved than others. We call them core positions, and these are precisely those positions with high information content. We choose the top h (usually 4 to 6) positions in the profile with highest information content as the core positions. For any potential match S', we can compute its core score Motif enumeration Simple motif scoring Given a profile motif To prune matches S' that will eventually not meet the score threshold, we check the score threshold as each position in S' is being considered. If the score for any prefix of S' falls below the score threshold, we can discard S'. In fact, before applying scores over all positions, we first consider the scores for the prefixes of the core positions within the component. This continuous check for the core positions and regular positions leads to very effective pruning. Note that as opposed to previous methods [19-22], our approach does not require the core positions be consecutive so as to find the most conserved parts in the profile. Figure Figure99
Structured motif scoring and positional joins After obtaining the pos-lists of simple motif components, we can enumerate structured motifs by doing positional joins on these pos-lists, as already outlined for structured pattern search. We compute the positional joins with Mi as the head and Mi+1 Mk as the tail, as we start from Mk as the head and end at M1 as the head. During the positional joins we also check the partial structured score thresholds λn (Mi Mk). If the check fails at any stage we prune the match candidate. We keep a pattern (along with its score) only if it satisfies the full structured score threshold λn.Figure 9(a) Full position recovery Once the pos-list for the profile The complete SMOTIF algorithm: complexity analysis The pseudo-code for the complete SMOTIF algorithm is shown in Figure Figure10.10 [k - q, k] simple motifs, which in the worst case is T = 2k - 1 (when q = k - 1). Since the positional joins take linear time in the length of the pos-lists (the length is O (n) in the worst case), the time for pos-list joins is O (|
Results and discussion Performance comparison We have implemented the SMOTIF algorithm in standard C++. We compare our results with SMARTFINDER, the best previous algorithm for structured motif search. For fair comparison, the suffix tree used in SMOTIF-2 for finding the pos-list of simple motifs is the same as the one used in SMARTFINDER, namely the lazy mode suffix tree [11], and we apply Sellers' dynamic programming algorithm [26] for approximate matching. The programs were compiled with g++ v3.2.2 at the optimization level 3 (-O3). Unless indicated otherwise, we did the experiments on an Apple G5 with dual 2.7Ghz processors and 4GB memory running Mac OS X; the timings reported are total times for all steps of the algorithms; all our experiments use exact matching for the simple motifs, and we report the full position for the occurrences. SMOTIF: parameter settings In the first set of experiments we used the 5 chromosomes of Arabidopsis thaliana as our sequence set. These five chromosomes have lengths 29M, 19M, 22.7M, 16.9M and 25.7M base pairs, respectively. The structured motif
Figure Figure1111
Figure 11(c) Figure 11(e) Figure 11(f) SMOTIF and SMARTFINDER: comparison Comparison on A. thaliana Figure 12(a)
To directly evaluate SMARTFINDER's constraint graph approach with our positional join approach, in Figure 12(c) We also multiply aligned 36 A. thaliana LTR retrotransposons from Repbase Update [2] database, which belong to Copia repeat class, and have reverse transcriptase as the keyword. We then extracted four conserved features from the alignment results, shown as [1, 4]) in Table 7. We searched chromosome 1 of A. thaliana for the four extracted motifs using SMARTFINDER, SMOTIF-1 and SMOTIF-2, respectively. Table 8 shows the number of occurrences |
Searching motifs with long gaps One application of SMOTIF is to find structured motifs with long gaps. For example, we extracted the motif DNNNNDRYW [2578, 4202]RNNGVHVY, from the same 36 LTRs of A. thaliana mentioned above. Here we retain only the first and last conserved components in the resulting alignment of the 36 sequences. Note the relatively long gap ranges, with minimum gap l = 2578 and maximum gap u = 4202. SMOTIF was able to extract the approximately 83 million full positions, corresponding to 3 million start positions, in just 35.75s (SMOTIF-1) and 15.72s (SMOTIF-2). Comparison on random motifs Here we used chromosome 20 of Homo Sapiens as the sequence; it has length 61M base pairs. We generate 100 random structured motifs in the ΣIUPAC alphabet, with k [3,8] simple motifs of length l [5,10] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random sub-interval of [-5, 100]. Note that the negative minimum gap shows that SMOTIF can mine overlapping simple motifs. Here we also compare the structured profile search approach SMOTIF-P, as follows: for each random motif, we form a profile by first expanding the IUPAC symbols into their corresponding DNA symbols and assign them a random probability of occurrence which accounts for 90% of the share, whereas the other DNA symbols randomly share in the remaining 10%. We use SMOTIF-P to search for the profiles with 2 as the number of core positions in each simple motif, λc = 0.1 as the core score threshold and λ = 0.6 as the total score threshold. We use random motifs mainly to demonstrate the effects of various parameters on SMOTIF and SMARTFINDER.Figure 13(a)–(d)
Table 9 shows the mean and variance of the search times over all the 100 structured motifs. It also shows the time for finding only the start positions or the full positions for SMOTIF. Overall, for these random motifs, we find that on average SMOTIF-1 is the fastest, SMOTIF-2 and SMOTIF-P are comparable, and all three outperform SMARTFINDER by a factor of 4 to 6.
It is interesting to note that SMOTIF is more stable than SMARTFINDER: SMOTIF-P has around 8 times less variance than SMARTFINDER, SMOTIF-2 has around 5 times less variance than SMARTFINDER, whereas SMOTIF-1 has around 17 times less variance than SMARTFINDER. Note also that the overhead in recovering the full positions from the start positions is negligible. Application: composite regulatory patterns The complex transcriptional regulatory network in Eukaryotic organisms usually requires interactions of multiple transcription factors. A potential application of SMOTIF is to search for such composite regulatory binding sites in DNA sequences. We took two such transcription factors, URS1H and UASH, that are known to cooperatively regulate 11 yeast genes [28]. These 11 genes are also listed in SCPD [1], the promoter database of Saccharomyces cerevisiae. In 10 of those genes the URS1H binding site appears downstream from UASH; in the remaining one (HOP1) the binding sites are reversed. We took the binding sites for the 10 genes, and after their multiple alignment, we obtained the composite motif NNDTBNGDWGDNNDH[5,179]WBRGCSGCYVW, where we represent each column in the alignment with the IUPAC symbol corresponding to the bases that appear at that position. We also extracted the profile for these 10 binding sites. Table 10 shows the binding sites for the 10 genes, their alignment, and the start positions and the distances between the sites (the difference of start positions). The smallest distance is 20 and the largest is 194. Since these are start positions, the variable gap range is obtained by subtracting the length (15) of UASH to obtain l = 20 – 15 = 5 and u = 194 – 15 = 179. Notice also how the alignment preserves the highly conserved GCSGC region in URS1H.
We then searched for the structured motif in the upstream regions of all 5873 genes in the yeast genome. We used the -800 to -1 upstream regions, and truncated the segment if it overlaps with an upstream open reading frame (ORF). As a result, 5794 sequences with average length of 497 bases are left. By searching for the IUPAC pattern, we found 65 occurrences, including the 10 originally known sites, within 1 second. By searching for the profile with 5 as the number of core positions in each simple motif, λc = 0.6 as the core score threshold and λ = 0.8 as the total score threshold, we found 56 occurrences in 0.18 seconds. For each occurrence, we then extracted its actual sequence segment in the matching upstream regions. Since the structured motif represented by IUPAC symbols may be too general, for each matching segment, we calculated its hamming distance to one of the 10 known binding sites. We then selected an occurrence as a possibly new binding site if the minimum hamming distance to any of the 10 known sites is within a given maximum threshold value. Table 11 lists the 12 newly found occurrences in the entire yeast genome (upstream regions) using a hamming distance threshold of 5. The sites discovered using both pattern and profile search are listed. These new occurrences could be putative binding sites for the two transcription factors UASH and URS1H.
Upon further analysis, we found that in fact, the new occurrence in MEK1 (at positions -233,-136) that we found is also listed in the SCPD database as a binding site. SCPD lists one site for UASH at position -233, and two sites for URS1H at positions -136 and -150. To construct the motif, we had used -150 as the site for URS1H, without knowledge of the other site. SMOTIF was thus able to automatically find the other site based on the extracted motif! For REC114, we also found another occurrence at positions -158 (UASH) and -94 (URS1H). However, this is not reported in SCPD. To further analyze the remaining new occurrences, we consulted the SGD (Saccharomyces Genome Database) Gene Ontology Term Finder [29] to find the inter-relationships between the genes. The first three rows of Table 12 show the significant GO terms (biological process or molecular function) that are common to the genes corresponding to a new occurrence and its closest (known) gene. The rest of the table shows the significant terms among the 18 genes. These results indicate that at least some of the new occurrences (such as SPO1, HSP60, MES1, and GNT1) have a potential to be binding sites since they share some significant processes with the known sites' genes. Out of these SPO1 has the highest potential to be a new binding site, since it is known that UASH and URS1H are involved in early meiotic expression, during sporulation [28]. Table 12 shows that SPO1 shares meiosis and M phase of meiotic cell cycle with the rest of the genes. After searching for SPO1 in SGD database, we found that SPO1 is a transcriptional regulator involved in sporulation, and required for middle and late meiotic expression. This increases our confidence that SPO1 has high potential to be a previously unknown binding site.
Finally, since we knew that in gene HOP1, the URS1H binding site appears upstream from UASH, we wanted to see if we could extract the "reversed" binding site. We search for the original and the reversed motifs using a hamming threshold of 6. We found 34 new binding sites where UASH can appear either up-or down-stream from URS1H. Among these we found two possible potential binding sites for the gene HOP1, with UASH at position -201 and URS1H at positions -534 and -175. The former pair (-201,-534) is in fact a known binding site as reported in the SCPD database [1]. This once again showcases the ability of SMOTIF to find potential new binding sites. Conclusion We introduced SMOTIF, a fast and efficient algorithm to search structured pattern and profile motifs in biological sequences. We showed its applications in searching for composite regulatory patterns, long terminal repeat retrotransposons, and for searching long range motifs. SMOTIF is also computationally more efficient than previous state-of-the-art methods like SMARTFINDER [6]. In biosequence analysis there are four related structured motif problems depending on whether the simple motifs and gap ranges in the structured motif are known or not: (i) Structured motif search [4-6]: all simple motifs and gap ranges are known; (ii) Structured motif extraction [30-32]: all simple motifs are unknown and all gap ranges are known; (iii) Extended structured motif search: all simple motifs are known and all gap ranges are unknown; and (iv) Extended structured motif extraction: all simple motifs and gap ranges are unknown. In this paper we tackled problem (i). A companion paper [33] tackles problem (ii). In the future, we plan to develop efficient algorithms for the other two motif problems as well. Authors' contributions All authors contributed equally to this work. Acknowledgements This work was supported in part by NSF CAREER Award IIS-0092978, DOE Career Award DE-FG02-02ER25538, and NSF grants EIA-0103708 & EMT-0432098. We thank the anonymous reviewers for their many helpful suggestions. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics. 1999 Jul-Aug; 15(7-8):607-11.
[Bioinformatics. 1999]Cytogenet Genome Res. 2005; 110(1-4):462-7.
[Cytogenet Genome Res. 2005]Comput Appl Biosci. 1993 Jun; 9(3):299-314.
[Comput Appl Biosci. 1993]Bioinformatics. 2003 Feb 12; 19(3):362-7.
[Bioinformatics. 2003]Nat Rev Genet. 2002 May; 3(5):329-41.
[Nat Rev Genet. 2002]Comput Appl Biosci. 1993 Jun; 9(3):299-314.
[Comput Appl Biosci. 1993]J Comput Biol. 1996 Spring; 3(1):33-51.
[J Comput Biol. 1996]J Comput Biol. 2003; 10(6):903-23.
[J Comput Biol. 2003]Nucleic Acids Res. 2003 Jul 1; 31(13):3576-9.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W432-7.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 1995 Dec 11; 23(23):4878-84.
[Nucleic Acids Res. 1995]Bioinformatics. 2005 Jul 1; 21(13):2933-42.
[Bioinformatics. 2005]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]J Comput Biol. 1996 Spring; 3(1):33-51.
[J Comput Biol. 1996]Nucleic Acids Res. 2003 Jul 1; 31(13):3576-9.
[Nucleic Acids Res. 2003]Bioinformatics. 2005 Jul 1; 21(13):2933-42.
[Bioinformatics. 2005]Bioinformatics. 2000 Mar; 16(3):233-44.
[Bioinformatics. 2000]Nucleic Acids Res. 2003 Jul 1; 31(13):3576-9.
[Nucleic Acids Res. 2003]Bioinformatics. 2005 Jul 1; 21(13):2933-42.
[Bioinformatics. 2005]Cytogenet Genome Res. 2005; 110(1-4):462-7.
[Cytogenet Genome Res. 2005]Bioinformatics. 2001 Jul; 17(7):608-21.
[Bioinformatics. 2001]Bioinformatics. 1999 Jul-Aug; 15(7-8):607-11.
[Bioinformatics. 1999]Bioinformatics. 2001 Jul; 17(7):608-21.
[Bioinformatics. 2001]Bioinformatics. 1999 Jul-Aug; 15(7-8):607-11.
[Bioinformatics. 1999]J Comput Biol. 1996 Spring; 3(1):33-51.
[J Comput Biol. 1996]J Comput Biol. 2000; 7(3-4):345-62.
[J Comput Biol. 2000]Algorithms Mol Biol. 2006 Nov 16; 1():21.
[Algorithms Mol Biol. 2006]