![]() | ![]() |
Formats:
|
||||||||||
Copyright © 2007 The Author(s) RE-MuSiC: a tool for multiple sequence alignment with regular expression constraints 1Department of Computer Science, National Tsing Hua University, Hsinchu 300, Taiwan, 2Institute of Bioinformatics, National Chiao Tung University, Hsinchu 300, Taiwan and 3Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan *To whom correspondence should be addressed. Phone: +886-3-5712121, Fax: +886-3-5729288, Email: cllu/at/mail.nctu.edu.tw Received January 31, 2007; Revised April 6, 2007; Accepted April 11, 2007. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract RE-MuSiC is a web-based multiple sequence alignment tool that can incorporate biological knowledge about structure, function, or conserved patterns regarding the sequences of interest. It accepts amino acid or nucleic acid sequences and a set of constraints as inputs. The constraints are pattern descriptions, instead of exact positions of fragments to be aligned together. The output is an alignment where for each pattern (constraint), an occurrence on each sequence can be found aligned together with those on the other sequences, in a manner that the overall alignment is optimized. Its predecessor, MuSiC, has been found useful by researchers since its release in 2004. However, it is noticed in applications that the pattern formulation adopted in MuSiC, namely, plain strings allowing mismatches, is not expressive and flexible enough. The constraint formulation adopted in RE-MuSiC is therefore enhanced to be regular expressions, which is convenient in expressing many biologically significant patterns like those collected in the PROSITE database, or structural consensuses that often involve variable ranges between conserved parts. Experiments demonstrate that RE-MuSiC can be used to help predict important residues and locate phylogenetically conserved structural elements. RE-MuSiC is available on-line at http://140.113.239.131/RE-MUSIC. INTRODUCTION Background and motivation Sequence alignment tools are essential to biological research [see, e.g. (1), for a survey of multiple sequence alignment methods]. In addition to merely the residues/nucleotides, biologists often possess more knowledge regarding function, structure or conserved patterns of the sequences to be analyzed. It is generally desirable to have such information incorporated into an alignment procedure, so that the alignment result can be more biologically meaningful. For example, functionally important sites are generally expected to be aligned together, but a typical alignment tool often fails to achieve this if the sequence similarity is low. Imposing constraints representing such information turns out to be an effective manner to incorporate biological knowledge into an alignment tool. Motivated by such demand, Tang et al. (2) formulated the constrained multiple sequence alignment problem, where each constraint is a single residue/nucleotide. They considered alignment of RNase sequences, which are known to have a sequence of conserved residues His (H), Lys (K) and His. Using H, K, H as constraints, in the resulting constrained alignment each of these three residues can be found aligned together in a column of the alignment, appearing in the order as specified. Chin et al. (3) then proposed an improved algorithm for pairwise alignment and an approximation algorithm for multiple alignment. It is also noted that there have been other formulations regarding alignment with constraints proposed from different perspectives with various approaches (4–14). Conserved sites of a protein/RNA/DNA family are often of several residues/nucleotides long. For these patterns, the original formulation in (2) is not expressive enough. In addition, such patterns may not appear in the exact form in general. Consequently, Tsai et al. (15) proposed a generalized formulation and algorithm, where each constraint is a (usually short) string pattern allowing mismatches. Lu and Huang (16) then proposed a space efficient algorithm for this formulation. Web-based systems, MuSiC (15) (available at http://genome.life.nctu.edu.tw/MUSIC) and MuSiC-ME (16) (available at http://genome.life.nctu.edu.tw/MUSICME), were also developed; from now on these two systems will be referred to as MuSiC jointly. With the aid of MuSiC, Tsai et al. (15) and Lu and Huang (16) successfully identified a fragment in the 3′ untranslated region (3′-UTR) of a SARS (severe acute respiratory syndrome) coronavirus sequence that can fold into a pseudoknot, which is potentially responsible for self-replication of the virus. Indeed, since its release, MuSiC has been found useful in, e.g. detection of functionally and/or structurally important residues/motifs in sequences (17,18), prediction of RNA pseudoknotted structures (15,19,20), prediction of protein structures (21) and so on. There are, however, formulations of many biologically significant patterns beyond the capability of MuSiC. For example, many function-related protein sites as those collected in the PROSITE database (22) are expressed in regular expressions, which cannot be modeled using the substring-with-mismatch formulation of constraints implemented in MuSiC. An example of regular expression patterns is the EGF-like domain signature 2 (EGF_2, PS01186 in PROSITE): C-x-C-x(2)-[GP]-[FYW]-x(4,8)-C, which is related to the initiation of a signal transduction that results in DNA synthesis and cell proliferation. The meaning of this pattern is that, the first residue is Cys, followed by one residue of any kind, then a Cys, followed by two residues of any kind, then a Gly or Pro, etc. Regular expressions are also convenient in describing variable ranges between patterns or between blocks within a pattern, which is necessary for some single patterns themselves, and useful in applications where different patterns are expected to exhibit proximity in their occurrences. In the above example of EGF_2, the ‘x(4,8)’ symbol preceding the last Cys indicates a range of length varying from 4 to 8 between a residue of [F, Y or W] (Phe, Tyr or Trp) and that last Cys. Due to the usefulness of regular expressions in describing biological patterns, an enhanced web server, RE-MuSiC (Multiple Sequence Alignment with Regular Expression Constraints), capable of handling regular expression constraints, is developed. DIALIGN (8,9,12,13) (http://dialign.gobics.de/) is a well-known web server that can accept user-defined constraints as anchor points. It can be noted that the constraint formulation of DIALIGN and the one of RE-MuSiC are significantly different. In DIALIGN, a constraint consists of the exact positions of a pair of equal-length segments on two of the sequences, where these two segments are expected to be aligned together. Conflicts of constraints, if any, are resolved according to a weight function defined on the segment pairs. This formulation is more similar to the one of Myers et al. (6). On the other hand, in RE-MuSiC, a constraint is a regular expression pattern. Each pattern may occur many times in a sequence, where each occurrence needs not have the same length. The occurrences to be aligned together so as to satisfy the constraints will be those that can make the overall alignment optimized. Using RE-MuSiC RE-MuSiC provides an intuitive user interface (Figure 1
METHODS The regular expression constrained sequence alignment problem was originally formulated by Arslan (23). The algorithm proposed in (23) is for pairwise alignment with a single constraint. In (24) Arslan extended the algorithm in (23) to support multiple alignment with multiple constraints. The algorithm proposed in (24) may be implemented. It computes mathematically optimal constrained alignments. Unfortunately, the time complexity is extremely high, involving an exponential multiplicative factor in addition to the exponential time complexity for optimal (unconstrained) MSA computations. Even for pairwise alignment with multiple constraints, its worst case time and space requirements are intensive. In addition, the algorithms in (23,24) cannot find in the resulting alignment the regions responsible for the satisfactions of the constraints; only the alignment score, without the alignment itself, is reported. But being able to report alignments is important for a web server. It is therefore necessary to propose a solution more suitable for practical applications. For pairwise alignment with one regular expression constraint, in a previous study (25) we have proposed an algorithm, which is more efficient both in time and in space than the one in (23). Furthermore, the alignment in addition to the score can be reconstructed without worsening the time and space complexity. In this work we extend the algorithm in (25) to support multiple constraints and multiple sequences, as required in RE-MuSiC. The resulting algorithm is more efficient than the one in (24) for pairwise alignment with multiple constraints. To deal with multiple sequences, a progressive method is implemented, using our improved pairwise algorithm as the kernel. For details of the algorithm the reader is referred to the supplementary material (available at http://140.113.239.131/RE-MUSIC/RE_MuSiC_method.pdf). EXPERIMENTS Protein sequences with active site residues The glutathione binding site (G-site) on glutathione S-transferase (GST) had been found to have conserved architectures across species (26). The chemical natures of their residues acting as G-site ligands and interactions facilitated with glutathione are also analogous (26). In a reasonable alignment of GST protein sequences, therefore, the residues for the G-site are expected to be aligned together. A structural superposition of the crystal structures of GST proteins from different species also suggests that most of these G-site residues should be aligned together (26). The sequence identity of those GST proteins from different species, however, is quite low; for example, it is reported in (26) that the pairwise sequence identity between the A. thaliana GST and each of other six non-plant GSTs is no more than 20.2%. In such a case, interfered by the low-similarity regions, it would be difficult for a typical alignment tool to align the important residues well. An experiment is therefore undertaken to examine the performance of a typical alignment tool in this case, as well as to demonstrate how RE-MuSiC can be used to produce a more reasonable alignment. In this experiment we analyze three GST proteins: (i) AtGST: a phi class GST from plant A. thaliana (PDBID: 1GNW); (ii) SjGST: an alpha class GST from non-mammalian S. japonicum (flat worm) (PDBID: 1M99); (iii) SsGST: a pi class GST from mammalian S. scrofa (pig) (PDBID: 2GSR). These sequences are first aligned using ClustalW (27). The result is shown in Figure 2 RNA sequences with phylogenetically conserved pseudoknots There is considerable evidence that suggests phylogenetically conserved pseudoknots found in the 3′-UTRs of various coronaviruses are involved in RNA replication of these viruses (28). In an alignment of the 3′-UTR sequences of coronaviruses, therefore, it is desirable if these pseudoknots can be aligned together. However, it is often the case that the sequence identity among the coronaviruses from different groups is low. It is not an easy task for a typical alignment tool to align together the conserved pseudoknots. In this experiment, we demonstrate that RE-MuSiC can be helpful in this situation. Four coronaviruses are considered in this experiment (GenBank accession numbers in parentheses): (i) HCoV-229E: human 229E coronavirus (af304460), (ii) PEDV: porcine epidemic diarrhea virus (af353511), (iii) BCoV: bovine coronavirus (af220295) and (iv) MHV: murine hepatitis virus (af201929). The first two are group 1 coronaviruses, while the others belong to group 2. First, ClustalW is applied to align these coronavirus sequences. The result is shown in Figure 3
SUMMARY Imposing constraints is an effective manner to incorporate biological knowledge into an alignment tool. Previous versions of MuSiC do not support many biologically significant patterns. RE-MuSiC adopts regular expressions as its constraint formulation, which is useful in expressing PROSITE patterns or structural elements that often involve variable ranges between conserved parts. The algorithm underlying RE-MuSiC represents an improvement over the previously proposed algorithm, and is more appropriate for implementation in a web-server. Experiments on GST proteins and on coronaviruses with phylogenetically conserved pseudoknots demonstrate that, with additional knowledge incorporated, RE-MuSiC is able to produce meaningful alignments in which important residues or structural elements can be aligned properly, even if the similarity among input sequences is low. Such ability is also useful for prediction purposes. ACKNOWLEDGEMENTS The authors thank the anonymous reviewers for their useful comments, which improved the usability of the web interface and the clarity of this article. This work was supported in part by National Science Council of Republic of China under grant NSC95-2221-E-009-231 (C.L. Lu) and grants NSC95-2221-E-007-031 and NSC95-2627-B-007-002 (C.Y. Tang), and in part by the ATU plan of MOE. Funding to pay the Open Access publication charges for this article was provided by the ATU plan of MOE. Conflict of interest statement. None declared. REFERENCES 1. Notredame C. Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics. 2002;3:131–144. [PubMed] 2. Tang CY, Lu CL, Chang MDT, Tsai YT, Sun YJ, Chao KM, Chang JM, Chiou YH, Wu CM, et al. Constrained multiple sequence alignment tool development and its application to RNase family alignment. J. Bioinform. Comput. Biol. 2003;1:267–287. [PubMed] 3. Chin FYL, Ho NL, Lam TW, Wong PWH, Chan MY. Efficient constrained multiple sequence alignment with performance guarantee. J. Bioinform. Comput. Biol. 2005;3:1–18. [PubMed] 4. Shuler GD, Altschul SF, Lipman DJ. A workbench for multiple alignment construction and analysis. Proteins: Struct. Funct. Genet. 1991;9:180–190. [PubMed] 5. Depiereux E, Feytmans E. MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences. Comput. Appl. Biosci. 1992;8:501–509. [PubMed] 6. Myers G, Selznick S, Zhang Z, Miller W. Progressive multiple alignment with constraints. J. Comput. Biol. 1996;3:563–572. [PubMed] 7. Morgenstern B, Dress A, Werner T. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad. Sci. USA. 1996;93:12098–12103. [PubMed] 8. Morgenstern B, Frech K, Dress A, Werner T. DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics. 1998;14:290–294. [PubMed] 9. Morgenstern B. DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics. 1999;15:211–218. [PubMed] 10. Thompson JD, Plewniak F, Thierry J-C, Poch O. DbClustal: rapid and reliable multiple alignments of protein sequences detected by database searches. Nucleic Acids Res. 2000;28:2919–2926. [PubMed] 11. Sammeth M, Morgenstern B, Stoye J. Divide-and-conquer multiple alignment with segment-based constraints. Bioinformatics. 2003;19(Suppl. 2):ii189–ii195. 12. Morgenstern B. DIALIGN: Multiple DNA and protein sequence alignment at BiBi-Serv. Nucleic Acids Res. 2004;32:W33–W36. [PubMed] 13. Morgenstern B, Werner N, Prohaska SJ, Schneider RSI, Subramanian AR, Stadler PF, Weyer-Menkhoff J. Multiple sequence alignment with user-defined constraints at GOBICS. Bioinformatics. 2005;21:1271–1273. [PubMed] 14. Morgenstern B, Prohaska SJ, Pohler D, Stadler PF. Multiple sequence alignment with user-defined anchor points. Algorithms Mol. Biol. 2006;1:6. [PubMed] 15. Tsai Y-T, Huang YP, Yu CT, Lu CL. MuSiC: a tool for multiple sequence alignment with constraints. Bioinformatics. 2004;20:2309–2311. [PubMed] 16. Lu CL, Huang YP. A memory-efficient algorithm for multiple sequence alignment with constraints. Bioinformatics. 2005;21:20–30. [PubMed] 17. Song B, Choi JH, Chen GY, Szymanski J, Zhang GQ, Tung AKH, Kang J, Kim S, Yang J. ARCS: an aggregated related column scoring scheme for aligned sequences. Bioinformatics. 2006;22:2326–2332. [PubMed] 18. Cheng CY, Chang CH, Wu YJ, Li YK. Exploration of glycosyl hydrolase family 75, a chitosanase from Aspergillus fumigatus. J. Biol. Chem. 2006;281:3137–3144. [PubMed] 19. Huang CH, Lu CL, Chiu HT. A heuristic approach for detecting RNA H-type pseudoknots. Bioinformatics. 2005;21:3501–3508. [PubMed] 20. Reeder J, Hochsmann M, Rehmsmeier M, Voss B, Giegerich R. Beyond Mfold: Recent advances in RNA bioinformatics. J. Biotechnol. 2006;124:41–55. [PubMed] 21. Dunbrack RL. Sequence comparison and protein structure prediction. Curr. Opin. Struct. Biol. 2006;16:374–384. [PubMed] 22. Hulo N, Sigrist CJA, Le Saux V, Langendijk-Genevaux PS, Bordoli L, Gattiker A, De Castro E, Bucher P, Bairoch A. Recent improvements to the PROSITE database. Nucleic Acids Res. 2004;32:134–137. 23. Arslan AN. Berlin: Springer; 2005. Regular expression constrained sequence alignment. In; pp. 322–333. Proceedings of 16th Annual Symposium on Combinatorial Pattern Matching (CPM05), Vol. 3537 of Lecture Notes in Computer Science. 24. Arslan AN. Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB05). USA: San Diego; 2005. multiple sequence alignment containing a sequence of regular expressions; pp. 1–7. 25. Chung Y-S, Lu CL, Tang CY. Berlin: Springer; 2006. Efficient algorithms for regular expression constrained sequence alignment; pp. 389–400. Proceedings of 17th Annual Symposium on Combinatorial Pattern Matching (CPM06), Vol. 4009 of Lecture Notes in Computer Science. 26. Reinemer P, Prade L, Hof P, Neuefeind T, Huber R, Zettl R, Palme K, Schell J, Koelln I, et al. Three-dimensional structure of glutathione S-transferase from Arabidopsis thaliana at 2.2 Å resolution: structural characterization of herbicide-conjugating plant glutathione S-transferases and a novel active site architecture. J. Mol. Biol. 1996;255:289–309. [PubMed]27. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. [PubMed] 28. Williams GD, Chang RY, Brian DA. A phylogenetically conserved hairpin-type 3′ untranslated region pseudoknot functions in coronavirus RNA replication. J. Virol. 1999;73:8349–8355. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||
Pharmacogenomics. 2002 Jan; 3(1):131-44.
[Pharmacogenomics. 2002]J Bioinform Comput Biol. 2003 Jul; 1(2):267-87.
[J Bioinform Comput Biol. 2003]J Bioinform Comput Biol. 2005 Feb; 3(1):1-18.
[J Bioinform Comput Biol. 2005]Proteins. 1991; 9(3):180-90.
[Proteins. 1991]Comput Appl Biosci. 1992 Oct; 8(5):501-9.
[Comput Appl Biosci. 1992]J Comput Biol. 1996 Winter; 3(4):563-72.
[J Comput Biol. 1996]J Bioinform Comput Biol. 2003 Jul; 1(2):267-87.
[J Bioinform Comput Biol. 2003]Bioinformatics. 2004 Sep 22; 20(14):2309-11.
[Bioinformatics. 2004]Bioinformatics. 2005 Jan 1; 21(1):20-30.
[Bioinformatics. 2005]Bioinformatics. 2006 Oct 1; 22(19):2326-32.
[Bioinformatics. 2006]J Biol Chem. 2006 Feb 10; 281(6):3137-44.
[J Biol Chem. 2006]J Bioinform Comput Biol. 2003 Jul; 1(2):267-87.
[J Bioinform Comput Biol. 2003]Proteins. 1991; 9(3):180-90.
[Proteins. 1991]Bioinformatics. 1998; 14(3):290-4.
[Bioinformatics. 1998]Bioinformatics. 1998; 14(3):290-4.
[Bioinformatics. 1998]Bioinformatics. 1999 Mar; 15(3):211-8.
[Bioinformatics. 1999]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W33-6.
[Nucleic Acids Res. 2004]Bioinformatics. 2005 Apr 1; 21(7):1271-3.
[Bioinformatics. 2005]J Comput Biol. 1996 Winter; 3(4):563-72.
[J Comput Biol. 1996]J Mol Biol. 1996 Jan 19; 255(2):289-309.
[J Mol Biol. 1996]Nucleic Acids Res. 1994 Nov 11; 22(22):4673-80.
[Nucleic Acids Res. 1994]J Virol. 1999 Oct; 73(10):8349-55.
[J Virol. 1999]J Virol. 1999 Oct; 73(10):8349-55.
[J Virol. 1999]Comput Appl Biosci. 1992 Oct; 8(5):501-9.
[Comput Appl Biosci. 1992]Proteins. 1991; 9(3):180-90.
[Proteins. 1991]Bioinformatics. 2004 Sep 22; 20(14):2309-11.
[Bioinformatics. 2004]Bioinformatics. 2005 Jan 1; 21(1):20-30.
[Bioinformatics. 2005]