pmc logo image
Logo of narJournal URL: http://nar.oupjournals.org

Formats:

Nucleic Acids Res. 2009 July 1; 37(Web Server issue): W545–W551.
Published online 2009 July 1. doi: 10.1093/nar/gkp291.
PMCID: PMC2703971
iSARST: an integrated SARST web server for rapid protein structural similarity searches
Wei-Cheng Lo,1 Che-Yu Lee,1 Chi-Ching Lee,1,2 and Ping-Chiang Lyu1,3*
1Institute of Bioinformatics and Structural Biology, 2Institute of Information System Application and 3Department of Life Sciences, National Tsing Hua University, Hsinchu, Taiwan
*To whom correspondence should be addressed. Tel: Phone: +886 3 5742762; Fax: +886 3 5715934; Email: lslpc/at/life.nthu.edu.tw
Received February 22, 2009; Revised April 14, 2009; Accepted April 14, 2009.
iSARST is a web server for efficient protein structural similarity searches. It is a multi-processor, batch-processing and integrated implementation of several structural comparison tools and two database searching methods: SARST for common structural homologs and CPSARST for homologs with circular permutations. iSARST allows users submitting multiple PDB/SCOP entry IDs or an archive file containing many structures. After scanning the target database using SARST/CPSARST, the ordering of hits are refined with conventional structure alignment tools such as FAST, TM-align and SAMO, which are run in a PC cluster. In this way, iSARST achieves a high running speed while preserving the high precision of refinement engines. The final outputs include tables listing co-linear or circularly permuted homologs of the query proteins and a functional summary of the best hits. Superimposed structures can be examined through an interactive and informative visualization tool. iSARST provides the first batch mode structural comparison web service for both co-linear homologs and circular permutants. It can serve as a rapid annotation system for functionally unknown or hypothetical proteins, which are increasing rapidly in this post-genomics era. The server can be accessed at http://sarst.life.nthu.edu.tw/iSARST/.
Protein structural data are increasing exponentially nowadays. This fact has made structural comparison indispensable for protein functional and evolutionary studies, the basic approach of which is to relate proteins according to their structural similarities. To achieve the requirements of high-throughput data analyses, which are especially common in structural genomics researches, fast and accurate tools are in a high demand to access structural similarity searches. Searching methods working on amino acid sequence data such as BLAST (1) and FASTA (2) are extremely rapid, though they have long been known insensitive to detect structural relationships among proteins sharing low sequence homology (3). Alignment algorithms which directly solve geometric problems in superimposing three-dimensional (3D) protein structures can be very accurate, but most of them are not fast enough to serve as the basis of instant protein similarity search web services (4).
To combine the speed advantages of sequence-based methods and the accuracy merits of using structural data, many linear encoding algorithms have been proposed, such as those by Levine et al. (5), Lesk (6) and those of TOPSCAN (7), YAKUSA (8), 3D-BLAST (4) and SARST (9). By transforming 3D protein structural data into one-dimensional (1D) text strings or numerical series, these algorithms convert complicated geometric problems of structural superimpositions to much easier sequence comparison problems, which can be solved rapidly by applying traditional sequence alignment techniques. Among recently proposed linear encoding methods, Ramachandran Sequential Transformation (RST) (9) has been shown suitable to develop efficient protein structural similarity search tools. For instance, SARST (Structural similarity search Aided by RST) can run over 240 000 times as rapid as Combinatorial Extension (CE) (10) with comparable precisions in database searching (9). Besides, RST has been demonstrated applicable to detecting circular permutations (CPs) in proteins (11). CP is an evolutionary event that causes the amino- and carboxyl-termini of the resulted protein variants to be located at different positions of the original protein (12–14), while the overall 3D structures and biological functions remain preserved (15,16), with sometimes increased stability, activity or functional diversity (17–19). CP has been applied in folding researches (20–22) and many bioengineering fields (17,23–26). In detecting CP, CPSARST (CP Search Aided by RST) achieved a speed around 9000 times higher than SAMO (protein Structure Alignment tool based on Multiple Objective optimization) (27) with similar alignment qualities. In addition, it was proposed capable of serving as a functional assignment system for hypothetical proteins when co-linear similarity search methods failed to properly annotate them (11).
Although the average precision of SARST is close to that of CE, it is basically a search tool. We thus proposed that it can be combined with some highly accurate structural comparison tool, e.g. FAST (Fast Alignment and Search Tool) (28), into a good web service, in which SARST rapidly screens the target database and then the structural comparison tool refines (re-orders) the hit list (9). The advantage of this combination is that, because most dissimilar structures can be eliminated in the screening stage, there is no need to perform one-against-all structural alignments, which may cost the user even more than a day (4,9), to obtain a precisely ordered hit list. However, to re-order a hit list of 500 proteins, for instance, takes from minutes to over an hour when common alignment methods are applied (27–29), which is too long yet to make an efficient and convenient web-based tool. The situation of CPSARST is similar; even if the ‘double filter-and-refine’ strategy greatly enhances its performance, this 2 × 2 step strategy still takes >2 min to search the current PDB (11).
For developing a rapid, accurate and multi-functional protein structural similarity search service, we have integrated SARST and CPSARST along with several structural alignment methods, i.e. FAST (28), TM-align (29), SAMO (27) and SE (Seed Extension) (30), into a multi-processor and batch-processing system named iSARST (the integrated service of SARST). In this service, (i) the RST algorithm forms the basis of rapid database searching, (ii) refinement engines, FAST and TM-align, provide a high accuracy in the ordering of hits, (iii) CPSARST and SAMO make it versatile since they can do circularly permuted and order-independent structural alignment, respectively and (iv) the SE algorithm equips it a state-of-the-art method to produce accurate structure-based sequence alignments. The developmental principles of iSARST include (i) giving the user as quick responses as possible, (ii) providing a batch-processing environment and (iii) offering user-friendly interfaces. When assessed with the datasets in Refs (9,31), iSARST well preserved the high precisions of the refinement engines, while the calculation time was greatly reduced. Retrieving and superimposing 500 homologs from the current PDB only takes 7.8 s. If the input proteins had been queried previously, the cached results can be regained in a second. The result pages of iSARST are designed in a way that structural examinations, functional assignments and successive database searches can be carried out conveniently. Server side programs are modulized; new search methods and refinement tools can be integrated easily. Besides, its multi-processor implementation system is quite flexible, any computer equipped with linux operating system, conventional C libraries and PHP language can join iSARST as a node upon request. We hope that this efficient, versatile and convenient web server can be a good assistant and collaboration platform for structural biologists in this post-genomics era.
The flowchart of iSARST can be found in Figure 1Figure 1.. After receiving the query structure, the master node will linearly encode it and perform database search. In the refinement stage, proteins in the hit list are scattered to all slave nodes and then superimposed to the query protein by using an accurate structural comparison tool specified by the user. The RMSD (root mean square distance) values, alignment sizes and structural similarity scores are gathered by the master node to re-order the hit list, which is output with superimpositions and functional information. Finally, the refined data are cached in several forms to ensure a quick response once the same proteins are queried again in the future.
Figure 1.
Figure 1.
Figure 1.
Flowchart of iSARST. The query structure is first transformed into a structurally meaningful Ramachandran string and then used to screen target database by SARST or CPSARST. In refinement stage, the raw hit list is re-ordered according to the structural (more ...)
Linear encoding of protein structures
The RST algorithm (9) is implemented in iSARST to linearly encode protein structures. Traditional Ramachandran plot was organized with a nearest-neighbor clustering approach into 22 regions represented by different symbols. In this way, a protein structure can be transformed into a structurally meaningful string residue-by-residue according to [var phi] and ψ angles along its backbone. These 1D structural strings are called Ramachandran (RM) strings.
Structural similarity searches
To perform rapid database searches, all proteins in the PDB (32) and SCOP (33) have been pre-transformed into several RM string databases of various identity cutoffs. SARST and CPSARST both recruit blastall program (1) as the search engine. SARST is developed for common (co-linear) structural homologs; the database search is a straightforward execution of blastall. CPSARST specifically finds circular permutants. In the screening stage, it performs two rounds of similarity searches, with normal length (nl) and duplicated length (dl) of the query structure, respectively. After comparing results of these two rounds, the hits showing improved alignment qualities in the dl alignment will be chosen as CP candidates. The criteria are as follows,
A mathematical equation, expression, or formula.
 Object name is gkp291m1.jpg

1
A mathematical equation, expression, or formula.
 Object name is gkp291m2.jpg

2
where score is the bit score calculated by blastall using the standard SARST scoring matrix (9) to measure the similarity between two RM strings. E-value (expectation value) is an assessment of the significance of score. Given that a hit has a score S, E-value is the expected number of different alignments occurring by chance with scores ≥S in this particular database search (1,11).
Refinement of searching results
After database searches, the ordering of retrieved structural homologs is refined by some accurate structural comparison tool. Currently, we utilize FAST (28), TM-align (29) and SAMO (27) as refinement engines. FAST and TM-align have been shown to exhibit high structural alignment qualities (28,29), in many cases even outperforming DALI (34). Among the published structural comparison methods, they have very outstanding running speeds, e.g. superimposing a pair of proteins in 0.2–0.5 s in average with a 1.2-GHz processor (28,29). The speed of SAMO is similar to that of DALI, which requires ~10 s for a pair-wise alignment (11,27); it is implemented in iSARST because of the excellent ability of order-independent structural alignment (27). Structurally similar proteins with different topologies can be identified by SAMO, which may help to reveal the evolutionary mechanisms of protein structure and function. Values of RMSD and alignment size calculated by refinement engines will be integrated into a single measure called structural diversity defined by Lu (35):
A mathematical equation, expression, or formula.
 Object name is gkp291m3.jpg

3
where avg (Lq, Ls) is the average length of the query and subject proteins. A lower structural diversity stands for a higher structural similarity. This measure is used to re-order the raw hit list.
When running CPSARST, the refinement process is more complicated since two rounds of alignments shall be done, with and without circularly permuting the PDB structure (11). Only those hits with improved structural similarities to the query protein with a circularly permuting manipulation of the PDB file will be output as final CP candidates.
Indexes like RMSD and alignment size may show the structural relationships between proteins; however, to understand their functional relationship properly, one may still need to examine the structure-based sequence alignment. We have implemented SE algorithm (30) to promote the quality of structure-based sequence alignments made by the refinement engines. Sequence identity and similarity values are provided by iSARST, too. Amino acids are considered to be similar if they have positive pairing scores in the BLOSUM62 matrix (36).
Multiprocessor implementations
iSARST is now running on an IBM BladeCenter system plus several linux machines (Supplementary Table S1). The cluster environment was established with Rocks operation system. Programs, structure source files and cached data stored on the master node were shared with slave nodes through Network File System (NFS). The user interface and most server-side programs are written in PHP language in a modulized way. The search engine, blastall v.2.2.13, is an intra-machine parallel program. We discovered that when the number of paralleling threads was set as twice the number of processors contained in a machine, it showed the highest speed. Here, we do not use mpiBLAST (37) because the time cost of distributing calculation works to other nodes is relatively high, i.e. several seconds in our preliminary tests. In the refinement stage, aligning one subject protein to the query structure is treated as an individual task. To deal with as many tasks in parallel as possible, each node server is set to run a number of threads according to the number of processors it possesses. Tasks are distributed to slave nodes by programs written in MPI C and PHP. To ensure a quick response to the user, the assignment principles are as follows. (i) Nodes responding faster are assigned with more tasks. (ii) Tasks arriving at similar time have the same priority to be carried out. (iii) There is at least one thread in each node coping with the tasks in a random order, and thus even those users who submit queries much later than others will still get quick responses from iSARST.
As a searching service, iSARST has been evaluated with information retrieval experiments using the same dataset as Aung and Tan (31) and Lo et al. (9). We first found that iSARST exactly preserves the high average precisions of its refinement engines at any recall level. For instance, at a 85.0% average recall, when FAST is used as the refinement engine, the average precision of iSARST is 85.2%, the same as that of FAST evaluated in (9). As shown in Table 1, to reach this level of average recall, iSARST only has to retrieve 500 hits from this 34 055 polypeptide database, and superimposing these 500 protein pairs by using FAST takes only 7.8 s when 80 processors are recruited.
Table 1.
Table 1.
Average recall and running time of iSARST over various sizes of hit list
To know the performance of iSARST when the number of coexisting users is large, we used a number of client programs to execute it simultaneously. The results (Supplementary Figure S2) indicated that, the time cost in database searching and the responding time of refinement engine rise only linearly as the number of simultaneous submissions (n) increases. To the end, iSARST has a time complexity of O(n).
Input and the searching page
The query interface of iSARST accepts several different types of input, inclusive of (i) one or more PDB/SCOP entry IDs, (ii) a single PDB file or (iii) an archive file consisting of many protein structures in PDB format. After users submit the query data, a temporary searching page will appear to show the session ID and raw hit list. As the refinement process goes on, users can simultaneously see the progression and structural superimpositions; instead, they may close the browser and later on retrieve the results by (iv) specifying session IDs in the query interface. iSARST will also automatically make a list of previous sessions when they return, provided that cookies are enabled in their browsers.
Output: hit list
Primary outputs of iSARST are tables listing co-linear or circularly permuted structural homologs of the query proteins (Figure 2aFigure 2.). In the hit list page, there are two selection menus helping users switch to other previous queries. The list can be re-ordered according to RMSD, alignment sizes, structural diversities, sequence identities, functions, etc. Functions of the five hits with the highest structural similarity scores are summarized and highlighted to assist those who want to make a quick functional assignment. Any protein in the list can be re-submitted as a new query by a simple click, which makes successive database searches very easy. If the search engine is CPSARST, some extra filtering parameters will appear here. Users can adjust them based on their requirements or the property of query proteins. Definitions and suggestions to the use of these parameters can be found in (11).
Figure 2.
Figure 2.
Figure 2.
Final output of iSARST. (a) Hit list. This list can be re-ordered according to various indexes and protein functions by clicking column titles. Functions of the top 5 hits are summarized and highlighted in red. Any protein listed here can be re-submitted (more ...)
Output: structure inspection page
Structure superimpositions can be downloaded through the hit list page or examined in an interactive inspection tool (Figure 2bFigure 2. and c). The structure inspection page provides a graphical display of the superimposition, which can be rotated, re-sized and shown in several modes such as cartoon, space-filled or ball-and-stick. When there is a CP relationship detected, C-α atoms of terminal residues are drawn as balls so that their different locations can be easily recognized. Besides, two proteins are colored very differently; boundaries between the lighter and darker colors are the locations of CP site. Structure-based sequence alignment is shown as (i) a plain text representing unaligned regions as gaps and (ii) a graph of circularized text in which unaligned regions are drawn as budding loops. A smaller number or size of the loops stands for a larger number of residues that can be well-aligned. This circularized alignment is helpful to identify CP relationships, especially when the difference between co-linear and circularly permuted alignments is obvious. If some kind of structural rearrangement, inclusive of CP, had occurred between the aligned proteins, more than one colored segments can be seen in the dot matrix plot embedded here. SE algorithm (30) is implemented in this page to provide an improved structure-based sequence alignment, in which corresponding functional residues can be better aligned (30) and this may help users more correctly derive the functional relatedness between proteins.
APPLICATIONS AND FUTURE WORKS
As a rapid, accurate and versatile protein structural similarity search web server, iSARST provides user-friendly interfaces and informative outputs for scientists to examine protein structures and do functional annotations. Its modulized design permits follow-up integrations of new searching and refinement methods and thus iSARST is supposed to be a good platform for bioinformatics researchers to test new algorithms. In the near future, we will broaden the capabilities of iSARST by adding new modules that can specifically detect other interesting protein structural relationships such as 3D domain swapping (38) and non-CPs (39).
FUNDING
National Science Council, Taiwan, R.O.C. (grant numbers 96-3112-B-007-006, 97-2752-B-007-003-PAE and 97-3112-B-007-007). Funding for open access charge: National Science Council, Taiwan, R.O.C. (grant number 97-3112-B-007-007).
Conflict of interest statement. None declared.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
[Supplementary Data]
ACKNOWLEDGEMENTS
We would like to thank our colleague Chun-Ting Yeh who set up the IBM BladeCenter system; Ling-Yun Wu, Yong Wang and Prof. Luonan Chen (Shanghai University, China) for the application of SAMO; Chin-Hsien Tai and Dr Byungkook Lee (National Cancer Institute, National Institutes of Health, USA) for providing us helpful instructions in the implementation of SE; and to the authors of BLAST, FAST and TM-align, which are extensively used in this work.
1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PubMed]
2. Pearson WR. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 2000;132:185–219. [PubMed]
3. Sauder JM, Arthur JW, Dunbrack R.L., Jr. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins. 2000;40:6–22. [PubMed]
4. Yang JM, Tung CH. Protein structure database search and evolutionary classification. Nucleic Acids Res. 2006;34:3646–3659. [PubMed]
5. Levine M, Stuart D, Williams J. A method for the systematic comparison of the three-dimensional structures of proteins and some results. Acta Crystallogr. 1984;A40:600–610.
6. Lesk AM. Proceedings of Prague Stringology Club Workshop '98. Prague; 1998. pp. 95–100.
7. Martin AC. The ups and downs of protein topology; rapid comparison of protein structure. Protein Eng. 2000;13:829–837. [PubMed]
8. Carpentier M, Brouillet S, Pothier J. YAKUSA: a fast structural database scanning method. Proteins. 2005;61:137–151. [PubMed]
9. Lo WC, Huang PJ, Chang CH, Lyu PC. Protein structural similarity search by Ramachandran codes. BMC Bioinformatics. 2007;8:307. [PubMed]
10. Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998;11:739–747. [PubMed]
11. Lo WC, Lyu PC. CPSARST: an efficient circular permutation search tool applied to the detection of novel protein structural relationships. Genome Biol. 2008;9:R11. [PubMed]
12. Jeltsch A. Circular permutations in the molecular evolution of DNA methyltransferases. J. Mol. Evol. 1999;49:161–164. [PubMed]
13. Weiner J, 3rd, Thomas G, Bornberg-Bauer E. Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics. 2005;21:932–937. [PubMed]
14. Tsai LC, Shyur LF, Lee SH, Lin SS, Yuan HS. Crystal structure of a natural circularly permuted jellyroll protein: 1,3-1,4-beta-D-glucanase from Fibrobacter succinogenes. J. Mol. Biol. 2003;330:607–620. [PubMed]
15. Lindqvist Y, Schneider G. Circular permutations of natural protein sequences: structural evidence. Curr. Opin. Struct. Biol. 1997;7:422–427. [PubMed]
16. Vogel C, Morea V. Duplication, divergence and formation of novel protein topologies. Bioessays. 2006;28:973–978. [PubMed]
17. Qian Z, Lutz S. Improving the catalytic activity of Candida antarctica lipase B by circular permutation. J. Am. Chem. Soc. 2005;127:13466–13467. [PubMed]
18. Anantharaman V, Koonin EV, Aravind L. Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains. J. Mol. Biol. 2001;307:1271–1292. [PubMed]
19. Todd AE, Orengo CA, Thornton JM. Plasticity of enzyme active sites. Trends Biochem. Sci. 2002;27:419–426. [PubMed]
20. Li L, Shakhnovich EI. Different circular permutations produced different folding nuclei in proteins: a computational study. J. Mol. Biol. 2001;306:121–132. [PubMed]
21. Chen J, Wang J, Wang W. Transition states for folding of circular-permuted proteins. Proteins. 2004;57:153–171. [PubMed]
22. Bulaj G, Koehn RE, Goldenberg DP. Alteration of the disulfide-coupled folding pathway of BPTI by circular permutation. Protein Sci. 2004;13:1182–1196. [PubMed]
23. Kojima M, Ayabe K, Ueda H. Importance of terminal residues on circularly permutated Escherichia coli alkaline phosphatase with high specific activity. J. Biosci. Bioeng. 2005;100:197–202. [PubMed]
24. Ostermeier M. Engineering allosteric protein switches by domain insertion. Protein Eng. Des. Sel. 2005;18:359–364. [PubMed]
25. Galarneau A, Primeau M, Trudeau LE, Michnick SW. Beta-lactamase protein fragment complementation assays as in vivo and in vitro sensors of protein protein interactions. Nat. Biotechnol. 2002;20:619–622. [PubMed]
26. Baird GS, Zacharias DA, Tsien RY. Circular permutation and receptor insertion within green fluorescent proteins. Proc. Natl Acad. Sci. USA. 1999;96:11241–11246. [PubMed]
27. Chen L, Wu LY, Wang Y, Zhang S, Zhang XS. Revealing divergent evolution, identifying circular permutations and detecting active-sites by protein structure comparison. BMC Struct. Biol. 2006;6:18. [PubMed]
28. Zhu J, Weng Z. FAST: a novel protein structure alignment algorithm. Proteins. 2005;58:618–627. [PubMed]
29. Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33:2302–2309. [PubMed]
30. Tai CH, Vincent JJ, Kim C, Lee B. SE: an algorithm for deriving sequence alignment from a pair of superimposed structures. BMC Bioinformatics. 2009;10 (Suppl. 1):S4. [PubMed]
31. Aung Z, Tan KL. Rapid 3D protein structure database searching using information retrieval techniques. Bioinformatics. 2004;20:1045–1052. [PubMed]
32. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. [PubMed]
33. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE. The ASTRAL compendium in 2004. Nucleic Acids Res. 2004;32:D189–D192. [PubMed]
34. Holm L, Sander C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 1993;233:123–138. [PubMed]
35. Lu G. Top: a new method for protein structure comparisons and similarity searches. J. Appl. Cryst. 2000;33:176–183.
36. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA. 1992;89:10915–10919. [PubMed]
37. Lin H, Ma X, Chandramohan P, Geist A, Samatova N. IEEE International Parallel & Distributed Processing Symposium. CO: Denver; 2005.
38. Liu Y, Eisenberg D. 3D domain swapping: as domains continue to swap. Protein Sci. 2002;11:1285–1299. [PubMed]
39. Bujnicki JM. Sequence permutations in the molecular evolution of DNA methyltransferases. BMC Evol. Biol. 2002;2:3. [PubMed]
40. Rozwarski DA, Swami BM, Brewer CF, Sacchettini JC. Crystal structure of the lectin from Dioclea grandiflora complexed with core trimannoside of asparagine-linked carbohydrates. J. Biol. Chem. 1998;273:32818–32825. [PubMed]
41. Velloso LM, Svensson K, Schneider G, Pettersson RF, Lindqvist Y. Crystal structure of the carbohydrate recognition domain of p58/ERGIC-53, a protein involved in glycoprotein export from the endoplasmic reticulum. J. Biol. Chem. 2002;277:15979–15984. [PubMed]

See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph