• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jul 1, 2011; 39(Web Server issue): W190–W196.
Published online Jun 6, 2011. doi:  10.1093/nar/gkr411
PMCID: PMC3125791

CSpritz: accurate prediction of protein disorder segments with annotation for homology, secondary structure and linear motifs

Abstract

CSpritz is a web server for the prediction of intrinsic protein disorder. It is a combination of previous Spritz with two novel orthogonal systems developed by our group (Punch and ESpritz). Punch is based on sequence and structural templates trained with support vector machines. ESpritz is an efficient single sequence method based on bidirectional recursive neural networks. Spritz was extended to filter predictions based on structural homologues. After extensive testing, predictions are combined by averaging their probabilities. The CSpritz website can elaborate single or multiple predictions for either short or long disorder. The server provides a global output page, for download and simultaneous statistics of all predictions. Links are provided to each individual protein where the amino acid sequence and disorder prediction are displayed along with statistics for the individual protein. As a novel feature, CSpritz provides information about structural homologues as well as secondary structure and short functional linear motifs in each disordered segment. Benchmarking was performed on the very recent CASP9 data, where CSpritz would have ranked consistently well with a Sw measure of 49.27 and AUC of 0.828. The server, together with help and methods pages including examples, are freely available at URL: http://protein.bio.unipd.it/cspritz/.

INTRODUCTION

The 3D native structure of proteins has been considered the major determinant of function for many years. Over the last decade there has been a growing realization of an alternative mechanism whereby non-folding regions are both widespread and also carry functional significance (1,2). These non-folding regions within a protein, coming in various guises ranging from fully extended to molten globule-like and partially folded structures (3), are collectively known as intrinsically disordered regions (4). Such regions often become structured upon binding to a target molecule and have been shown to be involved in various biological processes such as cell signaling or regulation (5), DNA binding (6) and molecular recognition in general (3,7). An interesting observation is that the amount of disorder within a proteome seems to correlate with complexity of the organism, with an apparent increase in disorder for eukaryotic organisms (8,9). The conservation of disorder (10,11) and specific amino acid patterns (12,13) (e.g. PxPxP) have also been studied. Indeed, there is a growing realization that intrinsically disordered regions are widely used as hubs for protein–protein interactions (14), for which structural data can be accessed in the ComSin database (15). Functional linear motifs (16,17), which are mostly hidden in disordered regions (18), have been characterized in resources such as ELM (19), an online repository of linear motifs.

The experimental determination of native disorder, once considered an anomaly, can be time consuming, difficult and expensive. As a result, computational approaches have largely driven our understanding of disorder over the last decade (14). The bi-yearly Critical Assessment of Techniques for protein Structure Prediction (CASP) experiment has included a disorder category since CASP5 in 2002 (20). Previously published methods can be roughly divided into biophysical and machine learning approaches. The former rely on the unique amino acid distribution associated with protein disorder (21–23). Machine learning methods use either neural networks (24–26) or support vector machines (9,27) and are commonly based on sequence profiles, predicted secondary structure and more recently template structures (28). More recently, meta servers combining several biophysical and machine learning methods have been published (29–31). All these methods have shown promising results, possibly for two reasons: (i) as the amino acid sequence contains all the information to determine structure it is reasonable to assume that unstructured regions have specific amino acid propensities and (ii) disorder is important in many biological functions and therefore unstructured protein segments should be conserved by evolution. Knowing that disordered segments have a biased sequence, machine learning techniques should excel. In this paper we describe and benchmark CSpritz, an extension of our previous Spritz server (27) based on three distinct modules for the prediction of intrinsically disordered regions in proteins. The performance of the method will be benchmarked on the latest available data for short and long disordered segments. A novel addition to the CSpritz server is information about homologous structures found from PSI-BLAST searches, secondary structure and linear motifs contributing to the functional annotation of disordered segments.

MATERIALS AND METHODS

CSpritz predicts intrinsic disorder from protein sequences through a combination of three machine learning systems, which will be described in the following sections. Most methods consider short and long disorder separately, as they have different characteristics. Short disorder can be derived from residues missing backbone atoms in X-ray crystallographic structures deposited in the Protein Data Bank (PDB) (32). Long disorder is taken from the Disprot database (33) because it is largely missing from the PDB. All data sets used throughout training are appropriately redundancy reduced using UniqueProt (34) and in all cases contain only sequences available before May 2008 (i.e. the start of CASP8).

Spritz

The original Spritz (27) is based on PSI-BLAST (35) multiple sequence profiles and predicted secondary structure. Support Vector Machines (SVMs) were used on a local sequence window to train two specialized binary classifiers, for long and short regions of disorder. A description of the data sets can be found in the previous publication (27). In addition to the original ab initio version of Spritz, a filter removing PDB structural homologues from predicted disorder is implemented. This works by performing a PSI-BLAST search against a redundancy reduced sequence database. The generated sequence profile is then used in a final PSI-BLAST round against a filtered PDB. Residues matching a structural template are assigned a Spritz score below the disorder threshold.

Punch

Punch is a SVM based predictor extending Spritz. Sequence and structural homologues are detected as in Spritz. In addition, Porter secondary structure (36) and PaleAle relative solvent accessibility (37) are also included. Unlike Spritz, information about structural templates is encoded and fed directly to the SVM together with the other inputs. The two data sets used for learning (see Supplementary Data) are a large set of disordered X-ray chains derived from the PDB (December 2007) and a publicly available data set (24) based on disordered X-ray segments from the PDB (May 2004). The assignment of disorder is different in both data sets and does not necessarily intersect.

ESpritz

ESpritz is a fast predictor using bidirectional recursive neural networks (BRNNs) (38). BRNNs do not require contextual windows because they extract this information dynamically from the sequence. ESpritz consists of 20 inputs where each unit is allocated for one of the 20 amino acids. Although the method is very simple, the BRNN is useful for extracting relevant patterns required for disorder without the use of PSI-BLAST sequence alignments (results not shown). Like Spritz, two types of data based on long and short disorder types are designed (see Supplementary Material). The short disorder set is built from X-ray PDB structures (May 2008). Long disorder segments are extracted from Disprot (version 3.7) with identical sequences removed.

Linear motifs and secondary structure

It can be useful to unify the following information for disordered segments: (i) amino acids involved; (ii) secondary structure; and (iii) important linear motifs. CSpritz offers this predicted information in various forms (see output section). Secondary structure propensities are predicted from Porter (36). Linear motifs (LMs) are selected from ELM (19) as the ligand binding subset (names starting with LIG). ELM is a resource for predicting functional sites in eukaryotic proteins where functional sites are identified by patterns. These motifs are supposed to be representative of the more studied LM–protein binding examples. The selected LMs are returned when sub-sequences are matched by their regular expressions in ELM.

PERFORMANCE EVALUATION

Combination

Experiments were carried out for the best procedure to combine Punch, Spritz and ESpritz. After trying majority voting, unanimous votes and combination with neural networks, the simplest method of averaging the probabilities produced by each system was found to be the best (data not shown). The optimal decision threshold was determined on data independent from the benchmarking set by maximizing the Sw measure (39). CASP8 data (39) was used for short and Disprot (version 3.7) for long disorder. Regular expressions are incorporated to fill disordered regions separated by less than three residues. The Pearson correlation of the probabilities produced on CASP9 disorder targets was calculated to test how different the three predictors are. Table 1 shows this correlation and proves that the three systems are indeed sufficiently different. This is important for combining the three systems since it is well known that ensembling predictions which are different or uncorrelated improve generalization performance considerably (40). In particular, combination is especially beneficial when the wrongly predicted residues for each predictor do not correlate (i.e. their probabilities do not correlate) (41,42).

Table 1.
Pearson correlation of the three systems on CASP9 targets

Benchmarking sets

Validation of short disorder segments is performed on the 117 CASP9 targets (URL: http://www.predictioncenter.org/casp9/), comparing with other groups taking part in the disorder category experiment according to their official CASP results. In order to validate the long disorder segments we choose DisProt entries enriched with PDB annotation from the SL data set defined in (43). Unfortunately, selecting sequences with <40% sequence identity to our training set leaves only 29 proteins. We also define a set of 569 X-ray sequences (Xray569) deposited in the PDB (resolution at most 2.5 Å and R-free <0.25) between May 2008 and September 2010 reduced by sequence identity using UniqueProt (34) to an HSSP value of 0 to our training data and among each other. Supplementary Table S1 shows the size and composition of the validation data sets. Note that to ensure a fair comparison to other methods on our benchmarking sets, CSpritz was in all cases run with sequence and PDB databases frozen prior to May 2008.

CASP short disorder

To assess the performance of our server for the short disorder option, we rank all groups participating in the CASP9 experiment. Table 2 shows the top 5 (out of 32) groups plus CSpritz and Spritz ranked by Sw, a commonly used measure at CASP. For Sw, as in the CASP8 assessment (39) the statistical significance of the evaluation scores was determined by bootstrapping: 80% of the targets were randomly selected 1000 times, and the standard error of the scores was calculated (i.e. 1.96*standard_error gives 95% confidence around mean for normal distributions). For a full list of rankings see the online methods page. Our results suggest a consistently good performance of our server, especially when taking into account that some of the top five are meta-servers and some are not publicly available.

Table 2.
Results for the top five performing groups at the CASP9 experiment, CSpritz and the original Spritz

DisProt long disorder

The long disorder type performance of CSpritz was benchmarked by comparing Sw, accuracy and AUC with the original Spritz and state-of-the-art predictors PONDR-FIT (30), Disopred (9) and IUPred (23). Table 3 shows CSpritz performing significantly better than the other predictors for this type of disorder. In addition CSpritz improves over the long disorder predictions made by our previous server Spritz.

Table 3.
Comparison for DisProt disordered regions

Large-scale performance

To estimate the run time of CSpritz compared to others and validate the predictions on a larger set of PDB structures we use the Xray569 set. The results (Supplementary Table S2) are similar to the DisProt set and confirm the performance of CSpritz compared to the other methods. As can be expected, all methods are better at predicting disorder at the N- and C-termini than in the central part of the protein sequences. The execution time for CSpritz is largely determined by the PSI-BLAST search and comparable to the original Spritz and Disopred2, with ca. 15 min for an average protein. When executing multiple predictions, the CSpritz web server will run up to five proteins in parallel, reducing the overall time significantly.

SERVER DESCRIPTION

The CSpritz input page is designed with simplicity in mind. A single or multiple sequences in FASTA format are the only input required and can be either pasted or uploaded as a file. Pasting is limited to 32 000 characters but uploading has no restrictions. User email address and a query title are optional. Either short (default) or long disorder options can be selected, with the appropriate decision thresholds determined on data not involved in the benchmarking. To facilitate navigation, help and methods pages are available at the top of the interface.

The CSpritz output is presented in two main pages. The first page, displaying statistics, links to individual pages and a downloadable archive for all user supplied proteins, is present only if more than one sequence was submitted. A histogram of disordered segments and an archive for download containing all generated data are also available. Figure 1 shows a sample global page for the 117 CASP9 targets.

Figure 1.
Global output page for multiple sequences. Summary statistics are displayed for some interesting values about the disorder segments of all query sequences. An archive is offered for download containing all disorder predictions, linear motifs and statistics ...

The second output displays predicted disorder and annotation for individual proteins. In addition to showing the sequence with predicted secondary structure and disorder, several statistics regarding the distribution of disorder are presented. An extensive description of the output is available as part of the online help page. Two graphs plot the probability of disorder and the number of available structural templates versus disordered regions in homologous PDB structures. The last part of the output concerns the presence of putative linear motifs and secondary structure propensity for disordered segments. This can be a useful source of functional annotation, as shown in Figure 2 for Drosophila melanogaster Cryptochrome (dCRY). Following computational analysis, functional linear motifs were experimentally confirmed in the disordered C-terminus of dCRY (44). CSpritz aims to speed up this type of analysis by providing additional clues. In dCRY the putative linear motifs (Figure 2) match the disordered residues having a favorable alpha helical propensity. It is known that many such interactions involve disorder to secondary structure transitions upon binding (45).

Figure 2.
Individual output page for D. melanogaster Cryptochrome. The main figure shows the list of available files and actual disorder prediction. The latter is composed of the amino acid sequence, its predicted secondary structure and the CSpritz disorder classification, ...

CONCLUSIONS

We have described CSpritz, a novel web server for the prediction of intrinsically disordered protein segments from sequence. It allows the batch prediction of many sequences simultaneously, providing overview statistics. The single protein sequence is annotated with disorder and useful information regarding local secondary structure and possible interaction motifs, providing a first step towards the functional interpretation of disorder. Future work will concentrate on improving the functional description of disordered regions by including other types of related information such as repeats (46) and aggregation (47).

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

University of Padova (CPDA098382, CPDR097328 to S.T.); FIRB Futuro in Ricerca (RBFR08ZSXY to S.T.). Funding for open access charge: FIRB Futuro in Ricerca grant from the Italian Ministry of Education, University and Research (MIUR).

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

The authors are grateful to members of the BioComputing UP lab for insightful discussions.

REFERENCES

1. Wright PE, Dyson HJ. Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J. Mol. Biol. 1999;293:321–331. [PubMed]
2. Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat. Rev. Mol. Cell Biol. 2005;6:197–208. [PubMed]
3. Tompa P, Fuxreiter M. Fuzzy complexes: polymorphism and structural disorder in protein-protein interactions. Trends Biochem. Sci. 2008;33:2–8. [PubMed]
4. Tompa P. Intrinsically unstructured proteins. Trends Biochem. Sci. 2002;27:527–533. [PubMed]
5. Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z. Intrinsic disorder and protein function. Biochemistry. 2002;41:6573–6582. [PubMed]
6. Weiss MA, Ellenberger T, Wobbe CR, Lee JP, Harrison SC, Struhl K. Folding transition in the DNA-binding domain of GCN4 on specific binding to DNA. Nature. 1990;347:575–578. [PubMed]
7. Tompa P, Fuxreiter M, Oldfield CJ, Simon I, Dunker AK, Uversky VN. Close encounters of the third kind: disordered domains and the interactions of proteins. Bioessays. 2009;31:328–335. [PubMed]
8. Dunker AK, Obradovic Z, Romero P, Garner EC, Brown CJ. Intrinsic protein disorder in complete genomes. Genome Inform. Ser. Workshop Genome Inform. 2000;11:161–171. [PubMed]
9. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol. 2004;337:635–645. [PubMed]
10. Schaefer C, Schlessinger A, Rost B. Protein secondary structure appears to be robust under in silico evolution while protein disorder appears not to be. Bioinformatics. 2010;26:625–631. [PMC free article] [PubMed]
11. Siltberg-Liberles J. Evolution of structurally disordered proteins promotes neostructuralization. Mol. Biol. Evol. 2010;28:59–62. [PMC free article] [PubMed]
12. Lise S, Jones DT. Sequence patterns associated with disordered regions in proteins. Proteins. 2005;58:144–150. [PubMed]
13. Lobanov MY, Furletova EI, Bogatyreva NS, Roytberg MA, Galzitskaya OV. Library of disordered patterns in 3D protein structures. PLoS Comput. Biol. 2010;6:e1000958. [PMC free article] [PubMed]
14. Russell RB, Gibson TJ. A careful disorderliness in the proteome: sites for interaction and targets for future therapies. FEBS Lett. 2008;582:1271–1275. [PubMed]
15. Lobanov MY, Shoemaker BA, Garbuzynskiy SO, Fong JH, Panchenko AR, Galzitskaya OV. ComSin: database of protein structures in bound (complex) and unbound (single) states in relation to their intrinsic disorder. Nucleic Acids Res. 2010;38:D283–D287. [PMC free article] [PubMed]
16. Gibson TJ. Cell regulation: determined to signal discrete cooperation. Trends Biochem. Sci. 2009;34:471–482. [PubMed]
17. Diella F, Haslam N, Chica C, Budd A, Michael S, Brown NP, Trave G, Gibson TJ. Understanding eukaryotic linear motifs and their role in cell signaling and regulation. Front Biosci. 2008;13:6580–6603. [PubMed]
18. Fuxreiter M, Tompa P, Simon I. Local structural disorder imparts plasticity on linear motifs. Bioinformatics. 2007;23:950–956. [PubMed]
19. Gould CM, Diella F, Via A, Puntervoll P, Gemund C, Chabanis-Davidson S, Michael S, Sayadi A, Bryne JC, Chica C, et al. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 2010;38:D167–D180. [PMC free article] [PubMed]
20. Melamud E, Moult J. Evaluation of disorder predictions in CASP5. Proteins. 2003;53(Suppl. 6):561–565. [PubMed]
21. Uversky VN. What does it mean to be natively unfolded? Eur. J. Biochem. 2002;269:2–12. [PubMed]
22. Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins. 2005;61(Suppl. 7):176–182. [PubMed]
23. Dosztanyi Z, Csizmok V, Tompa P, Simon I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J. Mol. Biol. 2005;347:827–839. [PubMed]
24. Cheng J, Sweredoski MJ, Baldi P. Accurate prediction of protein disordered regions by mining protein structure data. Data Min Knowl Disc. 2005;11:213–222.
25. Jones DT, Ward JJ. Prediction of disordered regions in proteins from position specific score matrices. Proteins. 2003;53(Suppl. 6):573–578. [PubMed]
26. Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB. Protein disorder prediction: implications for structural proteomics. Structure. 2003;11:1453–1459. [PubMed]
27. Vullo A, Bortolami O, Pollastri G, Tosatto SC. Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res. 2006;34:W164–W168. [PMC free article] [PubMed]
28. McGuffin LJ. Intrinsic disorder prediction from the analysis of multiple protein fold recognition models. Bioinformatics. 2008;24:1798–1804. [PubMed]
29. Mizianty MJ, Stach W, Chen K, Kedarisetti KD, Disfani FM, Kurgan L. Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources. Bioinformatics. 2010;26:i489–i496. [PMC free article] [PubMed]
30. Xue B, Dunbrack RL, Williams RW, Dunker AK, Uversky VN. PONDR-FIT: a meta-predictor of intrinsically disordered amino acids. Biochim. Biophys. Acta. 2010;1804:996–1010. [PMC free article] [PubMed]
31. Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B. Improved disorder prediction by combination of orthogonal approaches. PLoS One. 2009;4:e4433. [PMC free article] [PubMed]
32. Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007;35:D301–D303. [PMC free article] [PubMed]
33. Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN, et al. DisProt: the database of disordered proteins. Nucleic Acids Res. 2007;35:D786–D793. [PMC free article] [PubMed]
34. Mika S, Rost B. UniqueProt: creating representative protein sequence sets. Nucleic Acids Res. 2003;31:3789–3791. [PMC free article] [PubMed]
35. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
36. Pollastri G, McLysaght A. Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics. 2005;21:1719–1720. [PubMed]
37. Pollastri G, Martin AJ, Mooney C, Vullo A. Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information. BMC Bioinformatics. 2007;8:201. [PMC free article] [PubMed]
38. Baldi P, Pollastri G. The principled design of large-scale recursive neural network rchitectures–dag-rnns and the protein structure prediction problem. J. Mach. Learn. 2003;4:575–602.
39. Noivirt-Brik O, Prilusky J, Sussman JL. Assessment of disorder predictions in CASP8. Proteins. 2009;77(Suppl. 9):210–216. [PubMed]
40. Sollich P, Krogh A. Learning with ensembles: how over-fitting can be useful. Adv. Neural Inform. Processing Sys. 1996;8:190–196.
41. Albrecht M, Tosatto SC, Lengauer T, Valle G. Simple consensus procedures are effective and sufficient in secondary structure prediction. Protein Eng. 2003;16:459–462. [PubMed]
42. Ali KM, Pazzani MJ. Error reduction through learning multiple descriptions. Mach. Learn. 1996;24:173–202.
43. Sirota FL, Ooi HS, Gattermayer T, Schneider G, Eisenhaber F, Maurer-Stroh S. Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset. BMC Genomics. 2010;11(Suppl. 1):S15. [PMC free article] [PubMed]
44. Hemsley MJ, Mazzotta GM, Mason M, Dissel S, Toppo S, Pagano MA, Sandrelli F, Meggio F, Rosato E, Costa R, et al. Linear motifs in the C-terminus of D. melanogaster cryptochrome. Biochem. Biophys. Res. Commun. 2007;355:531–537. [PubMed]
45. Vanhee P, Stricher F, Baeten L, Verschueren E, Lenaerts T, Serrano L, Rousseau F, Schymkowitz J. Protein-peptide interactions adopt the same structural motifs as monomeric protein folds. Structure. 2009;17:1128–1136. [PubMed]
46. Marsella L, Sirocco F, Trovato A, Seno F, Tosatto SC. REPETITA: detection and discrimination of the periodicity of protein solenoid repeats by discrete Fourier transform. Bioinformatics. 2009;25:i289–i295. [PMC free article] [PubMed]
47. Trovato A, Seno F, Tosatto SC. The PASTA server for protein aggregation prediction. Protein Eng. Des. Sel. 2007;20:521–523. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...