Sentinel school.

This article presents a web server iPDA, which aims at identifying the disordered regions of a query protein. Automatic prediction of disordered regions from protein sequences is an important problem in the study of structural biology. The proposed classifier DisPSSMP2 is different from several existing disorder predictors by its employment of position-specific scoring matrices with respect to physicochemical properties (PSSMP), where the physicochemical properties adopted here especially take the disorder propensity of amino acids into account. The web server iPDA integrates DisPSSMP2 with several other sequence predictors in order to investigate the functional role of the detected disordered region. The predicted information includes sequence conservation, secondary structure, sequence complexity and hydrophobic clusters. According to the proportion of the secondary structure elements predicted, iPDA dynamically adjusts the cutting threshold of determining protein disorder. Furthermore, a pattern mining package for detecting sequence conservation is embedded in iPDA for discovering potential binding regions of the query protein, which is really helpful to uncovering the relationship between protein function and its primary sequence.


INTRODUCTION
Intrinsically disordered proteins or protein regions exhibit unstable and changeable three-dimensional structures under physiological conditions (1). Although lacking fixed structures, protein disorder has been identified to carry out important functions in many biological processes (1,2). In addition, it is observed that the absence of a rigid structure allows disordered binding regions to interact with several different targets (3,4). These regions, sometimes called 'molecular recognition elements', usually undergo a disorder-to-order transition when binding to their targets (5,6). In this regard, predicting protein disorder and investigating its potential of induced folding is a necessary preliminary to understanding protein structure and function (7).
The proposed web server iPDA aims at providing an integrated environment for detecting disordered regions and exploring their functional roles. In our recent work DisPSSMP (8), it is demonstrated that the accuracy of protein disorder prediction can be greatly improved if the disorder propensity of amino acids is considered when generating the condensed position-specific scoring matrix (PSSM) features. For iPDA, we implement a two-stage classifier of Radial Basis Function Networks (RBFN) to further enhance the predicting power of DisPSSMP. As unbalanced datasets, a large amount of ordered residues over disordered residues, are employed when training the new classifier DisPSSMP2, an alternative decision function is recently adopted and the cutting threshold is dynamically determined by the proportion of predicted secondary structure in the query protein.
iPDA takes an amino acid sequence as the input and reports the prediction of disordered residues with graphical plots, along with various sequence characteristics which are believed to be important when investigating the so-called induced folding behavior (6). The provided information includes sequence conservation from multiple sequence alignment (ClustalW) (9), concurrent sequence conservation from pattern mining (WildSpan) (10,11), secondary structure prediction (Jnet and PSIPRED) (12,13), low-complexity regions (CARD) (14) and hydrophobic clusters. Romero et al. stated that low-complexity regions are usually located in the long disordered regions (15), where the sequence complexity is measured by Shannon's entropy. In addition, Ferron et al. mentioned in their recent study that hydrophobic clusters and secondary structures can provide distinct clues for investigating induced folding (6). Meanwhile, we observe that sequence conservation is essential for disordered regions to maintain their functionality. Therefore, iPDA further provides a pattern mining utility to detect motifs in the specified disordered and/or ordered regions in order to predict potential intra-and inter-molecular interactions.

METHODS
The architecture of iPDA is shown in Figure 1. Given an amino acid sequence, iPDA performs various sequence analyses by invoking several well-established predictors. In addition, iPDA provides two utilities for examining missing residues in PDB structures. The details of how each predicting package is incorporated into iPDA are discussed in the following subsections.

Protein disorder prediction
In our recent work DisPSSMP (8), a condensed PSSM with respect to physicochemical properties (PSSMP) was considered when generating feature profiles to build the classifier, where the PSSMP merges several amino acid columns of a PSSM that belong to a certain property into a single column. Besides, DisPSSMP decomposed each conventional physicochemical property of amino acids into two disjoint groups which have a propensity for order and disorder, respectively. The experimental results revealed that the PSSMP features with disorder propensity considered perform better than both the PSSMP features from traditional physicochemical properties and the original PSSM features on this problem.
The web server iPDA implements a two-stage classifier of RBFN, named DisPSSMP2, to further enhance the predicting power of DisPSSMP. Figure 2 shows the system flow, in which the procedure of generating the feature set FS-PSSMP-4 from a protein sequence was described in (8). In the first stage, the RBFN outputs the probabilities of being disordered and ordered for a given residue R i , named Disorder(R i ) and Order(R i ), respectively. While our previous predictor DisPSSMP takes the larger probability as the prediction, in this article the values of Disorder(R i ) and Order(R i ) are collected with a sliding window to generate the feature set used in the second stage. In DisPSSMP2, the size of the sliding window and the cutting threshold of predicting disorder   used in the second stage have to be determined through cross-validation. For training and validation processes, six datasets have been extracted from different databases, as summarized in Table 1. The datasets PDB693 and D184 were first collected when developing DisPSSMP (8). However, only about 30% ordered residues in the training set were included by DisPSSMP when constructing the classifier. In order to completely exploit the knowledge present in the training datasets, DisPSSMP2 recruits all ordered residues in the datasets PDB652 [PDB652 excludes the sequences of PDB693 used in (8) that have similarity identity of more than 70% against any protein sequence in the other training sets by running CD-HIT (21), resulting 652 proteins.] and D184, and further combines another globular protein set named G200 (17) to enhance the accuracy of predicting ordered residues.
As unbalanced datasets, 284 059 ordered and 76 481 disordered residues, are employed when training the RBFN classifier, an alternative decision function is newly adopted to avoid the problem of under-prediction (17). A residue is predicted as disorder if [Disorder(R i ) À Order(R i ) þ 1]/2 is greater than a cutting threshold. The 936 protein chains for training are partitioned into five groups. According to the 5-fold cross-validation, the performance of DisPSSMP2 is about the same when the cutting threshold is set in between 0.3 and 0.4. It has been observed that unstructured proteins in average contain fewer secondary structure elements elements than globular proteins (17). In this regard, we propose setting the cutting threshold of DisPSSMP2 dynamically by the estimated proportion of secondary structure in the query protein. Since we expect DisPSSMP2 to predict more disorder for the practical use of iPDA, the cutting threshold is set by '0.4 À the proportion of coils Â 0.2', resulting in a cutting threshold lower than 0.3 if the proportion of coils is greater than 50%. The window size used in the second stage is also determined through cross-validation. Window sizes in between 35 and 59 perform similarly. Thus, a window size of 47 is adopted. To evaluate the performance of DisPSSMP2, the benchmark proposed by Yang et al. (18) is employed as the blind testing data, as listed in Table 1.

Sequence conservation
It has been observed that residues within structural domains usually have higher conservation scores than in domain linkers (19). For deriving conservation information, the homologues of a query protein are collected by invoking PSI-BLAST (20) against Swiss-Prot database with e-value cutting threshold of 0.01. After that, redundant sequences are removed by executing CD-HIT (21) with threshold set to 70%. Using these homologues, iPDA provides two levels of sequence conservation to investigate the functional regions of the query protein.
The sequence conservation with respect to a single position is calculated based on the multiple sequence alignment generated by ClustalW (9). The conservation of a given position is defined by the proportion of the particular amino acid type observed in the query protein.
Only the top 10% conserved residues are highlighted by iPDA. Next, a second level of sequence conservation, called concurrent conservation, is derived by employing sequential pattern mining. The employed algorithm is named WildSpan for its ability of generating patterns across large wildcard regions (11). WildSpan has been recruited in the web server MAGIIC-PRO (10) in detecting functional signatures directly from unaligned sequences. A pattern generated by WildSpan contains the residues that are simultaneously conserved but largely separated in the protein sequence. Hsu et al. (22) observed that 90% of the concurrent conserved blocks discovered by WildSpan interact with at least one of the other blocks in space. Since disordered fragments of a protein might undergo a disorder-to-order transition to interact with each other when binding ligands or other proteins, it is expected that their conservation propensity would be revealed by ClustalW and the concurrent conservation can be discovered by WildSpan.

Iterative pattern mining
As more and more the disordered regions of proteins are found to be functionally significant, mining conserved patterns in the disordered regions is essential for understanding protein function (23). However, Brown et al. (24) observed that disordered regions usually have higher evolutionary rates than ordered regions, which makes it difficult to detect the conserved patterns of the disordered regions when using the entire protein as a query. Therefore, iPDA provides users an iterative mining strategy. The users can select regions of interest or mask unwanted segments of the query protein, and then invoke WildSpan iteratively to find conserved fragments other than the most highly conserved positions. Two parameters are requested upon calling WildSpan: (1) b stands for the  (2) k is the minimum number of such patterns wanted. The default setting for the first call of WildSpan is b ¼ 3 and k ¼ 1 in iPDA.

Secondary structure
Since induced folding regions are shown to have secondary structure propensity, iPDA provides predicting results from two secondary structure predictors, Jnet (v0.1) and PSIPred (v2.5) (12,13). These predictors are selected to complement each other, because it was observed in our recent study that each available package of secondary structure prediction behaves divergently, especially in short secondary structure elements (17). More confidence arises when two predictors concur, while users should be aware of risks when the results are inconsistent. The predicted information by Jnet is also recruited by DisPSSMP2 in determining the cutting threshold of the classifier.

Low-complexity regions
Low-complexity regions of a protein are usually disordered, but disordered regions are not always with low-complexity property (15). iPDA adopts CARD (14) to perform prediction of low-complexity regions. This information helps to strengthen the confidence of the disordered regions predicted by DisPSSMP2.

Hydrophobic clusters
It has been shown in many studies that disordered regions of proteins are comprised of a category of amino acids distinct from that of ordered ones (25). For example, amino acids of aromatic hydrophobic groups are known to be favored in the ordered regions, and thus are less found in the disordered regions (26). Callebaut et al. categorized 20 amino acids into three groups for hydrophobic cluster analysis and identified 'VILFMYW' as hydrophobic residues (27). It has been shown that structured segments have more hydrophobic clusters than the linker ones. Thus, the information of hydrophobic clusters is provided to validate the prediction of ordered regions. Here, iPDA assigns a position as a hydrophobic cluster if more than 5 of itself and its 10 neighbors (five from its left and five from its right) belong to the hydrophobic group 'VILFMYW'.

Retrieving PDB missing residues
In the study of protein disorder, it is of interest to examine missing coordinates of backbone atoms in PDB structures. Residues present in the SEQRES records but not in the ATOM records are called missing residues (18). iPDA provides an utility to find missing residues from PDB database. All PDB chains are preprocessed to construct protein-structure mapping according to their SwissProt entry names or AC numbers. By marking the missing residues on the protein chains aligned by ClustalW, iPDA provides a clear view about which segments might be unstable. Disordered proteins usually activate their biological functions when undergoing disorder-to-order transitions. Therefore, the protein segments which are disordered in some PDBs but ordered in some others attract more attention for further analyses. Additionally, iPDA provides a similar utility of finding missing residues among all PDB chains belonging to a SCOP super-family/ family.

RESULTS AND DISCUSSIONS
In this section, we first evaluate how DisPSSMP2 performs in comparison with DisPSSMP and other existing packages for disorder prediction. After that, several interesting examples are provided to illustrate how iPDA helps users to explore the functional roles of the detected disorder regions. Many measures have been introduced to evaluate the performance of protein disorder predictors (8,18,28,29). Since sensitivity, specificity and accuracy listed in Table 2 are seriously affected by the relative frequency of the target classes (29), two more appropriate measures are included in Table 2 to reveal the properties of different packages. The first one, Matthews' correlation coefficient, is widely used in many bioinformatics problems (30,31). The other evaluation measure, named probability excess, was recommended by CASP (28,29) and Yang et al. (18) for this problem.
To evaluate the performance of DisPSSMP2, we use a benchmark proposed by Yang et al. (18), comprising 239 proteins. When preparing the training data of DisPSSMP2, the redundancy between the training and testing data has been avoided using the same criterion adopted in Yang's paper (18). This benchmark also helps to judge whether a predictor tends to over-predict or under-predict disorder (8,17,18,32). As listed in Table 3, DisPSSMP2 has a better performance than DisPSSMP Table 2. Definition of measures employed in this study
Note: The definition of the abbreviations used: TP and TN is the number of correctly classified disordered and ordered residues, respectively; FP is the number of ordered residues incorrectly classified as disordered; FN is the number of disordered residues incorrectly classified as ordered.
no matter which evaluation measure is used. The improvement of DisPSSMP2 is mainly from its including more ordered residues as training samples and the twostage architecture employed. In addition, the performance of the existing packages for predicting protein disorder is ranked by its Prob. Excess in Table 3. It should be aware that these packages were trained with different databases and some of them have different definitions for protein disorder from ours. Although many methods achieve specificity in excess of 90%, they usually result in low sensitivity. Since iPDA expects to discover potential disorder-to-order transitions, it is expected that employed predictor should deliver a high sensitivity rate of disordered regions without an explicit drop on specificity.
Next we provide some examples discussed in (1) to illustrate how iPDA facilitates the study of protein disorder and induced folding. The first example used is a DNA-binding protein GCN4. According to the prediction shown in Figure 3A, this protein might be largely unstructured. Meanwhile, it is observed that the region 225-281 is provided with large helical components. WildSpan also indicates high concurrent conservation in this area. Figure 3B shows that one pattern found by WildSpan identifies the important residues with respect to the DNA-binding region. Similar discoveries are observed on the proteins NFATC1 and RXR discussed in (1). In many cases, we observed that the regions undergoing disorder-to-order transitions when binding DNA usually possess both high disorder and secondary structure propensity, and additionally at least one pattern is found within this region to indicate potential intra-and/or intermolecular interactions. This observation can be again justified by another protein, SecA, which undergoes locally disorder-to-order transition upon ADP binding in high temperature (44). The partial result of analyzing SecA is shown in Figure 4A. It shows that the range of 500-600 exhibits both disorder and concurrent conservation property. In this region, WildSpan detects 26 residues, and 10 of them are predicted as disorder.
We highlight these 10 residues as red sticks in Figure 4B to examine their positions with respect to the molecule ADP.
Another example of disordered regions containing functional motifs is the protein p53. The iPDA result is shown in Figure 5. In the disordered N-terminal domain (NTD) of p53, a short motif 'FxxLW', called the MDM2 functional motif, is discussed by Dawson's et al. (45).
The key residues are detected by ClustalW, as well as the second run of WildSpan (b ¼ 3 and k ¼ 1). Those residues were not found in the first run of WildSpan, because the DNA-binding domain of p53 is more conserved than the MDM2 binding domain. If only the first disordered region (1-117) predicted by DisPSSMP2 are selected, the motif will be detected, as shown in Figure 5A and B. In Figure 5C, an available PDB structure shows the interaction of this polypeptide with the protein MDM2.
CONCLUSION iPDA provides comprehensive information for annotating the disordered regions of a query sequence. The integrated resource recognizes intrinsically unstructured proteins and helps to tell whether a disordered protein or protein fragment is with tendency toward being folded upon binding other molecules. According to the experiments conducted in this study, the disorder predictor DisPSSMP2 achieves a higher sensitivity rate than other existing packages performing the similar task without sacrificing the specificity rate. Besides, iPDA employs sequential pattern mining to identify concurrent conservation iteratively, from highly conserved regions to lightly conserved regions one at a time. It is observed in many cases that the disordered regions undergoing disorderto-order transitions upon binding usually exhibit high concurrent conservation and clear secondary structure propensity. This association deserves further studies in the near future.