• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Aug 2007; 35(15): e96.
Published online Aug 1, 2007. doi:  10.1093/nar/gkm562
PMCID: PMC1976432

Meta-prediction of protein subcellular localization with reduced voting

Abstract

Meta-prediction seeks to harness the combined strengths of multiple predicting programs with the hope of achieving predicting performance surpassing that of all existing predictors in a defined problem domain. We investigated meta-prediction for the four-compartment eukaryotic subcellular localization problem. We compiled an unbiased subcellular localization dataset of 1693 nuclear, cytoplasmic, mitochondrial and extracellular animal proteins from Swiss-Prot 50.2. Using this dataset, we assessed the predicting performance of 12 predictors from eight independent subcellular localization predicting programs: ELSPred, LOCtree, PLOC, Proteome Analyst, PSORT, PSORT II, SubLoc and WoLF PSORT. Gorodkin correlation coefficient (GCC) was one of the performance measures. Proteome Analyst is the best individual subcellular localization predictor tested in this four-compartment prediction problem, with GCC = 0.811. A reduced voting strategy eliminating six of the 12 predictors yields a meta-predictor (RAW-RAG-6) with GCC = 0.856, substantially better than all tested individual subcellular localization predictors (P = 8.2 × 10−6, Fisher's Z-transformation test). The improvement in performance persists when the meta-predictor is tested with data not used in its development. This and similar voting strategies, when properly applied, are expected to produce meta-predictors with outstanding performance in other life sciences problem domains.

INTRODUCTION

In the past decade, increased availability of large amounts of life sciences data, including low-throughput data accumulated by generations of scientists over a half-century, and high-throughput data acquired through newly developed biotechnologies, has coincided with great advances in data analysis and modeling techniques, most notably in the machine-learning area, leading to an increase in computational prediction programs in various important domains in life sciences research. In more and more problem domains, multiple prediction programs have emerged from independent efforts by different groups. These programs differ by what data features they use, or by what the methods or algorithms they apply in the classification tasks, or by both. These prediction programs may be complementary; i.e. one program performs better for one type of data under one set of circumstances, but another prediction program performs better for another type of data or under other circumstances. By proper exploitation of the combined strengths of these prediction programs, it may be possible to construct meta-predictors whose performance surpasses that of all existing prediction programs.

The meta-prediction problem is one that seeks to construct a prediction program (termed a meta-predictor), which makes predictions by organizing and processing the prediction results of a number of other prediction problems (termed element predictors). The meta-predictor takes the output of element predictors as its sole input. No explicit attention is paid to the feature definition or the underlying classification algorithms of the individual element predictors. Rather, the strengths and weaknesses of each element predictor, and the similarities and differences between different element predictors are visible to the meta-predictor only through the prediction results they make. The hope of meta-prediction is to develop a meta-predictor, which can combine the strengths of the element predictors and produce more accurate predictions than any of the element predictors. In this study, we focus on the meta-prediction of subcellular localization of proteins.

Subcellular localization is a key functional characteristic of eukaryotic proteins. Most proteins must be localized to the correct subcellular compartment or organelle in order to properly execute their biological function(s). Cooperating proteins must be present in the same location in order for them to interact. Since Nakai and Kanehisa's pioneering work (1), a large number of computational prediction programs have been developed in this field [see recent reviews, e.g. (2,3)]. These programs use many different data features, such as N-terminal signal sequence information [TargetP (4), PSORT (5) and iPSORT (6)]; amino acid composition [NNPSL (7), PLOC (8), FUZZY_LOC (9) and SubLoc (10)]; evolutionary information obtained by multiple sequence alignment or PSI-BLAST, and/or calculated physicochemical properties of the proteins [Proteome Analyst (11), LOCSVMPSI (12), ESLpred (13) and LOCtree (14)]; 3D structural data [LOC3d (15)]; or even gene expression data (16). They also use many different classification methods, such as expert systems (PSORT, iPSORT); artificial neural networks (ANN) (LOCnet, LOC3d, TargetP); k-nearest neighbor (k-NN) [PSORT II (5)]; Naive Bayes (NB) classifier (Proteome Analyst); fuzzy k-NN (FUZZY_LOC); or support vector machines (SVM) (SubLoc, LOCSVMPSI, PLOC, ELSpred and LOCtree).

These different data features and classification methods may give these prediction programs different, complementary strengths. In this study, we develop meta-predictors that harness the combined strengths of these individual element predictors. We first compiled an unbiased subcellular localization dataset that does not overlap with any data used in the development of these predictors; we then examined the performance of these predictors using this unbiased dataset; and explored several voting-based strategies for constructing meta-predictors. We show that, using a simple reduced voting strategy, an excellent meta-predictor can be developed, with a predicting performance substantially exceeding that of all element predictors, and that this meta-predictor's excellent performance persists with data not used in its development.

MATERIALS AND METHODS

Compilation of MetaSCL06 dataset

In this study, we focus on the subcellular localization predictions of animal proteins. The unbiased protein subcellular localization dataset MetaSCL06 was compiled from Swiss-Prot Release 50.2 (12 July 2006). The compilation procedure consisted of four steps: (1) assembling an unbiased set of proteins, (2) assigning class labels to the proteins based on gene ontology (GO) annotations in Swiss-Prot entries, (3) assigning class labels to the proteins based on the comment field in Swiss-Prot entries, and (4) manual reconciliation of protein sets from steps 2 and 3 (Figure 1).

Figure 1.
Compiling the MetaSCL06 dataset. Nuc: nuclear; Cyt: cytoplasmic; Mit: mitochondrial; Ext: extracellular.

Step 1: Assembling an unbiased set of proteins

For unbiased testing, the dataset compiled for this study should not contain data used in the development of any element prediction programs. An approach similar to (17) was taken, where the original report of each individual prediction program was carefully examined for descriptions about the data sources used in the development of the program. In the original reports of all but two prediction programs (PSORT and PSORT II), the Swiss-Prot database was explicitly stated as the original data source used in development, and the release numbers of the database were also provided. The latest release used in the development of these prediction programs was 45.0 (used in the development of WoLF PSORT), with release date 25 October 2004. This date was chosen as the cutoff date. For PSORT and PSORT II, since these two predictors were developed much earlier than other programs, and the web servers for these two programs have not been updated since November 1999 (nearly five years prior to the cutoff date), it is highly unlikely that any data used in the development of these two programs would be included in our protein dataset.

All animal protein sequences bearing an initial entry date after 25 October 2004 in Swiss-Prot 50.2 were the initial start of our unbiased protein dataset. All protein sequences with lengths <30 were discarded, because predictions made on shorter sequences were much less reliable for all element predictors (data not shown). In total, 14 246 proteins were retained at this step.

Step 2: Assigning class labels to proteins based on GO annotations

In this study, we focus on classifying proteins localized in four subcelluar compartments—nuclear, cytoplasmic, mitochondrial and extracellular.

Roughly 20% of the Swiss-Prot entries contain the Category C (denoting ‘cellular components’) GO annotations in their DR field, based on which a class label indicating one of the four subcellular compartments can be assigned to the corresponding protein. The GO annotations are considered more reliable than the comment annotations in the CC field (see Step 3 below) because they resulted from an additional round of manual curation by Swiss-Prot staff. However, assigning class labels based on GO annotations is not a straightforward task, because multiple GO terms frequently appear in the annotation of a single Swiss-Prot entry, and these terms are often of parent–child relationships in the GO hierarchical structure. In this structure (represented as a DAG, or directed acyclic graph), GO terms corresponding to the four subcellular compartments are interconnected through their parent terms. A procedure was developed to assign class labels to as many proteins as possible with consistent GO annotations (see Supplementary Methods and Supplementary Table 1). At this step, 555 proteins (305 nuclear, 29 cytoplasmic, 105 mitochondrial and 116 extracellular) in the unbiased protein set were assigned class labels.

Step 3: Assigning class labels to proteins based on annotations on comment field

All proteins in the unbiased protein set obtained at Step 1 were fed into a keyword filter, in which the comment field (CC) was checked against a list of keywords. Class labels were assigned to the proteins using an established procedure (see Supplementary Methods and Supplementary Table 2). At this step, 1595 proteins (500 nuclear, 277 cytoplasmic, 154 mitochondrial and 664 extracellular) received class labels.

Step 4: Manual reconciliation

Finally, all proteins that received class labels in the Steps 2 and 3 were subject to manual reconciliation. Entries were removed in cases of uncertainty or when there were conflicts between the class labels assigned in the last two steps. The final MetaSCL06 dataset includes 1693 proteins (607 nuclear, 173 cytoplasmic, 222 mitochondrial and 691 extracellular). This dataset is available as Supplementary Table 3.

Inconsistent annotation in GO categories and/or comment fields may represent rare but real cases of individual proteins present in multiple subcellular compartments. For simplicity, these proteins were removed from the MetaSCL06 dataset.

Compilation of MetaSCL07 dataset

The MetaSCL07 dataset is a validation set that was not used in meta-predictor development. This dataset was compiled from Swiss-Prot Release 51.6 (6 February 2007), with the same procedure used in the compilation of MetaSCL06. All entries of proteins bearing an initial entry date on or before 12 July 2006 (date of Release 50.2) were removed. This dataset includes 579 proteins (145 nuclear, 50 cytoplasmic, 144 mitochondrial and 240 extracellular). This dataset is available as Supplementary Table 4.

Selection of element predictors

In order to be usable as an element predictor for the meta-prediction problem, a prediction program needs to be accessible online or be available in downloadable form. Several predicting programs, including NNPSL, FUZZY_LOC, LOCnet and LOCSVMPSI, were excluded from consideration, because the implementations of these prediction programs are no longer available. Since a vast majority of the remaining prediction programs take full-length protein sequences as their only input, we focused on these. Programs requiring structural information (e.g. LOC3d) were excluded. Programs that calculate structure-related features internally (e.g. PSORT, PSORT II, LOCtree and ELSpred) were acceptable because these features are calculated based on the protein sequences, and the latter are the only input needed from the user.

A majority of the prediction programs make predictions on at least four major subcellular compartments: nuclear, cytoplasmic, mitochondrial and extracellular. Thus we focused on prediction of these four compartments. Two prediction programs, TargetP and iPSORT, were excluded because they make predictions on mitochondrial, chloroplast, and secretory pathway, but do not make predictions on nuclear, cytoplamic or extracellular proteins.

The final list of prediction programs and the chosen element predictors are shown in Table 1. A total of 12 element predictors were chosen, derived from eight prediction programs. Each program is discussed below:

Table 1.
Summary of the 12 element predictorsa

ELSpred (13). ELSpred uses the one-versus-the-rest SVM as the underlying classification method. It makes predictions into the four common subcellular localization compartments: nuclear, cytoplasmic, mitochondrial and extracellular. ELSpred provides five prediction options, each corresponding to a different feature formulation scheme: ELSpred_comp uses the compositions of the 20 amino acids as its features. ELSpred_physicochemical uses 33 physicochemical properties as its features. ELSpred_dipeptide defines features using dipeptide compositions. The features used in ELSpred_EuPSI are constructed following three iterations of EuPSI-BLAST through which the similarity between the protein and 2427 eukaryotic proteins is obtained. ELSpred_hybrid, uses a feature scheme that combines all the above four feature schemes. These five prediction options are considered as different element predictors in this study.

LOCtree (14). LOCtree uses amino acid compositions of the proteins as its features. It goes through a three-level binary tree-structured process with a binary SVM model working at each node in the tree. In addition to the four common subcellular localization compartments, LOCtree also makes predictions about whether a protein is an organelle protein, and about whether a nuclear protein is a DNA-binding protein.

PLOC (8). PLOC uses five different types of compositions (amino acids, amino acid pairs, one gapped amino acid pairs, two gapped amino acid pairs and three gapped amino acid pairs) as its features. The predictions are made by one-versus-the-rest SVMs followed by voting. In addition to the four common subcellular localization compartments, PLOC also makes predictions into six other subcellular localization compartments: cytoskeleton, endoplasmic reticulum, the Golgi apparatus, lysosome, peroxisome and plasma membrane.

Proteome Analyst (11). Proteome Analyst adopts features calculated with PSI-BLAST against the Swiss-Prot database, and employs a Naïve Bayes (NB) algorithm for making predictions. Besides the four common subcellular compartments, Proteome Analyst also makes predictions into five additional subcellular compartments: endoplasmic reticulum, Golgi apparatus, lysosome, peroxisome and plasma membrane.

PSORT and PSORT II (5). PSORT and PSORT II utilize a large number of features, including the presence of N-terminal sorting signals, the presence of RNA/DNA-binding motifs, amino acid compositions and some calculated structural information. PSORT is a knowledge-based system with a set of ‘if-then’ rules. PSORT II employs a k-NN learning algorithm. Besides the four common subcellular compartments, PSORT also makes predictions into endoplasmic reticulum, the Golgi apparatus, lysosome, microbody, plasma membrane, and PSORT II makes predictions into cytoskeleton, endoplasmic reticulum, the Golgi apparatus, plasma membrane, peroxisome and secretary vesicles.

SubLoc (10). SubLoc uses amino acid compositions as its features. Four one-versus-the-rest SVMs are trained to make predictions on a given protein into one of the four common subcellular localizations.

WoLF PSORT (18). WoLF PSORT defines features using amino acid compositions and N-terminal signals that are encoded by AAindex, and also adopts some PSORT features. A k-NN classifier is trained following the WoLF feature selection and weighting procedure. WoLF PSORT makes predictions into cytoskeleton, endoplasmic reticulum, the Golgi apparatus, lysosome, peroxisome and plasma membrane in addition to the four common subcellular compartments.

Obtaining and pre-processing prediction results of element predictors

Prediction jobs were submitted to each of the element prediction programs with the protein sequences in the MetaSCL06 and MetaSCL07 datasets. Some of the element prediction servers, ELSpred, PSORT, PSORT II and PLOC, do not provide a batch-processing option. For these prediction servers, simple Java programs were developed and used to handle the job submission and result retrieval. Other Java programs were developed to parse and analyze the prediction results returned from the element prediction servers. Some prediction programs, including PLOC, ELSpred_EuPSI and ELSpred_hybrid, provide in their output the most likely subcellular compartment, but other prediction programs generate numerical scores in their output, for example, the ‘reliability indices’ produced by LOCtree and SubLoc, the ‘certainty score’ produced by ELSpred_comp, ELSpred_physicochemical and ELSpred_dipeptide, and the percentage scores produced by Proteome Analyst and PSORT II. When multiple compartments appeared in the output with numerical scores, the one with the highest value was picked as the predicted compartment. Two of the prediction servers (PLOC and WoLF PSORT) can predict two compartments for a single protein both with the highest scores (e.g. ‘nuclear’ and ‘cytoplasmic’). In these cases, both predictions were considered valid and they were given equal weights when the predicting performance of the prediction program was evaluated.

Performance measures

For a two-class classification problem, commonly used performance measures include sensitivity, specificity, accuracy and Matthew's correlation coefficient (MCC) (19). These measures are defined as follows:

equation image

equation image

equation image

and

equation image

where TP, TN, FP and FN and denote the numbers of true positive, true negative, false positive and false negative samples in the classification.

For a multi-class classification problem, the definitions of sensitivity, specificity and MCC are no longer valid, but that of accuracy continues to be useful. In addition, Gorodkin (20) defined a correlation coefficient formula, which we will call Gorodkin correlation coefficient (GCC), a measure of predicting performance for a multi-class classifier. GCC calculates the correlation between two N × K matrices, the observation matrix An external file that holds a picture, illustration, etc.
Object name is gkm562i1.jpg and the prediction matrixAn external file that holds a picture, illustration, etc.
Object name is gkm562i2.jpg, where N is the number of samples, and K is the number of classes. In the protein subcellular localization prediction problem, K is equal to four for meta-predictors and the element predictors making predictions into the four common subcellular compartments only (e.g. SubLoc). For element predictors that make predictions into the ‘other compartment’ class (e.g. LOCtree and PLOC), K is equal to 5. An element in the observation matrix, An external file that holds a picture, illustration, etc.
Object name is gkm562i3.jpg, is set to be 1 if the ith sample is known to belong to class j, and it is set to be 0 if otherwise. An element in the prediction matrix, An external file that holds a picture, illustration, etc.
Object name is gkm562i4.jpg, is set to be 1 if the ith sample is predicted to belong to class j by the predictor, and it is set to be 0 if otherwise. For PLOC and WoLF PSORT, if a sample is predicted into two equally probable compartments, the corresponding elements in the prediction matrix are set to be 1/2 for both compartments. If no prediction is made for a sample by an element predictor, all corresponding elements in the prediction matrix are set to be 0.

Given the observation matrix An external file that holds a picture, illustration, etc.
Object name is gkm562i1.jpg and the prediction matrix An external file that holds a picture, illustration, etc.
Object name is gkm562i2.jpg GCC is defined as follows:

equation image

where An external file that holds a picture, illustration, etc.
Object name is gkm562i5.jpg, An external file that holds a picture, illustration, etc.
Object name is gkm562i6.jpg and An external file that holds a picture, illustration, etc.
Object name is gkm562i7.jpg are the covariance of the corresponding matrices, defined as the arithmetic average of the covariance of corresponding columns of the matrices.

GCC has the following desirable characteristics: it has a range of [−1, 1], just like Pearson's correlation coefficient and MCC. The more accurate the prediction is, the closer GCC is to 1. When the number of classes is equal to 2 in the classification problem, the definition of GCC reduces to the familiar MCC formula.

Comparison of element predictors

The element predictors are not completely compatible with one another in the types of predictions they make. First, 6 of the 12 element predictors make predictions for other subcellular compartments than the four we choose to use (Table 1). For simplicity, we lump all other subcellular compartments together for each of these programs and call them ‘other compartments’. Since the gold standard dataset (MetaSCL06) contains data for the four chosen compartments only, any predictions made into the ‘other compartments’ class by any element predictors are classified as wrong predictions. Second, some element predictors do not make predictions for all proteins. For instance, Proteome Analyst made predictions on 1523 of the 1693 proteins in the MetaSCL06 dataset, and ELSpred_EuPSI made predictions on only 624 of the 1693 proteins in the dataset. For the sake of making a fair performance comparison, we classed these ‘no prediction’ cases as wrong predictions when calculating the accuracy of these element predictors. However, we adopt an additional performance measure to assess the performance of an element predictor for the proportion of proteins in the dataset where predictions are actually made. We term this accuracy measure the ‘relative accuracy’, and calculate it as the ratio of the number of corrected predicted proteins and the number of proteins for which predictions are made. Finally, for some proteins, some element predictors (e.g. PLOC and WoLF PSORT) output two predicted subcellular compartments, which are considered equally probable by the predictors. For these samples, the predictions are considered ‘half correct’ if one of the two compartment output matches the true compartment label for the protein in the dataset.

Unweighted voting strategy

For a given protein in the dataset, the unweighted voting meta-predictor makes prediction Puv as

equation image

where i is the index of the subcellular compartments: i = 1 denotes ‘nuclear’, i = 2 denotes ‘cytoplasmic’, and i = 3 and i = 4 denote ‘mitochondria’ and ‘extracellular’, respectively. P(i,j) describes the prediction made by the jth (j = 1, … 12) element predictor: P(i,j) = 1 if the predictor made by the jth element predictor is i, and P(i,j) = 0 if otherwise. If, for a given protein, two equally probable compartments are output by an element predictor (PLOC or WoLF PSORT), the score is split so that P(i,j) = 1/2 for both compartments. The notation ‘arg maxi’ stands for the ‘argument of the maximum’, and it returns the value of i that leads to the highest value of the formula that follows (in this case, An external file that holds a picture, illustration, etc.
Object name is gkm562i8.jpg). That is, for each input protein, the unweighted voting meta-predictor sums the number of element predictors that make positive predictions for each of the four subcellular localization compartments, then picks the compartment with the largest number. When there are two or more compartments with the highest score, one compartment is picked at random.

Weighted voting strategy

The weighted voting strategy differs from the unweighted voting strategy in that the predictions made by element predictors are multiplied by a weight, which varies among predictors, before being summed up to produce the prediction of the meta-predictors. In other words, the prediction made by a weighted voting meta-predictor, Pwv, is described as

equation image

where wj is the weight for element predictor j (j = 1 through 12).

Weights are set to reflect the predicting performance of the element predictors: an element predictor with higher predicting performance is given a higher weight.

Reduced voting strategy

Although the prediction results of all element predictors are available to the meta-predictors, it is not necessary for all of them to be used. Indeed, if we exclude from consideration some of the element predictors that do not perform well, it may be possible to obtain meta-predictors with further improved performance. Thus, we applied the so-called ‘reduced voting strategy’: starting from a full (or ‘unreduced’) meta-predictor, we iteratively reduce the number of element predictors included in the construction of meta-predictor, by picking the next element predictor with the lowest performance, and setting its weight to 0. This process continues until only one element predictor remains in consideration. There are three performance measures used in evaluating the element predictors—accuracy, reduced accuracy and GCC. Therefore, for each (unreduced) meta-predictor, there are three different ways in which the reduction can be done. They are named accuracy-guided reduction (or AG), relative accuracy-guided reduction (or RAG), and GCC-guided reduction (GG), respectively. In each of these reduction methods, the lowest scoring element predictors are excluded one by one, producing a series of reduced voting meta-predictors.

RESULTS

Predicting performance of element predictors

Table 2 summarizes the predicting performance of the 12 element predictors assessed on the MetaSCL06 dataset. The predictions made by the element predictors vary considerably with one another. Proteome Analyst (accuracy: 0.821, GCC: 0.811) offers the best performance among all predictors, followed by LOCtree (accuracy: 0.746, GCC: 0.663) and WoLF PSORT (accuracy: 0.733, GCC: 0.635). Some predictors from the ELSpred program (ELSpred_hybrid and ELSpred_dipeptide) are ranked among the lowest in predicting performance. Proteome Analyst is also the element predictor that offers the highest relative accuracy (0.913), followed by ELSpred_EuPSI (0.880) and LOCtree (0.766).

Table 2.
Predicting performance of element predictors using the MetaSCL06 dataset

Unweighted voting strategy

With the performance of every element predictor assessed using the unbiased MetaSCL06 dataset, we set out to explore strategies to construct meta-predictors on top of these element predictors. First, we attempted a simple unweighted voting strategy. The meta-predictor constructed using the unweighted voting strategy (accuracy: 0.754, GCC: 0.651) offers better predicting performance than the average performance of the 12 element predictors (accuracy: 0.578, GCC: 0.459), but it does not reach the performance of the most accurate element predictor (Proteome Analyst, accuracy: 0.821, GCC: 0.811) (Table 3).

Table 3.
Predicting performance of unreduced voting meta-predictors as compared with that of element predictors

Weighted voting strategy

Next, we examined a weighted voting strategy. We looked at three different weighting schemes, which correspond to the three measures we used when assessing the performance of the element predictors: (i) accuracy weighting (or AW), (ii) relative accuracy weighting (or RAW) and (iii) GCC weighting (or GW). In each of these weighting schemes, the value of the respective performance measure for any given element predictor is used as the weights of the element predictor.

As is shown in Table 3, improved performance is achieved in these weighted voting meta-predictors. All three weighted voting meta-predictors show accuracy values that approach or slightly exceed that of the most accurate element predictor, Proteome Analyst. However, using GCC, none of these meta-predictors have reached the level of Proteome Analyst (GCC: 0.811).

Reduced voting strategy

The four voting schemes (unweighted voting and three weighted voting schemes) are combined with the three reduction methods [accuracy-guided reduction (or AG), relative accuracy-guided reduction (or RAG) and GCC-guided reduction (GG)], giving rise to a total of 12 series of reduced voting meta-predictors. In each of these predictor series, the predicting performance (measured in accuracy or in GCC) shows a biphasic relationship with the number of excluded element predictors (Figure 2). When the number of excluded element predictors is small, the predicting performance increases with the number of excluded element predictors, agreeing well with our conjecture that excluding badly performed element predictors may lead to improved predicting performance of the meta-predictors. The predicting performance reaches a peak when about 6–9 element predictors are expelled, then declines as more element predictors are excluded. Apparently, following this critical point, further removing of the more accurate element predictors is detrimental to the predicting performance of the resultant meta-predictor.

Figure 2.
Performance of reduced voting meta-predictors (accuracy on the left, GCC on the right) plotted against the number of excluded element predictors. (A) Relative accuracy weighted voting (RAW) combined with three reduction methods—accuracy-guided ...

As is shown in Table 4, the best predictor in each of the reduced voting meta-predictor series demonstrates better predicting performance than the best performed element predictor (Proteome Analyst) in both accuracy and GCC. Most of these best reduced meta-predictors show significantly higher GCC than that of Proteome Analyst in Fisher's Z-transformation test. The meta-predictor with the best performance was found to be the relative accuracy weighted, reduced by relative accuracy guiding, with six element predictors excluded (denoted as RAW-RAG-6). This meta-predictor makes predictions based on the predictions made by six element predictors: ELSpred_PhysicoChemical, ELSpred_EuPSI, LOCtree, Proteome Analyst, PSORT II and WoLF PSORT (Table 2). RAW-RAG-6 reaches a remarkable accuracy of 0.902, a nearly 8% improvement over Proteome Analyst (A: 0.821); and a GCC of 0.856, significantly higher than that of Proteome Analyst (GCC: 0.811), the best element predictor examined (P = 8.2 × 10−6, Fisher's Z-transformation test).

Table 4.
Predicting performance of reduced voting meta-predictors

RAW-RAG-6 with data not used in its development

Element predictor performance was evaluated on data not used in their development. To impose this same limit on RAW-RAG-6, the element predictors and RAW-RAG-6 were evaluated using the MetaSCL07 dataset, containing data not used in RAW-RAG-6 development (Table 5). Proteome Analyst remains the element predictor with the best predicting performance based on GCC (0.783), though LOCtree offers better accuracy (0.829) than Proteome Analyst (0.775) with the MetaSCL07 dataset. The superior performance offered by RAW-RAG-6 persists with this dataset, with an accuracy of 0.888 and GCC of 0.840, significantly better than those of any element predictors.

Table 5.
Predicting performance of element predictors and RAW-RAG-6 using the MetaSCL07 dataset

RAW-RAG-6 in individual compartment predictions

The problem of protein subcellular localization is commonly formulated as a multi-class classification problem. However, it can also be viewed as several individual two-class classification problems, one for each subcellular compartment. This allows one to examine the ability of a given predictor to identify proteins localized in each of the compartments individually. The MetaSCL06 dataset was converted into four variant datasets, each one of which for examining one of the four subcellular compartments: nuclear, cytoplasmic, mitochondria and extracellular, respectively. In the variant dataset for examining nuclear proteins, for instance, all proteins labeled as ‘nuclear’ were considered as ‘positive samples’, and all proteins labeled with any of the other compartments (cytoplasmic, mitochondria and extracellular) were lumped together and considered as ‘negative’ samples. We evaluated the predicting performance of each of the 12 element predictors, as well as that of RAW-RAG-6 meta-predictor, using these variant datasets.

RAW-RAG-6 outperforms each of the 12 element predictors in accuracy and MCC for all four two-class classification problems (Table 6). Comparing with the element predictor with the best performance (Proteome Analyst), the biggest improvement was achieved for the extracellular compartment, with 2.7% increase in accuracy (from 0.956 to 0.983) and 5.7% increase in MCC (from 0.909 to 0.966, p < 2 × 10−16, Fisher's Z-transformation test). It is followed by the nuclear compartment, for which a 2.1% increase in accuracy (from 0.908 to 0.929) and 4.6% increase in MCC (from 0.801 to 0.847, P = 1.4 × 10−5, Fisher's Z-transformation test) are achieved. The smallest improvement is found for the cytoplasmic compartment, where 0.7% increase in accuracy (from 0.922 to 0.929), and 1.3% improvement in MCC (from 0.617 to 0.630, P = 0.27, Fisher's Z-transformation test) are observed. Overall, the RAW-RAG-6 meta-predictor achieves remarkable performance in these two-class classification problems, and consistently outperforms every element predictor in identifying proteins localized in each of the four subcellular compartments.

Table 6.
Predicting performance of element predictors and RAW-RAG-6 in two-class predictions for the 4 subcellular compartments

DISCUSSION

Meta-predictors may resolve conflicting predictions

In many life science domains, several prediction programs have emerged that often have different strengths due to different types of data (or different aspects of the same data) used, and/or different classification methods adopted in their development. When more than one of these is used on the same data, they may produce conflicting predictions. Users are often confused and frustrated by such conflicting results, because they may lack the knowledge to make a sensible choice among them. If a meta-predictor can be developed with predicting performance exceeding that of any individual element predictors, it may resolve this quandary.

Meta-predictors versus element predictors

Meta-predictors cannot replace element predictors. Rather, they are enhancements. Meta-predictors are constructed from element predictors, and their performance depends on accurate predictions made by element predictors. Without good element predictors, it is not possible for good meta-predictors to be developed. In addition, meta-predictors (in particular, voting-based meta-predictors) are effective only within the scope of the prediction problem that is common to multiple element predictors. Often, element predictors make unique predictions. For example, among the prediction programs discussed in this study, only PSORT II makes predictions about protein localization to secretory vesicles. For unique predictions, one has to rely on an element predictor.

Cross-validation and future performance

We did not perform cross-validation explicitly in this development. However, because all parameters of RAW-RAG-6 (relative accuracy values of element predictors) are calculated as sample statistics, and the latter are insensitive to removal of a small number of samples given that the sample size is sufficiently large, the testing we performed can be considered as being equivalent to cross-validation. To demonstrate, suppose we perform an ‘explicit’ LOO (leave-one-out) cross-validation, i.e. taking 1692 samples as the ‘training dataset’, and the remaining sample as the validation data, and do so for 1693 iterations so that each sample is validated once. In each iteration, the parameters of the meta-predictor—which are relative accuracy values of the element predictors—calculated based on the 1692 training samples, would be essentially the same as the relative accuracy values of the whole 1693-sample dataset, because the relative accuracy of each element predictor is a sample statistic, which is insensitive to the removal of one sample from the dataset, given that the sample size is sufficiently large. Therefore, each predictor achieved from the 1693 iterations of LOO cross-validation would be the same as the predictor achieved from the entire 1693-sample dataset.

The validation performed using the MetaSCL07 dataset suggests that the RAW-RAG-6 meta-predictor is robust. Its performance for future, unseen data is expected to be close to what was achieved in this study, assuming no changes are made to the element predictors. However, if changes take place in any of its component element predictors, the reduced voting-based meta-predictor will need adjustment.

Linear voting strategies

The linear voting strategies explored in this study are related to several well-known online learning algorithms, including Littlestone and Warmuth's weighted majority (WM) algorithm (21) and Freund and Schapire's Hedge algorithm (22). Those algorithms are applied to situations where one person is trying to make predictions based on the opinions of several ‘experts’ from whom he seeks advice. If the weights are properly chosen, there are theoretical bounds of the maximal number of wrong predictions made by the ‘master predictor’, i.e. the performance of the ‘master predictor’ will not be ‘too much worse’ than that of the predictions made by the best ‘expert’. The meta-prediction problem discussed in this study differs from those previous studies in that ‘batch learning’, rather than ‘online learning’, applies. In other words, the training samples are assumed to be provided together, instead of one at a time. In the same paper in which the Hedge algorithm was discussed (22), Freund and Schapire introduced the well-known Adaboost algorithm, which applies to batch learning. The major difference between meta-prediction problem discussed in this study and Adaboost and other ensemble learning algorithms [e.g. Logitboost (23) and Bagging (24)] is that in ensemble learning, the ‘element predictors’ are results of an identical training algorithm applied to different samplings of the training data. In meta-prediction, the ‘element predictors’ are assumed to be known and unchanged, and all training data is used for all element predictors.

CONCLUSIONS

The successful development of RAW-RAG-6 demonstrates the effectiveness of voting-base strategies in the meta-prediction problems. Proper employment of voting-based strategies is likely to lead to good meta-predictors in other life sciences problem domains.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

ACKNOWLEDGEMENTS

We thank Drs A. Banerjee and W. Pan at the University of Minnesota for inspiring discussions. We also thank the Supercomputing Institute, University of Minnesota for computational resources, and W. Gong for technical assistance. T.L. acknowledges the support of NIH (1R21CA126209) and Minnesota Medical Foundation. Funding to pay the Open Access publication charges for the article was provided by NIH/NCI.

Conflict of interest statement. None declared.

REFERENCES

1. Nakai K, Kanehisa M. A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics. 1992;14:897–911. [PubMed]
2. Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y. Automatic prediction of protein function. Cell. Mol. Life Sci. 2003;60:2637–2650. [PubMed]
3. Donnes P, Hoglund A. Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinformatics. 2004;2:209–215. [PubMed]
4. Emanuelsson O, Nielsen H, Brunak S, von Heijne G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 2000;300:1005–1016. [PubMed]
5. Nakai K, Horton P. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem. Sci. 1999;24:34–36. [PubMed]
6. Bannai H, Tamada Y, Maruyama O, Nakai K, Miyano S. Extensive feature detection of N-terminal protein sorting signals. Bioinformatics. 2002;18:298–305. [PubMed]
7. Reinhardt A, Hubbard T. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 1998;26:2230–2236. [PMC free article] [PubMed]
8. Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003;19:1656–1663. [PubMed]
9. Huang Y, Li Y. Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics. 2004;20:21–28. [PubMed]
10. Hua S, Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics. 2001;17:721–728. [PubMed]
11. Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R. Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics. 2004;20:547–556. [PubMed]
12. Xie D, Li A, Wang M, Fan Z, Feng H. LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acids Res. 2005;33:110. [PMC free article] [PubMed]
13. Bhasin M, Raghava GP. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res. 2004;32:419. [PMC free article] [PubMed]
14. Nair R, Rost B. Mimicking cellular sorting improves prediction of subcellular localization. J. Mol. Biol. 2005;348:85–100. [PubMed]
15. Nair R, Rost B. Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins. 2003;53:917–930. [PubMed]
16. Drawid A, Gerstein M. A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. J. Mol. Biol. 2000;301:1059–1075. [PubMed]
17. Klee EW, Ellis LB. Evaluating eukaryotic secreted protein prediction. BMC Bioinformatics. 2005;6:256. [PMC free article] [PubMed]
18. Horton P, Park K-J, Obayashi T, Nakai K. The 4th Annual Asia Pacific Bioinformatics Conference APBC06; Taipei, Taiwan. 2006. pp. 39–48.
19. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta. 1975;405:442–451. [PubMed]
20. Gorodkin J. Comparing two K-category assignments by a K-category correlation coefficient. Comput. Biol. Chem. 2004;28:367–374. [PubMed]
21. Littlestone N, Warmuth MK. The weighted majority algorithm. Information and Computation. 1994;108:212–261.
22. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J, Comput. System Sci. 1997;55:119–139.
23. Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Ann. Stat. 2000;28:337–374.
24. Breiman L. Bagging predictors. Machine Learning. 1996;24:123–140.

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...