Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS One. 2010; 5(4): e9931.
Published online Apr 1, 2010. doi:  10.1371/journal.pone.0009931
PMCID: PMC2848569

A New Method for Predicting the Subcellular Localization of Eukaryotic Proteins with Both Single and Multiple Sites: Euk-mPLoc 2.0

Darren P. Martin, Editor

Abstract

Information of subcellular locations of proteins is important for in-depth studies of cell biology. It is very useful for proteomics, system biology and drug development as well. However, most existing methods for predicting protein subcellular location can only cover 5 to 12 location sites. Also, they are limited to deal with single-location proteins and hence failed to work for multiplex proteins, which can simultaneously exist at, or move between, two or more location sites. Actually, multiplex proteins of this kind usually posses some important biological functions worthy of our special notice. A new predictor called “Euk-mPLoc 2.0” is developed by hybridizing the gene ontology information, functional domain information, and sequential evolutionary information through three different modes of pseudo amino acid composition. It can be used to identify eukaryotic proteins among the following 22 locations: (1) acrosome, (2) cell wall, (3) centriole, (4) chloroplast, (5) cyanelle, (6) cytoplasm, (7) cytoskeleton, (8) endoplasmic reticulum, (9) endosome, (10) extracell, (11) Golgi apparatus, (12) hydrogenosome, (13) lysosome, (14) melanosome, (15) microsome (16) mitochondria, (17) nucleus, (18) peroxisome, (19) plasma membrane, (20) plastid, (21) spindle pole body, and (22) vacuole. Compared with the existing methods for predicting eukaryotic protein subcellular localization, the new predictor is much more powerful and flexible, particularly in dealing with proteins with multiple locations and proteins without available accession numbers. For a newly-constructed stringent benchmark dataset which contains both single- and multiple-location proteins and in which none of proteins has An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e001.jpg pairwise sequence identity to any other in a same location, the overall jackknife success rate achieved by Euk-mPLoc 2.0 is more than 24% higher than those by any of the existing predictors. As a user-friendly web-server, Euk-mPLoc 2.0 is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/euk-multi-2/. For a query protein sequence of 400 amino acids, it will take about 15 seconds for the web-server to yield the predicted result; the longer the sequence is, the more time it may usually need. It is anticipated that the novel approach and the powerful predictor as presented in this paper will have a significant impact to Molecular Cell Biology, System Biology, Proteomics, Bioinformatics, and Drug Development.

Introduction

With the avalanche of protein sequences generated in the post-genomic era, numerous efforts have been made to develop various methods for predicting protein subcellular localization based on the sequence information (see, e.g., [1], [2], [3], [4], [5], [6], [7], [8] as well as a long list of references cited in two comprehensive review articles [9], [10]). However, relatively much less efforts have been made to address those proteins which may simultaneously exist at, or move between, two or more different subcellular locations. Actually, proteins with multiple locations or dynamic feature of this kind are particularly interesting because they may have some very special biological functions worthy of our notice [11], [12]. Particularly, as pointed out by Millar et al. [13], recent evidences indicate that an increasing number of proteins have multiple locations in the cell.

About two years ago, a web-server predictor [14] was developed for dealing with the eukaryotic systems that contain both single-location and multiple-location proteins. The predictor is called Euk-mPLoc, where “m” stands for “multiple” meaning it can be used to deal with multiplex proteins as well. The Euk-mPLoc predictor was established by hybridizing the “higher-level” GO (gene ontology [15]) approach and PseAAC (pseudo amino acid composition [16], [17]) approach. Its power mainly came from the GO approach because proteins formulated in the GO database space would be clustered in a manner much better reflecting the distribution of their subcellular locations, as elucidated in [18].

However, the existing version of Euk-mPLoc has the following shortcomings. (1) In order to make the prediction engine able to use the advantage of the GO approach, the accession number for a query protein is required as a part of input; many proteins, such as synthetic and hypothetical proteins, or newly-discovered sequences without being deposited into databanks yet, do not have accession numbers, and hence cannot be treated with the GO approach. (2) Even though their accession numbers are available, it is not always certain for them to be meaningfully formulated in a GO space because the current GO database is far from complete yet. (3) Although the PseAAC approach, a complement to the GO approach in Euk-mPLoc, can take into account some partial sequence order effects, the original PseAAC [16], [19] missed the functional domain and sequential evolution information that may considerably affect the prediction quality.

The present study was devoted to develop a new and more powerful predictor for predicting eukaryotic protein subcellular localization by addressing the above three problems.

Materials and Methods

Protein sequences were collected from the Swiss-Prot database at http://www.ebi.ac.uk/swissprot/. The detailed procedures are basically the same as described in [14]; the only difference is: in order to establish a more updated benchmark dataset, instead of version 50.7 of the Swiss-Prot database released on 9-Sept-2006, the version 55.3 released on 29-Apr-2008 was adopted. After strictly following the procedures as described in [14], we finally obtained a benchmark dataset An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e002.jpg containing 7,766 different protein sequences that are distributed among 22 subcellular locations (Fig. 1); i.e.,

equation image
(1)

where An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e004.jpg represents the subset for the subcellular location of “acrosome”, An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e005.jpg for “cell membrane”, An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e006.jpg for “cell wall”, and so forth; while An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e007.jpg represents the symbol for “union” in the set theory. A breakdown of the 7,766 eukaryotic proteins in the benchmark dataset An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e008.jpg according to their 22 location sites is given in Table 1. To avoid redundancy and homology bias, none of the proteins in An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e009.jpg has An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e010.jpg pairwise sequence identity to any other in a same subset. The corresponding accession numbers and protein sequences are given in Online Supporting Information S1.

Figure 1
Illustration to show the 22 subcellular locations of eukaryotic proteins.
Table 1
Breakdown of the eukaryotic protein benchmark dataset An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e011.jpg derived from Swiss-Prot database (release 55.3) according to the procedures described in the Materials section.

Because the system investigated now contains both the single-location and the multiple-location proteins, some of the proteins in An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e037.jpg may occur in two or more location sites. Therefore, it is instructive to introduce the concept of “virtual sample”, as illustrated as follows. A protein sample coexisting at two different location sites will be counted as 2 virtual samples even though they have an identical sequence; if coexisting at three different sites, 3 virtual samples; and so forth. Accordingly, the total number of the different virtual protein samples is generally greater than that of the total different sequence samples. Their relationship can be formulated as follows

equation image
(2)

where An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e039.jpg is the number of total different virtual protein samples in An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e040.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e041.jpg the number of total different protein sequences, An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e042.jpg the number of proteins with one location, An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e043.jpg the number of proteins with two locations, and so forth; while An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e044.jpg is the number of total subcellular location sites (for the current case, An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e045.jpg as shown in Fig. 1 and Table 1).

For the current 7,766 different protein sequences, 6,687 occur in one subcellular location, 1,029 in two locations, 48 in three locations, 2 in four locations, and none in five or more locations. Substituting these data into Eq.2, we have

equation image
(3)

which is fully consistent with the figures in Table 1 and the data in Online Supporting Information S1.

As stated in a recent comprehensive review [20], to develop a powerful method for statistically predicting protein subcellular localization, one of the most important things is to formulate the sample of a protein with the core features that have intrinsic correlation with its localization in a cell. Since the concept of pseudo amino acid composition (PseAAC) was proposed [16], it has provided a very flexible mathematical frame for investigators to incorporate their desired information into the representation of protein samples. According to its original definition, the PseAAC is actually formulated by a set of discrete numbers [16] as long as it is different from the classical amino acid composition (AAC) and that it is derived from a protein sequence that is able to harbor some sort of its sequence order and pattern information, or able to reflect some physicochemical and biochemical properties of the constituent amino acids. Since the concept of PseAAC was proposed, it has been widely used to deal with many protein-related problems and sequence-related systems (see, e.g., [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42] and a long list of PseAAC-related references cited in a recent review [20]). As summarized in [20], until now 16 different PseAAC modes have been used to represent the samples of proteins for predicting their attributes. Each of these modes has its own advantage and disadvantage. In this study, we are to formulate the protein samples by hybridizing the following three different modes of PseAAC.

1. GO (Gene Ontology) Representation Mode

GO database [15] was established according to the molecular function, biological process, and cellular component. Accordingly, protein samples defined in a GO database space would be clustered in a way better reflecting their subcellular locations [10], [18]. However, the way of using GO mode to represent a protein sample in the original Euk-mPLoc predictor [14] was derived through its accession number from the GO database [43]. Thus, when using Euk-mPLoc to perform prediction, the accession number of a query protein would be indispensable. To avoid such a requirement, the following different procedures are proposed to derive the GO representation mode.

Step 1

Use BLAST [44] to search the homologous proteins of the query protein An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e047.jpg from the Swiss-Prot database (version 55.3), with the expect value An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e048.jpg for the BLAST parameter.

Step 2

Those proteins which have An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e049.jpg pairwise sequence identity with the query protein An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e050.jpg are collected into a set, An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e051.jpg, called the “homology set” of An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e052.jpg. All the elements in An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e053.jpg can be deemed as the “representative proteins” of An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e054.jpg. Because they were retrieved from the Swiss-Prot database, these representative proteins must each have their own accession numbers.

Step 3

Search each of these accession numbers collected in Step 2 against the GO database at http://www.ebi.ac.uk/GOA/ to find the corresponding GO numbers [43].

Step 4

The current GO database (version 70.0 released 10 March 2008) contains 60,020 GO numbers, thus the query protein An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e055.jpg can be expressed via its representative proteins in An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e056.jpg by the following formulation

equation image
(4)

where An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e058.jpg is the transposing operator, and

equation image
(5)

Through the above steps, we can use the GO information derived from its representative proteins in An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e060.jpg to formulate the query protein An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e061.jpg. The rationale of so doing is based on the fact that homology proteins generally share similar attributes, such as structural conformations and biological functions [45], [46], [47]. Thus, the accession number is no longer indispensable for the input of the query protein even if using the high-level GO approach to predict its subcellular localization as required in Euk-mPLoc [14].

The above homology-based GO extraction method is particularly useful for studying those proteins which do not have UniProt accession numbers. However, it would still fail to work under any one of the following situations: (1) the query protein does not have significant homology to any protein in the Swiss-Prot database, i.e., An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e062.jpg meaning the homology set is an empty one; (2) its representative proteins do not contain any useful GO information for statistical prediction based on a given training dataset.

Therefore, it is necessary to consider the following representation modes for those proteins which fail to be meaningfully defined in the GO space.

2. FunD (Functional Domain) Representation Mode

FunD is the core of a protein that plays the major role for its function. That is why in determining the 3-D (dimensional) structure of a protein by experiments (see, e.g., [48], [49]) or by computational modeling (see, e.g., [47], [50]) the first priority was always focused on its FunD. Actually, using the FunD information to formulate protein samples for statistical predictions was originally proposed in [51], [52], and quite encouraged results were achieved. In that time, the 2005 FunDs in the SBASE-A database [53] were used as bases to formulate the protein samples. Since then, a series of follow-up protein FunD databases were established, such as COG [54], KOG [54], SMART [55], Pfam [56], and CDD [57]. Of these databases, CDD contains the domains imported from COG, Pfam and SMART, and hence is relatively much more complete [57]. The version 2.11 of CDD contains 17,402 characteristic domains. Using each of these domains as a base vector, we can define a FunD space with 17,402 dimensions. Thus, by following the similar procedures in [51], a protein sample can be uniquely defined through the steps described below:

Step 1

Use RPS-BLAST (Reverse PSI-BLAST) program [44] to conduct sequence alignment of the protein sequence with each of the 17,402 domain sequences in the CDD database.

Step 2

If the significance threshold value (expect value) is An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e063.jpg for the An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e064.jpg domain meaning that a “hit” is found, then the An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e065.jpg component of the protein in the 17402-D space is assigned 1; otherwise, 0.

Step 3

The protein sample An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e066.jpg in the FunD space can thus be formulated as

equation image
(6)

where An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e068.jpg is the transpose operator, and

equation image
(7)

Defined this way, the protein sample becomes corresponding to a 17402-D vector An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e070.jpg with each of the 17402 functional domain sequences as a base for the vector space. By using such a representation, not only some sequence-order effects but also some functional information is included. Since the function of a protein is closely related to its subcellular location, the FunD formulation of Eq.6 would naturally incorporate those factors that might be directly correlated with the protein subcellular location.

3. SeqEvo (Sequential Evolution) Representation Mode

Since biology is a natural science with historic dimension, all biological species have actually developed continuously starting out from a very limited number of ancestral species. It is quite typical for protein sequences [47]. Their evolution involves changes of single residues, insertions and deletions of several residues, gene doubling, and gene fusion. With such changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are eliminated, but the corresponding proteins may still share many common attributes, such as their location site in a cell. Therefore, to catch the core feature and intrinsic relationship from a huge number of complicated protein sequences, it is particularly important to take into account the evolution effects. To realize this, here we are to incorporate the evolution information through the “Position-Specific Scoring Matrix” or “PSSM” [44], i.e., to express the protein An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e071.jpg by a An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e072.jpg matrix as formulated by

equation image
(8)

where An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e074.jpg is the length of An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e075.jpg (counted in the total number of its constituent amino acids), An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e076.jpg represents the score of the amino acid residue in the An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e077.jpg position of the protein sequence being changed to amino acid type An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e078.jpg during the evolutionary process. Here, the numerical codes 1, 2, …, 20 are used to denote the 20 native amino acid types according to the alphabetical order of their single character codes. The An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e079.jpg scores in Eq.8 were generated by using PSI-BLAST [44] to search the Swiss-Prot database (version 55.3 released on 29-Apr-2007) through three iterations with 0.001 as the An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e080.jpg-value cutoff for multiple sequence alignment against the sequence of the protein An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e081.jpg, followed by a standard conversion given below:

equation image
(9)

where An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e083.jpg represent the original scores directly created by PSI-BLAST [44] that are generally shown as positive or negative integers (the positive score means that the corresponding mutation occurs more frequently than expected by chance, while the negative means just the opposite); the symbol An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e084.jpg means taking the average of An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e085.jpg over An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e086.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e087.jpg means the corresponding standard deviation. The converted values obtained by Eq.9 will have a zero mean value over the 20 amino acids and will remain unchanged if going through the same conversion procedure again. However, according Eq.8, a protein with An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e088.jpg length is corresponding to a matrix of An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e089.jpg rows. Hence, proteins with different lengths will correspond to matrices of different dimensions. This will become a hurdle for us to develop a predictor able to unanimously cover proteins of any length. To overcome such a hurdle, one possible avenue is to represent a protein sample An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e090.jpg by

equation image
(10)

where

equation image
(11)

where An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e093.jpg represents the average score of the amino acid residues in the protein An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e094.jpg being changed to amino acid type An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e095.jpg during the evolutionary process. However, if An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e096.jpg of Eq.10 was used to represent the protein An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e097.jpg, all the sequence-order information during the evolutionary process would be erased. To avoid completely erasing the sequence-order information, the concept of PseAAC as originally proposed in [16] was utilized; i.e., instead of Eq.10, let us use the pseudo position-specific scoring matrix as given by

equation image
(12)

to represent the protein An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e099.jpg, where

equation image
(13)

meaning that An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e101.jpg is the correlation factor by coupling the most contiguous position-specific scoring matrix scores along the protein chain for the amino acid type An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e102.jpg; An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e103.jpg that by coupling the second-most contiguous position-specific scoring matrix scores; and so forth. Note that, as mentioned in the Material section of [14], the length of the shortest protein sequence in the benchmark dataset is An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e104.jpg, and hence the value allowed for An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e105.jpg in Eq.13 must be smaller than 50. When An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e106.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e107.jpg becomes a naught element and Eq.12 is degenerated to Eq.10.

A hybridization of the above three different PseAAC modes, i.e., Eq.4, Eq.6, and Eq.12, will be used to represent protein samples for establishing a new classifier for predicting eukaryotic protein subcellular localization, as described below.

4. Prediction Engine An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e108.jpg and Computing Procedures

The prediction engine used in this study is the ensemble classifier An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e109.jpg formed by fusing many individual OET-KNN (Optimized Evidence-Theoretic K-Nearest Neighbor) classifiers [58], [59]. According to the underlying rule of the OET-KNN classifier, a query protein should be assigned to the class the majority of its K nearest neighbors belongs to. However, for most benchmark datasets, when An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e110.jpg the success rate thus obtained would decrease markedly. Therefore, our consideration for K can be confined within the range from 1 to 10. Accordingly, the ensemble classifier An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e111.jpg can be formulated as

equation image
(14)

where the symbol An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e113.jpg denotes the fusing operator, An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e114.jpg is the individual OET-KNN classifier based on An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e115.jpg nearest neighbor, An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e116.jpg that based on An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e117.jpg nearest neighbors, and so forth. The detailed mathematical formulations for OET-KNN and An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e118.jpg have been given in Eqs.22–29 in [10], where it has also been clearly elaborated how the ensemble classifier An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e119.jpg worked during the process of prediction. To avoid redundancy, we are not to repeat the details here.

The prediction is processed according to the following order.

Step 1

If the query protein An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e120.jpg can be expressed as a meaningful or productive descriptor in the GO database via its representative proteins in its homology set An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e121.jpg, then An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e122.jpg of Eq.4 should be input into the prediction engine for identifying its subcellular location site(s); i.e.

equation image
(15)

where An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e124.jpg represents the identification operator, and the fusion is made via a voting operation as formulated by Eqs.32–35 in [10].

Step 2

If the query protein An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e125.jpg does not have significant homology to any protein in the Swiss-Prot database, i.e., An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e126.jpg (empty set), or its representative proteins in An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e127.jpg do not contain any useful GO information, then both the FunD representation An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e128.jpg of Eq.6 and the pseudo position-specific scoring matrix representation An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e129.jpg of Eq.12 should be inputted into the prediction engine An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e130.jpg. The output will be determined by fusing many preliminary outcomes associated with different K of An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e131.jpg (cf. Eq.14) and different possible An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e132.jpg of the pseudo sequential evolution descriptor (cf. Eq.12); i.e.,

equation image
(16)

where the factor 10 is because An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e134.jpg in An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e135.jpg can be An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e136.jpg and the factor 50 is because An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e137.jpg in An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e138.jpg can be An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e139.jpg (cf. Eqs.12–13).

Step 3

To make Eqs.15–16 capable to handle proteins with multiple locations as well, the ensemble classifier An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e140.jpg needed to be modified to An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e141.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e142.jpg is a threshold parameter for controlling the count of multiple location sites and optimizing the predicted results, as formulated by Eqs.39–48 in [10] where it was also elaborated how to evaluate the overall success rate when using An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e143.jpg on a benchmark dataset containing both single and multiple location proteins.

The entire ensemble classifier thus established is called “Euk-mPLoc 2.0”, where “2.0” refers to an updated version evolved from Euk-mPLoc [14]. To provide an intuitive picture, a flowchart is given in Fig. 2 to illustrate the prediction process of Euk-mPLoc 2.0.

Figure 2
A flowchart to show the prediction process of Euk-mPLoc 2.0.

Protocol Guide

For the convenience of experimental scientists, a user-friendly web-server was established for Euk-mPLoc 2.0. Below, let us give a step-by-step guide on how to use it to get the desired results.

Step 1

Open the web server at http://www.csbio.sjtu.edu.cn/bioinf/euk-multi-2/ and you will see the top page of the predictor on your computer screen, as shown in Fig. 3a. Click on the Read Me button to see a brief introduction about Euk-mPLoc 2.0 predictor and the caveat when using it.

Figure 3
Semi-screenshot to show the prediction steps.

Step 2

Either type or copy and paste the query protein sequence into the input box at the center of Fig. 3a. The input sequence should be in the FASTA format. A sequence in FASTA format consists of a single initial line beginning with a greater-than symbol (“>”) in the first column, followed by lines of sequence data. The words right after the “>” symbol in the single initial line are optional and only used for the purpose of identification and description. All lines should be no longer than 120 characters and usually do not exceed 80 characters. The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence. Example sequences in FASTA format can be seen by clicking on the Example button right above the input box. For more information about FASTA format, visit http://en.wikipedia.org/wiki/Fasta_format.

Step 3

Click on the Submit button to see the predicted result. For example, if you use the sequence of query protein 1 in the Example window, the input screen should look like the illustration in Fig. 3b; after clicking the Submit button, you will see “Cell membrane; Cytoplasm; Nucleus” shown on the predicted result window (Fig. 3c), meaning that the protein is a multiplex one, which can simultaneously occur in “cell membrane”, “cytoplasm”, and “nucleus” organelles, fully consistent with experimental observations. However, if using the sequence of query protein 2 in the Example window as an input, you will instead see “Cytoplasm” shown on the predicted result window (Fig. 3d), meaning that the protein is a single-location one residing in “cytoplasm” compartment only, also fully consistent with experimental observations. It takes about 15 seconds for a protein sequence of 400 amino acids before the predicted result appears on your computer screen; the longer the sequence is, the more time it is usually needed.

Step 4

Click on the Citation button to find the relevant papers that document the detailed development and algorithm of Euk-mPLoc 2.0.

Step 5

Click on the Data button to download the benchmark datasets used to train and test the Euk-mPLoc 2.0 predictor.

Caveat

To obtain the predicted result with the expected success rate, the entire sequence of the query protein rather than its fragment should be used as an input. A sequence with less than 50 amino acid residues is generally deemed as a fragment. Also, if the query protein is known not one of the 22 locations as shown in Fig. 1, stop the prediction because the result thus obtained will not make any sense.

Results and Discussion

In statistical prediction, it would be meaningless to simply say a success rate of a predictor without specifying what method and benchmark dataset were used to test its accuracy. The following three cross-validation methods are often used to evaluate the accuracy of a statistical predictor: independent dataset test, sub-sampling (K-fold) test, and jackknife test [60]. Of these three, the jackknife test is deemed the most objective because the independent dataset test and sub-sampling test cannot avoid arbitrariness, as elaborated in a comprehensive review [10]. Therefore, the jackknife test has been increasingly and widely adopted to examine the power of various predictors (see, e.g., [23], [24], [25], [27], [29], [31], [34], [37], [61], [62], [63], [64], [65], [66], [67]). However, even if tested by the jackknife cross-validation, a same predictor can still yield different success rates for different benchmark datasets. This is because the more stringent of a benchmark dataset in excluding homologous sequences, or the more subcellular locations it covers, the more difficult for a predictor to yield a high overall success rate. For instance, ProtLock [2] and HSLPred [68] are two predictors developed for identifying protein subcellular localization. Both were reported with the success rates over 70–80% [2], [68] when tested by the benchmark datasets that allow inclusion of homologous proteins with up to 90% pairwise sequence identity and cover only 4 or 5 subcellular location sites. However, when the two predictors were tested by the stringent dataset covering 16 different subcellular locations in which none of proteins included has An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e144.jpg pairwise sequence identity to any other in a same subset, the overall jackknife success rate achieved by ProtLock [2] would drop down to 28.7% and that by HSLPred [68] down to 33.1%, as reported in [58].

Now the current benchmark dataset is even more stringent because, in addition to the same threshold to rigorously exclude the homologous sequences, it covers even more, i.e., 22 location sites. Besides, to the best of our knowledge, except Euk-mPLoc [14], so far there is no other web-server predictor whatsoever that can be used to predict a system with both single- and multiple-location proteins distributed among 22 different location sites. Accordingly, to demonstrate the advantage of Euk-mPLoc 2.0, it would be sufficient to simply compare the success rates achieved by the new predictor with those by Euk-mPLoc [14].

Listed in Table 2 are the results obtained with Euk-mPLoc [14] and Euk-mPLoc 2.0 on the benchmark dataset An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e145.jpg (cf. Table 1) by the jackknife cross-validation test. During the testing process, only the sequences of proteins in Online Supporting Information S1 but not their accession numbers were used as inputs in order to make the comparison between the two predictors under exactly the same condition. During the course of the jackknife cross-validation by Euk-mPLoc 2.0 and Euk-mPLoc, the false positives (over-predictions) and false negatives (under-predictions) were also taken into account to reduce the scores for calculating the success rate. Note that it is more complicated to count the over-predictions and under-predictions for a system containing both single-location and multiple-location proteins. For the detailed calculation process, refer to Eqs.43–48 as well as Fig. 4 in a comprehensive review [10]. As we can see from Table 2, for such a stringent and multiplex benchmark dataset, the overall success rate achieved by Euk-mPLoc 2.0 is over 64%, which is about 25% higher than that by Euk-mPLoc.

Table 2
A comparison of Euk-mPLoc 2.0 with Euk-PLoc in the jackknife cross-validation test on the benchmark dataset covering 22 location sites where none of the eukaryotic proteins included has An external file that holds a picture, illustration, etc.
Object name is pone.0009931.e146.jpg pairwise sequence identity to any other in a same location.

Finally, it should be pointed out that although Euk-mPLoc 2.0 is more powerful than the existing predictors in identifying the subcellular locations of eukaryotic proteins, there is much room for further improvement in future studies. As shown in Table 2, the success rates by Euk-mPLoc 2.0 for proteins belonging to “melanosome” and “synapse” locations are very low. This is because of that, compared with the most of the other 20 location sites, the numbers of proteins in the two sites are not sufficiently large (cf. Table 1 and Online Supporting Information S1) to train the prediction engine in a more effective way. It is anticipated that with more experimental data available for the two sites in the future, the situation will be improved and Euk-mPLoc 2.0 will become even more powerful.

Supporting Information

Supporting Information S1

(4.45 MB PDF)

Acknowledgments

The authors wish to thank the tree anonymous reviewers for their constructive comments, which are very helpful for strengthening the presentation of this paper.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This work was supported by the National Natural Science Foundation of China (Grant No. 60704047), Science and Technology Commission of Shanghai Municipality (Grant No. 08ZR1410600, 08JC1410600), sponsored by Shanghai Pujiang Program and Innovation Program of Shanghai Municipal Education Commission (10ZZ17). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Nakashima H, Nishikawa K. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol. 1994;238:54–61. [PubMed]
2. Cedano J, Aloy P, P'erez-Pons JA, Querol E. Relation between amino acid composition and cellular location of proteins. J Mol Biol. 1997;266:594–600. [PubMed]
3. Chou KC, Elrod DW. Protein subcellular location prediction. Protein Engineering. 1999;12:107–118. [PubMed]
4. Emanuelsson O, Nielsen H, Brunak S, von Heijne G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of Molecular Biology. 2000;300:1005–1016. [PubMed]
5. Zhou GP, Doctor K. Subcellular location prediction of apoptosis proteins. PROTEINS: Structure, Function, and Genetics. 2003;50:44–48. [PubMed]
6. Small I, Peeters N, Legeai F, Lurin C. Predotar: A tool for rapidly screening proteomes for N-terminal targeting sequences. Proteomics. 2004;4:1581–1590. [PubMed]
7. Matsuda S, Vert JP, Saigo H, Ueda N, Toh H, Akutsu T. A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci. 2005;14:2804–2813. [PMC free article] [PubMed]
8. Pierleoni A, Martelli PL, Fariselli P, Casadio R. BaCelLo: a balanced subcellular localization predictor. Bioinformatics. 2006;22:e408–416. [PubMed]
9. Nakai K. Protein sorting signals and prediction of subcellular localization. Advances in Protein Chemistry. 2000;54:277–344. [PubMed]
10. Chou KC, Shen HB. Review: Recent progresses in protein subcellular location prediction. Analytical Biochemistry. 2007;370:1–16. [PubMed]
12. Glory E, Murphy RF. Automated subcellular location determination and high-throughput microscopy. Dev Cell. 2007;12:7–16. [PubMed]
13. Millar AH, Carrie C, Pogson B, Whelan J. Exploring the function-location nexus: using multiple lines of evidence in defining the subcellular location of plant proteins. Plant Cell. 2009;21:1625–1631. [PMC free article] [PubMed]
14. Chou KC, Shen HB. Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. Journal of Proteome Research. 2007;6:1728–1734. [PubMed]
15. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. Nature Genetics. 2000;25:25–29. [PMC free article] [PubMed]
16. Chou KC. Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics (Erratum: ibid, 2001, Vol44, 60) 2001;43:246–255. [PubMed]
17. Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21:10–19. [PubMed]
18. Chou KC, Shen HB. Cell-PLoc: A package of web-servers for predicting subcellular localization of proteins in various organisms. Nature Protocols. 2008;3:153–162. [PubMed]
19. Shen HB, Chou KC. PseAAC: a flexible web-server for generating various kinds of protein pseudo amino acid composition. Analytical Biochemistry. 2008;373:386–388. [PubMed]
20. Chou KC. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics. 2009;6:262–274.
21. Zhou XB, Chen C, Li ZC, Zou XY. Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. Journal of Theoretical Biology. 2007;248:546–551. [PubMed]
22. Zhang GY, Li HC, Fang BS. Predicting lipase types by improved Chou's pseudo-amino acid composition. Protein & Peptide Letters. 2008;15:1132–1137. [PubMed]
23. Nanni L, Lumini A. Genetic programming for creating Chou's pseudo amino acid based features for submitochondria localization. Amino Acids. 2008;34:653–660. [PubMed]
24. Zhang GY, Fang BS. Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou's amphiphilic pseudo amino acid composition. Journal of Theoretical Biology. 2008;253:310–315. [PubMed]
25. Zeng YH, Guo YZ, Xiao RQ, Yang L, Yu LZ, et al. Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. Journal of Theoretical Biology. 2009;259:366–372. [PubMed]
26. Qiu JD, Huang JH, Liang RP, Lu XQ. Prediction of G-protein-coupled receptor classes based on the concept of Chou's pseudo amino acid composition: an approach from discrete wavelet transform. Analytical Biochemistry. 2009;390:68–73. [PubMed]
27. Lin H, Wang H, Ding H, Chen YL, Li QZ. Prediction of Subcellular Localization of Apoptosis Protein Using Chou's Pseudo Amino Acid Composition. Acta Biotheor. 2009;57:321–330. [PubMed]
28. Lin H, Ding H, Feng-Biao Guo FB, Zhang AY, et al. Predicting subcellular localization of mycobacterial proteins by using Chou's pseudo amino acid composition. Protein & Peptide Letters. 2008;15:739–744. [PubMed]
29. Lin H. The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. Journal of Theoretical Biology. 2008;252:350–356. [PubMed]
30. Li FM, Li QZ. Predicting protein subcellular location using Chou's pseudo amino acid composition and improved hybrid approach. Protein & Peptide Letters. 2008;15:612–616. [PubMed]
31. Jiang X, Wei R, Zhang TL, Gu Q. Using the concept of Chou's pseudo amino acid composition to predict apoptosis proteins subcellular location: an approach by approximate entropy. Protein & Peptide Letters. 2008;15:392–396. [PubMed]
32. Georgiou DN, Karakasidis TE, Nieto JJ, Torres A. Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition. Journal of Theoretical Biology. 2009;257:17–26. [PubMed]
33. Fang Y, Guo Y, Feng Y, Li M. Predicting DNA-binding proteins: approached from Chou's pseudo amino acid composition and other specific sequence features. Amino Acids. 2008;34:103–109. [PubMed]
34. Ding H, Luo L, Lin H. Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition. Protein & Peptide Letters. 2009;16:351–355. [PubMed]
35. Esmaeili M, Mohabatkar H, Mohsenzadeh S. Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. Journal of Theoretical Biology. 2010;263:203–209. [PubMed]
36. Ding YS, Zhang TL. Using Chou's pseudo amino acid composition to predict subcellular localization of apoptosis proteins: an approach with immune genetic algorithm-based ensemble classifier. Pattern Recognition Letters. 2008;29:1887–1892.
37. Chen C, Chen L, Zou X, Cai P. Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine. Protein & Peptide Letters. 2009;16:27–31. [PubMed]
38. Gonzalez-Diaz H, Gonzalez-Diaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks, and connectivity indices. Proteomics. 2008;8:750–778. [PubMed]
39. Gonzalez-Diaz H, Prado-Prado F, Perez-Montoto LG, Duardo-Sanchez A, Lopez-Diaz A. QSAR Models for Proteins of Parasitic Organisms, Plants and Human Guests: Theory, Applications, Legal Protection, Taxes, and Regulatory Issues. Current Proteomics. 2009;6:214–227.
40. Gonzalez-Diaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8:1676–1690. [PubMed]
41. Gonzalez-Diaz H, Vilar S, Santana L, Uriarte E. Medicinal chemistry and bioinformatics - current trends in drugs discovery with networks topological indices. Curr Top Med Chem. 2007;10:1015–1029. [PubMed]
42. Perez-Montoto LG, Prado-Prado F, Ubeira FM, Gonzalez-Diaz H. Study of Parasitic Infections, Cancer, and other Diseases with Mass-Spectrometry and Quantitative Proteome-Disease Relationships. Current Proteomics. 2009;6:246–261.
43. Camon E, Magrane M, Barrell D, Binns D, Fleischmann W, et al. The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res. 2003;13:662–672. [PMC free article] [PubMed]
44. Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001;29:2994–3005. [PMC free article] [PubMed]
45. Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, et al. Protein function annotation by homology-based inference. Genome Biol. 2009;10:207. [PMC free article] [PubMed]
46. Gerstein M, Thornton JM. Sequences and topology. Curr Opin Struct Biol. 2003;13:341–343. [PubMed]
47. Chou KC. Review: Structural bioinformatics and its impact to biomedical science. Current Medicinal Chemistry. 2004;11:2105–2134. [PubMed]
48. Schnell JR, Chou JJ. Structure and mechanism of the M2 proton channel of influenza A virus. Nature. 2008;451:591–595. [PMC free article] [PubMed]
49. Wang J, Pielak RM, McClintock MA, Chou JJ. Solution structure and functional analysis of the influenza B proton channel. Nat Struct Mol Biol. 2009;16:1267–1271. [PMC free article] [PubMed]
50. Chou KC. Insights from modelling the 3D structure of the extracellular domain of alpha7 nicotinic acetylcholine receptor. Biochemical and Biophysical Research Communication. 2004;319:433–438. [PubMed]
51. Chou KC, Cai YD. Using functional domain composition and support vector machines for prediction of protein subcellular location. Journal of Biological Chemistry. 2002;277:45765–45769. [PubMed]
52. Cai YD, Zhou GP, Chou KC. Support vector machines for predicting membrane protein types by using functional domain composition. Biophysical Journal. 2003;84:3257–3263. [PMC free article] [PubMed]
53. Murvai J, Vlahovicek K, Barta E, Pongor S. The SBASE protein domain library, release 8.0: a collection of annotated protein sequence segments. Nucleic Acids Research. 2001;29:58–60. [PMC free article] [PubMed]
54. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. [PMC free article] [PubMed]
55. Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, et al. SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 2006;34:D257–260. [PMC free article] [PubMed]
56. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34:D247–251. [PMC free article] [PubMed]
57. Marchler-Bauer A, Anderson JB, Derbyshire MK, DeWeese-Scott C, Gonzales NR, et al. CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res. 2007;35:D237–240. [PMC free article] [PubMed]
58. Chou KC, Shen HB. Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. Journal of Proteome Research. 2006;5:1888–1897. [PubMed]
59. Denoeux T. A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Transactions on Systems, Man and Cybernetics. 1995;25:804–813.
60. Chou KC, Zhang CT. Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology. 1995;30:275–349. [PubMed]
61. Jahandideh S, Abdolmaleki P, Jahandideh M, Asadabadi EB. Novel two-stage hybrid neural discriminant model for predicting proteins structural classes. Biophys Chem. 2007;128:87–93. [PubMed]
62. Jahandideh S, Sarvestani AS, Abdolmaleki P, Jahandideh M, Barfeie M. gamma-Turn types prediction in proteins using the support vector machines. J Theor Biol. 2007;249:785–790. [PubMed]
63. Chen K, Kurgan LA, Ruan J. Prediction of protein structural class using novel evolutionary collocation-based sequence representation. J Comput Chem. 2008;29:1596–1604. [PubMed]
64. Jiang Y, Iglinski P, Kurgan L. Prediction of protein folding rates from primary sequences using hybrid sequence representation. J Comput Chem. 2008. [PubMed]
65. Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, et al. Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. Journal of Theoretical Biology. 2009;257:618–626. [PubMed]
66. Vilar S, Gonzalez-Diaz H, Santana L, Uriarte E. A network-QSAR model for prediction of genetic-component biomarkers in human colorectal cancer. Journal of Theoretical Biology. 2009;261:449–458. [PubMed]
67. Nanni L, Lumini A. A Further Step Toward an Optimal Ensemble of Classifiers for Peptide Classification, a Case Study: HIV Protease. Protein & Peptide Letters. 2009;16:163–167. [PubMed]
68. Garg A, Bhasin M, Raghava GP. Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J Biol Chem. 2005;280:14427–14432. [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...