# PLPD: reliable protein localization prediction from imbalanced and overlapped datasets

^{1,}

^{2}Dae-Won Kim,

^{3}DoKyun Na,

^{1}Kwang H. Lee,

^{1,}

^{2}and Doheon Lee

^{1,}

^{*}

^{1}Department of BioSystems, KAIST, Daejeon City, Republic of Korea

^{2}Advanced Information Technology Research Center, KAIST, Daejeon City, Republic of Korea

^{3}School of Computer Science and Engineering, Chung-Ang University, Seoul City, Republic of Korea

^{*}To whom correspondence should be addressed. Tel: +82 42 869 4316; Fax: +82 42 869 8680; Email: rk.ca.tsiak.tfosoib@eelhd

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

Subcellular localization is one of the key functional characteristics of proteins. An automatic and efficient prediction method for the protein subcellular localization is highly required owing to the need for large-scale genome analysis. From a machine learning point of view, a dataset of protein localization has several characteristics: the dataset has too many classes (there are more than 10 localizations in a cell), it is a multi-label dataset (a protein may occur in several different subcellular locations), and it is too imbalanced (the number of proteins in each localization is remarkably different). Even though many previous works have been done for the prediction of protein subcellular localization, none of them tackles effectively these characteristics at the same time. Thus, a new computational method for protein localization is eventually needed for more reliable outcomes. To address the issue, we present a protein localization predictor based on D-SVDD (PLPD) for the prediction of protein localization, which can find the likelihood of a specific localization of a protein more easily and more correctly. Moreover, we introduce three measurements for the more precise evaluation of a protein localization predictor. As the results of various datasets which are made from the experiments of Huh *et al*. (2003), the proposed PLPD method represents a different approach that might play a complimentary role to the existing methods, such as Nearest Neighbor method and discriminate covariant method. Finally, after finding a good boundary for each localization using the 5184 classified proteins as training data, we predicted 138 proteins whose subcellular localizations could not be clearly observed by the experiments of Huh *et al*. (2003).

## INTRODUCTION

Recent advances in large-scale genome sequencing have resulted in the huge accumulation of protein amino acid sequences (1). Currently, many researchers are trying either to discover or to clarify the unknown functions of these proteins. Since knowing the subcellular localization where a protein resides can give important insight into its possible functions (2), it is indispensable to identify the subcellular localization of a protein. However, it is time consuming and costly to identify the subcellular localizations of newly found proteins entirely by performing experimental tests. Thus, a reliable and efficient computational method is highly required to directly extract localization information.

From a machine learning point of view, a task that predicts the subcellular localizations of given proteins has several characteristics which demonstrate the task's complexity (see Table 1). First, there are too many localizations in a cell. For example, according to the work of Huh *et al*. (3), there are 22 distinct subcellular localization categories in budding yeast. It means that the possibility of correct prediction of one localization is <4.55% with a random guess. Second, the prediction task is a ‘multi-label’ classification problem; some proteins may have several different subcellular localizations (1). For instance, the YBR156C can be located either in ‘microtubule’ or ‘nucleus’ according to the work of Huh *et al*. (3). Thus, a computational method should be able to handle the multi-label problem. Finally, the number of proteins in each localization is too different making a protein localization data set highly ‘imbalanced’. It is generally accepted that proteins located in some organelles are much more abundant than in others (2). It also can be checked with the data of Huh *et al*. (3); the number of proteins in ‘cytoplasm’ is 1782, while the number of proteins in ‘ER to Golgi’ is only 6 (see the second column of Table 1). All these three characteristics make the task difficult. Thus, not only good features for a protein but also a good computational algorithm is ultimately needed for the reliable prediction of protein subcellular localization.

Actually, many works have been done during the last decade or so in this field. The efforts in these works have followed several trends (see Table 2).

1) *Feature Extraction:* one trend is to try to extract good information (or features) from given proteins. One category of the features used is based on amino acid composition (AA) (1,2,4–14,15–22). Many works have used AA as the unique feature or the complementary feature of a protein owing to its simplicity and its high coverage. Prediction based on only AA features would lose sequence order information. Thus, to give sequential information to the AA, Nakashima and Nishikawa (4) also used amino acid pair composition (PairAA), Chou (5) used pseudo amino acid composition (PseAA) using sequence-order correlation (SOC) factor (6), and Park and Kanehisa (18) also used gapped amino acid composition (GapAA). For more sequence order effect, Pan *et al*. (19) used digital signal processing filter technique to the PseAA. These features based on AA have the advantage of achieving a very high coverage but may have limit on the high performance. Other researchers have used several kinds of motif information as the feature of proteins (1,9–12,23,24–26). Since Nakai and Horton (23) used protein sorting signal motifs in the N-terminal portion of a protein, some researchers (24) have used the motif in the prediction of localization. Using the 2005 functional domain sequences of SBASE-A which is a collection of well known structural and functional domain types (28), Chou and Cai (9,25) represented a protein as a vector with a 2005-dimensional functional domain composition (SBASE-FunD). They also introduced the dimensional functional domain composition (InterPro-FunD) (1,11) using the InterPro database (29). In contrast, Nair and Rost (26) represented a protein with functional annotations from the SWISS-PROT database (30). Recently, for higher prediction accuracy Cai and Chou (1,11) used Gene Ontology (GO) term as a auxiliary feature of a protein. Even though motif information and GO can improve the prediction accuracy, the information has a limited coverage of the proteins.

2) *Class coverage extension:* another trend is to increase the coverage of protein localization for practical use. At the beginning, Nakashima and Nishikawa (4) distinguished between intracellular proteins and extracellular proteins using the AA and the PairAA features. After that, many researchers enlarged the number of localization classes to 5 classes (13), to 8 classes (27), to 11 classes (23), to 12 classes (2), then to 14 classes (14). Recently Chou and Cai (1) used up to 22 localization classes using the dataset of Huh *et al*. (3), which is the biggest coverage of protein localization up to now.

3) *Computational algorithm:* to improve the prediction quality, another trend is to try to use an efficient computational algorithm in the prediction stage. Current computational methods include the following: a Least Distance Algorithm using various distance measures [a distance in PlotLock (31) that is modified from Mahalanobis distance originally introduced by Chou in predicting protein structural class (32), a Covariant discriminant algorithm (CD) in (2,27), and an augmented CD in (6)], an Artificial Neural Network approach in (15,20), a Nearest Neighbor approach in (1,11,17,23,26), a Markov Model (MM) in (21), a Bayesian Network (BN) approach in (24), and Support Vector Machines (SVMs) approach in (7,9,16,18,25). In (12), three algorithms, such as SVMs, a Hidden MM and a BN are used for improving prediction accuracy.

Even though many previous works have been done for the prediction of protein subcellular localization, none of them tackled effectively the three characteristics of protein localization prediction at the same time. For example, many existing predictors use only less than five different subcellular localizations. Moreover, very few predictors deal with the issue of multiple-localization proteins except for Chou and Cai (1). The majority only assumed that there is no multiple-localization protein. Furthermore, almost all previous methods did not consider the imbalanced problem in a given dataset. That means these methods achieve high accuracy only for the most populated localizations, such as the ‘nucleus’ and ‘cytosol’. They, however, are generally less accurate on the numerous localizations containing fewer individual proteins. Thus, a new computational method is eventually needed for more reliable prediction which should have the following characteristics: (i) it can show relatively good performance in case many classes exist, (ii) it can handle a multi-label problem and (iii) it should be robust in an imbalanced dataset. Our study is aimed to address these issues.

To achieve the purpose, we developed a PLPD method which can predict better the localization information of proteins using a Density-induced Support Vector Data Description (D-SVDD) approach. The PLPD stands for ‘Protein Localization Predictor based on D-SVDD’. The D-SVDD (33) is a general extension of conventional Support Vector Data Description (C-SVDD) (34–36) inspired by the SVMs (37). According to the work of Lee *et al*. (33), D-SVDD highly outperformed the C-SVDD. D-SVDD is one of one-class classification methods whose purpose is to give a compact description of a set of data referred to as target data. One-class classification methods are suitable for imbalanced datasets since find compact descriptions for target data independently from other data (33,36). Moreover, they are easily used for the dataset whose number of classes is big owing to linear complexity with regard to the number of classes. However, original D-SVDD is not for a multi-class and multi-label problem. For the protein localization problem, thus, we propose the PLPD method by extending the original D-SVDD method using the likelihood of a specific protein localization.

The structure of the paper is organized as follows: First, we briefly provide the information on the C-SVDD and the D-SVDD in Section 2. In this section, we also introduce the proposed PLPD method for protein localization prediction. Section 3 highlights the potentials of the proposed approach through experiments with datasets from the work of Huh *et al*. (3). Concluding remarks are presented in Section 4.

## MATERIALS AND METHODS

### Conventional support vector data description

Since C-SVDD is not well introduced in bioinformatics fileds, we first briefly describe the basic ideas of the C-SVDD. Suppose a dataset containing *n* target data, {**x**_{i} ∣ *i* = 1, … , *n*}, is given in the task of one-class classification. The basic idea of a C-SVDD (36) is to find the hypersphere (**a**, *R*) with minimum volume which includes most of the target data, where **a** and *R* are respectively the center and the radius of the solution (the hypersphere) as shown in Figure 1. To permit **x**_{i} to be located in the outside of a hypersphere, the C-SVDD introduces a slack variable ξ_{i} ≥ 0 for each target data point analogous to SVMs (38). Thus the solution of the C-SVDD is obtained by minimizing the objective function *O*:

subject to (**x**_{i} − **a**) · (**x**_{i} − **a**) ≤ *R*^{2} + ξ_{i} where the parameter *C*^{+} > 0 gives the trade-off between volume of a hypersphere and the number of errors (number of target data rejected) (36).

**...**

When negative data **x**_{l} which should not be included in a compact description of target data are available during training, C-SVDD utilizes it. In this case, the C-SVDD finds the minimum-volume hypersphere that includes most of target data and at the same time, excludes most of negative data as shown in Figure 2. Suppose we have both *n* target data and *m* negative data. (Note: the total number of training data are *N* where *N* = *n* + *m*.) By using another slack variable ξ_{l} ≥ 0 for the permission of including negative data in a compact description, the objective function of this case is written as:

subject to (**x**_{i} − **a**) · (**x**_{i} − **a**) ≤ *R*^{2} + ξ_{i} and (**x**_{l} − **a**) · (**x**_{l} − **a**) > *R*^{2} − ξ_{l} where *C*^{−} is another parameter which controls the trade-off between the volume of data description and the errors of negative data (36).

By using Lagrange Multipliers α_{k} for **x**_{k}, the dual problem of the C-SVDD is reduced to maximizing *D*(α):

subject to ${\sum}_{k=1}^{N}{y}_{k}{\alpha}_{k}=1$, 0 ≤ α_{i} ≤ *C*^{+}, and 0 ≤ α_{l} ≤ *C*^{−} where *y _{k}* is the label of data

**x**

_{k}. (Note:

*y*= 1 for a target data point, otherwise

_{k}*y*= −1.) After solving

_{k}*D*(α) with regard to α

_{k}, the

**a**of the optimal hypersphere can be calculated by $\mathbf{a}={\sum}_{k=1}^{N}{y}_{k}{\alpha}_{k}{\mathbf{x}}_{k}$, and the radius

*R*can be obtained by the distance between

**a**and any target data point

**x**

_{i}that is located on the boundary of the hyperplane.

By comparing the distance between a test data point **x**_{t} and **a** with *R*, the C-SVDD determines the decision whether **x**_{t} is the same data type with the target data or not as:

where *I* is an indicator function (35). Thus, if the distance between **x*** _{t}* and

**a**is less than

*R*, then we predict that

**x**

_{t}is included in the given target dataset; otherwise, we predict that

**x**

_{t}is not included.

Similar to SVMs, the C-SVDD can find a more flexible data description in a high-dimensional feature space without directly mapping the space using a kernel function *K*(·,·) (38). As shown in Figure 3, the C-SVDD finds a solution by mapping the input training data into a possible high-dimensional feature space using a mapping function Φ, and seeking a hypersphere in the feature space. However, instead of explicitly mapping the training data into the feature space, C-SVDD directly finds the solution in the input space using the corresponding kernel *K*(**x**_{i}, **x**_{j}) between any two data **x**_{i} and **x**_{j}, defined by:

Thus, the kernelized form of Equation 3 is:

Using the kernel trick, the C-SVDD directly finds a more flexible boundary in an original input space as shown in the left figure of Figure 3 in an original input space.

### Density-induced support vector data description

As mentioned earlier, a C-SVDD finds a compact description that includes most of target data in a high-dimensional feature space, and the data that are not fully included in the hypersphere are called support vectors (36). In a C-SVDD, these support vectors completely determine the compact description (36) even though most target data can be non-support vectors. This can be a problem in data domain description especially when support vectors do not have the characteristics of a target dataset regarding its density distribution (33).

To address the problems outlined above and to identify the optimal description more easily, Lee *et al*. (33) recently proposed a D-SVDD which is a general extension of the C-SVDD by introducing the notion of a relative density degree for each data point. Using the relative density degree, Lee *et al*. defined a density-induced distance measurement for target data making data with higher density degrees give more influence on finding a compact description. By several experiments, the D-SVDD highly improved the performance of the C-SVDD (33). However, the original work of Lee *et al*. (33) only introduced the mechanism that can incorporate the density degrees of the target data, not negative data. Thus, in this study, we also introduce the mechanism of incorporating the density degrees of negative data by defining a density-induced distance measurement for negative data.

#### 1) *Relative density degree and density induced distance*

to reflect the density distribution into searching the boundary of the C-SVDD, Lee *et al*. first introduced the notion of a relative density degree for each target data point which means the density of the region of the corresponding data point compared to other regions in a given dataset (33). Suppose we calculate a relative density degree for a target data point **x**_{i}. By using $d({\mathbf{x}}_{i},{\mathbf{x}}_{i}^{K})$, the distance between **x**_{i} and ${\mathbf{x}}_{i}^{K}$ (the *K*th nearest neighborhood of **x**_{i}), and the mean distance of *K*th nearest neighborhoods of all target data, ℑ^{K}, they defined the local density degree ρ_{i} > 0 for **x**_{i} as:

where ${\Im}^{K}=\frac{1}{n}{\sum}_{i=1}^{n}d({\mathbf{x}}_{i},{\mathbf{x}}_{i}^{K})$, *n* is the number of data in a target class, and 0 ≤ ω ≤ 1 is a weighting factor. Note that the measure reports higher local density degree ρ_{i} for the data in a higher density region: the data with lower ${\mathbf{x}}_{i}^{K}$ have higher ρ_{i} values. Moreover, a bigger ω produces higher local density degrees. In a similar manner, the relative density degrees for negative data are also calculated.

After calculating the relative density degrees, to incorporate the degrees into searching the optimal description in a C-SVDD, Lee *et al*. (33) proposed a new geometric distance called density-induced distance. They defined a positive density-induced distance ${\delta}_{i}^{+}$ between target data point **x**_{i} and the center of a hyperspherical model (**a**, *R*) of a target dataset as:

where **a** and *R* are the center and the radius of the hypersphere, respectively. The basic idea is simple. To give higher influence on the search of the minimum-sized hypersphere, they made the distance between **x**_{i} and the center **a** longer using relative density degre ρ_{i}. Note that to enclose the data point with increased distance owing to a higher relative density degree ρ_{i}, the radius of a minimum-sized hypersphere should be increased (a data point with higher ρ_{i} gives stronger influence on the boundary).

In a similar manner, we can incorporate the relative density degrees for negative data. After calculating the relative density degrees ρ_{l} for negative data **x**_{l} using Equation 7, we define another density-induced distance ${\delta}_{l}^{-}$ between negative data and **a**, the center of the hyperspherical description of a target data set, as:

where **a** and *R* are the center and the radius of a hypersphere, respectively.

Note that ${\delta}_{l}^{-}$ decreases with increasing ρ_{l}. Hence, to exclude the negative data point with decreased ${\delta}_{l}^{-}$ owing to higher relative density degree ρ_{l}, the radius of a minimum-sized hypersphere should be decreased; the negative data point with higher relative density degree gives higher degree of penalty on the search of the minimum-sized hypersphere for a target dataset.

#### 2) *Mathematical formulation of D-SVDD*

Using the density-induced distance measure ${\delta}_{i}^{+}$ for target data, Lee *et al*. reformulated the C-SVDD when only target data are available (33). Similar to C-SVDD, they permitted the possibility of training error using a slack variable ζ_{i} ≥ 0, which is the distance between the boundary Ω and **x**_{i} outside Ω. Here, ζ_{i} equals ${\left({\delta}_{i}^{+}\right)}^{2}-{R}^{2}$ for training error data, otherwise it is 0. It implies that ζ_{i} contains the information of a relative density degree of **x**_{i}.

Using the slack variable for each target data point, they deduce Equation 10 from Equation 1:

subject to ρ_{i}(**x**_{i} − **a**) · (**x**_{i} − **a**) ≤ *R*^{2} + ζ_{i}.

By introducing Lagrange Multipliers, Lee *et al*. (33) could construct the dual problem: maximize *D*(α),

subject to ${\sum}_{i=1}^{n}{\alpha}_{i}=1,0\le {\alpha}_{i}\le {C}^{+}$, and $T={\sum}_{i=1}^{n}{\alpha}_{i}{\rho}_{i}$.

After deriving α_{k} that satisfy Equation 11, **a** of the optimal description can be calculated by:

Different from the C-SVDD (36), the center **a** of the optimal hypersphere is weighted by the relative density degree ρ* _{i}* (Equation 12) where the center is shifted to a higher dense region. Moreover, the

*R*of optimal description is calculated by the ${\delta}_{i}^{+}$ distance between

**a**and any

**X**

*of which 0 < α*

_{i}_{k}<

*C*

^{+}(33).

When negative data are available, D-SVDD can also utilize them to improve the description of the target dataset. In this case, we use negative density-induced distance ${\delta}_{l}^{-}$ for negative data in order to incorporating the density distribution of negative data. Using the negative density-induced distance and another slack variable ζ_{l} for possibility of training error for each negative data point, we find the optimal hypersphere that includes most target data and excludes most negative data. Here, ζ_{l} for each negative data point is the distance between the boundary Ω and **x**_{l} inside Ω; that is, ${\zeta}_{l}={R}^{2}-{\left({\delta}_{l}^{-}\right)}^{2},({\zeta}_{l}\ge 0)$. Thus the new objective function is defined as:

subject to ${\rho}_{i}({\mathbf{x}}_{i}-\mathbf{a})\xb7({\mathbf{x}}_{i}-\mathbf{a})\le {R}^{2}+{\zeta}_{i}$ and $\frac{1}{{\rho}_{l}}({\mathbf{x}}_{l}-\mathbf{a})\xb7({\mathbf{x}}_{l}-\mathbf{a})>{R}^{2}-{\zeta}_{l}$ where ${C}^{-}>0$ is a similar control parameter with that of the C-SVDD (34).

Similar to the previous case, after constructing the Lagrangian of Equation 13, the dual representation of this case is obtained as:

subject to ${\sum}_{k=1}^{N}{y}_{k}{\alpha}_{k}=1,\text{}0\le {\alpha}_{i}\le {C}^{+},0\le {\alpha}_{l}\le {C}^{-},$ and $T={\sum}_{k=1}^{N}{y}_{k}{\alpha}_{k}{\rho}_{k}^{\prime}$ where *y _{k}* is the label of

**x**

_{k}(

*y*= 1 for a target data point, otherwise

_{k}*y*= −1), and

_{k}*N*is the total number of data (

*N*=

*n*+

*m*). Here, ${\rho}_{k}^{\prime}={\rho}_{i}$ for target data

**x**

_{i}and ${\rho}_{k}^{\prime}=1/{\rho}_{l}$ for negative data

**x**

_{l}.

Note that using ${\alpha}_{k}^{\prime}={y}_{k}{\alpha}_{k}$ and ${\rho}_{k}^{\prime}$ (instead of ρ_{k}) the dual representation of negative version of D-SVDD becomes identical to the previous case of D-SVDD (Equation 11) except for the number of data considered (*n* becomes *N* = *n* + *m*). Therefore, when negative data are available, we can just use α′_{k} and ${\rho}_{k}^{\prime}$ in the optimization problem and in the decision function of the precious case. That means there are no extra computational complications except the complication caused by the increased data size.

As seen in Equations 11 and 14, the dual forms of the objective functions of D-SVDD are represented entirely in terms of inner products of input vector pairs. Thus, we can kernelize D-SVDD where the kernelized version of the dual representation of the objective function when negative data are available is:

where *T* and constraints are the same as in Equation 14.

### PLPD for prediction of protein subcellular localization

As mentioned earlier, the prediction of protein localization is a kind of multi-class and multi-label classification problem. However, D-SVDD including conventional SVDD is not for the multi-class and multi-label classification problems; it is for one-class and mono-label classification problems. Thus, we propose a new method for the prediction of protein localization by modifying the D-SVDD.

For a multi-class and multi-label classification problem, a predictor should answer two points. First, for a multi-class problem, a predictor should report the degree of being a member of each class for a test data point **x**_{t} using some score function $f({\mathbf{x}}_{t},\xb7)$ where *f*: ℵ × Ł → ℜ. (Here, ℵ is the domain of data, Ł is the domain of class labels, and ℜ is the domain of real numbers.) That is, a label *l*_{1} is considered to be ranked higher than *l*_{2} if $f({\mathbf{x}}_{t},{l}_{1})>f({\mathbf{x}}_{t},{l}_{2})$. Second, owing to multi-label cases, the score function should be able to report multiple labels not a mono label, and the true positive labels of a protein should rank higher than any other labels. That is, if Ł_{p} is a set of the true positive labels of **x**_{t}, then a successful learning system will tend to rank the labels in the Ł_{p} higher than any other labels not in Ł_{p}.

To address the requirements mentioned above, we adopt the following procedure.

- if a training dataset is given, we divide it into a target dataset and a negative dataset by class. For a label
*l*_{i}, for instance, a data point whose label set has*l*_{i}is included in the target data; otherwise, it is included in the negative dataset. - if a target dataset and a negative dataset are prepared for each class, we find the optimal boundary for the target data by using D-SVDD as formulated in Equation 15.
- we calculate the degree of being a member of each class
*l*_{i}for a test data point**x**_{t}using the following score function*f*:where ($$f({\mathbf{x}}_{t},{l}_{i})=\frac{{R}_{i}}{d({\mathbf{x}}_{t},{\mathbf{a}}_{i})}$$(16)**a**_{i},*R*) is the optimal hypersphere of target data which are included in the class label_{i}*l*_{i}. Note that this score function reports a higher value for a test data point with smaller distance between**x**_{t}and**a**_{i}regarding to the distance of*R*_{i}. If the distance between**x**_{t}and**a**_{i}, for example, is smaller than*R*_{i}, the score function reports higher values than 1; otherwise, the score function reports smaller values than 1. That means if the location of*x*_{t}is closer to the center of the hypershpere, then the score function reports a higher degree of value. - Finally, according to the values of the score function for all classes, we rank the labels, and report them.

With this procedure, we can easily and intuitively modify the D-SVDD for a multi-class and multi-label classification problem like the prediction of protein localization by reflecting the overall characteristics of each class. We call the proposed method a PLPD.

## RESULTS AND DISCUSSION

### Data preparation and evaluation measures

#### 1) *Data preparation*

To test the effectiveness of the proposed PLPD method, we experimented with the data of Huh *et al*. (3), which is currently the biggest class coverage and is a multi-label dataset. From the website http://yeastgfp.ucsf.edu, we get 3914 unique proteins whose locations are clearly identified. The breakdown of the original 3914 different proteins (*N* = 3914) is given in the second column of Table 1. Owing to some proteins may coexist several localizations, the so-called multi-label feature as mentioned earlier, the total sum of proteins in all localizations, denoted by $\tilde{N}$, is 5184; In *N* = 3914 total different proteins, there are 2725 proteins in unique localization, 1117 proteins in two different localizations, 64 proteins in three different localizations, 7 proteins in four different localizations. And there is only 1 protein in five different localizations. Similar to the formulation by Chou and Cai (1),

Using the 3914 proteins, we represent a protein in three different ways, and make three datasets: Dataset-I, Dataset-II and Dataset-III to evaluate the performance of the proposed PLPD in various manners (see Table 1). In Dataset-I, we used AA-based features for a protein. After finding all amino acid sequences of 3914 proteins from the SWISS-PROT DataBank (30), we used PairAA features (400D; features from *a*_{21} to *a*_{420}), and GapAA up to the maximal allowed gap (9200D = 400D × 23; the minimum length among all 3914 proteins is 25; features from *a*_{421} to *a*_{9620}) for sequence information including the amino acid composition (20 Dimensions; features from *a*_{1} to *a*_{20}). That is, a protein *P* is represented as a vector in a 9620-dimensional space (9620D = 20D + 400D + 9200D) as:

This dataset is described in the third column of Table 1. As you can see in the column, the coverage of this representation is 100%; that means, all of the 3914 proteins can be represented by this manner.

In Dataset-II, we adopted a similar manner with the Chou and Cai's approach (5) using the InterPro Motifs (29). Different from Chou and Cai, we first extracted 2372 unique motifs which are occurred only in the 3914 proteins in order to delete meaningless motifs. Using the unique motif set, we represent a protein as:

where *b*_{i} = 1 if the protein has the *i*th motif in the unique motif set; otherwise, *b*_{i} = 0. The coverage of each localization is depicted in the fourth column is Table 1. With this second manner, the overall coverage is 77.08% (3017 proteins among 3914 proteins).

For Dataset-III, we combined the previous two features only for the proteins which have more than one motif in the unique motif set. In this case, thus, a protein is characterized as:

where the elements from *c*_{1} to *c*_{9620} are derived from Equation 18, and the elements from *c*_{9621} to *c*_{11992} are derived from Equation 19. The coverage is same with the Dataset-II (see the fifth column is Table 1).

#### 2) *Evaluation measures*

For multi-label learning paradigms, only one measurement is not sufficient to evaluate the performance of a predictor owing to the variety of correctness in prediction (39,40). Thus, we introduce three measurements (Measure-I, Measure-II and Measure-III) for the evaluation of a protein localization predictor. First, to check the overall success rate regarding to the total number of the unique proteins *N*, we define Measure-I as:

where *L*(*P _{i}*) is the true label set of a protein

*P*

_{i}, ${Y}_{i}^{k}$ is the predicted top-

*k*labels by a predictor, and

Note that owing to the multi-label protein localization problem, it is not sufficient to evaluate the performance of a predictor by checking only the topmost label predicted true. Thus, we check the real label set with the predicted top-*k* labels using the ψ[·, ·] function. The *k* value in Equation 22 is given by user, and we use 3 in this study since the numbers of true localization sites of most proteins are less than or equal to 3 (3).

Similar to the formulation by Chou and Cai (1), to check the overall success rate regarding to the total number of classified proteins, $\tilde{N}$, we also evaluate the performance of a predict by using Measure-II:

where ${Y}_{i}^{{k}_{i}}$ is the predicted top-*k _{i}* labels by a predictor, and the Ψ[·, ·] function returns the number of labels which is predicted correctly. Note that the

*k*value determined by the number of true labels of a protein

_{i}*P*, not by user.

_{i}As mentioned earlier, a dataset of protein localization is imbalanced in nature. Including overall success rate, thus, the information on the success rate of each class and the average rate of the success rates of each class is useful to evaluate the performance of a predictor. To achieve this, we define Measure-III as:

where μ = 22 is total number of labels or classes, *l* is a label index, ${\tilde{n}}_{l}$ is the number of proteins in the *l*th label, and

### Performance comparison

To investigate the success of the proposed PLPD method and analyze it, we conducted several tests with the three datasets (Dataset-I, Dataset-II and Dataset-III). For comparative analysis we also experimented with the ISort method (1). To the best of our knowledge, ISort method showed the best performance for the prediction of yeast protein multiple localization up to now (1). Even though the jackknife cross-validation is one of most rigorous and objective validation measures (41,42), we did a 2-fold cross-validation approach to prevent the overfitting problem to a given training dataset. For a more flexible boundary of the PLPD, we used a Gaussian RBF function for kernelization in Equation 15 owing to the Gaussian RBF function is one of most suitable functions for kernelization (36,37). In addition, we selected the model parameters, such as *C*^{+} and *C*^{−}, and the width parameter of the Gaussian RBF kernel function by using cross-validation approach (38,43) to identify the solutions of the PLPD.

The results of the ISort Method and the PLPD method for the three datasets are given in Table 3–5. As we can see in Table 3, ISort method showed 65.14, 35.91 and 10.02% according to Measure-I, Measure-II and Measure-III for the Dataset-I. For the same dataset, the PLPD method showed 73.89, 53.09 and 11.43% performance for the three measurements, respectively. This implies that the success rates of PLPD were 8.75, 17.18 and 1.41% higher than the ISort method regarding the Measure-I, the Measure-II and the Measure-III, respectively. Even though PLPD method showed better scores than the ISort method with regard to Measure-III, the two methods showed low degrees of average prediction accuracies for all localizations. For instance, ISort showed zero prediction accuracies at 10 localizations and PLPD at 9 localizations. As you can see in Table 1, those localizations whose prediction accuracies were zero have relatively small number of proteins in themselves. Thus, the prediction of localization based on only AA-based features has limitation to correctly predict all the localizations on average.

When the Dataset-II was used, the prediction accuracies of each method were higher than those of the two methods with Dataset-I. As you can see in Table 4, the ISort method showed 69.94, 44.27 and 15.33% accuracies for three measurements. On the contrary, the PLPD method showed 82.40, 56.32 and 32.41% accuracies for the three measurements. These improvements of the PLPD over the ISort were more conspicuous. Moreover, there was no zero performance for each localization in the PLPD method. In ISort method, however, there were 6 localizations(Actin, ER to Golgi, Lipid particle, Nuclear periphery, Peroxisome and Vacuolar membrane) whose prediction accuracies were zero. This means that the proposed method showed promising results even though a dataset is imbalanced when Dataset-II was used.

The best performance of each method was achieved when Dataset-III was used. Similarly, the PLPD method outperformed all the three measurements considered as shown in Table 5. The PLPD, for example, showed up to 83.49% accuracy with Measure-I, while the ISort method showed 75.90% accuracy. In regard to Measure-III, the PLPD method showed 38.78% average accuracy for all 22 localizations; it was 18.53% higher than the ISort method on average of all prediction accuracies for each localization.

In order to check the performance of PLPD with regard to Measure-III without considering the other two measurements, we did tests for the same data sets with similar manner to the previous tests. This means that we did parameter fitting process only regarding Measure-III for PLPD method. The results for Dataset-I, Dataset-II, and Dataset-III are depicted in Table 6. As you can see in Table 6, the average accuracies were highly increased. When the Dataset-III was used, for example, the average accuracy for all localization was up to 46.50%; it was 7.72% higher than the previous result of PLPD and 26.25% higher than the result of ISort method for the same dataset. It was remarkable.

From Tables 3–6, we could conclude that the proposed PLPD method outperformed the ISort method regardless of the kind of evaluation measurement and regardless of the dataset used. Moreover, Motif information could increase the prediction accuracies of the two methods considered even though its coverage is lower than AA-based information. Furthermore, the best performance was obtained when both the features were used.

Similar to the work of Cai and Chou (44), to avoid homology bias, we removed all the sequences with >40% sequence homology and after then, we performed similar experiments to the previous cases. The new datasets are depicted in Table 1 in the Supplementary Data and the results of this experiments are described in Tables 2–4 in Supplementary Data. As we can see in the tables, we observed similar phenomena with previous results that were depicted in Tables 3–5, which means that the PLPD method can play a complimentary role to existing methods, regardless of the existence of sequence homology.

Using the 5184 classified proteins as training data we predicted 138 proteins whose subcellular localizations could not be clearly observed by the experiments of Huh *et al*. (3). Since the prediction accuracies were highest when both AA-based features and the unique InterPro Motif set were used as features of a protein, we used the information together. Actually we could not know the number of true localizations of a protein whose localization is not known yet, we enlist top three localizations which have the highest likelihood to be the true localization. In some cases, proteins have similar degrees in the likelihood. Thus, to treat this issue, we quantize the degrees of likelihood and treat them in the same rank. The predicted results of first 35 unknown localization proteins are given in the Table 7, where the roman numerals indicate the rank of likelihood of the corresponding localization. (See Supplementary Data for all results.)

Among the predicted protein localizations, YBL105C, YBR072W and YDR313C are analyzed for validation as an example. As shown in Table 7, YBL105C is predicted to localize to ‘cytoplasm’ as the first rank, to ‘bud neck’ and ‘cell periphery’ as the second ranks. YBL105C (PKC1) is a Ser/Thr kinase and controls signaling pathway for cell wall integrity, for instance bud emergence and cell wall remodelling during growth (45,46). PKC1 contains C1, C2 and HR1 domains. HR1 domain targets PKC1 to bud tip and C1 domain targets it to cell periphery (47). Thus, PKC1 localizes to the ‘bud neck’ for its function. Those indicate that the localizations of YBL105C found from literature are consistent with our prediction results; ‘bud neck’ and ‘cell periphery’.

YBR072W (HSP26) is a heat shock protein which transforms into high molecular weight aggregate on heat shock and binds to non-native proteins (48,49). Depending on cellular physiological conditions, HSP26 localizes differently; accumulation in the ‘nucleus’ or wide-spread throughout the cell (48). The localizations of YBR072W found from literature, ‘nucleus’ and ‘cytoplasm’, are predicted correctly by our method.

Finally, YDR313C (PIB1) is a RING-type ubiquitin ligase containing FYVE finger domain. PIB1 binds specifically to phosphatidylinositol(3)-phosphate. Phosphatidylinositol(3)-phosphate is a product of phosphoinositide 3-kinase which is an important regulator of signaling cascade and intracellular membrane trafficking (50). The FYVE domain targets PIB1 to ‘endosome’ and ‘vacuolar membrane’ (51). The localizations of YDR313C found from literature, ‘endosome’ and ‘vacuolar membrane’, are consistent with our prediction results.

## CONCLUSION

Subcellular localization is one of the most basic functional characteristics of a protein, and an automatic and efficient prediction method for the localization is highly required owing to the need for large-scale genome analysis. Even though many previous works have been done for the task of protein subcellular localization prediction, none of them tackles effectively all the characteristics of the task: a multiple class problem (there are too many localizations), a multi-label classification problem (a protein may have several different localizations), and an imbalanced dataset (a protein dataset is too imbalanced in nature). To get more reliable results, thus, a new computational method is eventually needed.

In this paper, we developed a PLPD method for the prediction of protein localization, which can find the likelihood of the specific localization for a protein more easily and more correctly. PLPD method is developed by using the Density-induced Support Vector Data Description (D-SVDD) (33). D-SVDD is one of one-class classification methods which is suitable for imbalanced datasets since they finds a compact description of a target data independently from other data (33,36). Moreover, it is easily used for the dataset whose number of classes is big owing to linear complexity with regard to the number of classes. However, D-SVDD is originally not for a multi-class and multi-label problem. Thus, we extended the D-SVDD for the prediction of protein subcellular localization. Moreover, we have introduced three measurements (Measure-I, Measure-II and Measure-III) for the evaluation of a protein localization predictor to more precisely evaluate the predictor. As the results of three datasets which are made by the experimental results of Huh *et al*. (3), the proposed PLPD method represents a different approach that might play a complimentary role to the existing methods, such as Nearest Neighbor method and discriminate covariant method. Finally, after finding the good boundary of each localization using the 5184 classified proteins as training data, we predicted 138 proteins whose subcellular localizations could not be clearly observed by the experiments of Huh *et al*. (3).

For the reliable prediction of subcellular localizations of proteins, both good features for a protein and a good computational algorithm are ultimately needed. Actually, this study mainly focused on a good computational algorithm for protein localization prediction. In this paper, we represented proteins in three different ways: AA-based features in Dataset-I, Motif-based features in Dataset-II and AA-and-Motif-based features in Dataset-III. From this study we observed that Motif-based features are more informative than AA-based features in protein localization prediction even though Motif-based features have lower coverage than AA-based features. Moreover, best performance was achieved when both AA-based features and Motif-based features are used simultaneously. In the current study we achieved relatively high performance in the protein localization prediction problem. However, it is not sufficient yet and for better results, further study should focus on a mechanism that can extract better information from given proteins.

## SUPPLEMENTARY DATA

Supplementary data are available at NAR online.

## Acknowledgments

The authors thank the anonymous reviewers for their helpful comments. This work was supported by National Research Laboratory Grant (2005–01450) and the Korean Systems Biology Research Grant (2005–00343) from the Ministry of Science and Technology. The authors would like to thank CHUNG Moon Soul Center for supporting BioInformation and BioElectronics and the IBM SUR program for providing research and computing facilities. Funding to pay the Open Access publication charges for this article was provided by Korean Science and Engineering Foundation (KOSEF).

*Conflict of interest statement.* None declared.

## REFERENCES

*Saccharomyces cerevisiae*in bud emergence and identification of a putative upstream regulator. EMBO J. 1997;16:4924–4937. [PMC free article] [PubMed]

*Saccharomyces cerevisiae*protein kinase C to sites of polarized growth and Pkc1p targeting to the nucleus and mitotic spindle. Eukaryotic Cell. 2005;4:36–45. [PMC free article] [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (196K) |
- Citation

- A method to improve protein subcellular localization prediction by integrating various biological data sources.[BMC Bioinformatics. 2009]
*Tung TQ, Lee D.**BMC Bioinformatics. 2009 Jan 30; 10 Suppl 1:S43. Epub 2009 Jan 30.* - FGsub: Fusarium graminearum protein subcellular localizations predicted from primary structures.[BMC Syst Biol. 2010]
*Sun C, Zhao XM, Tang W, Chen L.**BMC Syst Biol. 2010 Sep 13; 4 Suppl 2:S12. Epub 2010 Sep 13.* - Semi-supervised protein subcellular localization.[BMC Bioinformatics. 2009]
*Xu Q, Hu DH, Xue H, Yu W, Yang Q.**BMC Bioinformatics. 2009 Jan 30; 10 Suppl 1:S47. Epub 2009 Jan 30.* - Enhancing membrane protein subcellular localization prediction by parallel fusion of multi-view features.[IEEE Trans Nanobioscience. 2012]
*Yu D, Wu X, Shen H, Yang J, Tang Z, Qi Y, Yang J.**IEEE Trans Nanobioscience. 2012 Dec; 11(4):375-85. Epub 2012 Aug 3.* - Multilabel learning for protein subcellular location prediction.[IEEE Trans Nanobioscience. 2012]
*Li GZ, Wang X, Hu X, Liu JM, Zhao RW.**IEEE Trans Nanobioscience. 2012 Sep; 11(3):237-43.*

- Proteome-wide remodeling of protein location and function by stress[Proceedings of the National Academy of Scie...]
*Lee K, Sung MK, Kim J, Kim K, Byun J, Paik H, Kim B, Huh WK, Ideker T.**Proceedings of the National Academy of Sciences of the United States of America. 2014 Jul 29; 111(30)E3157-E3166* - Proteome-wide discovery of mislocated proteins in cancer[Genome Research. 2013]
*Lee K, Byun K, Hong W, Chuang HY, Pack CG, Bayarsaikhan E, Paek SH, Kim H, Shin HY, Ideker T, Lee B.**Genome Research. 2013 Aug; 23(8)1283-1294* - mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines[BMC Bioinformatics. ]
*Wan S, Mak MW, Kung SY.**BMC Bioinformatics. 13290* - Imbalanced Multi-Modal Multi-Label Learning for Subcellular Localization Prediction of Human Proteins with Both Single and Multiple Sites[PLoS ONE. ]
*He J, Gu H, Liu W.**PLoS ONE. 7(6)e37155* - Protein subcellular localization prediction of eukaryotes using a knowledge-based approach[BMC Bioinformatics. ]
*Lin HN, Chen CT, Sung TY, Ho SY, Hsu WL.**BMC Bioinformatics. 10(Suppl 15)S8*

- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem chemical substance records that cite the current articles. These references are taken from those provided on submitted PubChem chemical substance records.
- TaxonomyTaxonomyTaxonomy records associated with the current articles through taxonomic information on related molecular database records (Nucleotide, Protein, Gene, SNP, Structure).
- Taxonomy TreeTaxonomy Tree

- PLPD: reliable protein localization prediction from imbalanced and overlapped da...PLPD: reliable protein localization prediction from imbalanced and overlapped datasetsNucleic Acids Research. 2006 Oct; 34(17)4655

Your browsing activity is empty.

Activity recording is turned off.

See more...