• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bmcbioiBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Bioinformatics
BMC Bioinformatics. 2008; 9: 80.
Published online Feb 1, 2008. doi:  10.1186/1471-2105-9-80
PMCID: PMC2262056

ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization

Abstract

Background

Gene Ontology (GO) annotation, which describes the function of genes and gene products across species, has recently been used to predict protein subcellular and subnuclear localization. Existing GO-based prediction methods for protein subcellular localization use the known accession numbers of query proteins to obtain their annotated GO terms. An accurate prediction method for predicting subcellular localization of novel proteins without known accession numbers, using only the input sequence, is worth developing.

Results

This study proposes an efficient sequence-based method (named ProLoc-GO) by mining informative GO terms for predicting protein subcellular localization. For each protein, BLAST is used to obtain a homology with a known accession number to the protein for retrieving the GO annotation. A large number n of all annotated GO terms that have ever appeared are then obtained from a large set of training proteins. A novel genetic algorithm based method (named GOmining) combined with a classifier of support vector machine (SVM) is proposed to simultaneously identify a small number m out of the n GO terms as input features to SVM, where m <<n. The m informative GO terms contain the essential GO terms annotating subcellular compartments such as GO:0005634 (Nucleus), GO:0005737 (Cytoplasm) and GO:0005856 (Cytoskeleton). Two existing data sets SCL12 (human protein with 12 locations) and SCL16 (Eukaryotic proteins with 16 locations) with <25% sequence identity are used to evaluate ProLoc-GO which has been implemented by using a single SVM classifier with the m = 44 and m = 60 informative GO terms, respectively. ProLoc-GO using input sequences yields test accuracies of 88.1% and 83.3% for SCL12 and SCL16, respectively, which are significantly better than the SVM-based methods, which achieve < 35% test accuracies using amino acid composition (AAC) with acid pairs and AAC with dipedtide composition. For comparison, ProLoc-GO using known accession numbers of query proteins yields test accuracies of 90.6% and 85.7%, which is also better than Hum-PLoc (85.0%) and Euk-OET-PLoc (83.7%) using ensemble classifiers with hybridization of GO terms and amphiphilic pseudo amino acid composition for SCL12 and SCL16, respectively.

Conclusion

The growth of Gene Ontology in size and popularity has increased the effectiveness of GO-based features. GOmining can serve as a tool for selecting informative GO terms in solving sequence-based prediction problems. The prediction system using ProLoc-GO with input sequences of query proteins for protein subcellular localization has been implemented (see Availability).

Background

Gene Ontology (GO) [1] annotation, which describes the function of genes and gene products across species, has recently been utilized to predict protein subcellular and subnuclear localization. The prediction of protein localization is important for elucidating protein functions involved in various cellular processes. Additionally, the accomplishment of the various genome sequencing projects causes the accumulation of massive amount of gene sequence information. For example, the percentage of large-scale eukaryotic proteins with subcellular locations annotated in the Swiss-Prot database increased rapidly from 52.4% (version 49.5, released on April 18, 2006) [2] to 69.4% (version 50.7, released Sep. 11, 2006) [3]. Meanwhile, the percentage of proteins with subcellular locations annotated in the GO database increased from 44.9% [2] to 65.5% [3]. The growth of the GO database in size and popularity increases the effectiveness of GO-based features.

Some existing computation methods in literature for predicting protein localization are described below according to the used classifiers and features.

a) Mining informative features. The prediction methods in this group focus on mining informative features consisting of GO terms [2-5], sorting signals [6,7], amino acid composition (AAC) [8-10], k-peptide encoding vector [7,11-14], physicochemical properties of amino acids [15-17], and fusing AAC and physicochemical properties [2,4,18,19].

b) Designing efficient classifiers. Most of the following prediction methods use effective classifiers based on support vector machine (SVM) [5,10-12,14,16,17,20] or the k nearest neighbour (k-NN) classifiers [2,4,5,13,19,21].

c) Integrating informative features with efficient classifier. Methods in this group include pSLIP [17], ProLoc [18], Euk-OET-PLoc [2] and Hum-PLoc [4]. The pSLIP system utilizes five top-rank features of physicochemical properties according to the prediction accuracy of SVM using a single feature [17]. The ProLoc system uses SVM with automatic selection from physicochemical properties to predict protein subnuclear localization [18]. The two ensemble classifiers Euk-OET-PLoc [2] and Hum-PLoc [4] fuse many basic individual classifiers operated by the engine of k-NN rules, where protein sequences are represented by hybridizing the GO annotation and amphiphilic pseudo amino acid (Pse-AA) composition.

Additionally, these two efficient GO-based systems Euk-OET-PLoc [2] and Hum-PLoc [4] predict subcellular localization of proteins using their known accession numbers. However, they cannot work for novel proteins without known accession numbers. The GO-AA method [5], which uses GO terms of homologies retrieved by BLAST to assess protein similarity, can deal with novel proteins without known accession numbers for subnuclear localization prediction. Besides, some SVM-based methods using only the features derived from input sequences, such as ProtLock with AAC [8], Ploc with AAC and acid pairs [10], and HSLPred with AAC and dipeptide composition [11], predict subcellular localization inaccurately [4]. Therefore, this study would develop an accurate SVM-based method for predicting subcellular localization of novel proteins by using input sequences with BLAST.

The Gene Ontology provided by the GO Consortium [1] has quickly grown in size and popularity. The newest version (UniProt 52.0 released in September 2007) of GO [22] contained 29,383 terms in the three branches, molecular function, biological process and cellular component. The terms and relationships among them are represented by a directed acyclic graph in which vertices represent the GO terms, and edges represent the relationships among these terms. Genes can be annotated with GO terms creating gene associations that can be used for whole genome analyses [23].

GO annotation has been successfully used in various sequence-based applications, which can be classified into two groups. 1) The first group uses the GO terms and their corresponding structure information of GO graph, such as grouping GO terms to improve the assessment of gene set enrichment [24]; using GO with probabilistic chain graphs for protein classification [25,26], and prediction of subnuclear localization [5]. 2) The second group uses GO terms only without structure information, such as predicting transcription factor DNA binding preference [27] and various predictions of subcellular and subnuclear localization [2,4,5]. In the second group, protein sequences are often represented as high dimensional vectors of n binary features, where n is the total number of terms in the complete annotation set (a component of 1 if the annotation is hit, and 0 otherwise) [28]. This representation is valuable in well-known vector space clustering algorithms such as k-NN [2,4,13,19,21] and fuzzy k-NN [13,29,30]. However, because n is often large, and each gene product is generally annotated by few GO terms, the vectors became long and sparse, making the clustering rather problematic [28].

This study proposes an efficient method, named GOmining, based on an intelligent genetic algorithm (IGA) [31,32] incorporating an SVM classifier to simultaneously identify a small number m out of a large number n of GO terms as input features, where m <<n. Some GO annotations corresponding to subcellular compartments are called essential GO terms for subcellular localization prediction, such as GO:0005634 (Nucleus), GO:0005737 (Cytoplasm) and GO:0005856 (Cytoskeleton), shown in Table Table1.1. These essential GO terms are regarded as domain knowledge to be included in the feature set of m informative GO terms for subcellular localization prediction. A prediction method ProLoc-GO based on GOmining was implemented using the feature set of informative GO terms. This method performed well in predicting protein subcellular localization from input sequences only.

Table 1
Essential GO terms and their definitions

Results

Data sets

Two existing data sets SCL12 [4] and SCL16 [2] obtained from UniProtKB/Swiss-Prot database [33] were used to evaluate the proposed method ProLoc-GO. The SCL12 and SCL16 have 2041 human proteins localized in 12 human subcellular compartments and 4150 eukaryotic proteins in 16 subcellular compartments, respectively. The two data sets were operated by a culling program [34] so that those sequences had < 25% sequence identity.

The proteins in SCL12 were screened strictly using the following rules: 1) only those sequences annotated with "human" in the ID (identification) field were collected; 2) sequences annotated with ambiguous or uncertain terms, such as "potential", "probable", "probably", "maybe", or "by similarity", were excluded; 3) sequences annotated by two or more locations were excluded, and 4) sequences with less than 50 amino acid residues were removed [4]. The data set SCL12 was divided into two parts, SCL12L and SCL12T, with 919 and 1122 proteins, respectively. The SCL12L set was used for training and the SCL12T was used for independent testing, as shown in Table Table22[4].

Table 2
Data set SCL12. The data set SCL12 consists of SCL12L and SCL12T as the learning data set and testing data set, respectively. There are 12 essential GO terms corresponding to subcellular compartments. The number t of (t) in SCL12L represents the number ...

The proteins of SCL16 were screened according to four criteria. The first criterion is to exclude sequences annotated with "prokaryotic", because this study focused only on eukaryotic proteins. The other three criteria were the same as criteria 2–4 for SCL12 above. Table Table33 shows the numbers of proteins within each compartment, where the SCL16 consists of two parts, SCL16L for training and SCL16T for independent testing. The sequences in the training and test data sets were obtained from the web servers of Euk-OET-PLoc [2] and Hum-PLoc [4].

Table 3
Data set SCL16. The data set SCL16 consists of SCL16L and SCL16T as the learning data set and testing data set, respectively. There are 15 essential GO terms corresponding to eukaryotic subcellular compartments. Note that GO:0005814 is not appeared in ...

GO annotation

This study applied the Gene Ontology Annotation (GOA) database [35], which includes GO annotations for non-redundant proteins from many species in the UniProtKB/Swiss-Prot database [33]. The GOA database was downloaded directly from [36] (UniProt 45.0 released in Jan. 2007). The accession numbers of proteins are required for querying the GOA database to obtain GO terms. BLAST [37,38] was used to obtain a homology with a known accession number to the protein for retrieving the GO terms. The corresponding accession numbers of all protein sequences in SCL12 and SCL16 were obtained by using BLAST with h = 1 and e = 10-9.

Table Table44 shows the GO annotation results of all proteins in the training data sets SCL12L and SCL16L. For SCL12L, the size of the complete set of all GO terms that appeared was n = 1714 from the 919 human proteins. The smallest, largest and mean numbers of GO terms annotated for individual proteins were 0, 35 and 8.3, respectively. The percentage of training proteins whose homologies were not annotated by any GO term (that is, the number of GO terms annotated is zero) was 1.31%. For SCL16L, n = 2870 GO terms were obtained from 2423 eukaryotic proteins. The smallest, largest and mean numbers of GO terms annotated were 0, 50 and 7.7, respectively. The percentage of training proteins whose homologies were not annotated was 3.96%. The proteins annotated by GO are often represented as an n-dimensional binary feature vector, where the attribute value is 1 if the corresponding GO term is annotated, and 0 otherwise.

Table 4
Results of GO annotation for all sequences in SCL12L and SCL16

To know the prediction performance according to only the essential GO terms annotated, we calculated the numbers of sequences annotated by g essential GO terms. Table Table44 shows that 453 out of 919 (49.3%) sequences are annotated by only one essential GO term (g = 1) for SCL12L, where 425 sequences are correctly annotated and 28 sequences are incorrectly annotated. The other 466 sequences annotated by zero (g = 0) or more than one (g > 1) essential GO term can not be effectively predicted. Table Table22 lists the numbers of sequences which are correctly annotated by only one essential GO term for every compartment. The two GO terms, GO:0005634 (Nucleus) and GO:0005739 (Mitochondrion), made a great contribution to the prediction accuracy of 46.2% (= 425/919), which correctly annotate a large number of sequences, 179 and 111, respectively.

As for SCL16L, the number of sequences annotated by only one essential GO term is 1247 out of 2423 (51.5%). Table Table33 lists the numbers of sequences which are correctly annotated by only one essential GO term for every compartment. Only 48.0% (= 1162/2423) of the sequences with known accession numbers can be correctly predicted by using only the annotation of essential GO terms. According to Table Table3,3, the three essential GO terms, GO:0005634 (Nucleus, 395 out of 474), GO:0009507 (Chloroplast, 192 out of 207) and GO:0005739 (Mitochondrion, 173 out of 183), made a great contribution to prediction accuracy.

The analytic results reveal that it is not sufficient to use only essential GO terms for accurately predicting protein subcellular localization. However, the essential GO terms play an important role in designing GO-based prediction methods.

Selected informative GO terms

Selecting a set of m informative GO terms out of n candidate GO terms is a combinatorial optimization problem C(n, m), which can be solved by using the intelligent genetic algorithm with an inheritance mechanism (IGA) [31,32]. IGA can efficiently search for the solution Sr+1 to C(n, r+1) by inheriting a good solution Sr to C(n, r). This study proposes an efficient algorithm based on IGA, called GOmining, to identify a small set of m informative GO terms including the essential GO terms as features to SVM. The GOmining algorithm incorporates LIBSVM [39] using series of binary classifiers. GOmining aims to maximize the training accuracy of prediction using 10-fold cross-validation (10-CV) when identifying the m informative GO terms.

The SVM classifier based on the selected informative GO terms as features is called SVM-IGO. To evaluate a candidate set of r informative GO terms accompanied with the SVM parameters, the prediction accuracy of 10-CV serves as a fitness function of IGA. Figure Figure11 shows the results of SVM-IGO from r = 40, 41,..., 70. Table Table55 lists the m = 44 informative GO terms for SCL12L obtained from the highest accuracy of 89.8% (r = 44), where the SVM parameters (C, γ) = (23, 2-4). Table Table66 lists the m = 60 informative GO terms for SCL16L, where the highest accuracy was 86.5%, and (C, γ) = (25, 2-3).

Table 5
The m = 44 informative GO terms by applying GOmining to SCL12L. The GO terms in bold style are essential GO terms.
Table 6
The m = 60 informative GO terms by applying GOmining to SCL16L. The GO terms in bold style are essential GO terms.
Figure 1
Training accuracies of SVM-IGO and SVM-RBS performed by using SVM with a number r of selected informative GO terms.

The orthogonal experimental design with orthogonal array and factor analysis used in IGA is an efficient method for simultaneously examining the individual effect of several factors on the evaluative function [40,41]. The factors are the parameters (GO terms) that manipulate the evaluation function, and a setting of a parameter is regarded as a level of the factor. In this study, the two levels of one factor are the inclusion and exclusion of the ith GO term in the feature selection using IGA. The factor analysis can quantify the effects of individual factors on the evaluation function, rank the most effective factors and determine the best level for each factor to optimize the evaluation function. The most effective factor has the largest main effect difference (MED). Tables Tables55 and and66 show that the essential GO term GO:0005634 (Nucleus) having the largest values of MED is the most effective feature of discrimination. The only essential GO term GO:0030198 (Extracellular matrix organization and biogenesis) belongs to biologic process branch and the other essential GO terms belong to cellular component branch. The abbreviations M, B and C represent the three branches molecular function, biological process, and cellular component, respectively.

Evaluation of feature selection

SVM-IGO was implemented by using the m informative GO terms and the SVM classifier using (C, γ) = (23, 2-4) and (C, γ) = (25, 2-3) for SCL12L and SCL16L, respectively. To evaluate the effectiveness of SVM-IGO, four additional classifiers were implemented for comparison. Three classifiers SVM-GO, k-NN-GO and fuzzy k-NN-GO in order based on SVM, k-NN and fuzzy k-NN were implemented by using all the n GO terms as features without GO term selection. The classifier SVM-RBS used SVM with a subset of n GO terms selected by the rank-based selection (RBS) method [17,42]. The best values of parameters C and γ determined using a step-wise approach were employed to the SVM-based methods SVM-GO and SVM-RBS, where γ [set membership] {2-7, 2-6,..., 28} and C [set membership] {2-7, 2-6,..., 28}. The best values (C, γ) = (23, 2-4) and (24, 2-6) were applied to SVM-GO for SCL12L and SCL16L, respectively. As for SVM-RBS, (C, γ) = (21, 2-3) and (22, 2-2) were used for SCL12L and SCL16L, respectively. The Methods section describes SVM-RBS, k-NN-GO and fuzzy k-NN-GO in detail. Table Table77 lists all prediction accuracies using 10-CV for both data sets SCL12L and SCL16L.

Table 7
Comparison of prediction accuracy (%) using 10-CV. Performance comparison uses prediction accuracy (%) of 10-CV.

The highest accuracies of SVM-RBS are 86.5% and 83.5% using 65 and 68 selected GO terms for SCL12L and SCL16L, respectively, shown in Fig. Fig.1.1. Table Table77 shows that the three SVM-based classifiers (SVM-GO, SVM-RBS and SVM-IGO), with accuracies >80%, were better than the two k-NN based classifiers (k-NN-GO and fuzzy k-NN-GO), with accuracies <75%, for both data sets. SVM-IGO had the highest accuracies 89.8% and 86.5% for SCL12L and SCL16L, respectively. The GO term selection method based on GOmining was more effective than RBS and the method without selection of GO terms. Furthermore, SVM uses the selected GO terms as features, making it better than the k-NN classifier.

Performance comparison

The proposed ProLoc-GO method predicts the subcellular localization of an input sequence using either SVM-IGO or SVM-GO, depending on its annotation on the informative GO terms (see Methods for detail). Tables Tables8,8, ,9,9, ,10,10, ,1111 list the results of ProLoc-GO using SCL12 and SCL16. Some existing AAC-based prediction methods, such as ProtLock [8], Least Euclidean distance [9], Ploc [10] and HSLPred [11], use only the query sequence as input data for their classifiers. Hum-PLoc [4] and Euk-OET-PLoc [2] use both the sequence and its accession number as input data. For comparison with these predictors, the method ProLoc-GO was performed using the two kinds of input data separately. The first test used only the sequence and used BLAST to obtain annotated GO terms. The second test used the known accession number of proteins directly. For the accuracy on both SCL12L and SCL16L, ProLoc-GO used leave-one-out cross-validation (LOOCV) for comparison with the other methods (see Methods section).

Table 8
Comparison of prediction accuracy (%) for SCL12. Prediction accuracies (%) for using leave-one-out cross-validation (LOOCV) on SCL12L and independent test on SCL12T are obtained from the paper [4]. The input data is sequence only (S) or sequence with ...
Table 9
Comparison of prediction accuracy (%) for SCL16. Prediction accuracies (%) of using LOOCV on SCL16L and independent test on SCL16T are obtained from the paper [2]. The input data is sequence only (S) or sequence with accession number (AN).
Table 10
Accuracies and MCC preformed on SCL12
Table 11
Accuracies and MCC preformed on SCL16

The test accuracies for ProLoc-GO performed on the human protein data set SCL12L and SCL12T were 90.0% and 88.1%, respectively, where m = 44, and (C, γ) = (23, 2-4) for SVM-IGO and SVM-GO. These results were much better than those (<35%) of the four sequence-based prediction methods [8-11] using only input sequences, shown in Table Table8.8. Ploc [10] had the highest test accuracy of 34.3% among the four AAC-based methods.

The Matthews correlation coefficient (MCC) [5,12,18] values are usually employed while evaluating the performance on unbalanced datasets. In addition to the overall accuracy, the MCC values were also recorded due to the unbalance of numbers of proteins localized in the compartments, such as 196 of Nucleus vs. 7 of Microsome (Table (Table2).2). The MCC is defined as follows [5]:

MCCc=pcscucoc(pc+uc)(pc+oc)(sc+uc)(sc+oc),c=1,2,,Nc,
(1)

where pc is the number of correctly predicted proteins of the location c, sc is the number of correctly predicted proteins not in the location c, uc is the number of under-predicted proteins, oc is the number of over-predicted proteins, and Nc is the number of locations. The test MCC performances of ProLoc-GO were 0.822 and 0.661 for SCL12L and SCL12T, respectively. Table Table1010 presents the detailed results for individual compartments. The results of the five sequence-based methods reveal that the set of informative GO terms is more useful for protein subcellular localization than the AAC-based features.

Performance of using known accession numbers

The accession number of each protein sequence in SCL12 and SCL16 was available in querying the GOA database. For comparison with the methods [2,4] based on the proteins with known accession numbers, ProLoc-GO using the known accession numbers of proteins as input data obtained test accuracies of 91.1% and 90.6% (MCC = 0.724) performed on SCL12L and SCL12T, respectively, where m = 56, (C, γ) = (22, 2-1) for SVM-IGO and (C, γ) = (22, 2-4) for SVM-GO. Hum-PLoc [4] using hybridization of GO terms and Pse-AA composition obtained training and test accuracies of 81.1% and 85.0% for SCL12L and SCL12T, respectively. The performance of ProLoc-GO using sequences or accession numbers as the input data was better than that of Hum-PLoc [4] using the ensemble classifiers with features of both sequence and accession number.

Tables Tables99 and and1111 show the performance results of ProLoc-GO and Euk-OET-PLoc [2] using SCL16. ProLoc-GO using input sequences yielded test accuracies 86.6% (MCC = 0.799) and 83.3% (MCC = 0.706) for SCL16L and SCL16T, respectively, where m = 60, (C, γ) = (25, 2-3) for SVM-IGO, and (C, γ) = (24, 2-6) for SVM-GO. ProLoc-GO is significantly better than all the AAC-based methods with test accuracies smaller than 35%. ProLoc-GO yields the test accuracies 89.0% and 85.7% (MCC = 0.710) for SCL16L and SCL16T, respectively, using the known accession numbers of proteins, where m = 60, (C, γ) = (22, 2-3) for SVM-IGO and (C, γ) = (23, 2-5) for SVM-GO. Euk-OET-PLoc [2] using the ensemble classifiers with features of both sequence and accession number obtains training and test accuracies of 81.6% and 83.7%, respectively. ProLoc-GO performed better than Euk-OET-PLoc on SCL16 using either sequences or accession numbers as the input data [2].

Analysis of informative GO terms

The GOmining method identifies a feature set of m effective GO terms, called informative GO terms, to design an accurate SVM-based prediction method. Table Table1212 shows the distribution of the m informative GO terms in the GO graph. For SCL12L with m = 44, GOmining selected 12 essential GO terms and 32 instructive GO terms. The 32 instructive GO terms consist of 7 GO terms from the molecular function branch, 14 terms from the biological process branch, and 11 terms from the cellular component branch, denoted as 7(M), 14(B) and 11(C), respectively. Analytical results reveal that all the three branches contain instructive GO terms.

Table 12
Distribution of the m informative GO terms. Most instructive GO terms (80%) are not offspring of the essential GO terms that the ratios are 26/32 and 36/45 for SCL12L and SCL16L, respectively.

Due to the high correlation among GO terms in the GO graph, the feature selection of SVM should consider simultaneously a set of informative GO terms, rather than individual GO terms. Since the essential GO terms are always included, GOmining benefits from a confined search space of candidate instructive GO terms. Considering the position relationships between instructive and essential GO terms in the GO graph, instructive GO terms belonged to one of the three classes: (a) offspring but not ancestor of some essential GO term; (b) between two essential GO terms, and (c) not offspring of any essential GO term. Of the 32 instructive GO terms, 4, 2 and 26 GO terms belonged to the classes (a), (b) and (c), respectively. The 26 GO terms consist of 7(M), 14(B) and 5(C). The GO terms near the root of the GO graphs are considered to be more generic while terms near the leaves are more specific [23]. Of the instructive GO terms, 81.2% (26/32) were not offspring of any essential GO term. These analytical results reveal that the essential GO terms are informative enough in predicting subcellular localization, and are effective in confining the space of searching instructive GO terms. The other six instructive GO terms from the cellular component branch have more specific functions than the essential GO terms in discrimination of the subcellular localization.

Figures Figures2,2, ,3,3, ,44 illustrate some of the instructive GO terms belonging to the three classes. Three instructive GO terms were found to belong to class (a), namely SCL12L: GO:0031227 (Intrinsic to endoplasmic reticulum membrane, rank 11), GO:30662 (Coated vesicle membrane, rank 41) and GO:0017119 (Golgi transport complex, rank 21), according to Fig. Fig.2.2. The two terms belonging to class (b), namely GO:0005815 (Microtubule organizing center, rank 25) and GO:0005813 (Centrosome, rank 36), were found between the essential GO terms GO:0005856 (Cytoskeleton) and GO:0005814 (Centriole), as shown in Fig. Fig.3.3. According to Fig. Fig.4,4, five instructive GO terms belonging to the class (c) were not offspring of essential GO terms, GO:0016021 (Integral to membrane, rank 3), GO:0005576 (Extracellular region, rank 4), GO:0005622 (intracellular, rank 18), GO:0005578 (Proteinaceous extracellular matrix, rank 37) and GO:0005615 (Extracellular space, rank 38).

Figure 2
Some of the selected GO terms which are offspring of essential GO terms. For SCL12L, there are three terms shown: GO:0031227, GO:0030662 and GO:0017119. For SCL16L, five GO terms are shown: GO:0009514, GO:0005681, GO:0005789, GO:0005759 and GO:0005829. ...
Figure 3
Some of the selected GO terms which are between two essential GO terms. For SCL12L, the two instructive GO terms GO:0005815 and GO:0005813 are between the essential GO terms GO:0005856 and GO:0005814. For SCL16L, GO:0005813 and GO:0000922 are offspring ...
Figure 4
Some of the selected GO terms are NOT offspring of any essential GO terms. For SCL12L, five instructive GO terms are shown belonging to cellular component branch: GO:0016021, GO:0005576, GO:0005622, GO:0005578 and GO:0005615, which are not offspring of ...

The m = 60 informative GO terms for SCL16L comprises 15 essential GO terms and 45 instructive GO terms. The 45 instructive GO terms consisted of 18(M), 13(B) and 14(C). The numbers of instructive GO terms coming from each branch were not significantly different. However, the numbers of instructive GO terms belonging to the three classes (a), (b) and (c) are 9, 0 and 36, respectively, which are very different. 80% (36/45) of the instructive GO terms were not offspring of any essential GO term. The 9 instructive GO terms belonging to the class (a) had 5, 2 and 2 terms, respectively, as shown in Figs. Figs.2,2, ,33 and and4.4. Class (c) has five GO terms with a dot-pattern box: GO:0005622 (intracellular), GO:0005615 (Extracellular space), GO:0020015 (Glycosome), GO:0016020 (Membrane) and GO:0045261 (Proton-transporting ATP synthase complex, catalytic core F(1)), as revealed by Fig. Fig.44.

The statistical results of instructive GO terms distributed in the three classes for both SCL12L and SCL16L reveal that the inclusion of essential GO terms can be regarded as using domain knowledge for GOmining to mine a feature set of informative GO terms. The heuristic approach (using domain knowledge) of GOmining is efficient when the GO database grows fast. Therefore, GOmining can be easily applied to other applications of sequence-based predictions using SVM with the features of informative GO terms.

Discussion

The GO database has grown in size recently, increasing the effectiveness of GO-based features. Meanwhile, the percentage of proteins with subcellular locations annotated in the GO database increased from 44.9% [2] to 65.5% [3] fast. It is indicated that there is a linkage in the GO annotation process between molecular function annotation and subcellular localization annotation [43]. Therefore, the GO-based prediction method for protein subcellular localization is increasingly efficient. Because the accession number of proteins is necessary for retrieving GO terms from GO databases, existing efficient GO-based systems Euk-OET-PLoc [2] and Hum-PLoc [4] directly utilize the accession numbers of proteins and a large number n of GO terms annotated in a complete set where n = 9918 for SCL12L [4] and n = 9567 for SCL16L [2].

To predict subcellular localizations for novel proteins, ProLoc-GO uses a good homology, rather than the query protein itself, to retrieve annotated GO terms using BLAST. To use GO term features effectively, ProLoc-GO uses only a homology with annotated GO terms to reduce n. Thus, n = 1714 for SCL12L and n = 2870 for SCL16L. Furthermore, a small set of m informative GO terms is selected simultaneously by GOmining. GOmining can consider internal relevant-feature correlation, instead of individual features by using an efficient global optimization method. The distribution analysis of informative GO terms in the GO graph is consistent with the properties of GO annotation. Additionally, ProLoc-GO using input sequences is slightly worse than using the accession numbers of proteins, with accuracies of 88.1% vs. 90.6% for SCL12T, and 83.3% vs. 85.7% for SCL16T, as shown in Tables Tables1010 and and1111.

Conclusion

Computational prediction methods from primary protein sequences are fairly economical in terms of identifying large-scale eukaryotic proteins with unknown functions. The GO annotation, which describes the function of genes and gene products across species, has been used to improve the prediction of protein subcellular localization. The accession numbers of proteins are necessary to query the GOA database to obtain GO terms. Since novel proteins have no known accession numbers, BLAST was used to obtain homologies with known accession numbers to the proteins for the retrieval of GO terms.

GO annotation has grown in size and popularity. However, few studies have explored informative GO terms from the over 20,000 annotations available at present for sequence-based prediction problems. This study proposes a genetic algorithm based method, GOmining, which combines SVM to simultaneously identify a small number m out of the n GO terms as features to SVM, where m <<n. The m GO terms include the essential GO terms annotating subcellular compartments such as GO:0005634 (Nucleus), GO:0005737 (Cytoplasm) and GO:0005856 (Cytoskeleton). ProLoc-GO was evaluated using SVM with the GO-based features from two kinds of input data, sequence and known accession numbers of proteins.

ProLoc-GO yields test accuracies of 88.1% and 83.3% from SCL12 and SCL16, respectively, when using only input sequences. These results are significantly superior to those of the other SVM-based methods, which have accuracies <35% using AAC with acid pairs, and using AAC with dipedtide composition. ProLoc-GO using known accession numbers of proteins has accuracies 90.6% and 85.7% for SCL12 and SCL16, which is also slightly better than Hum-PLoc and Euk-OET-PLoc, which have 85.0% and 83.7%, respectively.

Analysis of m informative GO terms in the GO graph reveals that GOmining can consider internal relevant-feature correlation, rather than individual features, by using an efficient global optimization method. GOmining can serve as an efficient tool for mining informative GO terms for various sequence-based predictions of proteins, especially when the GO database grows fast. The prediction system using ProLoc-GO with protein sequence as input data for protein subcellular localization has been implemented (see Availability).

Methods

Proposed GOmining algorithm

An efficient genetic-algorithm-based method, called GOmining, is proposed for selecting informative GO terms. GOmining uses an intelligent genetic algorithm with an inheritable mechanism (IGA) [31,32], combined with an SVM classifier, to simultaneously identify a small number m out of a large number n of GO terms as input features, where m <<n. The exploration of the m informative GO terms from n candidate GO terms is a combinatorial optimization problem C(n, m) with a huge search space of size C(n, m) = n!/(m!(n-m)!)). An IGA based on orthogonal experimental design using a divide-and-conquer strategy and systematic reasoning method can efficiently solve this large combinatorial optimization problem.

The leave-one-out cross-validation (LOOCV) is considered to be the most rigorous and objective test. Although bias-free, this test is very computationally demanding and is often impractical for large data sets. The N-fold cross-validation not only provides a bias-free estimation of the accuracy at a much reduced computational cost, but is also considered as an acceptable test for evaluating prediction performance of an algorithm [44]. Therefore, GOmining uses the prediction accuracy of 10-CV as the fitness function to perform IGA on the entire training sets of proteins under considering the computation cost.

The input of the algorithm GOmining is composed of 1) a training set of protein sequences categorized into a number of compartments (classes), and 2) the essential GO terms corresponding to the compartments. The output comprises a set of m informative GO terms and the associated parameter settings of an SVM classifier. Since the novel sequences without known accession numbers use BLAST to obtain annotated GO terms, all training sequences use the same BLAST to obtain GO terms for consistence.

Step 1: (preparation of SVM) The multi-classification problem is solved by using a series of binary classifiers of LIBSVM [39]. In this study, the kernel parameter γ and cost parameter C are tuned where γ [set membership] {2-7, 2-6,..., 28} and C [set membership] {2-7, 2-6,..., 28}.

Step 2: (sequence representation) Obtain annotated GO terms from the GOA database for all training proteins using BLAST with h = 1 and e = 10-9. Let n be the total number of GO terms that appear among all proteins in the training data set. For example, n = 1714 and n = 2870 were derived for SCL12L and SCL16L, respectively. The protein is represented as an n-dimensional binary feature vector.

Step 3: (inclusion of essential GO terms) Identify d essential GO terms out of n GO terms and number them from 1 to d. For example, d = 12 and d = 15 were found from SCL12L and SCL16L, respectively.

Step 4: (chromosome encoding) The IGA-chromosome comprises n binary IGA-genes fi for selecting informative GO terms and two 4-bit IGA-genes for encoding γ and C, where fi = 1, i = 1,..., d. The ith GO term is included in the feature set of the SVM classifier if fi = 1; otherwise, the ith GO term is excluded (fi = 0). Figure Figure55 shows the sequence representation and IGA-chromosome encoding method.

Figure 5
Sequence representation and IGA-chromosome encoding method.

Step 5: (initial solution) Perform IGA to select rstart out of n GO terms, i.e., the solution to C(n, rstart), where the d GO terms are always selected. Table Table1313 shows the parameter settings of IGA, such as crossover probability pc = 0.8. The procedure of IGA is described in detail in the work [18].

Table 13
The used control parameters of IGA

Step 6: (inheritance mechanism) The inheritance mechanism of IGA can efficiently search for the solution to C(n, r+1) by inheriting a good solution Sr to C(n, r). Obtain all solutions Sr from r = rstart+1,..., rend one by one using IGA [31,32]. For example, rstart = 40 and rend = 70 according to former experience.

Step 7: (decoding chromosome) Let Sm be the most accurate solution with m selected GO terms among all solutions Sr. Obtain the m informative GO terms and parameter values of γ and C.

Step 8: (robust performance) Perform Steps 5–7 for N independent runs to obtain the best one of N solutions Sm and the associated parameter settings of the SVM parameters. The best solution considers both high prediction accuracy and high mean frequency of the m selected GO terms appeared in the N runs. In this study, N = 30.

ProLoc-GO

As shown in Fig. Fig.6,6, each query protein is first BLASTed with h = 1 and e = 10-9 against the Swiss-Prot database to obtain a homology with a known accession number. If no such homology exists, then adjust the threshold value e of BLAST until the desired homology is obtained, where h = 1 and e [set membership] {10-9, 10-8,..., 10-1}. The accession number of the homology of each protein sequence in SCL12 and SCL16 was obtained by using BLAST with h = 1 and e = 10-9. This accession number is used as input to the GOA database for retrieving the corresponding k (>1) GO terms: GO:1, GO:2,... GO:k. If none of the k GO terms belongs to the set of m informative GO terms, then the sequence is represented using an n-dimensional binary vector and is predicted by the SVM-GO classifier. Otherwise, the sequence is represented as an m-dimensional binary vector and is predicted by the SVM-IGO classifier. Notably, the SVM-GO classifier predicts only a very small percentage of input sequences. ProLoc-GO is derived from the two major classifiers SVM-GO and SVM-IGO for subcellular localization prediction.

Figure 6
Prediction flowchart of ProLoc-GO using both classifiers SVM-IGO and SVM-GO.

Fuzzy k-NN

The protein is represented as an n-dimensional binary vector and the generalized distance between two proteins P and Pi [2] is denoted as :

D(P, Pi) = 1 - P·Pi/||P|| ||Pi||,
(2)

where P·Pi is the dot product of vectors P and Pi, and ||P|| and ||Pi|| are their moduli.

This study determined the best value of k by using a step-wise approach where k [set membership] {1, 2,..., 10}.

The fuzzy k-NN classifier [13,29,30] is a variation of k-NN, which assign fuzzy membership values rc(P) of a query sequence P to each class c as follows:

rc(P)=j=1krc(Pj)|PPj|2/(w1)j=1k|PPj|2(w1),c=1,2,...,NC,
(3)

where the distance is calculated by according to (1). In this study, the best values of parameters (k, w) are tuned iteratively from k [set membership] {1, 2,..., 10} and w [set membership] {1.05, 1.10,..., 1.95} for the fuzzy k-NN classifier.

SVM-RBS

To evaluate the proposed IGA-based feature selection method GOmining, this study implements a classifier SVM-RBS by using SVM with a subset of the n GO terms by the rank-based selection (RBS) method [17,42]. One previous work on ProLoc [18] showed that this univariate method RBS is inferior to the multivariate feature selection by IGA for selecting physicochemical properties. First, each of all n GO terms (for example, n = 1714 for SCL12L) is ranked according to the accuracy of SVM with the evaluated single feature, where the best values of parameters (C, γ) were determined using a step-wise approach where γ [set membership] {2-7, 2-6,..., 28} and C [set membership] {2-7, 2-6,..., 28}. The top-ranking 70 features ai, i = 1,..., 70 are then picked, and the top-ranking 40 features with r = 40 are used as an initial feature set {b1,..., b40}. Consequently, the feature set with size r+1 is incrementally established by adding the best feature br+1 (having the highest accuracy of SVM using 10-CV) from the remaining 70-r features into the current feature set.

Authors' contributions

WLH designed the system, implemented programs, participated in manuscript preparation and carried out the detail study. CWT designed the system and implemented programs. SWH, SFH and SYH conceived the idea of this work. Additionally, SYH supervised the whole project and participated in manuscript preparation. All authors have read and approved the final manuscript.

Availability

The prediction system using ProLoc-GO with input sequences of query proteins for protein subcellular localization has been implemented at http://iclab.life.nctu.edu.tw/prolocgo.

Acknowledgements

The authors would like to thank the National Science Council of Taiwan for financially supporting this research under the contract numbers NSC 96-2628-E-009-141-MY3 and NSC 96-2627-B-009-002.

References

  • Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000:25–29. [PMC free article] [PubMed]
  • Chou KC, Shen HB. Predicting Eukaryotic Protein Subcellular Location by Fusing Optimized Evidence-Theoretic K-Nearest Neighbor Classifiers. J Proteome Res. 2006; 5:1888 –11897. [PubMed]
  • Chou KC, Shen HB. Euk-mPLoc: A Fusion Classifier for Large-Scale Eukaryotic Protein Subcellular Location Prediction by Incorporating Multiple Sites. J Proteome Res. 2007;6:1728–34. Epub 2007 Mar 31. [PubMed]
  • Chou KC, Shen HB. Hum-PLoc: A novel ensemble classifier for predicting human protein subcellular localization. Biochem Biophys Res Commun. 2006;347:150–157. [PubMed]
  • Lei Z, Dai Y. Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction. BMC Bioinformatics. 2006:491–590. [PMC free article] [PubMed]
  • Emanuelsson O, Nielsen H, Brunak S, von Heijne G. Predicting Subcellular Localization of Proteins Based on their N-terminal Amino Acid Sequence. J Mol Biol. 2000;300:1005–1016. [PubMed]
  • Nakai K, Horton P. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem Sci. 1999;24:34–35. [PubMed]
  • Cedano J, Aloy P, P’erez-Pons JA, Querol E. Relation between amino acid composition and cellular location of proteins. J Mol Biol. 1997;266:594–600. [PubMed]
  • Nakashima H, Nishikawa K. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol. 1994;238:54–61. [PubMed]
  • Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acid and amino acid pairs. Bioinformatics. 2003;19:1656–1663. [PubMed]
  • Garg A, Bhasin M, Raghava GP. Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J Biol Chem. 2005;280:14427–14432. [PubMed]
  • Hua S, Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics. 2001;17:721–728. [PubMed]
  • Huang Y, Li Y. Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics. 2004;20:21–28. [PubMed]
  • Yu CS, Lin CJ, Hwang JK. Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Sci. 2004;13:1402–1406. [PMC free article] [PubMed]
  • Bhasin M, Garg A, Raghava GPS. PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics. 2005;21:2522–2524. [PubMed]
  • Bhasin M, Raghava GPS. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res. 2004;32:W414–419. [PMC free article] [PubMed]
  • Sarda D, Chua G, Li KB, Krishnan A. pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinformatics. 2005;6:152–163. [PMC free article] [PubMed]
  • Huang WL, Tung CW, Huang HL, Hwang SF, Ho SY. ProLoc: Prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features. BioSystems. 2007;90:573–581. [PubMed]
  • Nanni L, Lumini A. An ensemble of K-local hyperplanes for predicting protein-protein interactions. Bioinformatics. 2006;22:1207–1210. [PubMed]
  • Cai YD, Liu XJ, Xu XB, Chou KC. Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect. J Cell Biochem. 2002;84:343–348. [PubMed]
  • Chou KC, Shen HB. Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition. Biochem Biophys Res Commun. 2005;337:752–756. [PubMed]
  • http://www.geneontology.org/
  • Yi G, Sze SH, Thon MR. Identifying clusters of functionally related genes in genomes. Bioinformatics, 2007;23:1053–60. doi: 10.1093/bioinformatics/btl673. Epub 2007 Jan 19. [PubMed] [Cross Ref]
  • Lewin A, Grieve I. Grouping Gene Ontology terms to improve the assessment of gene set enrichment in microarray data. BMC Bioinformatics. 2006;7:426. [PMC free article] [PubMed]
  • Carroll S, Pavlovic V. Protein classification using probabilistic chain graphs and the Gene Ontology structure. Bioinformatics. 2006;22:1871–1878. [PubMed]
  • Wolstencroft K, Lord P, Tabernero L, Brass A, Stevens R. Protein classification using ontology classification. Bioinformatics. 2006;22:e530–538. [PubMed]
  • Qian Z, Cai YD, Li Y. A novel computational method to predict transcription factor DNA binding preference. Biochem Biophys Res Commun. 2006;348:1034–1037. [PubMed]
  • Popescu M, Keller JM, Mitchell JA. Fuzzy Measures on the Gene Ontology for Gene Product Similarity. IEEE/ACM Trans Comput Biol Bioinformatics. 2006;3:263–74. [PubMed]
  • Huang WL, Chen HM, Hwang SF, Ho SY. Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method. BioSystems, 2007;90:405–418. Epub 2006 Oct 26. [PubMed]
  • Keller JM, Gray MR, Givens JA. A fuzzy k-nearest neighbours algorithm. IEEE Trans Syst Man Cybern. 1985;15:580–585.
  • Ho SY, Chen JH, Huang MH. Inheritable genetic algorithm for biobjective 0/1 combinatorial optimization problems and its applications. Systems, Man and Cybernetics, Part B, IEEE Transactions on. 2004;34:609–620. [PubMed]
  • Ho SY, Shu LS, Chen JH. Intelligent evolutionary algorithms for large parameter optimization problems. Evolutionary Computation, IEEE Transactions on. 2004;8:522–541.
  • Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004;32:D115–D119. [PMC free article] [PubMed]
  • Wang GL, Dunbrack Jr. RL. PISCES: a protein sequence culling server. Bioinformatics. 2003;19:1589–1591. [PubMed]
  • Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004;32:D262–266. [PMC free article] [PubMed]
  • GOA. [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/]
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. [PubMed]
  • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSIBLAST:a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
  • Chang CC, Lin CJ. LIBSVM : a library for support vector machines, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. 2001.
  • Ho SY, Hsieh CH, Chen HM, Huang HL. Interpretable gene expression classifier with an accurate and compact fuzzy rule base for microarray data analysis. BioSystems. 2006;85:165–176. [PubMed]
  • Tung CW, Ho SY. POPI: Predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties. . Bioinformatics. 2007;23:942–949. [PubMed]
  • Li T, Zhang C, Ogihara M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004;20:2429–2437. [PubMed]
  • Lu Z, Hunter L. GO Molecular Function Terms Are Predictive of Subcellular Localization. Pac Symp Biocomput. 2005:151–161. [PMC free article] [PubMed]
  • Stone M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. 1974;36:111–147.

Articles from BMC Bioinformatics are provided here courtesy of BioMed Central
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...