- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC3400678

# PREDICTING FLEXIBLE LENGTH LINEAR B-CELL EPITOPES

^{}

^{1,}

^{2,}

^{5}Drena Dobbs,

^{3,}

^{4,}

^{5}and Vasant Honavar

^{1,}

^{2,}

^{4,}

^{5}

^{1}Artificial Intelligence Laboratory, Iowa State University, Ames, IA 50010, USA

^{2}Department of Computer Science, Iowa State University, Ames, IA 50010, USA

^{3}Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50010, USA

^{4}Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA 50010, USA

^{5}Center for Computational Intelligence, Learning, and Discovery, Iowa State University, Ames, IA 50010, USA

^{}Corresponding author.

## Abstract

Identifying B-cell epitopes play an important role in vaccine design, immunodiagnostic tests, and antibody production. Therefore, computational tools for reliably predicting B-cell epitopes are highly desirable. We explore two machine learning approaches for predicting flexible length linear B-cell epitopes. The first approach utilizes four sequence kernels for determining a similarity score between any arbitrary pair of variable length sequences. The second approach utilizes four different methods of mapping a variable length sequence into a fixed length feature vector. Based on our empirical comparisons, we propose FBCPred, a novel method for predicting flexible length linear B-cell epitopes using the subsequence kernel. Our results demonstrate that FBCPred significantly outperforms all other classifiers evaluated in this study. An implementation of FBCPred and the datasets used in this study are publicly available through our linear B-cell epitope prediction server, BCPREDS, at: http://ailab.cs.iastate.edu/bcpreds/.

## 1. INTRODUCTION

B-cell epitopes are antigenic determinants that are recognized and bound by receptors (membrane-bound antibodies) on the surface of B lymphocytes ^{1}. The identification and characterization of B-cell epitopes play an important role in vaccine design, immunodiagnostic tests, and antibody production. As identifying B-cell epitopes experimentally is time-consuming and expensive, computational methods for reliably and efficiently predicting B-cell epitopes are highly desirable ^{2}.

There are two types of B-cell epitopes: (i) linear (continuous) epitopes which are short peptides corresponding to a contiguous amino acid sequence fragment of a protein ^{3, 4}; (ii) conformational (discontinuous) epitopes which are composed of amino acids that are not contiguous in primary sequence but are brought into close proximity within the folded protein structure. Although it is believed that a large majority of B-cell epitopes are discontinuous ^{5}, experimental epitope identification has focused primarily on linear B-cell epitopes ^{6}. Even in the case of linear B-cell epitopes, however, antibody-antigen interactions are often conformation-dependent. The conformation-dependent aspect of antibody binding complicates the problem of B-cell epitope prediction, making it less tractable than T-cell epitope prediction. Hence, the development of reliable computational methods for predicting linear B-cell epitopes is an important challenge in bioinformatics and computational biology ^{2}.

Previous studies have reported correlations between certain physicochemical properties of amino acids and the locations of linear B-cell epitopes within protein sequences ^{7–11}. Based on that observation, several amino acid propensity scale based methods have been proposed. For example, methods in ^{8–11} utilized hydrophilicity, flexibility, turns, and solvent accessibility propensity scales, respectively. PREDITOP ^{12}, PEOPLE ^{13}, BEPITOPE ^{14}, and BcePred ^{15} utilized groups of physicochemical properties instead of a single property to improve the accuracy of the predicted linear B-cell epitopes. Unfortunately, Blythe and Flower ^{16} showed that propensity based methods can not be used reliably for predicting B-cell epitopes. Using a dataset of 50 proteins and an exhaustive assessment of 484 amino acid propensity scales, Blythe and Flower ^{16} showed that the best combinations of amino acid propensities performed only marginally better than random. They concluded that the reported performance of such methods in the literature is likely to have been overly optimistic, in part due to the small size of the data sets on which the methods had been evaluated.

Recently, the increasing availability of experimentally identified linear B-cell epitopes in addition to Blythe and Flower results ^{16} motivated several researchers to explore the application of machine learning approaches for developing linear B-cell epitope prediction methods. BepiPred ^{17} combines two amino acid propensity scales and a Hidden Markov Model (HMM) trained on linear epitopes to yield a slight improvement in prediction accuracy relative to techniques that rely on analysis of amino acid physicochemical properties. ABCPred ^{18} uses artificial neural networks for predicting linear B-cell epitopes. Both feed-forward and recurrent neural networks were evaluated on a *non-redundant* data set of 700 B-cell epitopes and 700 non-epitope peptides, using 5-fold cross validation tests. Input sequence windows ranging from 10 to 20 amino acids, were tested and the best performance, 66% accuracy, was obtained using a recurrent neural network trained on peptides 16 amino acids in length. In the method of Söllner and Mayer ^{19}, each epitope is represented using a set of 1487 features extracted from a variety of propensity scales, neighborhood matrices, and respective probability and likelihood values. Of two machine learning methods tested, decision trees and a nearest-neighbor method combined with feature selection, the latter was reported to attain an accuracy of 72% on a data set of 1211 B-cell epitopes and 1211 non-epitopes, using a 5-fold cross validation test ^{19}. Chen et al. ^{20} observed that certain amino acid pairs (AAPs) tend to occur more frequently in B-cell epitopes than in non-epitope peptides. Using an AAP propensity scale based on this observation, in combination with a support vector machine (SVM) classifier, they reported prediction accuracy of 71% on a data set of 872 B-cell epitopes and 872 non-B-cell epitopes, estimated using 5-fold cross validation. In addition, ^{20} demonstrated an improvement in the prediction accuracy, 72.5%, when the APP propensity scale is combined with turns accessibility, antigenicity, hydrophilicity, and flexibility propensity scales.

Existing linear B-cell epitope prediction tools fall into two broad categories. Tools in the first category, residue-based predictors, take as input a protein sequence and assign binary labels to each individual residue in the input sequence. Each group of neighboring residues with predicted positive labels define a variable length predicted linear B-cell epitope. Residue-based prediction methods scan the input sequence using a sliding window and assign a score to the amino acid at the center of the window based on the mean score of a certain propensity scale (e.g., flexibility or hydrophilicity). The target residue is predicted positive if its score is greater than a predetermined threshold. Unfortunately, it has been shown that the performance of these methods is marginally better than random ^{16}. PepiPred ^{17} used the information extracted using the sliding window to train a HMM and combined it with two propensity scale based methods. BcePred ^{15} combined several propensity scales and showed that the performance of the combined scales is better than the performance of any single scale.

The second category of linear B-cell prediction tools consist of the epitope-based predictors. An example of such predictors is the ABCPred server ^{18}. For this server, the input is a protein sequence and an epitope length (should be in {20, 18, .., 10}). The server then applies a sliding window of the user specified length and passes the extracted peptides to a neural network classifier trained using epitope dataset in which all the epitope sequences have been set to the specified epitope length via trimming and extending longer and shorter epitopes, respectively. A limitation of this approach is that the user is forced to select one of the available six possible epitope lengths and can not specify a different epitope length.

Because linear B-cell epitopes can vary in length over a broad range (see Figure 1), it is natural to train classifiers using the experimentally reported epitope sequences without trimming or extending them. Such an approach will allow us to provide a linear B-cell epitope prediction tool that allows the user to experiment with virtually any arbitrary epitope length. In this work, we explore two machine learning approaches for predicting flexible length linear B-cell epitopes. The first approach utilizes several sequence kernels for determining a similarity score between any arbitrary pair of variable length sequences. The second approach utilizes many different methods of mapping a variable length sequence into a fixed length feature vector. Based on our empirical comparisons, we propose FBCPred, a novel method for predicting flexible length linear B-cell epitopes using the subsequence kernel. Our results demonstrate that FBCPred significantly outperforms all other classifiers evaluated in this study. An implementation of FBCPred and the datasets used in this study are publicly available through our linear B-cell epitope prediction server, BCPREDS, at: http://ailab.cs.iastate.edu/bcpreds/.

## 2. MATERIALS AND METHODS

### 2.1. Data

We retrieved 1223 unique linear B-cell epitopes of lengths more than 3 amino acids from Bcipep database ^{21}. To avoid over-optimistic performance of classifiers evaluated on the set of unique epitopes, we applied a homology reduction procedure proposed by Raghava ^{22} for reducing sequence similarity among flexible length major histocompatibility complex class II (MHC-II) peptides. Briefly, given two peptides *p*_{1} and *p*_{2} of lengths *l*_{1} and *l*_{2} such that *l*_{1}
*l*_{2}, we compare *p*_{1} with each *l*_{1}-length subpeptide in *p*_{2}. If the percent identity (PID) between *p*_{1} and any subpeptide in *p*_{2} is greater than 80%, then the two peptides are deemed to be similar. For example, to compute the PID between (ACDEFGHIKLMNPQRST) and (DEFGGIKLMN), we compare (DEFGGIKLMN) with (ACDEFGHIKL), (CDEFGHIKLM), …, (IKLMNPQRST). The PID between (DEFGGIKLMN) and (DEFGHIKLMN) is 90% since nine out of 10 residues are identical.

Applying the above homology reduction procedure to the set of 1223 unique variable length linear B-cell epitopes yields a *homology-reduced* set of 934 epitopes. Two datasets of flexible length linear B-cell epitopes have been constructed. An *original* dataset constructed from the set of 1223 unique epitopes as the positive examples and 1223 non-epitopes randomly extracted from SwissProt ^{23} and a *homology-reduced* dataset constructed from *homology-reduced* set of 934 epitopes as positive examples and an equal number of negative examples extracted randomly form SwissProt sequences. In both datasets two selection criteria have been applied to the randomly extracted non-epitopes: (i) the length distribution in the negative data is identical to the length distribution in the positive data; (ii) none of the non-epitopes appears in the set of epitopes.

### 2.2. Support vector machines and kernel methods

Support vector machines (SVMs) ^{24} are a class of supervised machine learning methods used for classification and regression. Given a set of labeled training data (*x _{i}*,

*y*), where

_{i}*x*

_{i}*R*and

^{d}*y*{+1, −1}, training an SVM classifier involves finding a hyperplane that maximizes the geometric margin between positive and negative training data samples. The hyperplane is described as

_{i}*f*(

*x*) =

*w,*x +

*b*, where

*w*is a normal vector and

*b*is a bias term. A test instance,

*x*, is assigned a positive label if

*f*(

*x*) > 0, and a negative label otherwise. When the training data are not linearly separable, a kernel function is used to map nonlinearly separable data from the input space into a feature space. Given any two data samples

*x*and

_{i}*x*in an input space

_{j}*X*

*R*, the kernel function

^{d}*K*returns

*K*(

*x*,

_{i}*x*) = (

_{j}*x*), (

_{i}*x*) where is a nonlinear map from the input space

_{j}*X*to the corresponding feature space. The kernel function

*K*has the property that

*K*(

*x*,

_{i}*x*) can be computed without explicitly mapping

_{j}*x*and

_{i}*x*into the feature space, but instead, using their dot product

_{j}*x*in the input space. Therefore, the kernel trick allows us to train a linear classifier, e.g., SVM, in a high-dimensional feature space where the data are assumed to be linearly separable without explicitly mapping each training example from the input space into the feature space. This approach relies implicitly on the selection of a feature space in which the training data are likely to be linearly separable (or nearly so) and explicitly on the selection of the kernel function to achieve such separability. Unfortunately, there is no single kernel that is guaranteed to perform well on every data set. Consequently, the SVM approach requires some care in selecting a suitable kernel and tuning the kernel parameters (if any).

_{i}, x_{j}### 2.3. Sequence kernel based methods

String kernels ^{25–29} are a class of kernel methods that have been successfully used in many sequence classification tasks ^{25, 26, 28, 30–32}. In these applications, a protein sequence is viewed as a string defined on a finite alphabet of 20 amino acids. In this work, we explore four string kernels: spectrum ^{25}, mismatch ^{26}, local alignment ^{28}, and subsequence ^{27}, in predicting linear B-cell epitopes. A brief description of the four kernels follows.

#### 2.3.1. Spectrum kernel

Let *A* denote a finite alphabet, e.g., the standard 20 amino acids. *x* and *y* denote two strings defined on the alphabet *A*. For *k* ≥ 1, the *k*-spectrum is defined as ^{25}:

where * _{α}* is the number of occurrences of the

*k*-length substring

*α*in the sequence

*x*. The

*k*-spectrum kernel of the two sequences

*x*and

*y*is obtained by taking the dot product of the corresponding

*k*spectra:

The *k*-spectrum kernel captures a simple notion of string similarity: two strings are deemed similar (i.e., have a high *k*-spectrum kernel value) if they share many of the same *k*-length substrings.

#### 2.3.2. Mismatch kernel

The mismatch kernel ^{26} is a variant of the spectrum kernel in which inexact matching is allowed. Specifically, the (*k, m*)-mismatch kernel allows up to *m k* mismatches to occur when comparing two *k*-length substrings. Let *α* be a *k*-length substring, the (*k, m*)-mismatch feature map is defined on *α* as:

where * _{β}* (

*α*) = 1 if

*β*

*N*

_{(}

_{k,m}_{)(}

_{α}_{)}, where

*β*is the set of

*k*-mer substrings that differs from

*α*by at most

*m*mismatches. Then, the feature map of an input sequence

*x*is the sum of the feature vectors for

*k*-mer substrings in

*x*:

The (*k, m*)-mismatch kernel is defined as the dot product of the corresponding feature maps in the feature space:

It should be noted that the (*k,* 0)-mismatch kernel results in a feature space that is identical to that of the *k*-spectrum kernel. An efficient data structure for computing the spectrum and mismatch kernels in *O*(|*x*|+ |*y*|) and *O*(*k ^{m}*

^{+1}|

*A*|

*(|*

^{m}*x*|+ |

*y*|)), respectively, is provided in

^{26}.

#### 2.3.3. Local alignment kernel

Local alignment (LA) kernel ^{28} is a string kernel adapted for biological sequences. The LA kernel measures the similarity between two sequences by summing up scores obtained from gapped local alignments of the sequences. This kernel has several parameters: the gap opening and extension penalty parameters, *d* and *e*, the amino acid mutation matrix *s*, and the factor *β*, which controls the influence of suboptimal alignments on the kernel value. Detailed formulation of the LA kernel and a dynamic programming implementation of the kernel with running time complexity in *O*(|*x*||*y*|) are provided in ^{28}.

#### 2.3.4. Subsequence kernel

The subsequence kernel (SSK) ^{27} generalizes the *k*-spectrum kernel by considering a feature space generated by the set of all (contiguous and non-contiguous) *k*-mer subsequences. For example, if we consider the two strings “*act*” and “*acctct*”, the value returned by the spectrum kernel with *k* = 3 is 0. On the other hand, the (3, 1)-mismatch kernel will return 3 because the 3-mer substrings “*acc*”*,* “*cct*”, and “*tct*” have at most one mismatch when compared with “*act*”. The subsequence kernel considers the set (“*ac* – *t*”*,* “*a* – *ct*”*,* “*ac* – – – *t*”*,* “*a* – *c* – –*t*”*,* “*a* – – – *ct*”) of non-contiguous substrings and returns a similarity score that is weighted by the length of each non-contiguous substring. Specifically, it uses a decay factor, λ 1, to penalize non-contiguous substring matches. Therefore, the subsequence kernel with *k* = 3 will return 2λ^{4} + 3λ^{6} when applied to “*act*” and “*acctct*” strings. More precisely, the feature map _{(}_{k,λ}_{)} of a string *x* is given by:

where *u* = *x*[*i*] denotes a substring in *x* where 1 *i*_{1} < … < *i*_{|}_{u}_{|} |*x*| such that *u _{j}* =

*s*, for

_{ij}*j*= 1, …, |

*u*| and

*l*(

*i*) =

*i*

_{|}

_{u}_{|}−

*i*

_{1}+ 1 is the length of the subsequence in

*x*. The subsequence kernel for two strings

*x*and

*y*is determined as the dot product of the corresponding feature maps:

This kernel can be computed using a recursive algorithm based on dynamic programming in *O*(*k*|*x*||*y*|) time and space. The running time and memory requirements can be further reduced using techniques described in ^{33}.

### 2.4. Sequence-to-features based methods

This approach has been previously used for protein function and structure classification tasks ^{34–37} and the classification of flexible length MHC-II peptides. The main idea is to map each variable length amino acid sequence into a feature vector of fixed length. Once the variable length sequences are mapped to fixed length feature vectors, we can apply any of the standard machine learning algorithms to this problem. Here, we considered SVM classifiers trained on the mapped data using the widely used RBF kernel.

We explored four different methods for mapping a variable length amino acid sequence into a fixed length feature vector: (i) amino acid composition; (ii) dipeptide composition; (iii) amino acid pairs propensity scale; (iv) composition-transition-distribution. A brief summary of each method is given below.

#### 2.4.1. Amino acid and dipeptide composition

Amino acid composition (AAC) represents a variable length amino acid sequence using a feature vector of 20 dimensions. Let *x* be a sequence of |*x*| amino acids. Let *A* denote the set of the standard 20 amino acids. The amino acid composition feature mapping is defined as:

where ${\phi}_{\beta}(x)={\scriptstyle \frac{\mathit{number}\phantom{\rule{0.16667em}{0ex}}of\phantom{\rule{0.16667em}{0ex}}\mathit{occurrences}\phantom{\rule{0.16667em}{0ex}}of\phantom{\rule{0.16667em}{0ex}}\mathit{amino}\phantom{\rule{0.16667em}{0ex}}\mathit{acid}\phantom{\rule{0.16667em}{0ex}}\beta \phantom{\rule{0.16667em}{0ex}}in\phantom{\rule{0.16667em}{0ex}}x}{\mid x\mid}}$.

A limitation of the amino acid composition feature representation of amino acid sequences is that we lose the sequence order information. Dipeptide composition (DC) encapsulates information about the fraction of amino acids as well as their local order. In dipeptide composition each variable length amino acid sequence is represented by a feature vector of 400 dimensions defined as:

where ${\phi}_{\alpha}(x)={\scriptstyle \frac{\mathit{number}\phantom{\rule{0.16667em}{0ex}}of\phantom{\rule{0.16667em}{0ex}}\mathit{occurrences}\phantom{\rule{0.16667em}{0ex}}of\phantom{\rule{0.16667em}{0ex}}\mathit{dipeptide}\phantom{\rule{0.16667em}{0ex}}\alpha \phantom{\rule{0.16667em}{0ex}}in\phantom{\rule{0.16667em}{0ex}}x}{\mathit{total}\phantom{\rule{0.16667em}{0ex}}\mathit{number}\phantom{\rule{0.16667em}{0ex}}of\phantom{\rule{0.16667em}{0ex}}\mathit{all}\phantom{\rule{0.16667em}{0ex}}\mathit{possible}\phantom{\rule{0.16667em}{0ex}}\mathit{dipeptides}\phantom{\rule{0.16667em}{0ex}}in\phantom{\rule{0.16667em}{0ex}}x}}$.

#### 2.4.2. Amino acid pairs propensity scale

Amino acid pairs (AAPs) are obtained by decomposing a protein/peptide sequence into its 2-mer subsequences. ^{20} observed that some specific AAPs tend to occur more frequently in B-cell epitopes than in non-epitope peptides. Based on this observation, they developed an AAP propensity scale defined by:

where
${f}_{\alpha}^{+}$ and
${f}_{\alpha}^{-}$ are the occurrence frequencies of AAP *α* in the epitope and non-epitope peptide sequences, respectively. These frequencies have been derived from Bcipep ^{21} and Swissprot ^{23} databases, respectively. To avoid the dominance of an individual AAP propensity value, the scale in Eq. (10) has been normalized to a [−1, +1] interval through the following conversion:

where *max* and *min* are the maximum and minimum values of the propensity scale before the normalization.

The AAP feature mapping, * _{AAP}*, maps each amino acid sequence,

*x*, into a 400-dimentional feature space defined as:

where * _{α}*(

*x*) is the number of occurrences of the 2-mer

*α*in the peptide

*x*.

#### 2.4.3. Composition-Transition-Distribution

The basic idea behind the Composition-Transition-Distribution (CTD) method ^{38, 39} is to map each variable length peptide into a fixed length feature vector such that standard machine learning algorithms are applicable. From each peptide sequence, 21 features are extracted as follows:

- First, each peptide sequence
*p*is mapped into a string*s*defined over an alphabet of three symbols, {1, 2, 3}. The mapping is performed by grouping amino acids into three groups using a physicochemical property of amino acids (see Table 3). For example the peptide (AIRHIPRRIR) is mapped into (2312321131) using the hydrophobicity division of amino acids into three groups (see Table 3)._{p} - Second, for each peptide string
*s*, three descriptors are derived as follows:_{p}- Composition (C): three features representing the percent frequency of the symbols, {1, 2, 3}, in the mapped peptide sequence.
- Transition (T): three features representing the percent frequency of
*i*followed by*j*or*j*followed by*i*, for*i, j*{1, 2, 3}. - Distribution (D): five features per symbol representing the fractions of the entire sequence where the first, 25, 50, 75, and 100% of the candidate symbol are contained in
*s*. This yields an additional 15 features for each peptide._{p}

Table 1 shows division of the 20 amino acids, proposed by Chinnasamy et al. ^{40}, into three groups based on hydrophobicity, polarizability, polarity, and Van der Waal’s volume properties. Using these four properties, we derived 84 CTD features from each peptide sequence. In our experiments, we trained SVM classifiers using RBF kernel and peptide sequences represented using their amino acid sequence composition (20 features) and CTD descriptors (84 features).

### 2.5. Performance evaluation

We report the performance of each classifier using the average of 10 runs of 5-fold cross validation tests. Each classifier performance is assessed by both threshold-dependent and threshold-independent metrics. For threshold-dependent metrics, we used accuracy (ACC), sensitivity (*S _{n}*), specificity (

*S*), and correlation coefficient (

_{p}*CC*). The

*CC*measure has a value in the range from −1 to +1 and the closer the value to +1, the better the predictor. The

*S*and

_{n}*S*summarize the accuracies of the positive and negative predictions respectively. ACC,

_{p}*S*,

_{n}*S*, and

_{p}*CC*are defined in Eq. (13–15) where TP, FP, TN, FN are the numbers of true positives, false positives, true negatives, and false negatives respectively.

For threshold-independent metrics, we report the Receiver Operating Characteristic (ROC) curve. The ROC curve is obtained by plotting the true positive rate as a function of the false positive rate or, equivalently, sensitivity versus (1-specificity) as the discrimination threshold of the binary classifier is varied. Each point on the ROC curve describes the classifier at a certain threshold value and hence a particular choice of tradeoff between true positive rate and false negative rate. We also report the area under ROC curve (AUC) as a useful summary statistic for comparing two ROC curves. AUC is defined as the probability that a randomly chosen positive example will be ranked higher than a randomly chosen negative example. An ideal classifier will have an AUC = 1, while a classifier performs no better than random will have an AUC = 0.5, any classifier performing better than random will have an AUC value that lies between these two extremes.

### 2.6. Implementation and SVM parameter optimization

We used Weka machine learning workbench ^{41} for implementing the spectrum, mismatch, and LA kernels (RBF and SSK kernels are already implemented in Weka). We evaluated the *k*-spectrum kernel,
${K}_{k}^{\mathit{spct}}$, for *k* = 1, 2, and 3. The (*k, m*)-mismatch kernel was evaluated at (k,m) equals (3, 1)*and*(4, 1). The subsequence kernel,
${K}_{(k,\lambda )}^{\mathit{sub}}$, was evaluated at *k* = 2, 3, and 4 and the default value for λ, 0.5. The LA kernel was evaluated using the BLOSUM62 substitution matrix, gap opening and extension parameters equal to 10 and 1, respectively, and *β* = 0.5. For the SVM classifier, we used the Weka implementation of the SMO ^{42} algorithm. For the string kernels, the default value of the *C* parameter, *C* = 1, was used for the SMO classifier. For methods that uses the RBF kernel, we found that tuning the SMO cost parameter *C* and the RBF kernel parameter γ is necessary to obtain satisfactory performance. We tuned these parameters using a 2-dimensional grid search over the range *C* = 2^{−5}, 2^{−3}, …, 2^{3}, γ = 2^{−15}, 2^{−13}, …, 2^{3}.

## 3. RESULTS AND DISCUSSION

Table 2 compares the performance of different SVM based classifiers on the *original* dataset of unique flexible length linear B-cell epitopes. The SVM classifier trained using SSK with *k* = 4 and λ = 0.5,
${k}_{(4,0.5)}^{\mathit{sub}}$, significantly (using statistical paired t-test ^{43} with p-value = 0.05) outperforms all other classifiers in terms of the AUC. The two classifiers based on the mismatch kernel have the worst AUC. The classifier trained using
${k}_{3}^{\mathit{spct}}$ is competitive to those trained using the LA kernel and
${k}_{(2,0.5)}^{\mathit{sub}}$. The last four classifiers belong to the sequence-to-feature approach. Each of these classifiers has been trained using an SVM classifier and the RBF kernel but on different data representation. The results suggest that representation of

the peptides using their dipeptide composition performs better than other feature representations on the *original* dataset. Figure 2 shows the ROC curves for different methods on *original* dataset of unique flexible length linear B-cell epitopes. The ROC curve of
${K}_{(4,0.5)}^{\mathit{sub}}$ based classifier almost dominates all other ROC curves (i.e., for any choice of specificity value, the
${K}_{(4,0.5)}^{\mathit{sub}}$ based classifier almost has the best sensitivity).

*original*dataset of unique flexible length linear B-cell epitopes. The ROC curve of ${K}_{(4,0.5)}^{\mathit{sub}}$ based classifier almost dominates all other ROC curves.

Table 3 reports the performance of the different SVM based classifiers on the *homology-reduced* dataset of flexible length linear B-cell epitopes. We note that the performance of each classifier is considerably worse than its performance on the *original* dataset of unique epitopes. This discrepancy can be explained by the existence of epitopes with significant pairwise sequence similarity in the *original* dataset. Interestingly, the SVM classifier based on the
${k}_{(4,0.5)}^{\mathit{sub}}$ kernel still significantly outperforms all other classifiers at 0.05 level of significance. Figure 3 shows the ROC curves for different methods on *homology-reduced* dataset of flexible length linear B-cell epitopes. Again, the ROC curve of
${K}_{(4,0.5)}^{\mathit{sub}}$ based classifier almost dominates all other ROC curves.

*homology-reduced*dataset of flexible length linear B-cell epitopes. The ROC curve of ${K}_{(4,0.5)}^{\mathit{sub}}$ based classifier almost dominates all other ROC curves.

Comparing results on Table 2 and Table 3 reveals two important issues that to the best of our knowledge have not been addressed before in the literature on B-cell epitope prediction. First, our results demonstrate that performance estimates reported on the basis of the *original* dataset of unique linear B-cell epitopes is overly optimistic compared to the performance estimates obtained using the *homology-reduced* dataset. Hence, we suspect that the actual performance of linear B-cell epitope prediction methods on *homology-reduced* datasets is somewhat lower than the reported performance on the original dataset of unique peptides. Second, our results suggest that conclusions regarding how different prediction methods compare to each other drawn on the basis of datasets of unique epitopes may be misleading. For example, from the reported results in Table 2, one may conclude that
${k}_{3}^{\mathit{spct}}$ outperforms
${k}_{1}^{\mathit{spct}}$ and
${k}_{2}^{\mathit{spct}}$ while results on the *homology-reduced* dataset (see Table 3) demonstrate that the three classifiers are competitive with each other. Another example of misleading conclusions drawn from results in Table 2 is that dipeptide composition features is a better representation than amino acid composition representation of the data. This conclusion is contradicted by results in Table 3 which show that the classifier constructed using the amino acid composition representation of the data slightly outperforms the classifier constructed using the dipeptide composition of the same data.

The results in Table 2 and Table 3 show that the classifier that used the amino acid composition features outperforms the classifier that used CTD features. This is interesting because the set of amino acid composition features is a subset of the CTD features. Recall that CTD is composed of 20 amino acid composition features plus 84 physicochemical features, we conclude that the added physicochemical features did not yield additional information that was relevant for the classification task. In addition, we observed that the classifier that used the dipeptide composition outperforms the classifier that used the AAP features. This is interesting because AAP features as defined in Eq. (12) can be viewed as dipeptide composition features weighted by the amino acid propensity of each dipeptide.

### 3.1. Web server

An implementation of FBCPred is available as a part of our B-cell epitope prediction server (BCPREDS) ^{44} which is freely accessible at http://ailab.cs.iastate.edu/bcpreds/. Because it is often valuable to compare predictions of multiple methods, and consensus predictions are more reliable than individual predictions, the BCPREDS server aims at providing predictions using several B-cell epitope prediction methods. The current implementation of BCPREDS allows the user to select among three prediction methods: (i) Our implementation of AAP method ^{20}; (ii) BCPred ^{44}, a method for predicting linear B-cell epitope using the subsequence kernel; (iii) FBCPred, the method introduced in this study for predicting flexible length B-cell epitopes. The major difference between FBCPred and the other two methods is that FBCPred can predict linear B-cell epitopes of virtually any arbitrary length while for the other two methods the length has to be one of possible six values, {12, 14, …, 22}.

Another goal of BCPREDS server is to serve as a repository of benchmark B-cell epitope datasets. The datasets used for training and evaluating BCPred and the two datasets used in this study can be freely downloaded from the web server.

## 4. SUMMARY AND DISCUSSION

We explored two machine learning approaches for predicting flexible length linear B-cell epitopes. The first approach utilizes sequence kernels for determining a similarity score between any arbitrary pair of variable length sequences. The second approach utilizes several methods of mapping a variable length sequence into a fixed length feature vector. Our results demonstrated a superior performance of the subsequence kernel based SVM classifier compared to other SVM classifiers examined in our study. Therefore, we proposed FBCPred, a novel method for predicting flexible length linear B-cell epitopes using the subsequence kernel. An implementation of FBCPred and the datasets used in this study are publicly available through our linear B-cell prediction server, BCPREDS, at: http://ailab.cs.iastate.edu/bcpreds/.

Previous methods for predicting linear B-cell epitopes (e.g., ^{15, 17, 19, 18, 20}) have been evaluated on datasets of unique epitopes without applying any homology reduction procedure as a pre-processing step on the data. We showed that performance estimates reported on the basis of such datasets is considerably over-optimistic compared to performance estimates obtained using the *homology-reduced* datasets. Moreover, we showed that using such *non homology-reduced* datasets for comparing different prediction methods may lead to false conclusions regarding how these methods compare to each other.

### 4.1. Related work

Residue-based prediction methods ^{7–11, 15, 17} assign labels to each residue in the query sequence and therefore are capable of predicting linear B-cell epitopes of variable length. However, most of these methods have been shown to be of low to moderate performance ^{16}.

AAP method ^{20} maps each peptide sequence into a set of fixed length numeric features and therefore it can be trained using datasets of flexible length sequences. However, the performance of this method had been reported using a dataset of 20-mer peptides.

Söllner and Mayer ^{19} introduced a method for mapping flexible length epitope sequences into feature vectors of 1478 attributes. This method has been evaluated on a dataset of flexible length linear B-cell epitopes. However, no homology reduction procedure was applied to remove highly similar sequences from the data. In addition, the implementation of this method is not publicly available.

Recently, two methods ^{45, 39} have been successfully applied to the problem of predicting flexible length MHC-II binding peptides. The first method ^{45} utilized the LA kernel ^{28} for developing efficient SVM based classifiers. The second method ^{39} mapped each flexible length peptide into the set of CTD features employed in our study in addition to some extra features extracted using two secondary structure and solvent accessibility prediction classifiers. In our study we could not use these extra features due to the unavailability of these two programs.

## Acknowledgments

This work was supported in part by a doctoral fellowship from the Egyptian Government to Yasser EL-Manzalawy and a grant from the National Institutes of Health (GM066387) to Vasant Honavar and Drena Dobbs.

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (739K)

- Predicting linear B-cell epitopes using string kernels.[J Mol Recognit. 2008]
*El-Manzalawy Y, Dobbs D, Honavar V.**J Mol Recognit. 2008 Jul-Aug; 21(4):243-55.* - Improved method for linear B-cell epitope prediction using antigen's primary sequence.[PLoS One. 2013]
*Singh H, Ansari HR, Raghava GP.**PLoS One. 2013; 8(5):e62216. Epub 2013 May 7.* - Pep-3D-Search: a method for B-cell epitope prediction based on mimotope analysis.[BMC Bioinformatics. 2008]
*Huang YX, Bao YL, Guo SY, Wang Y, Zhou CG, Li YX.**BMC Bioinformatics. 2008 Dec 16; 9:538. Epub 2008 Dec 16.* - SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.[BMC Bioinformatics. 2007]
*Melvin I, Ie E, Kuang R, Weston J, Stafford WN, Leslie C.**BMC Bioinformatics. 2007 May 22; 8 Suppl 4:S2. Epub 2007 May 22.* - In silico models for B-cell epitope recognition and signaling.[Methods Mol Biol. 2013]
*Ansari HR, Raghava GP.**Methods Mol Biol. 2013; 993:129-38.*

- The Structure and Dynamics of BmR1 Protein from Brugia malayi: In Silico Approaches[International Journal of Molecular Sciences...]
*Khor BY, Tye GJ, Lim TS, Noordin R, Choong YS.**International Journal of Molecular Sciences. 15(6)11082-11099* - Prediction of IL4 Inducing Peptides[Clinical and Developmental Immunology. 2013...]
*Dhanda SK, Gupta S, Vir P, Raghava GP.**Clinical and Developmental Immunology. 2013; 2013263952* - Identification of B-cell epitopes in an antigen for inducing specific class of antibodies[Biology Direct. ]
*Gupta S, Ansari HR, Gautam A, Open Source Drug Discovery Consortium, Raghava GP.**Biology Direct. 827* - Epitope Predictions Indicate the Presence of Two Distinct Types of Epitope-Antibody-Reactivities Determined by Epitope Profiling of Intravenous Immunoglobulins[PLoS ONE. ]
*Luštrek M, Lorenz P, Kreutzer M, Qian Z, Steinbeck F, Wu D, Born N, Ziems B, Hecker M, Blank M, Shoenfeld Y, Cao Z, Glocker MO, Li Y, Fuellen G, Thiesen HJ.**PLoS ONE. 8(11)e78605* - Improved Method for Linear B-Cell Epitope Prediction Using Antigen's Primary Sequence[PLoS ONE. ]
*Singh H, Ansari HR, Raghava GP.**PLoS ONE. 8(5)e62216*

- PubMedPubMedPubMed citations for these articles

- PREDICTING FLEXIBLE LENGTH LINEAR B-CELL EPITOPESPREDICTING FLEXIBLE LENGTH LINEAR B-CELL EPITOPESNIHPA Author Manuscripts. 2008; 7()121PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...