- Journal List
- NIHPA Author Manuscripts
- PMC2795355

# Target Fishing for Chemical Compounds using Target-Ligand Activity data and Ranking based Methods

^{*}Currently at Pfizer Global Research and Development, Pfizer Inc., Groton, CT. moc.rezifp@elaw.likin

## Abstract

In recent years the development of computational techniques that identify all the likely targets for a given chemical compound, also termed as the problem of Target Fishing, has been an active area of research. Identification of likely targets of a chemical compound helps to understand problems such as toxicity, lack of efficacy in humans, and poor physical properties associated with that compound in the early stages of drug discovery. In this paper we present a set of techniques whose goal is to rank or prioritize targets in the context of a given chemical compound such that most targets that this compound may show activity against appear higher in the ranked list. These methods are based on our extensions to the SVM and Ranking Perceptron algorithms for this problem. Our extensive experimental study shows that the methods developed in this work outperform previous approaches by 2% to 60% under different evaluation criterions.

## 1 Introduction

Target-based drug discovery, which involves selection of an appropriate target (generally a single protein) implicated in a disease state as the first step, has become the primary approach of drug discovery in pharmaceutical industry.^{1}^{,}^{2} This was made possible through the advent of High Throughput Screening (HTS) technology in the late 1980s that enabled rapid experimental testing of a large number of chemical compounds against the target of interest (using target-based assays). HTS is now routinely utilized to identify the most promising compounds (called *hits*) that show desired binding/activity against this target. Some of these compounds then go through the long and expensive process of optimization and eventually one of them goes to clinical trials. If clinical trials are successful then it becomes a drug. HTS technology was considered to usher a new era in drug discovery by reducing time and money taken to find hits that will have a high chance of eventually becoming a drug.

However, the expansion of candidate list of hits via HTS did not result in productivity gains in terms of actual drugs coming out of the drug discovery pipeline. One of the principal reasons ascribed for this failure is that the above approach suffers from a serious drawback of only focusing on the target of interest and therefore taking a very narrow view of the disease. As such, it may lead to unsatisfactory phenotypic effects such as toxicity, promiscuity, and low efficacy in the later stages of drug discovery.^{2}^{,}^{3} More recently, focus is shifting to directly screen molecules to identify desirable phenotypic effects using cell-based assays. This screening evaluates properties such as toxicity, promiscuity and efficacy from the onset rather than in later stages of drug discovery.^{2}^{-}^{4} Moreover, toxicity and off-target effects are also a focus of early stages of conventional target-based drug discovery.^{5}^{,}^{6} But from the drug discovery perspective, target identification and subsequent validation has become the rate limiting step in order to tackle the above issues.^{7} Targets must be identified for the hits in phenotypic assay experiments and for secondary pharmacology as the activity of hits against all of its potential targets sheds light on the toxicity and promiscuity of these hits.^{6}^{,}^{8} Therefore, the identification of all likely targets for a given chemical compound, also called *Target Fishing*,^{4} has become an important problem in drug discovery.

In this work we focus on the target fishing problem by utilizing the available target-ligand activity data matrix. In this approach, we are given a set of targets and a set of ligands (chemical compounds) and a bipartite activity relation between the targets and the ligands in the two sets. Given a new test chemical compound not in the set, the goal is to correctly predict all the activity relations between the test compound and the targets. We address this problem by formulating it as a category ranking problem. The goal is to learn a model such that for a given test compound it ranks the targets that this compound shows activity against (relevant targets) higher than the rest of the targets (non-relevant targets). In this work, we propose a number of methods that are inspired by research in the area of multiclass/multilabel classification and protein secondary structure prediction. Specifically, we develop four methods based on support vector machines (SVM)^{9} and ranking perceptrons^{10} to solve the above ranking problem. Three of these methods try to explicitly capture dependencies between different categories to build models. Our results show that the methods proposed in this work are either competitive or substantially outperform other methods currently employed to solve this problem.

The rest of this paper is organized as follows. Section 2 describes related research in the area of target fishing. Section 3 introduces definition and notations used in this paper. Section 4 describes our methods for target fishing. Section 5 discusses the datasets and experimental methodology used in this work. Section 6 describes our results and finally section 7 has concluding remarks.

## 2 Related Methods

Computational techniques are becoming increasingly popular for target fishing due to plethora of data from high-throughput screening (HTS), microarrays, and other experiments.^{4} Given a compound whose targets need to be identified, these techniques initially assign a score to each target based on some measure of likelihood that the compound binds to each target. These techniques then select as the potential compound’s targets either those targets whose score is above a certain cut-off or a small number of the highest scoring targets. Three general classes of methods have been developed for determining the required compound-target scores. The first, referred to as *inverse docking*, contains methods that score each compound-target pair by using ligand docking approaches.^{6}^{,}^{11} The second, referred to as nearest-neighbor, contains methods that determine the compound-target scores by exploiting structural similarities between the compound and the target’s known ligands.^{12} Finally the third, referred to as model-based, contains methods that determine these scores using various machine-learning approaches to learn models for each one of the potential targets based on their known ligands.^{13}^{-}^{15}

The first class of methods to derive a score for each compound-target pair comes from the computational chemistry domain. Specifically, inverse docking process docks a single compound of interest to a set of targets and obtains a score for each target against this compound. The highest scoring targets are then considered as most likely targets that this compound will bind to.^{6} This approach suffers from a serious drawback that all the targets used in this process must have their three dimensional structure available in order to employ a docking procedure. However, the majority of target proteins do not have such information available.^{16}

The second class of methods, nearest-neighbor based techniques, rely on the principle of structure-activity relationship (SAR)^{17}^{,}^{18} which suggests that very similar compounds will have a higher chance of overlap between the sets of targets that they show activity against.^{12} Therefore, identifying targets for a given chemical compound can be solved by utilizing its structural similarity with other chemical compounds that are known to be active or inactive against certain targets. In these approaches, for a given test compound, its nearest-neighbor(s) are identified from a database of compounds with known targets using some notion of structural similarity. The most likely targets for the test compound are then identified as those targets that its nearest neighbors show activity against. A ranking among these targets can be obtained by taking into account the similarity values/rankings of the nearest neighbors that these targets belongs to. In these approaches the solution to the target fishing problem only depends on the underlying descriptor-space representation, the similarity function employed, and the definition of nearest neighbors.

Lastly, a number of methods have been proposed that explicitly build models on the given set of compounds with known targets. These techniques treat the target fishing problem as an instance of a multilabel (multi-category) prediction problem.^{15}^{,}^{19}^{,}^{20} In this setting, for a given chemical compound, each of the targets is treated as one of the potential labels and the goal is to predict all the labels (i.e., targets) that the compound belongs to (i.e., binds to). One such approach utilizes multi-category bayesian models^{13} wherein a model is built for every target using the available SAR data. Compounds that show activity against a target are used as positives instances and the rest of the compounds are treated as negatives instances. For a new compound, each of these models is used to compute the compound’s likelihood to be active against the corresponding target and the targets that obtained the highest likelihood scores are considered to be the targets for this compound. In addition, approaches have been developed that build the classification models using one-versus-rest binary support vector machines^{15} and neural networks.^{14} Note that even though the underlying machine learning problem is that of multilabel prediction, all of the above model-based methods essentially build one-vs-rest models and then produce a ranking of the possible targets by directly comparing the outputs of these models.

## 3 Definitions and Notations

The target fishing problem that we consider in this paper is defined as follows:

Definition 1 (Target Fishing Problem).Given a set of compounds (more than one) whose bioactivity is known to be either active or inactive against each of the targets in a given set, learn a model such that it correctly predicts for a test compound a ranking of all the targets according to how likely they are to show activity against the test compound.

Throughout this paper we will use = {_{1}, … , * _{M}*} to denote a library of chemical compounds,

*Τ*= {

*τ*

_{1}, … ,

*τ*} to denote a set of protein targets, and will assume that they contain

_{N}*M*and

*N*elements, respectively. For each compound

_{}, we will use

*Τ*

_{}to denote the set of all targets that

_{}shows activity against. Note that

*Τ*

_{}

*Τ*. We will use

*Τ** to denote a total ordering of the targets of

*Τ*. Given two sets

*A*and

*B*such that

*A*

*Τ*and

*B*

*Τ*, we will use

*A*<

_{Τ* }

*B*to denote that every target of

*A*precedes every target of

*B*in

*Τ**, and

*A*

_{Τ*}

*B*otherwise.

Each compound will be represented by a topological descriptor-based representation.^{18}^{,}^{21} In this representation, each compound is modeled as a frequency vector of certain topological descriptors (e.g., subgraphs) present in its molecular graph. Each dimension’s frequency counts the number of times (i.e., embeddings) the corresponding topological descriptor is present in the compound’s molecular graph. We will use to represent the dimensionality of descriptor-space representation of the chemical compounds in . Given a compound , and a parameter , we define top- to be the predicted targets that are most likely to show activity for . Lastly, throughout this paper we will use the terms target, category, and labels interchangeably.

## 4 Methods

Our solution to the Target Fishing problem relies on the principle of Structure Activity Relationship (SAR). Specifically, we develop solutions for the target fishing problem using SAR data by formulating it as a ranking problem. We pursue this approach because in real-world situations a practitioner might want to know the top- most likely targets for a given compound so that they can be tested experimentally or further investigated. Therefore, if the relevant (true) targets fall in one of these top- predicted targets, they will have a higher chance to be recognized. These methods are described in the rest of this section.

### 4.1 SVM-based Method

One approach for solving the ranking problem is to build for each target τ_{} *Τ* a one-versus-rest binary SVM classifier. Given a test chemical compound , the classifier for each target *τ*_{} will then be applied to obtain a prediction score _{}(). The ranking *Τ** of the *N* targets will be obtain by simply sorting the targets based on their prediction scores. That is,

where argsort returns an ordering of the targets in decreasing order of their prediction scores _{}(). Note that this approach assumes that the prediction scores obtained from the *N* binary classifiers are directly comparable, which may not necessarily be valid. This is because different classes may be of different sizes and/or less separable from the rest of the dataset indirectly affecting the nature of the binary model that was learned and consequently their prediction scores.^{22} This SVM-based sorting method is similar to the approach described previously^{15} in Section 2. We refer to this method as SVMR.

### 4.2 Cascaded SVM-based Method

A limitation of the previous approach is that by building a series of one-vs-rest binary classifiers it does not explicitly couple the information on the multiple categories that each compound belongs to during model training and as such it cannot capture dependencies that might exist between the different categories. A promising approach that has been explored to capture such dependencies is to formulate it as a cascaded learning problem.^{19}^{,}^{23}^{,}^{24} In these approaches, two sets of binary one-vs-rest classification models for each category, referred to as *L*_{1} and *L*_{2}, are connected together in a cascaded fashion. The *L*_{1} models are trained on the initial inputs and their outputs are used as input, either by themselves or in conjunction with the initial inputs, to train the *L*_{2} models. This cascaded process is illustrated in Figure 1. During prediction time, the *L*_{1} models are first used to obtain the required predictions which are used as input to the *L*_{2} models from which we obtain the final predictions. Since the *L*_{2} models incorporate information about the predictions produced by the *L*_{1} models, they can potentially capture inter-category dependencies.

Motivated by the above observation, we developed a ranking method for the target fishing problem in which both the *L*_{1} and *L*_{2} models consist of *N* binary one-vs-rest SVM classifiers, one for each target in *Τ*. The *L*_{1} models correspond exactly to the set of models built by the SVMR method discussed in the previous section. The representation of each compound in the training set for the *L*_{2} models consists of its descriptor-space based representation and its output from each of the *N L*_{1} models. Thus, each compound corresponds to an + *N* dimensional vector, where is the dimensionality of the descriptor space. The final ranking *Τ** of the targets for a compound will be obtained by sorting the targets based on their prediction scores from the *L*_{2} models (${\mathcal{f}}_{\mathcal{i}}^{{L}_{2}}\left(\mathcal{c}\right)$). That is,

We will refer to this approach as SVM2R.

A potential problem with such an approach is that the descriptor-space based representation of a chemical compound and the set of its outputs from *L*_{1} models are not directly comparable. Therefore, in this work, we also experiment with various kernel functions that combine dimensional descriptor-space representation and the *N* dimensional *L*_{1} model outputs of a chemical compound. Specifically, we experiment with two forms of the kernel function. The first function is given by

where *α* is a user defined parameter and * _{A}* and

*are the kernel functions measuring the similarity between compound vector formed using descriptor-space based representation and*

_{B}*L*

_{1}SVM classifier outputs respectively. The second kernel function is given by

These approaches are motivated by the work on kernel fusion^{25} and tensor product kernels.^{26}

### 4.3 Ranking Perceptron Method

We also developed a ranking method that exploits the potential dependencies between the categories based on the ranking perceptron.^{10} The ranking perceptron extends Rosen-blatt’s linear perceptron classifier^{27} for the task of learning a ranking function. The perceptron algorithm and its variants have proven to be effective in a broad range of applications in machine learning, information retrieval and bioinformatics.^{10}^{,}^{20}^{,}^{28}

Our approach is based on the online version of the ranking perceptron algorithm proposed to learn a ranking function on a set of categories developed by Crammer and Singer.^{10} This algorithm takes as input a set of objects and the categories that they belong to and learns a function that for a given object it ranks the different categories based on the likelihood that binds to the corresponding targets. During the learning phase, the distinction between categories is made only via a binary decision function that takes into account whether a category is part of the object’s categories (relevant set) or not (non-relevant set). As a result, even though the output of this algorithm is a total ordering of the categories, the learning is only dependent on the partial orderings induced by the set of relevant and non-relevant categories.

The pseudocode of our ranking perceptron algorithm is shown in Algorithm 1. This algorithm learns a linear model *W* that corresponds to a *N*× matrix, where *N* is the number of targets and is the dimensionality of the descriptor space. Using this model, the prediction score for compound _{} and target *τ*_{} is given by *W*_{}, _{}, where *W*_{} is the th row of *W*, _{} is the descriptor-space representation of the compound, and ·,· denotes a dot-product operation. Our algorithm extends the work of Crammer and Singer by introducing margin based updates and extending the online version to a batch setting. Specifically, for each training compound _{} that is active for targets belonging to categories *Τ*_{} *Τ*, our algorithm learns *W* such that the following constraints as satisfied:

where *β* is a user-specified non-negative constant that corresponds to the separation margin. The idea behind these constraints is to force the algorithm to learn a model in which the set of relevant categories (*Τ*_{}) for a given chemical compound _{} are well-separated and ranked higher from all the non-relevant categories (*T* \ *Τ*_{}). Therefore, our algorithm tries to satisfy ∑_{} |*Τ*_{}| × |*Τ* \ *Τ*_{}| constraints for each of the training set compound _{} to enforce a degree of acceptable separation between relevant and non-relevant categories that is controlled by *β*.

### Algorithm 1Learning Category Weight Vectors with the ranking perceptron algorithm

**input:**

- : Set of
*M*training compounds. *Τ*: Set of*N*targets (categories).- (
_{},*Τ*_{}): Compound_{}and its categories*Τ*_{}. *β*: User defined margin constraint.- : Dimensionality of the compound’s descriptor-space representation.

**output:**

*W*:*N*× model matrix.

1:W= 0 {Initial model}2:η= 1/M{Update weight}3:while(STOPPING CRITERION == FALSE)do4:for=1 toMdo5:Τ* = argsort_{τΤ}{W_{},_{}}6:τ_{}= lowest ranked target ofΤ_{}inΤ*7:τ_{}= highest ranked target ofΤ\Τ_{}inΤ*8:ifW_{},_{}−W_{},_{}<βthen9:for∀τ_{q}Τ_{}:W,_{q}_{}−W_{},_{}<βdo10:λ= |{τ_{r}Τ\Τ_{}:W,_{q}_{}−W,_{r}_{}<β}|11:W=_{q}W+_{q}λη_{}12:end for13:for∀τ_{r}Τ\Τ_{}:W_{},_{}−W,_{r}_{}<βdo14:λ= |{τ_{q}Τ_{}:W,_{q}_{}−W,_{r}_{}<β}|15:W=_{r}W–_{r}λη_{}16:end for17:end if18:end for19: ∀τ_{}Τ,W_{}=W_{}/||W_{}||20:end while21:returnW

During each outer iteration (lines 3–20) the algorithm iterates over all the training compounds (lines 4–18) and for each compound _{} it obtains a ranking *Τ** of all the categories (line 5) based on the current model *W*, and updates the model if any of the constraints in Equation 5 are violated. The check for the constraint violation is done in line 8 by comparing the lowest ranked target *τ*_{} *Τ*_{} with the highest ranked target *τ*_{} *Τ* \ *Τ*_{}. If there are any constraint violations, the condition on line 8 will be true and lines 9–16 of the algorithm will be executed. The model *W* is updated by adding/subtracting a multiple of _{} from the rows of *W* involved in the pair of targets of the violated constraints. Instead of updating the model’s vectors by using a constant multiple, which is usually done in perceptron training, our algorithm uses a multiple that is proportional to the number of constraints that each target violates in *Τ**. Specifically, for each target *τ _{q}*

*Τ*

_{}, our algorithm (line 10) finds the number

*λ*of targets

*τ*

_{r}*Τ*\

*Τ*

_{}that violate the margin constraint with

*τ*and adds in the

_{q}*q*th row of

*W*(which is the portion of the model corresponding to target

*τ*) a

_{q}*λ*multiple of

*η*

_{}, where

*η*is a small constant set to 1/

*M*in our experiments. The motivation behind this proportional update is that if a relevant target

*τ*follows a large number of non-relevant targets in the ordering

_{q}*Τ**,

*τ*’s model (

_{q}*W*) needs to move towards the direction of

_{q}_{}more than the model for another relevant target

*τ*which is followed only by a small number of non-relevant targets in

_{q’}*Τ**. Note that the term “follows” in the above discussion needs to be considered within the context of the margin

*β*. A similar approach is used to determine the multiple of

_{}to be subtracted from the rows of

*W*corresponding to the non-relevant targets that are involved in violated constraints (lines 13–16). Our experiments (not reported here) showed that this proportional update achieved consistently better results than those achieved by constant update rules.

Since the ranking perceptron algorithm is not guaranteed to converge when the training instances are not *β*-linearly separable, Algorithm 1 incorporates an explicit *stopping criterion*. After every pass over the entire training set, it computes the average uninterpolated precision (Section 5.2.4) over all the compounds using the weights *W*, and terminates when this precision has not improved in *N* consecutive iterations. The algorithm returns the *W* that achieved the highest training precision over all iterations. We directly apply the above method on the descriptor-space representation of the training set of chemical compounds.

The predicted ranking for a test chemical compound is given by

We will refer to this approach as RP.

### 4.4 SVM+Ranking Perceptron-based Method

A limitation of the above ranking perceptron method over the SVM-based methods is that it is a weaker learner as (i) it learns a linear model, and (ii) it does not provide any guarantees that it will converge to a good solution when the dataset is not linearly separable. In order to partially overcome these limitations we developed a scheme that is similar in nature to the cascaded SVM-based approach, but the *L*_{2} models are replaced by a ranking perceptron. Specifically, *N* binary onevs-rest SVM models are trained, which form the set of *L*_{1} models. Similar to the SVM2R method, the representation of each compound in the training set for the *L*_{2} models consists of its descriptor-space based representation and its output from each of the *N L*_{1} models. Thus, each compound corresponds to an + *N* dimensional vector, where is the dimensionality of the descriptor space. Finally, a ranking model *W* learned using the ranking perceptron of Algorithm 1. Since the *L*_{2} model is based on the descriptor-space based representation and the outputs of the *L*_{1} models, the size of *W* is *N* × ( + *N*). A recent study within the context of remote homology prediction and fold recognition has shown that this way of coupling SVM and ranking perceptrons improves the overall performance.^{28} We will refer to this approach as SVMRP.

## 5 Materials

### 5.1 Datasets

We evaluated the methods proposed in this work using a set of assays derived from a wide variety of databases that store the bioactivity relationship between a target and a set of small chemical molecules or ligands. In particular, these databases provide us target-ligand activity relationship pairs.

We use the PubChem^{29} database to extract target-specific dose-response confirmatory assays. For each assay we choose compounds that show the desired activity and confirmed as active by the database curators. We filter compounds that show different activity signals in different experiments against the same targets and they are deemed to be inconclusive and so not used in the study. Duplicate compound entries are removed by comparing the canonical SMILES^{30} representations of these molecules. We also incorporate target-ligand pairs from the following databases: BindingDB,^{31} Drug-Bank,^{32} PDSP *K*_{} database,^{33} KEGG BRITE database,^{34} and an evaluation sample of the WOMBAT database.^{35} All the protein targets that can be mapped to an identifier in PDB database^{16} are extracted from this set. We then eliminate all the targets that have < 10 actives to ensure that there is some amount of activity information available for each target in our database. Note that a minority of databases report binding affinity between compound and targets (that can be converted to binds to or does not bind to a target) instead of activity. In this work we do not distinguish between the two. It should be noted that the datasets used in our study consists of mostly confirmatory dose-response assays. Therefore it is expected to have lower level of noise than either the single point or the High Throughput Screening data.

After the above integration and filtering steps our final dataset contains 231 targets, and 27,205 compounds or ligands with a total of 40,170 target-ligand active pairs. Note that certain compounds may show activity in relation with two or more targets as well. Table 1 shows the number of compounds that belong to one or more categories. Out of the 27,205 compounds, 19,154 belong to a single category, 5,363 belong to two categories, 1,697 belong to three categories and the rest of the compounds belong to greater than three categories. It should also be noted that most of these compounds have been experimentally tested for activity against a very small subset of the 231 targets. Thus, if a compound belongs to a very small number of categories, it may be simply due to the fact that it has not been experimentally tested against many targets. Figure 2 shows the distribution of actives against the 231 targets in this dataset. The dataset has a skewed distribution wherein most targets have few active compounds (less than hundred) but a small number of targets have a large number active compounds (in thousands).

The filtered dataset consists of many clinically important enzymes such as PDEs as well as many kinase proteins. It also consists of many targets of clinically important virulent bacterial and viral strains such as tuberculosis, cholera, HIV, and herpes. In particular the dataset contains 165 human targets, 25 non-human mammalian targets and the remaining 41 consists mostly of bacterial, viral, and fungal targets. Figure 3 describes the distribution of the 231 targets in various classes. (The dataset with smiles representation of compounds and all the PDB ids of targets it binds to can be found online^{a}).

### 5.2 Experimental Methodology

All the experiments were performed on dual core AMD Opterons with 4 GB of memory. The following sections will review our experimental methodology to assess the performance of various methods proposed in Section 4.

#### 5.2.1 Descriptor Spaces and Similarity Measures

The similarity between chemical compounds is usually computed by first transforming them into a suitable descriptor-space representation.^{18}^{,}^{21} A number of different approaches have been developed to represent each compound by a set of descriptors. These descriptors can be based on physiochemical properties as well as topological and geometric substructures (fragments).^{30}^{,}^{36}^{-}^{40} In this study we use the ECFP descriptors^{37}^{,}^{38} as it has been shown to be very effective in the context of classification, ranked-retrieval, and scaffold-hopping.^{39}^{,}^{41} We utilize Scitegic’s Pipeline Pilot^{42} to generate ECFP4, a variation of ECFP descriptor-space for our dataset.

We use the Tanimoto coefficient^{40} (extended Jacquard similarity coefficient) to measure the similarity of chemical compounds based on their descriptor-space representation. This similarity measure was used to form the kernel function in SVMR and the kernel * _{A}* in SVM2R (Equations 3 and 4). We also utilize the cosine similarity to measure the similarity between two compounds represented by the

*L*

_{1}SVM outputs (

*in Equations 3 and 4). Tanimoto coefficient was selected because it has been shown to be an effective way of measuring the similarity between chemical compounds using sparse descriptor-space representations,*

_{B}^{39}

^{,}

^{40}

^{,}

^{43}whereas the cosine function was selected because it achieved slightly better performance as compared to Tanimoto coefficient and variants of Eucledian distance for this task.

#### 5.2.2 Multi-Category Bayesian Predictor

We implemented the multi-category bayesian models as described in^{13} and will call this approach BAYESIAN as the original authors call their approach Laplacian-corrected multi-category bayesian models.^{13} We compare this scheme to our methods developed in this paper. We have experimented with nearest-neighbor based scheme^{12} as well and found the overall results to be comparable to the BAYESIAN method. Therefore, we do not report the results for the nearest neighbor scheme in this paper.

#### 5.2.3 Training Methodology

For each dataset we separated the compounds into test and training sets, ensuring that the test set is never used during any parts of the learning phase. We split the entire data randomly into ten parts and use nine parts for training and one part for test. We will refer to the nine parts training set as * _{r}* and the one part test set as

*s*.

In order to learn models using BAYESIAN, SVMR, and RP, we utilize the entire set * _{r}* for training. To learn the models for the cascaded methods (SVM2R and SVMRP) we experiment with an approach that allows us to use the entire training set (

*) to build both*

_{r}*L*

_{1}as well as

*L*

_{2}models. This approach is motivated by the cross-validation methodology and is similar to that used in previous works.

^{20}

^{,}

^{28}In this approach we further partition the entire training set

*into ten equal-size parts. Nine out of these ten parts are used to train N first level (*

_{r}*L*

_{1}) binary classifiers (one for each target). A total of N prediction values are then obtained for each compound in the remaining part. This process is repeated for each of the ten parts. At the end of this process, each training instance in

*has been predicted by a set of*

_{r}*N*binary classifiers, and these predicted values serve as descriptors of the training samples (

*) for the second-level learning (using the SVMRP or the SVM2R algorithm). Having learned the second-level (*

_{r}*L*

_{2}) models, we take the entire training set

*and retrain the*

_{r}*L*

_{1}models. These

*L*

_{1}models are then used to obtain output values (that form the input representation for

*L*

_{2}models) for the independent test set

*.*

_{s}In our setup we use each part (* _{s}*) from the initial ten way split as the test set exactly once. Therefore we have ten different variants of

*and*

_{r}*. In order to be consistent with the methodology described in*

_{s}^{13}for learning BAYESIAN models, we assumed all the compound-target pairs with unknown activity as inactives. Moreover, we did not have inactivity information for many targets in our dataset so it was impossible to model most of our targets using true actives and inactives.

#### 5.2.4 Evaluation Metrics

During the evaluation stage, we compute the prediction for our untouched test dataset. These predictions are then evaluated using the uninterpolated precision^{46} and precision/recall values in top-.^{47}

To calculate uninterpolated precision for each test compound we utilize the following methodology. For a test compound we obtain top- ranked targets (using one of the five schemes described in this paper), where is equal to the number of targets that this test compound is active against. Now, for every correctly predicted target that appears at a position the top- predictions we compute precision value at that position. This precision value is defined as the ratio of the number of correct targets identified up to that position over the total number of targets seen thus far (which is the same as the number ). We calculate this value for every one of the positions in the top- ranking that corresponds to a correctly predicted target for the given test compound. The final un-interpolated precision value is given by summing up all the precision values obtained above and dividing it by .

We also calculate precision and recall values in the top one to fifteen ranks of the target for each test compound. This precision value for a test compound is defined as the fraction of correct targets in the top- ranked targets (where ranges from one to fifteen). Note that this precision is different from the uninterpolated precision described above. The recall value is defined as the number of correctly predicted targets in the top- ranked predictions divided by the total number of true targets for a test compound. Note that a high recall value indicates the ability of a scheme to identify a high fraction of true targets for a given compound in top- ranks. The final values of uninterpolated precision, precision and recall reported in this work are averaged over all the compounds in the test set.

We used the box plots to compare the relative performance of the different methods. These box plots were derived from the performance achieved in each one of the ten independent test sets in the ten-fold cross validation experiments. Using these box plots, the relative performance of two methods was assessed by comparing the first (lower) quantile (*q*1) of one method against the median of the other. Specifically, we will consider that method *A* performs *relatively better* than another method *B* if the first quantile of *A* is higher than the median of *B*. Note that this approach only provides a qualitative way by which to compare the relative performance of two schemes and does not convey any indication of statistical significance. We resorted to this approach because the results of the ten-fold cross validation are not entirely independent from each other (training compounds overlap), and as such the traditional statistical significance tests cannot be used.^{48}

#### 5.2.5 Model Selection

The performance of SVM depends on the parameter that controls the trade-off between the margin and the misclassification cost (“C” parameter in *SV M*^{light}^{,49}), whereas the performance of ranking perceptron depends on the margin *β* in Algorithm 1.

We perform a model selection or parameter selection step. To perform this exercise fairly, we split our test set into two equal halves of similar distributions, namely sets A and B. Using set A, we vary the controlling parameters and select the best performing model for set A. We use this selected model and compute the accuracy for set B. We repeat the above steps by switching the roles of A and B. The final results are the average of the two runs. While using the *SV M*^{light} program we let C take values from the set {0.0001, 0.001, 0.01, 0.1, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 15.0, 25.0}. While using the perceptron algorithm we let the margin *β* take values in the set {0.00001, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 1.0, 2.0, 5.0, 10.0}.

For SVM2R, we experiment with kernel functions described by Equations 3 and 4. For the kernel function described by equation 4 no parameter tuning is required. For the kernel function described by Equation 3 we experiment with five different values of the parameter *α* (*α* = 0.2, 0.4, 0.5, 0.6, and 0.8). Since Equation 3 with parameter *α* = 0.5 showed the best performance, we only present results for SVM2R using this kernel and parameter value.

## 6 Results

In this section, we will evaluate the performance of the BAYESIAN scheme as well as the four methods described in Section 4: SVMR, SVM2R, RP, and SVMRP.

We compare these methods along three directions. The first direction compares the overall performance of these methods with each other on the entire dataset. The second direction compares the effect of the number of categories that a compound belongs to on the results obtained. We utilize un-interpolated precision described in Section 5.2.4 as our metric for these comparisons. The third direction compares these five methods on their ability to retrieve all the relevant categories in the top- ranks. In order to evaluate and compare the performance of these category ranking algorithms for this task we utilize two measures - precision in top- and recall in top- (also described in Section 5.2.4). Finally, we also compare each of these methods for the time they take to train the final ranking model.

### 6.1 Overall Comparison

Table 2 compares the uninterpolated precision of the five methods using the ECFP4 descriptor-space for each of the ten tests sets. These results are derived using compounds that bind to at least one target (*L* ≥ 1) and therefore include all the compounds in the dataset. Figure 4(a) shows the box plot results corresponding to *L* ≥ 1.

From Table 2 it can be observed that the best performing schemes over the ten test sets are the SVM2R and SVMRP. The two schemes are better than all the other methods tested in our work over these ten splits. Comparing SVM2R and SVMRP shows that SVM2R performs better than SVMRP. However, from figure 4(a) it can be observed that this difference is not substantial as the variance in the results of SVM2R and SVMRP overlap considerably. Similarly, the next two methods, SVMR and RP show equivalent performance among each other as observed through the box plots. Note that the variance in the results of SVM2R and SVMRP (Figure 4(a)) is much smaller than their single stage counterparts, SVMR and RP respectively. Finally, all of the above four methods are significantly better than the BAYESIAN approach.

Table 2 also indicates that the absolute gain in performance achieved by SVM2R over the other methods is not very high (with the exception of its performance over the BAYESIAN scheme). The gains are relatively modest with SVM2R gaining 0.4% over SVMRP, 2.0% over RP and nearly 1.5% over SVMR on an average. Similarly, SVMRP achieves a gain of only about 1.5% over RP as well as about 1.2% over SVMR. However, the gains are consistent across all the ten test sets highlighting the power of SVM2R and SVMRP in capturing the interclass dependencies.

Finally, it can be observed from Table 2 and Figure 4(a) that RP performs as well as SVMR (a insignificant difference of 0.6% between the two methods). RP is a simple linear learning method that does not guarantee convergence. However, it tries to capture dependencies between the different categories by explicitly trying to rank them in the context of a given test compound. SVMR on the other hand is based on the powerful SVM methodology which employs a sophisticated Tanimoto kernel function shown to be the most effective kernel function for chemical compounds.^{40} However, it builds independent one-vs-rest classifiers that fail to capture dependencies among different categories and therefore does not perform better than RP.

### 6.2 Effect of the number of categories

We also investigated the effect of the number of categories (*L*) that a compound belongs to on the performance of the different methods. This type of evaluation can provide insights on the ability of these methods to identify compounds that hit more than one target and may pose problems in terms of toxicity or promiscuity.

Table 3 summarizes the average performance of the five different methods over subsets of compounds that belong to *L* or more targets in the ten test sets utilizing the uninterpolated precision metric. The first row in this table corresponds to *L* ≥ 1, which is the set of all the compounds in the dataset. Therefore the average results in the first row of Table 3 are identical to the averages reported in Table 2. The subsequent rows show the performance of different methods over compounds that belong to two or more categories. We also utilize Figures 4(a)-4(d) to compare the performance across the methods for different values of *L* and the variability within each one of them.

*L*or more categories (targets).

From this table it can be observed that, in general, as the number of targets (*L*) that a compound belongs to increases, the methods that try to capture dependencies among the different categories (SVM2R, RP, and SVMRP) perform better than the SVMR method, which does not.

Moreover, as *L* increases from one to four, the edge of SVM2R over the two ranking perceptron based schemes disappears, and SVMRP achieves the best results (Table 3 and Figures 4(a)-4(d)). However, as seen from Figures 4(a) to 4(d), the variability of these methods within themselves increases considerably when *L* is ≥ 3 and 4. Finally, simple sorting based scheme SVMR which performs slightly better than RP in terms of absolute performance for *L* ≥ 1 performs worse than RP for *L* ≥ 2, 3, and 4.

Overall, these results indicate that the schemes that capture interclass dependencies tend to outperform schemes that do not for compounds that belong to more than one category.

### 6.3 Ability to retrieve all relevant categories

In this section we compare the ability of the five methods to identify relevant categories in the top- ranks. We analyze the results along this direction because this directly corresponds to the use case scenario where a user may want to look at top- predicted targets for a test compound and further study or analyze them for toxicity, promiscuity, off-target effects, pathway analysis *etc*. If all the true targets fall in the top- ranks, there is a high likelihood of successfully recognizing them for the above analysis.

For this comparison we utilize precision and recall metric in top- for each of the five schemes as shown in Figures 5(a) and 5(b). These figures show the actual precision and recall values in top- by varying from one to fifteen. These figures are obtained by averaging precision and recall results over all the test sets used in this study. Therefore, these results are averaged over all the 27,205 compounds present in the dataset.

A number of trends can be observed from these figures. First, for identifying one of the correct categories or targets in the top 1 predictions, SVM2R outperforms all the other schemes in terms of both precision and recall. This is followed by SVMRP, RP and SVMR in that order. Second, BAYESIAN is still the worst performing method among the five compared and the performance order of these schemes is exactly the same as the performance order for the uninterpolated results in Table 2.

However, as increases from one to fifteen, the precision and recall results indicate that the best performing scheme is now SVMRP and it outperforms all other schemes for both precision as well as recall. This is followed by RP, which outperforms the other three schemes (BAYESIAN, SVMR, SVM2R) for both precision and recall in the top fifteen. The performance of RP is followed by SVM2R, SVMR, and finally BAYESIAN. The ranking perceptron based methods achieve an average recall of approximately 0.96 and 0.95 for SVMRP and RP respectively for = 15 and is better than the other schemes for the ten test sets (with average recall rates of 0.90, 0.89, and 0.76 respectively for SVM2R, SVMR, and BAYESIAN). Moreover, these values in figure 5(b) show that as increases from one to fifteen, ranking perceptron based schemes start performing consistently better that others in identifying all the correct categories. The two ranking perceptron based schemes also achieve average precision values that are significantly better than other schemes in the top fifteen (Figure 5(a)).

In summary, these result indicate that ranking perceptron based methods because they have a higher recall will tend to find more of the correct categories in top ranks than other schemes. Thus, these methods present a better chance of finding and analyzing targets in the context of a chemical compound.

### 6.4 Performance on Dissimilar Compounds

To evaluate how well the models being learned by the different methods can generalize to a set of compounds that are structurally different from those used for training, we partitioned the compounds into ten clusters and then used a leave-one-cluster-out cross validation approach^{44} to assess the performance of the different methods. The clustering was computed using the direct -way clustering algorithm of CLUTO^{45} with cosine similarity and its default clustering criterion function. The size and various characteristics of the resulting clusters as they relate to the inter- and intra-cluster similarities are shown in Table 4.

Table 5 shows the average uninterpolated precision that was achieved by the different methods on the ten clusters. From these results we can see that even though the absolute performance achieved by the methods developed in this paper is lower than the corresponding performance reported in Table 2, it is still considerably higher than what would have been achieved by a random predictor (which for the number of classes involved, is close to zero) and they are still much better than the performance achieved by the Bayesian approach. In fact the relative performance gains achieved by the SVM- and ranking perceptron-based methods over the Bayesian approach are higher in this experiment than in the earlier one. These results indicate that the methods that we developed are not only able to effectively predict the targets of the compounds when these compounds are drawn from the same overall distribution as those used for training (Table 2) but they can also generalized to compounds that are structurally different from those whose activity information is already known. Finally, comparing the relative performance of the different methods, the results of Table 5 show that SVM2R performs better than the rest of the schemes and that the approaches based on ranking perceptron do not perform as well as they did in the earlier experiments.

### 6.5 Computational Complexity

The complexity of Ranking Perceptron based methods depend on the number of compounds as well as the number of targets. If the number of compounds is *M* and the number of targets is *N*, our ranking perceptron iterates over the dataset until the stopping criterion is satisfied (Algorithm 1). The complexity of the outer loop on line 4 in Algorithm 1 is *O*(*M*). Each compound in the worst case has to resolve *O*(*N*^{2}) constraints in the inner loops on lines 8 and 13. Therefore, for every iteration the algorithm has a complexity of *O*(*MN*^{2}).

BAYESIAN approach has a linear complexity with respect to the number of compounds and targets as for each feature it computes the probability of occurrence in every target using all the compounds in the dataset. This can be efficiently computed in *O*(*M*) time. Lastly, computational complexity of the SVM algorithm depends on the implementation as well as the kernel employed. For our setting, we use *SV M ^{light}* implementation

^{49}in a one-vs-rest setting and use the Tanimoto kernel. The computational complexity is quadratic with respect to the number of compounds

^{49}and linear with respect to the number of targets. Therefore, the final complexity of the SVM algorithm is

*O*(

*M*

^{2}

*N*).

We also compare the actual run-time performance of different methods we developed and the BAYESIAN method on 64 bit AMD Opetron Machines. We report both the training and test set run-times for each method. Table 6 summarizes the run-time results for one of the 10 cross-validation folds for each of the five methods. The run-times for other cross-validation folds were found to be very similar.

It can be observed from this table that two stage methods, particularly SVM2R, has the longest training time among all of the methods. This is followed by SVMRP, SVMR, RP, and BAYESIAN in the decreasing order of training times. Therefore, for training models, BAYESIAN is the fastest among all the methods. This is not surprising as BAYESIAN scheme does not perform any direct optimization. However, as seen in the preceding sections, it is also the worst performing scheme to others by a significant margin.

Looking at the time taking for the testing 2720 compounds from Table 6 it can again be observed that SVM2R has the longest run-time. This is followed by SVMR, SVMRP, BAYESIAN, and RP in that order. Thus, the fastest methods are RP and BAYESIAN as they only requires dot-product of each test compound with the weight vectors of each of 231 classes.

## 7 Discussion and Conclusion

SVM based methods have been extensively employed for the task of virtual screening, classification, as well as selectivity analysis of chemical compounds and their performance has been assessed for these tasks.^{39}^{,}^{50}^{-}^{52} For example, Wassermann and co-workers^{52} showed that the SVM results in better predictions for selectivity analysis than either nearest neighbor based or BAYESIAN method. Similarly, Glick and co-workers^{50} showed that the SVM outperformed BAYESIAN scheme when performance was measured as enrichment of actives in top 1% of the high throughput screening data. However, they also found that the performance of SVM and BAYESIAN was near identical when the noise level was significantly increased. In this work, we proposed methods based on the SVM and ranking perceptrons for the task of Target Fishing. Extensive experiments and analysis comparing our methods showed that our methods are either as good as or superior to the current state-of-the-art.

However, a number of issues still need to be addressed. First, in this work we assumed the compound with no activity information against a target to be inactive against that target. The primary reason for such assumption was almost no inactive data available for many of the targets in our dataset collected from publicly available sources. Similar assumption has been made by previous studies.^{13} However, this is not a very satisfactory assumption from the point of view of drug discovery as it may result in the method missing some rare compound-target activity. Therefore, in our future work, we will try to address the issue of unknown compound-target activity explicitly in order to come up with practical methods for drug discovery.

Second, in this work we have not utilized target category information (for example target sequence, structure, family) in building these ranking perceptron or SVM based models. Identification of key characteristics common across two target themselves (similar geometry of binding sites or similar biochemical characteristics of binding residues) might identify two targets that will likely bind to the same compound. Therefore, effective solutions can be devised using both SAR data and target information. A number of recent approaches for chemogenomics utilize SAR data as well as target information to build predictive models on the target-ligand graph.^{53}^{,}^{54} Our initial studies in trying to include target information did not yield promising results for the problem of target fishing. They were found to be no better as compared to only SAR data based approaches proposed in this paper. So we did not pursue approaches that include target information in this work. However, we believe that exploring a good way of including target information is a worthwhile effort and plan to investigate it rigorously as a part of our future work.

Finally, recent approaches have shown that the interclass dependencies could be learned within structural learning framework that utilizes structural SVM.^{20}^{,}^{28}^{,}^{55} We also experimented with this approach in our work using the SVM-struct algorithm.^{55} However, our preliminary results showed that this approach did not yield any significant gains over the ranking perceptron based approach over the large number of categories in our domain. Moreover, the computational cost of employing structural learning was much higher than that of utilizing ranking perceptrons. Therefore we did not pursue structural SVM based approach in the present work.

## Acknowledgements

This work was supported by NSF ACI-0133464, IIS-0431135, NIH RLM008713A, and by the Digital Technology Center at the University of Minnesota.

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (645K) |
- Citation

- TargetHunter: an in silico target identification tool for predicting therapeutic potential of small organic molecules based on chemogenomic database.[AAPS J. 2013]
*Wang L, Ma C, Wipf P, Liu H, Su W, Xie XQ.**AAPS J. 2013 Apr; 15(2):395-406. Epub 2013 Jan 5.* - Exploring polypharmacology using a ROCS-based target fishing approach.[J Chem Inf Model. 2012]
*AbdulHameed MD, Chaudhury S, Singh N, Sun H, Wallqvist A, Tawa GJ.**J Chem Inf Model. 2012 Feb 27; 52(2):492-505. Epub 2012 Jan 23.* - A comparative assessment of ranking accuracies of conventional and machine-learning-based scoring functions for protein-ligand binding affinity prediction.[IEEE/ACM Trans Comput Biol Bioinform. 2012]
*Ashtawy HM, Mahapatra NR.**IEEE/ACM Trans Comput Biol Bioinform. 2012 Sep-Oct; 9(5):1301-13.* - Computational tools for polypharmacology and repurposing.[Future Med Chem. 2011]
*Achenbach J, Tiikkainen P, Franke L, Proschak E.**Future Med Chem. 2011 Jun; 3(8):961-8.* - Reverse fingerprinting and mutual information-based activity labeling and scoring (MIBALS).[Comb Chem High Throughput Screen. 2009]
*Williams C, Schreyer SK.**Comb Chem High Throughput Screen. 2009 May; 12(4):424-39.*

- Comprehensive Analysis of Prokaryotes in Environmental Water Using DNA Microarray Analysis and Whole Genome Amplification[Pathogens. ]
*Akama T, Kawashima A, Tanigawa K, Hayashi M, Ishido Y, Luo Y, Hata A, Fujitani N, Ishii N, Suzuki K.**Pathogens. 2(4)591-605* - New target prediction and visualization tools incorporating open source molecular fingerprints for TB Mobile 2.0[Journal of Cheminformatics. ]
*Clark AM, Sarker M, Ekins S.**Journal of Cheminformatics. 638* - Exploring the Ligand-Protein Networks in Traditional Chinese Medicine: Current Databases, Methods, and Applications[Evidence-based Complementary and Alternativ...]
*Zhao M, Zhou Q, Ma W, Wei DQ.**Evidence-based Complementary and Alternative Medicine : eCAM. 2013; 2013806072* - TB Mobile: a mobile app for anti-tuberculosis molecules with known targets[Journal of Cheminformatics. ]
*Ekins S, Clark AM, Sarker M.**Journal of Cheminformatics. 513* - idTarget: a web server for identifying protein targets of small chemical molecules with robust scoring functions and a divide-and-conquer docking approach[Nucleic Acids Research. 2012]
*Wang JC, Chu PY, Chen CM, Lin JH.**Nucleic Acids Research. 2012 Jul; 40(Web Server issue)W393-W399*

- PubMedPubMedPubMed citations for these articles

- Target Fishing for Chemical Compounds using Target-Ligand Activity data and Rank...Target Fishing for Chemical Compounds using Target-Ligand Activity data and Ranking based MethodsNIHPA Author Manuscripts. Oct 2009; 49(10)2190

Your browsing activity is empty.

Activity recording is turned off.

See more...