Multi-task bioassay pre-training for protein-ligand binding affinity prediction

Abstract Protein–ligand binding affinity (PLBA) prediction is the fundamental task in drug discovery. Recently, various deep learning-based models predict binding affinity by incorporating the three-dimensional (3D) structure of protein–ligand complexes as input and achieving astounding progress. However, due to the scarcity of high-quality training data, the generalization ability of current models is still limited. Although there is a vast amount of affinity data available in large-scale databases such as ChEMBL, issues such as inconsistent affinity measurement labels (i.e. IC50, Ki, Kd), different experimental conditions, and the lack of available 3D binding structures complicate the development of high-precision affinity prediction models using these data. To address these issues, we (i) propose Multi-task Bioassay Pre-training (MBP), a pre-training framework for structure-based PLBA prediction; (ii) construct a pre-training dataset called ChEMBL-Dock with more than 300k experimentally measured affinity labels and about 2.8M docked 3D structures. By introducing multi-task pre-training to treat the prediction of different affinity labels as different tasks and classifying relative rankings between samples from the same bioassay, MBP learns robust and transferrable structural knowledge from our new ChEMBL-Dock dataset with varied and noisy labels. Experiments substantiate the capability of MBP on the structure-based PLBA prediction task. To the best of our knowledge, MBP is the first affinity pre-training model and shows great potential for future development. MBP web-server is now available for free at: https://huggingface.co/spaces/jiaxianustc/mbp.


Introduction
Protein-ligand binding affinity (PLBA) is a measurement of the strength of the interaction between a target protein and a ligand drug [1].Accurate and efficient PLBA prediction is the central task for the discovery and design of effective drug molecules in silico [2].Traditional computer-aided drug discovery tools use scoring functions (SF) to estimate PLBA roughly [3], which is of low accuracy.Molecular dynamics simulation methods can achieve more accurate binding energy estimation [4], but these methods are typically expensive in terms of computational resources and time.In recent years, deep learning (DL) models have been widely used to predict PLBA, which are thought to be promising tools for accurately and rapidly predicting PLBA.Based on the accumulated biological data, a series of DL-based scoring functions have been built, such as Pafnucy [5], OnionNet [6], Transformer-CPI [7], IGN [8], and SIGN [9].In particular, structure-based DL models that use the 3D structure of protein-ligand complexes as inputs are most successful, which typically use 3D convolutional neural networks (3D-CNNs) [10; 11; 12] or graph neural networks (GNNs) [8; 9] to model and extract the interactions within the protein-ligand complex structures.However, the generalizability of these data-driven DL models is limited because the number of high-quality samples in PDBbind used for model training is relatively small (approximately 5,000) [13].(1) The top three panels show an example where the same protein-ligand pair have different binding affinities with assay 1-3 in terms of measurement type (IC50 v.s.Ki) and value (IC50=10 nM v.s.IC50=7600 nM).( 2) The bottom two panels show an example of the binding of different ligands (Ligand 1 & 2) to a protein in the same assay (assay 3).
One solution for this problem is pretraining, which has been widely used in computational biology, such as molecular pre-training for compound property prediction [14; 15; 16; 17] and protein pre-training for protein folding [18; 19; 20; 21].These pre-training models utilize data from large-scale datasets to learn embeddings, which expand the ligand chemical space and protein diversities.Therefore, affinity pre-training models on the large amount of affinity data in databases such as ChEMBL [22] and BindingDB [23] can be helpful.Nevertheless, though attempts have been made to directly use the data, such as BatchDTA [24], several challenges have prevented researchers from widely using ChEMBL data for PLBA previously.Firstly, the data were collected from various bioassays, which introduce different system biases and noises to the data and make it difficult for comparison [24; 25] (label noise problem).For some cases, the affinities of the same protein-ligand pair in different bioassay can have a difference of several orders of magnitude (Fig. 1).Secondly, several types of affinity measurement exist (label variety problem), such as half-maximal inhibitory concentration (IC50), inhibition constant (Ki), inhibition ratio, dissociation constant (Kd), half-maximal effective concentration (EC50), etc., which cannot be compared directly as well.Thirdly, the unavailability of 3D structures of protein-ligand complexes within ChEMBL poses a significant limitation for researchers in training and leveraging structure-based DL models (missing conformation problem).
To solve the above problems, we propose the Multi-task Bioassay Pre-training (MBP) framework for structure-based PLBA prediction models.In general, by introducing multi-task and pairwise ranking within bioassay samples, MBP can make use of the noisy data in databases like ChEMBL.Specifically, the multi-task learning strategy [26] treats the prediction of different label measurement types (IC50/Ki/Kd) as different tasks, thus enabling information extraction from related but different affinity measurements.Meanwhile, although different assays can introduce different types of noises to the data, data from the same assay is relatively more comparable.Inspired by recent progress in the recommendation system [27; 28; 29], by considering ranking between samples from the same assay, the model is enforced to learn the relative relationship of samples and differences in protein-ligand interactions, which allows the MBP to learn robust and transferrable structural knowledge beyond the noisy labels.
We then construct a pre-training dataset, ChEMBL-Dock, for MBP.ChEMBL-Dock contains 313,224 protein-ligand pairs with from 21,686 assays and the corresponding experimental PLBA labels (IC50/Ki/Kd).Molecular docking softwares are emplyed to generate about 2.8M docked 3D complex structures in ChEMBL-Dock.Then we implant MBP with simple and commonly used GNN models, such as GCN [30], GIN [31], GAT [32], EGNN [33], and AttentiveFP [34].Experiments on the PDBbind core set and the CSAR-HiQ dataset have shown that even simple models can be improved and achieve comparable or better performances than the state-of-the-art (SOTA) models with MBP.Through ablation studies, we further validate the importance of multi-task strategy and bioassayspecific ranking in MBP.
Overall, the contributions of this paper can be summarized as follows: • We propose the first PLBA pre-training framework MBP, which can significantly improve the accuracy and generalizability of PLBA prediction models.• We construct a high-quality pre-training dataset, ChEMBL-Dock, based on ChEMBL, which significantly enlarges existing PLBA datasets in terms of chemical diversities.• We show that even vanilla GNNs can significantly outperform the previous SOTA method by following the pre-training protocol in MBP.

Related Work
Protein-Ligand Binding Affinity Prediction.One critical step in drug discovery is scoring and ranking the predicted protein-ligand binding affinity.Scoring functions can be roughly divided into four main types: force-field-based, empirical-based, machine-learning-based, and 3D-structure-based [9].Force-field-based methods aim at estimating the free energy of the binding by using the first principles of statistical mechanics [6].Despite its remarkable performance as a gold standard, it suffers from high computational overhead.Empirical-based methods [35; 36; 37] design docking and scoring functions especially to make affinity predictions, while expert domain knowledge is needed to encode internal biochemical interactions.Machine-learning-based methods, such as random forest [38] and support vector machines (SVM) [39], aim to predict binding affinity based on a data-driven learning paradigm.These methods, however, rely on the quality of hand-crafted features and have poor performance in generalization.Recently, due to advances in deep learning methods and the creation of structure-based protein-ligand complex datasets, many structure-based deep learning methods [40; 6; 41; 42; 15; 43; 44; 8; 9] have been developed for predicting binding affinity.Such methods directly learn the structural information of protein-ligand complexes end-to-end, avoiding artificial feature design.However, due to the scarcity of high-quality training data, current methods still suffer from poor generalization in real applications.
Datasets of Protein-Ligand Binding Affinity.Existing protein-ligand binding affinity datasets can be roughly divided into three categories.The first category includes datasets such as PDBbind, BindingMOAD [45], and CSAR-HIQ [46], which contain 3D co-crystal structures of protein-ligand complexes determined by structural characterization methods and experimentally determined binding affinity values.Such datasets have small yet high-quality data and are typically widely used for training structure-based deep learning models [47; 48].The second category contains large-scale protein-ligand binding affinity measured labels but without 3D structures, such as ChEMBL and BindingDB.The third category contains the 3D structure and binding affinity value of the proteinligand complex calculated by molecular docking [49], and the representative database is CrossDocked [50].Due to the lack of experimental affinity labels, such datasets are often used to train generative models rather than affinity prediction models [51].
Pre-training for Biomolecules.Much effort has been devoted to biomolecular pre-training to achieve better performance on related tasks.For small molecules and proteins, a series of selfsupervised pre-training methods based on molecular graphs [14; 15; 16; 17] and protein sequences [18; 19; 20; 21] have been proposed, respectively.However, these existing pre-training methods are designed for individual molecules [52], and there is still a gap in the research on pre-training methods for protein-ligand affinity.
Pairwise Learning to Rank.Learning to Rank (LTR) is an essential research topic in many areas, such as information retrieval and recommendation systems [53; 54; 55].The common solutions of LTR could be basically categorized into three types: pointwise, pairwise, and listwise.Among these methods, pairwise LTR models are widely used in practice due to their efficiency and effectiveness.These years have witnessed the success of pairwise methods, such as BPR [56], RankNet [57], GBRank [58], and RankSVM [59].In addition, recent studies have shown that the bias between labels can be effectively solved using pairwise methods [27].

Multi-Task Bioassay Pre-training
In this section, we formalize the problem of pre-training for PLBA and then introduce our proposed multi-task bioassay pre-training framework -MBP.After defining the problem, we provide a framework overview of our solution in Section 3.1.Then, in Section 3.2, we discuss how we develop a GNN-based model as the shared-bottom encoder in multi-task learning.Finally, we present the curation process of the ChEMBL-Dock dataset used for pre-training in Section 3.3.
Problem Formulation.Conceptually, given a protein P , a ligand L, and the binding conformation C of the ligand to the protein, the problem of structure-based protein-ligand binding affinity prediction is to learn a model f (P, L, C) to predict the binding affinity.However, due to the rarity and high cost of ground truth 3D structure data, the training of structure-based PLBA prediction models has to be restricted to PDBbind with co-crystal structures.In this work, we aim to leverage the ChEMBL dataset, which contains large-scale protein-ligand binding affinity data but without 3D structures.As discussed in Section 1, in order to pre-train a PLBA prediction model on ChEMBL, we have to resolve three challenges, namely label variety, label noise, and missing conformation.

Framework Overview
The framework overview of MBP is illustrated in Fig. 2, which includes three main parts -(1) pre-training data pipeline; (2) multi-task learning objectives and architecture; and (3) downstream task fine-tuning.In this section, we describe them in detail.
Pre-Training Data Pipeline.Before introducing the pre-training data pipeline, we provide the necessary definitions of a bioassay and a bioassay-specific data pair in MBP.Definition 1 (A Bioassay in MBP).A bioassay is defined as an analytical method to determine the concentration or potency of a substance by its effect on living animals or plants (in vivo) or on living cells or tissues(in vitro) [60].In this work, we mainly focus on bioassays measuring in vitro binding of ligands to a protein target.Formally, the i-th bioassay is denoted as It means that there are n i experimental records in bioassay A i , and each record measures the binding affinity y ij of a ligand L ij to the protein target P i .And the type of binding affinity in bioassay A i is t i , in this work we consider t i ∈ {Ki, Kd, IC50}.The ChEMBL dataset can be formalized as a collection of bioassays such that D Definition 2 (A Bioassay-specific Data Pair).A bioassay-specific data pair is a six-tuple indicating there is a bioassay A i which includes the binding measurement of ligand L ij and ligand L ik to a protein target P i .And the experimentally measured binding affinity (with type t i ) is y ij and y ik , respectively.
The bioassay-specific data pairs in this work are extracted and randomly sampled from ChEMBL bioassays [22].We first sample a bioassay A i with probability proportional to its size, i.e., Prob(A i ) ∝ n i .Then we randomly pick two different ligands L ij and L ik from the sampled assay A i , together with their binding affinity y ij and y ik .The above sampling process produces a bioassay-specific data pair (P i , L ij , L ik , y ij , y ik , t i ) as defined in Definition 2. Taking the running case shown in Fig. 2 as an example, the data pair from a ChEMBL bioassay (with AssayId = CHEMBL1216983) can be written as (Q92769, CHEMBL1213458, CHEMBL99, 125nM, 0.65nM, Ki).
Knowledge of the binding conformation of a ligand to a target protein plays a vital role in structurebased drug design, particularly in predicting binding affinity.However, the ground truth co-crystal structure of a protein-ligand complex is experimentally very expensive to determine and is therefore not available in ChEMBL.Consequently, we only have the binding affinities (e.g., y ij and y ik ) of a ligand to a protein without knowing their conformation and relative orientation.To solve the above missing data problem, we propose to use computationally determined docking poses as an approximation to the true binding conformations.Specifically, we construct a large-scale docking dataset named ChEMBL-Dock from ChEMBL.For each protein-ligand pair in ChEMBL, we generate its docking poses according to the following three steps.Firstly, we use RDKit library [61] to generate 3D conformations from the 2D SMILES of the ligand.Then, the 3D structure of a protein is extracted from PDBbind according to its UniProt ID.Finally, we use docking software SMINA [62] to generate the docking poses of the protein-ligand pair.Throughout the rest of this paper, we denote the docking conformation of protein P i and ligand L ij as C ij .The detailed data curation process of ChEMBL-Dock can be found later in Section 3.3.
Overall, the pre-training data pipeline generates bioassay-specific data pairs (P i , L ij , L ik , y ij , y ik , t i ), retrieves their docking conformations -C ij and C ik -from the pre-processed ChEMBL-Dock datasets, and then feeds them into a multi-task learning model which we will discuss below.
Multi-task Learning Objectives and Architecture.As discussed in Section 1, there are three main challenges of applying pre-training to the PLBA problem -missing conformation, label variety, and label noise.In this section, we propose to solve the label variety and label noise problem via multi-task learning.
For the label variety challenge, it is intuitive and straightforward to introduce label-specific tasks for each type of binding affinity measurement.In this work, we define two categories of label-specific tasks -IC50 task and K={Ki, Kd} task, which handle bioassay data with affinity measurement type IC50 and Ki/Kd, respectively.Here we merge Ki and Kd as a single task following [63], and the main reasons are twofold.Firstly, Ki and Kd are calculated in the same way, except that Kd only considers the physical binding, while Ki specifies the biological effect of this binding to be inhibition.So they can essentially be seen as the same label type.Secondly, the number of Kd data is significantly less compared to Ki data, which may lead to data imbalance if we were to design a separate Kd task.
For the label noise challenge, instead of leveraging learning with noisy labels techniques [64], we turn to utilize the intrinsic characteristics of bioassay data.As discussed in Section 1, the label noise challenge in ChEMBL stems mainly from its data sources and curation process.The binding affinity values from different bioassays were measured under various experimental protocols and conditions (such as temperature and pH value), leading to systematic errors between different assays.However, the binding affinity labels within the same bioassay were usually determined under similar experimental conditions.Thus, intra-bioassay data are more consistent than inter-bioassay ones, and the comparison within a bioassay is much more meaningful.Inspired by the above characteristics of bioassay data, we design both regression tasks and ranking tasks in MBP.To be more formal, given a bioassay-specific data pair (P i , L ij , L ik , y ij , y ik , t i ), the regression task is to directly predict the binding affinity y ij , while the ranking task is to compare the binding affinity values within the bioassay, i.e., to classify whether y ij < y ik or y ij > y ik .
In summary, we have 2 × 2 = 4 tasks in MBP, namely the IC50 regression task, IC50 ranking task, K regression task, and K ranking tasks.An illustration of these tasks can be found in Fig. 2.
As for the multi-task learning architecture, we adopt the shared-bottom technique (also known as the hard parameter sharing) [65] in MBP.Such a technique shares a bottom encoder among all tasks while keeping several task-specific heads.As illustrated in Fig. 2, the model architecture consists of a shared encoder network f Enc and four task specific heads -an IC50 regression head f IC50, Reg , a K={Ki,Kd} regression head f K, Reg , an IC50 ranking head f IC50, Rank and a K={Ki,Kd} ranking head f K, Rank .Given a bioassay-specific data pair (P i , L ij , L ik , y ij , y ik , t i ) together with their conformation C ij and C ik from the pre-training data pipeline, the shared bottom encoder maps them into compact hidden representations shared among tasks: There are many possibilities for implementing an encoder for protein-ligand complexes, including but not limited to models based on 3D-CNN [66; 12; 50], GNN [67; 9; 8], and Transformer [68; 69; 7].
In MBP, we propose a simple and effective shared bottom encoder.For the sake of clarity, we defer its implementation detail in Section 3.2, and focus on multi-task learning in this section.
For the regression task, we pick the task-specific regression head f ti,Reg according to the label type t i ∈ {IC50, K} (recall that Ki and Kd have been merged to be a single label type K), and predict the binding affinity to be ŷij = f ti,Reg (o 1 ).The regression loss is calculated using the mean squared error (MSE) loss between ground truth y ij and the predicted value ŷij .More formally, the regression loss is defined as It is worth mentioning that for a data pair, only label y ij will be used to compute the regression loss.
Similarly, for the ranking task, we select the task-specific ranking head f ti,Rank according to the label type t i ∈ {IC50, K}, concatenate the hidden representations as o 1 ||o 2 , and then predict the pairwise ranking to be rijk = f ti,Rank (o 1 ||o 2 ).The ranking loss is calculated as the binary cross entropy loss between ground truth I[y ij > y ik ] and the predicted value rijk , where I(•) denotes the indicator function.More formally, the ranking loss is defined as The overall loss function for a bioassay-specific data pair is a weighted sum of the regression loss in Equation 3 and ranking loss in Equation 4: where λ is the weight coefficient for regression loss.
Overall, we introduce multi-task learning into MBP, aiming to deal with label variety and label noise problems.In the illustrative example of MBP shown in Fig. 2, MBP accepts the bioassayspecific data pair (Q92769, CHEMBL1213458, CHEMBL99, 125nM, 0.65nM, Ki) and their docking poses as inputs, encodes them to hidden representations, forwards the K regression head to predict y ij = 125nM, and also forward the K ranking head to classify 125nM > 0.65nM.
Downstream Task Fine-Tuning.The final part of the MBP framework is the downstream task fine-tuning.Given the 3D structure of a protein-ligand complex as input, the downstream task is to predict its binding affinity.The 3D structure can be either an experimentally determined co-crystal structure or a computationally determined docking pose.We transfer and fine-tune the shared bottom encoder f Enc together with the regression heads f IC50, Reg and f K, Reg in downstream protein-ligand binding affinity datasets (such as PDBbind).The right panel of Fig. 2 shows how the transferred model predicts the Ki value for a protein-ligand complex from PDBbind (PDB ID=3g2y).

Shared Bottom Encoder
For large-scale pre-training, a simple and effective backbone model is of utmost importance.Thus, we design the shared bottom encoder based only on vanilla GNN models.To simplify, we assume the input of the shared bottom encoder is a 3-tuple (P, L, C), indicating a protein P , a ligand L, and their binding conformation C.
Representing Protein-Ligand Complex as Multi-Graphs The input protein-ligand complex (P, L, C) is processed into three graphs -a ligand graph, a protein graph, and a protein-ligand interaction graph.We formally define the three graphs as follows: Definition 3 (Ligand Graph).A ligand graph, denoted by G L = (V L , E L ), is constructed from the input ligand L. V L is the node set where node i represents the i-th atom in the ligand.Each node i is also associated with (1) atom coordinate c L i retrieved from the binding conformation C and (2) atom feature vector x L i .The edge set E L is constructed according to the spatial distances among atoms.More formally, the edge set is defined to be where cut L is a distance threshold, and each edge (i, j) ∈ E L is associated with an edge feature vector e L ij .The node and edge features are obtained by Open Babel [70].Definition 4 (Protein Graph).A protein graph, denoted by G P = (V P , E P ), is constructed from the input protein P .V P is the node set where the node i represents the i-th residue in the protein.Each node v P i is also associated with (1) the alpha carbon coordinate of the i-th residue c P i retrieved from the binding conformation C and (2) the residue feature vector x P i .The edge set E P is constructed according to the spatial distances among atoms.More formally, the edge set is defined to be where cut P is a distance threshold, and each edge (i, j) ∈ E P is associated with an edge feature vector e P ij .The node and edge features are obtained following [71].Definition 5 (Interaction Graph).The protein-ligand interaction graph G I = (V P , V L , E I ) is a bipartite graph constructed based on the protein-ligand complex, whose nodes set are the union of protein residues V P and ligand atoms V L .The edge set E I models the protein-ligand interactions according to spatial distances.More formally, where cut I is a spatial distance threshold for interaction, and each edge (i, j) ∈ E I is associated with an edge feature vector e I ij .The edge features are obtained following [71].
Encoding Module (Ligand/Protein Encoder) Having represented the protein-ligand complex as multi-graphs, we respectively feed the ligand graph G L and the protein graph G P into the ligand encoder and the protein encoder, aiming to extract informative node representations.More formally, taking G L and G P as inputs, we have Here H L is the ligand embedding matrix of shape V L × d.And the i-th row of H L , denoted by h L i , represents the embedding of the i-th ligand atom.Similarly, H P is the protein embedding matrix of shape V P × d.And the i-th row of H P , denoted by h P i , represents the embedding of the i-th protein residue.
Encoders used here can be any GNN model, such as GCN, GAT, GIN, EGNN, AttentiveFP, etc.Here we briefly review GNNs following the message-passing paradigm following [77] and [78].For simplicity and convenience, we assume that the GNN operates on graph G with node features x i and edge features e ij , and temporarily ignore whether it is a ligand or protein graph.The message-passing process runs for several iterations.At the ℓ-th iteration, the message-passing is defined according to a message function M ℓ , an aggregation function AGGREGATOR ℓ , and an update function where N (i) is the neighbors of node i.Finally, after n iterations of message passing, we sum up the node representations of each layer to get the final node representation, i.e., i .The GNNs with layer-wise aggregation are also known as jumping knowledge networks [79].
Interacting Module After extracting ligand atom embedding h L i and protein residue embedding h P j from the encoding module, the interacting module is designed to conduct knowledge fusion according to the protein-ligand interaction graph.For each protein-ligand interaction edge (i, j) ∈ E I , we define its interaction embedding as the concatenation of the protein residue embedding h P i , ligand atom embedding h L j , and transformed edge features.More formally, where || is the concatenation operator, MLP is a multilayer perceptron, and FC is a fully-connected layer.
Read-Out Module After obtaining interaction embeddings h I ij for each protein-ligand interaction edge (i, j) ∈ E I , we further apply an attention-based weighted sum operation to read out a global embedding for the whole protein-ligand complex: where w is the attention vector, tanh is the hyperbolic tangent function.Besides, a global maximum pooling operation is adopted to highlight the most informative interaction embedding, s.t., o max = MaxPool(h I ij ).We concatenate the above two graph-level embedding to form the final graph embedding for the protein-ligand complex (P, L, C), i.e., o = o sum ||o max .

Pre-training Dataset: ChEMBL-Dock
ChEMBL-Dock is a self-constructed dataset used for pre-training MBP.The detailed ChEMBL-based curation workflow, including the data collection & cleaning step and molecular docking step, can be found in Appendix D. Here, we compare it to other commonly used protein-ligand complex datasets in terms of the label, 3D structure, protein diversity, molecular diversity, and dataset size in Fig. S1 and Table 1.Combining the strengths of molecular docking and ChEMBL, ChEMBL-Dock provides a large-scale 3D protein-ligand complex dataset with corresponding experimental affinity labels.While the quality of the docked 3D structures of the complexes in ChEMBL-Dock is not as high as that of the 3D co-crystal structures of the protein-ligand complexes in PDBbind, ChEMBL-Dock provides a much larger number of 3D structures of protein-ligand complexes than the PDBbind database.By comparing ChEMBL-Dock and CrossDocked, two datasets generated through molecular docking, it is evident that ChEMBL-Dock exhibits a higher molecular diversity than CrossDocked, suggesting its potential to provide a more comprehensive dataset for drug discovery research.

Experiments
4.1 Experimental Setup Downstream Datasets.Two publicly available datasets are used to comprehensively evaluate the performance of models.
• PDBbind v2016 [13] is a famous benchmark for evaluating the performance of models in predicting PLBA.The dataset includes three overlapping subsets: the general set (13,283 3D protein-ligand complexes), the refined set (4,057 complexes selected out of the general set with better quality), and the core set (290 complexes selected as the highest quality benchmark for testing).We refer to the difference between the refined and core subsets (3,767 complexes) as the refined set for convenience.The general set contains IC50 data and K data, while the refined set and core only contain K data.In this paper, the core set is used as the test set, and we train models on the refined set or the general set.• CSAR-HiQ [46] is a publicly available dataset of 3D protein-ligand complexes with associated experimental affinity labels.Data included in CSAR-HiQ are K data.When training models on the refined set of PDBbind, CSAR-HiQ is typically used to evaluate the generalization performance of the model [9].In this paper, we create an independent test set of 135 samples based on CSAR-HiQ by removing samples that already exist in the PDBbind v2016 refined set.
Baselines.We mainly compare MBP with four families of methods.The first family is machine learning-based methods such as Linear Regression (LR), Support Vector Regression (SVR), and RF-Score [80].The second family is CNN-based methods, including Pafnucy [40] and OnionNet [6].
Evaluation Metrics.Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Standard Deviation (SD), and Pearson's correlation coefficient (R) are used to evaluate the performance of PLBA prediction [9].The definition of these metrics can be found in Appendix C.
Training Parameter Settings.The models were trained using Adam [84] with an initial learning rate of 10 −3 and an L 2 regularization factor of 10 −6 .The learning rate was scaled down by 0.6 if no drop in training loss was observed for 10 consecutive epochs.For pre-training, the number of training epochs was set to 100, while for fine-tuning, the number of training epochs was set to 1000 with an early stopping rule of 70 epochs if no improvement in the validation performance was observed.

Experimental Results
In this work, we employ five different GNNs in the shared bottom encoder of MBP, which are denoted as MBP-X (where X corresponds to the GNN used) for distinction.For example, MBP-GCN denotes the MBP model using GCN in its shared bottom encoder.Unless specified otherwise, AttentiveFP is used as the default GNN in MBP.
Overall Performance Comparison on PDBbind core set and CSAR-HiQ.We first fine-tune MBP on the PDBbind refined set and report the test performance averaged over five repetitions for each method on the PDBbind core set and CSAR-HiQ set in Table 2.It can be observed that MBP achieves the best performance across all metrics of the two publicly available datasets.In particular, MBP-AttentiveFP and MBP-EGNN outperforms all competing methods on the PDBbind core set.Compared with SIGN, which the previous SOTA method [9], MBP-AttentiveFP achieving an improvement of 4.0%, 2.7%, 6.3%, and 3.5% on RMSE, MAE, SD, and R, respectively.And on the CSAR-HiQ dataset, MBP-AttentiveFP also achieves results superior to the other competing methods.For instance, it attains more than 6.3%, 6.6%, 10.1%, and 4.9% on RMSE, MAE, SD, and R gain compared to SIGN.Both MBP-EGNN and MBP-AttentiveFP surpass the previous best method, SIGN, and are the best two methods in the current results.In addition, it is worth noting that MBP with even the simplest GNN models (e.g., GCN) outperforms SIGN on the CSAR-HiQ dataset.This indicates that the proposed multi-task pre-training framework is able to improve the capacity and generalization of the backbone model in the PLBA prediction problem.To further evaluate the generalization performance of the proposed model, we conduct an extra experiment on the PDBbind general set.As shown in Fig. 4, comparing to all baselines, MBP achieves the best performance in terms of both RMSE and MAE.
For the molecular docking method TANKBind and the molecular pre-training method Transformer-M, we conducted experiments by adhering to the reported settings of these methods and fine-tuned our MBP model accordingly to ensure a fair comparison.Due to the variations in experimental settings and limited space, we provide a detailed comparison setting and results in Appendix E. The outcomes  1 The result of IGN was obtained by repeating the protocol provided by its authors.The other results were taken from [9].S1 and Table S2 demonstrate the effectiveness and competitiveness of MBP, even when compared to these powerful methods.

Ablation Studies
In this section, we conduct extensive ablation studies to investigate the role of different components in MBP.In ablation studies, all MBP models are fine-tuned on the PDBbind refined set.
Multi-Task Learning Objectives.We perform an ablation study to investigate the effect of multitask learning.Table 3 shows the results of our MBP with different learning tasks.We have two main observations: • Regarding regression tasks and ranking tasks, we find that on both PDBbind core set and independent CSAR-HiQ set, MBP with a combination of both regression and ranking tasks can always outperform MBP with only regression or ranking tasks, indicating the power of ensembling regression and bioassay-specific ranking.
• Regarding IC50 tasks and K tasks, we also find that MBP pre-trained with both IC50 and K tasks is better than that using only IC50 or K tasks.This implies that MBP is able to learn the task correlation between IC50 and K data from ChEMBL and transfer the knowledge to the PDBbind core set, which only contains Ki/Kd data.
These results justify the effectiveness of the multi-task learning objectives designed in MBP.
GNN Used in Shared Bottom Encoder.As shown in Table 2, we benchmark and compare MBP with different GNNs in the shared bottom encoder.We choose five popular GNN models -GCN, GIN, GAT, EGNN, and AttentiveFP.EGNN and AttentiveFP are able to capture the 3D structure of biomolecules, while GCN, GIN, and GAT are mainly designed for general graphs which can not capture structural information directly.We have two interesting observations.Firstly, all GNN models, even the vanilla GCN, achieve comparable or better performance than previous methods.For example, MBP-GCN achieves RMSE of 1.718 on the CSAR-HiQ set, slightly better than SIGN's   Hyper-Parameter Studies.
We ablate the weight coefficient λ of regression loss in MBP, which is crucial to the performance of MBP.Intuitively, too small a λ may hurt the ability to predict binding affinity, while too large a λ may aggravate the label noise problem.We vary the weight coefficients λ from {0, 0.01, 0.1, 0.3, 1.0}, and then depict the tendency curves of the test RMSE w.r.t.λ in Fig. 5.As expected, too large or too small a weight coefficient leads to worse performance in different multi-task settings (i.e., IC50 tasks, K tasks, and IC50+K tasks).

Conclusion
Protein-ligand binding affinity (PLBA) is a critical measure in drug discovery.However, the limited availability of high-quality training data and the variety and noise of PLBA labels pose difficulties in improving the performance of these models through pre-training.In this paper, we introduce the ChEMBL-Dock dataset, which contains 313,224 3D protein-ligand complexes with experimental PLBA labels.Based on this dataset, we proposed the MBP method that addresses the label variety and noise problems through multi-task learning and assay-specific ranking tasks.

E Additional experimental results
In this section, we conduct a comprehensive comparative analysis between MBP, TANKBind, and Transformer-M.To ensure a fair evaluation, we base our comparison on the reported results from their papers.

E.1 Comparision with Transformer-M
We follow Transformer-M's methodology to train and test on the PDBbind2016 dataset and use the same dataset split as Transformer-M (see https://openreview.net/forum?id=vZTp1oPV3PC& noteId=ZEa3K6qePg_formoredetails).Following Transformer-M, we employ the adversarial training method FLAG [85] during fine-tuning.We repeated the experiment five times.Our comparative analysis reveals that MBP consistently outperforms Transformer-M across multiple evaluation metrics, including RMSE, SD, and R. Notably, MBP achieves these superior results while utilizing significantly fewer model parameters, with only about 1 million parameters compared to Transformer-M's 50 million parameters.These findings demonstrate the remarkable effectiveness of MBP in the task of PLBA prediction, surpassing the performance of the powerful molecule pre-training method, Transformer-M.

E.2 Comparision with TANKBind
In this section, we present a comparison between MBP and the molecular docking method TANKBind in the task of PLBA prediction.Specifically, we train and test MBP on the PDBbind2020 dataset, utilizing the identical dataset split (time split strategy) as employed by TANKBind.To establish statistical robustness, we repeated the experiment five times, consistent with the methodology of TANKBind.The obtained results, as summarized in Table S2, reveal a substantial advantage of MBP over TANKBind across various evaluation metrics, including RMSE, MAE, and R. Notably, MBP demonstrates a remarkable 6.6% improvement in terms of the RMSE metric when compared to TANKBind.These findings substantiate the competitiveness and effectiveness of MBP.

Figure 1 :
Figure 1: A real example of bioassay data in ChEMBL.(1) The top three panels show an example where the same protein-ligand pair have different binding affinities with assay 1-3 in terms of measurement type (IC50 v.s.Ki) and value (IC50=10 nM v.s.IC50=7600 nM).(2) The bottom two panels show an example of the binding of different ligands (Ligand 1 & 2) to a protein in the same assay (assay 3).

Figure 2 :
Figure 2: The framework of MBP in pre-training and fine-tuning.The solid arrows indicate the flow path of the running examples of AssayID = CHEMBL1216983 during pre-training and PDB ID = 3g2y during fine-tuning.

Figure 3 :
Figure 3: Shared bottom encoder of MBP.It contains three modules: (a) encoding module, (b) interacting module, and (c) read-out module.(d) shows the detailed GNN model of the ligand/protein encoder in the encoding module.

Figure 4 :
Figure 4: Performance improvements of baselines and MBP on the PDBbind benchmark when training on general set.

Figure 5 :
Figure 5: Test RMSE and MAE of MBP on the PDBbind core set with varying weight coefficients λ of the regression loss.

Table 2 :
Test performance comparison on the PDBbind v2016 core set and the CSAR-HiQ dataset.The mean RMSE, MAE, SD, and R (std) over 3 repetitions are reported.The best two results are highlighted in bold.

Table 3 :
Ablation study of MBP with different pre-training tasks.The mean RMSE, MAE, SD, and R (std) over 5 repetitions are reported.The best two results are highlighted in bold.

Table S1 :
Comparision with Transformer-M on Transformer-M's setting.The mean RMSE, MAE, SD, and R (std) over 5 repetitions are reported.The best results are highlighted in bold.

Table S2 :
Comparision with TANKBind on TANKBind's setting.The mean RMSE, MAE, SD, and R (std) over 3 repetitions are reported.The best results are highlighted in bold.