- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC3009464

# Real-Time Ligand Binding Pocket Database Search Using Local Surface Descriptors

^{1}École Normale Supérieure de Cachan, Computer Science Department, 61 Avenue du President Wilson, 94235 Cachan cedex, Britanny, France

^{2}Department of Biological Sciences, College of Science, Purdue University, West Lafayette, IN, 47907, USA

^{3}Department of Computer Science, College of Science, Purdue University, West Lafayette, IN, 47907, USA

^{4}Markey Center for Structural Biology, College of Science, Purdue University, West Lafayette, IN, 47907, USA

## Abstract

Due to the increasing number of structures of unknown function accumulated by ongoing structural genomics projects, there is an urgent need for computational methods for characterizing protein tertiary structures. As functions of many of these proteins are not easily predicted by conventional sequence database searches, a legitimate strategy is to utilize structure information in function characterization. Of a particular interest is prediction of ligand binding to a protein, as ligand molecule recognition is a major part of molecular function of proteins. Predicting whether a ligand molecule binds a protein is a complex problem due to the physical nature of protein-ligand interactions and the flexibility of both binding sites and ligand molecules. However, geometric and physicochemical complementarity is observed between the ligand and its binding site in many cases. Therefore, ligand molecules which bind to a local surface site in a protein can be predicted by finding similar local pockets of known binding ligands in the structure database. Here, we present two representations of ligand binding pockets and utilize them for ligand binding prediction by pocket shape comparison. These representations are based on mapping of surface properties of binding pockets, which are compactly described either by the two dimensional pseudo-Zernike moments or the 3D Zernike descriptors. These compact representations allow a fast real-time pocket searching against a database. Thorough benchmark study employing two different datasets show that our representations are competitive with the other existing methods. Limitations and potentials of the shape-based methods as well as possible improvements are discussed.

**Keywords:**protein surface, structure-based function prediction, pocket shape, pseudo-Zernike moments, 3D Zernike descriptors, ligand binding site

## Introduction

Characterization of protein function is one of the most important tasks in bioinformatics^{1}^{–}^{4}. Taking advantage of accumulated knowledge of gene functions stored in databases^{5}^{;}^{6}, computational function prediction methods typically search similar patterns in sequences or structures of a query protein against databases of known proteins. Traditionally, methods for sequence database searches^{7}, including homology search^{8}^{–}^{10}, functional domain search^{11}^{;}^{12}, and motif search^{13}^{;}^{14}, have been widely used for function prediction, since sequence information is almost always available for genes of unknown function. Recent years have observed development of advanced approaches for sequence-based function prediction^{15}^{–}^{19}, which achieve an improved accuracy and coverage in genome-scale function assignment. Moreover, many function prediction methods have been developed that utilize other types of data, such as protein-protein interaction data^{20}^{;}^{21}, gene expression data^{22}, and text mining^{23}^{;}^{24}, or combination of such heterogeneous data^{25}^{;}^{26}. Of recent particular importance is functional characterization of proteins from their tertiary structures since an increasing number of protein structures of unknown function have been solved by ongoing structural genomics projects^{27}^{;}^{28}. As of this writing, there are currently more than 3000 protein structures of unknown function in the Protein Data Bank (PDB)^{29} that are awaiting functional characterization.

Roughly speaking, tertiary structure information can be used for function prediction either by considering global fold or local structure of proteins. The former approach utilizes the observation that the evolutionary relationships of proteins could be better tracked by considering overall protein fold similarity to reach a further evolutionary distance where proteins share barely detectable sequence similarity^{30}^{–}^{32}. FINDSITE^{33} is one such method that utilizes global structure information to predict function. It uses groups of template structures of distant homologs of a target protein identified by threading. However, caution is needed in inferring function from the global structure similarity since there are protein folds which are adopted by many different proteins^{34}. On the other hand, the latter approaches aim to capture local geometry of known functional sites or small ligand molecule binding sites. Since local methods directly search for geometrical and/or physicochemical properties of functional sites, it could be possible to predict molecular functions of proteins which lack homology to proteins of known function^{35}.

Local structure-based function prediction can be logically divided into two parts: 1) detection of characteristic local sites, such as pockets, in a given protein surface, and 2) matching the local sites against a database of known functional site patterns. Of a particular interest in the first step is to detect pocket regions since binding of small ligand molecules occurs at pocket regions in many cases^{36}. Therefore, if a protein is known to bind a ligand molecule, the binding site itself can be well predicted by just identifying pockets ^{37}^{;}^{38}. We have shown in our previous work that ligand binding sites of proteins can be identified as one of the three largest pockets in the protein surface in 95% of the cases^{36}.

Toward the goal of identifying potential ligand binding sites in proteins, several methods have been developed. SURFNET^{39} searches for a gap in a protein surface by fitting spheres inside the convex hull. POCKET^{40} and PHECOM^{41} also use probe spheres. PocketPicker^{42} and LIGSITE^{43} locate a protein onto a three dimensional (3D) grid and scan it for protein-void-protein events in many directions, whereas VisGrid^{36} uses the visibility of surface points to find pockets. PocketDepth^{44} clusters grid cells using information of the depth of the grid cells. CAST^{37} computes a Voronoi diagram of a protein and identifies pockets as void tetrahedrons. Several methods consider additional information, such as sequence conservation^{45}^{–}^{47} and energetics^{48}^{–}^{50}, which are often combined while considering geometrical shape.

Algorithms used for matching local sites are closely interrelated with the representation of the local sites. In Catalytic Site Atlas^{51}, AFT^{52}, and SURFACE^{53}, where a local site is represented as a set of few residue positions, the root mean square deviation (RMSD) of equivalent amino acid residues is computed. In SitesBase^{54}, atoms in ligand binding sites are compared using geometric hashing. Another functional local site database, eF-site^{55}, represents protein surface as a graph with nodes characterized by local geometry and the electrostatic potential, and hence uses a maximum subgraph algorithm for seeking similar sites. Recently, Thornton and her colleagues explored the use of spherical harmonics in representing and comparing protein pockets^{56}^{;}^{57}. They compared ligand surface shape with pocket sizes^{56} and also did pocket to pocket comparison. Garutti and Bock proposed a 2D representation of binding sites by computing a collection of 2D histograms (spin-images) associated to surface points^{58}. Comparing two sites consists in finding highly-correlated pairs of spin-images that also satisfy geometric criteria.

In this paper, we introduce two approaches for representation and comparison of properties of ligand binding pockets. In the first approach, the shape and the electrostatic potential of a binding pocket is mapped on a two dimensional (2D) picture, which is then represented as the pseudo-Zernike moments. The pseudo-Zernike moments are series expansion of a 2D function; hence a pocket is represented compactly by a vector of coefficients assigned to terms in the series. This representation is conceptually very different from, for example, the spin-image representation^{58}: the spin-images require many 2D images per pocket, a simple correlation coefficient to compare images and computationally expensive geometric matching procedure, whereas our method only uses one 2D image, with mathematically more elaborate descriptors and inexpensive pocket matching. In the second approach, we employ the 3D Zernike descriptors^{59}^{;}^{60}, a series expansion of a 3D function, to directly represent 3D pocket surface properties. These two compact representations allow a fast real-time search against a database of known pockets. For example, a search against all pockets in PDB would take only a few seconds. Employing two different datasets of ligand binding pockets, we compare performance of pocket retrieval by the 2D and 3D pocket representations as well as the other existing methods. We also investigated how well our methods perform when ligand-free binding pockets or predicted binding pockets are used as queries. Limitations and possible improvements of the shape-based methods are discussed.

## Materials and Methods

In this work, we propose two binding pocket description models. The first model uses 2D moments to represent mapping of pocket surface properties on a 2D image. We compare three different 2D moments, the pseudo-Zernike, 2D Zernike, and Legendre moments in terms of invariance upon rotation of pockets and the accuracy of pocket retrieval from a database. The second one uses the 3D Zernike descriptors, 3D moments-based descriptors which represents 3D shape and properties of binding pockets.

### 2D pocket model using 2D moments

This binding pocket description model is based on ray-casting and 2D moments. Intuitively, a binding pocket is represented as a spherical panoramic picture viewed from its center of gravity. We then compute the pseudo-Zernike moments, 2D image descriptors, of the panoramic picture. Throughout this paper, the *surface* of a protein refers to the Connollysurface^{61}, which is a commonly used definition in proteins surface visualization and surface-related computations. Following the Interact Cleft Model used in Kahraman’s work^{56}, a ligand binding pocket (BP) is the surface of protein heavy atoms (*i.e.* atoms other than hydrogen) which are within 8Å to any heavy atom of the bound ligand. We define *G* as the center of gravity of *BP*, provided it does not lie inside the protein volume; otherwise, *G* is defined as any of the closest points outside of *BP*. The *opening* of *BP* is defined as the set of rays starting at *G* and not intersecting *BP*. In total of 64800 (=180 × 360) rays are shot from G to each (θ,) direction.

### Ray-casting of outermost binding pocket surface

We now describe a ray-casting strategy^{62} to represent a *BP* as seen from *G*. A 3D Cartesian coordinate system (*$\stackrel{\u20d7}{x}$*, *$\stackrel{\u20d7}{y}$*, *$\stackrel{\u20d7}{z}$*) specific to a *BP* is defined as follows (Figure 1): the point *G* is the origin of the coordinate system and the unit vector of the x-axis *$\stackrel{\u20d7}{x}$* is defined as a collinear vector to the average vector that define the opening of BP. In cases where the opening is empty, the x-axis is arbitrary defined. In the first Kahraman dataset, 19 out of 100 pockets have an empty opening. However, defining their x-axes arbitrarily still produces robust descriptors: the mean AUC of shape-only descriptors of these pockets is only 0.7% lower than the mean dataset AUC. We will later use 2D rotationally invariant moments on the (*$\stackrel{\u20d7}{y}$*, *$\stackrel{\u20d7}{z}$*) plane. Therefore, the remaining two vectors, *$\stackrel{\u20d7}{y}$* and *$\stackrel{\u20d7}{z}$*, can be defined arbitrarily, as long as the basis (*$\stackrel{\u20d7}{x}$*, *$\stackrel{\u20d7}{y}$*, *$\stackrel{\u20d7}{z}$*) is orthogonal. This choice of coordinate system provides a good approximation of rotation invariance for binding pockets descriptors, as seen in the section of pseudo-Zernike moments below.

Using spherical coordinates, we define a spherical function *f*(*θ*,) that describes the outermost surface of *BP* on [0,2π]×[0,π]:

In this equation, the subscript *i* is used in the event that a ray intersects *BP* multiple times, but such situation is very rare. Figure 1 sketches the definition of *f* in two dimensions by projecting the scene on a fictional plane containing *G*. The function *f* is a piecewise continuous spherical function. Since it is only used to describe the shape of the pocket, *f* can be normalized such that its highest value is 1. In order to compute 2D moments, the function *f* has to be mapped to a 2D plane.

The protein surface electrostatic potential can also be mapped to the protein surface in the same fashion by defining the value *f(θ,)* as the surface electrostatic potential at the outermost intersection between the ray and the protein surface. We used the Finite Difference Poisson Boltzman (FDPB) solver of the BALL library^{63} version 1.2 (http://www.ball-project.org/) for computing the electrostatic potential. The grid spacing set to 0.8Å, solvent dielectric constant is 78.0, and the PARSE force field^{64} is used to assign atomic charges and radii, all of which are the default parameters for calculating the electrostatic potential with the FDPB solver in the BALL library.

### Projection of 3D surface to 2D plane

Numerous methods exist for spherical function projection, because no construction preserves the following three spherical properties altogether, the area, shape, and the distance. We choose to use a scheme, which is a special case of the equi-rectangular (distance preserving) projection named *plate-carrée* projection. This consists of mapping the surface representation, *f(θ,)*, to a 2D plane (Fig. 1). By this mapping, the opening of the pocket corresponds to θ=0 and the bottom of the pocket (θ =π) is projected to the center of the image. Hence, rotations around the x-axis of the pocket (changes to θ) correspond to rotations around the center of the image, modulo distortions due to the projection. Computing 2D moments that are invariant around the center of the image compensates for the lack of reference for theta (*i.e.* arbitrary definition of the z-axis). Empirically, this projection is satisfactory because it does not distort shapes of a binding pocket beyond recognition by image descriptors (see Results). Projected surfaces and electrostatics of sample binding pockets are shown in Figure 2. The resolution of these pictures is 360×180, since the coordinates are mapped to integer values of (θ, ). In the followings we describe the projections with 2D image descriptors, which we have examined in this study.

### Pseudo-Zernike moments

The pseudo-Zernike (p-Z) moments^{65} are commonly used in optics and are shown to be less sensitive to noise than conventional (two dimensional) Zernike moments^{66}^{;}^{67}. The p-Z moments use a set of complete and orthogonal basis functions defined over the unit circle (x^{2}+y^{2} ≤1) as follows:

where
$\rho =\sqrt{{x}^{2}+{y}^{2}}$, *θ*= tan^{−1} (*y*/*x*), and *n*≥*0*, *|m|*≤*n*. Using the polynomials, the p-Z moments of the order *n* and the repetition *m* for a 2D image *f*(*x*, *y*) are defined as:

The asterisk (*) denotes the complex conjugate. Please refer the previous papers^{65}^{;}^{67} for more mathematical details. In this study, we used n = 5 for most of the computation. Among some other moments used in image processing, we chose the p-Z moments for 2D binding pocket representation because of the following reasons. First, they are orthogonal over the unit circle, thus information is not redundant between moments. Second, from the mathematical point of view these moments are rotationally invariant around the center of the image, which is a required property of the coordinate system we used in the model of binding pockets. Third, previous comparative studies show that these moments are one of the most tolerant to noise for shape description^{67}^{–}^{69}.

### 2D Zernike moments

For comparison with the p-Z moments, we also employ the 2D Zernike moments and the Legendre moments, both of which are common alternative choices in the field of image analysis. The difference of the 2D Zernike moments and the p-Z moments is the radial function *R _{nm}(r)* in the Eqn. 2. The 2D Zernike moments use the following radial function in the polynomials:

with |*m*|≤*n* and *n* − |*m*| = *even*.

### Legendre moments

The Legendre moments of order (m+n) for an 2D image *f(x, y)* are defined as

where m, n = 0, 1, 2, .. ∞. P_{m}(x) is the Legendre polynomials:

The 2D image function *f(x, y)* can be written as a series expansion in terms of the Legendre polynomials over the square [−1 ≤ x, y ≤ 1]. For more mathematical details of the 2D Zernike and the Legendre moments, refer elsewhere^{67}. The 2D Zernike and the Legendre moments are computed for the same 2D picture of pockets (Fig. 2).

### 3D pocket model using 3D Zernike descriptors

In this model, binding pockets are extracted in the same way as the previous 2D moments-based pocket model and are represented by the 3D Zernike descriptors (3DZD). It was previously shown by our group and others that the 3DZD are effective in comparing global surface shape^{70}^{–}^{72}, local surface regions^{73}^{;}^{74}, and surface physicochemical properties^{75} of proteins. Naturally, the 3DZD can be also applied for comparing shape of small ligand molecule^{76}. Recently we have developed surface shape-based protein docking prediction method named LZerD which uses the 3DZD for detecting complementarity of surface shapes^{77}. In this work we examine how well the 3DZD perform in representing and comparing local shapes (binding pockets) of proteins.

### 3D Zernike descriptors

3DZD is a series expansion of a 3D function, which allows a compact representation of a 3D object (*i.e.* a 3D function). Mathematical foundation of the 3DZD was laid by Canterakis^{60} then Novotni and Klein^{59} have applied it to 3D shape retrieval. Below we provide brief mathematical derivation of the 3DZD. See the two papers^{59}^{;}^{60} for more details.

Pocket surface is extracted in the same way as the 2D moments-based pocket models but with a different distance threhosld value of 8Å to identify ligand binding atoms in the protein, since 8Å gave better results for the 3DZD than 5Å. Then, the pocket surface are placed on a 3D grid. To represent a surface shape, each grid cell (voxel) is assigned 1 if it is on the surface and 0 otherwise. Values of other physicochemical properties, such as the electrostatic potentials, are also assigned only to the surface voxels. The resulting voxels with values on them are considered as a 3D function, *f(x)*, which is expanded into a series in terms of Zernike-Canterakis basis^{59} defined by the collection of functions

with −*l* < *m* < *l*, 0≤ *l* ≤ *n*, and (*n*− *l*) even. Spherical harmonics^{78},
${Y}_{l}^{m}(\vartheta ,\varphi )$, is the angular portion of an orthogonal set of solutions to Laplace’s equation, which is given by:

where ${N}_{l}^{m}$ is a normalization factor,

and
${P}_{l}^{m}$ is the associated Legendre function. *R _{nl}*(

*r*) are radial functions defined by Canterakis, constructed so that ${Z}_{nl}^{m}(r,\vartheta ,\varphi )$ are polynomials when written in terms of Cartesian coordinates. ${Z}_{nl}^{m}(r,\vartheta ,\varphi )$, which are currently written in spherical coordinates, are converted into Cartesian coordinate functions ${Z}_{nl}^{m}(\mathbf{x})$ in the following three steps:

- The conversion between spherical coordinates, (
*r*, ,*ϕ*), and Cartesian coordinates,**x**= (*x*,*y*,*z*), is defined as$$\mathbf{x}=\mid \mathbf{x}\mid \zeta =r\zeta =r(sin\vartheta sin\varphi ,sin\vartheta cos\varphi ,cos\varphi )$$(10) - Using Eqn. 4, we define a function ${e}_{l}^{m}$ in Cartesian coordinates, which is later used for rewriting the 3D Zernike function (Eqn. 1) into Cartesian coordinates. The harmonics polynomials ${e}_{l}^{m}$ are defined as$${e}_{l}^{m}(\mathbf{x})\equiv {r}^{l}{Y}_{l}^{m}(\vartheta ,\varphi )={r}^{l}{c}_{l}^{m}{\left(\frac{ix-y}{2}\right)}^{m}{z}^{l-m}\sum _{\mu =0}^{\lfloor \frac{l-m}{2}\rfloor}\left(\begin{array}{l}l\hfill \\ \mu \hfill \end{array}\right)\left(\begin{array}{l}l-\mu \hfill \\ m+\mu \hfill \end{array}\right){\left(-\frac{{x}^{2}+{y}^{2}}{4{z}^{2}}\right)}^{\mu},$$(11)where ${c}_{l}^{m}$ are normalization factors$${c}_{l}^{m}={c}_{l}^{-m}=\frac{\sqrt{(2l+1)(l+m)!(l-m)!}}{l!}.$$(12)
- Using the harmonics polynomials ${e}_{l}^{m}$, 3D Zernike functions (Eqn. 1) can be rewritten in Cartesian coordinates:$${Z}_{nl}^{m}(\mathbf{x})={R}_{nl}(r){Y}_{l}^{m}(\vartheta ,\varphi )=(\sum _{\nu =0}^{k}{q}_{kl}^{\nu}{\mid \mathbf{x}\mid}^{2\nu}{r}^{l})\xb7{Y}_{l}^{m}(\vartheta ,\varphi )=(\sum _{\nu =0}^{k}{q}_{kl}^{\nu}{\mid \mathbf{x}\mid}^{2\nu})\xb7{e}_{l}^{m}(\mathbf{x})$$(13)where
*2k = n*−*l*and the coefficient ${q}_{kl}^{\nu}$ are determined as follows to guarantee the orthonormality of the functions within the unit sphere,$${q}_{kl}^{\nu}=\frac{{(-1)}^{k}}{{2}^{2k}}\sqrt{\frac{2l+4k+3}{3}}\left(\begin{array}{c}2k\\ k\end{array}\right){(-1)}^{\nu}\frac{\left(\begin{array}{c}k\\ \nu \end{array}\right)\left(\begin{array}{c}2(k+l+\nu )+1\\ 2k\end{array}\right)}{\left(\begin{array}{c}k+l+\nu \\ k\end{array}\right)}.$$(14)

Now 3D Zernike moments of *f* (**x**) are defined as the coefficients of the expansion in this orthonormal basis, *i.e.* by the formula

Finally, the moments are collected into (2*l*+1) dimensional vectors
${\mathrm{\Omega}}_{nl}=({\mathrm{\Omega}}_{nl}^{l},{\mathrm{\Omega}}_{nl}^{l-1},{\mathrm{\Omega}}_{nl}^{l-2},{\mathrm{\Omega}}_{nl}^{l-3},\dots ,{\mathrm{\Omega}}_{nl}^{-l})$ and the rotational invariance is obtained by defining 3DZD, *F _{nl}*, as norms of vectors Ω

*:*

_{nl}The parameter *n* is called the order of 3DZD. The order determines the resolution (*i.e.* the number of terms in the series expansion) of the descriptor. *n* defines the range of *l*. And a 3DZD is a series of invariants (Eqn. 16) for each pair of *n* and *l*, where *n* ranges from 0 to the specified order. We use *n* = 20, which yields a total of 121 invariants, because it is shown to provide sufficient accuracy in a previous works of shape comparison^{59}^{;}^{70}.

As for the surface electrostatic potentials, 3DZD is computed separately for the pattern of positive values and for the negative values and later combined in the following way^{75}: First, voxels with a positive electrostatic potential value are kept but all the other voxels with a negative electrostatic potential value are reset with a value of zero. Then 3DZD of the pattern of the positive values in the cubic grid is computed. Next, similarly, voxels with a negative electrostatic potential value are kept but all the other voxels are reset with a value of zero. Then 3DZD of the pattern of the negative values is computed. Then, the two 3DZDs, one for voxels with a positive value and another one for voxels with a negative value are combined, yielding a descriptor with 2×121= 242 invariants. This is because Eqn. 16 does not differentiate positive and negative values, but only a pattern of non-zero values in the 3D space. Finally, we normalize numbers in a descriptor by the norm of the descriptor. This normalization is found to reduce dependency of 3DZD on the number of voxels used to represent a protein.

### Scoring function for binding ligand prediction

The proposed binding pocket model is tested in terms of performance of retrieving pockets of the same binding ligand type as a query pocket. For a given query protein pocket of a protein, k “closest” pockets in a benchmark dataset (described below) are retrieved. The closeness (*i.e.* distance) of two pockets is defined as either by the Manhattan distance, the Euclidean distance, or the correlation coefficient-based metric of the descriptors of the two pockets. The Manhattan distance of two pockets, *P _{a}* and

*P*, is defined as:

_{b}The Euclidean distance:

The correlation coefficient-based distance:

Here,
${A}_{i}^{Pa}$ and
${A}_{i}^{Pb}$ are the *i*-th value of the descriptors of pocket, *P _{a}* and

*P*. N is the total number of values of the descriptors. The correlation coefficient-based metric,

_{b}*d*, equals zero when two descriptors correlate perfectly.

_{c}Using the k closest pockets to a query based on one of the distances (Eqns. 17, 18, 19) described above, the scoring function for a binding pocket of a ligand type *F* is defined as

where *l*(*i*) denotes the ligand type (AMP, FAD, etc.) of the *i*-th closest pocket to the query, *n* is the number of pockets of the type F in the database, and the function δ_{X,Y} equals to 1 if X is of type Y, and is null otherwise. The first term is to consider top k closest pockets to the query, with a higher score assigned to a pocket with a higher rank. The second term is to normalize the score by the number of pockets of the same type F included in the database. The ligand with the highest *Pocket_score* is predicted to bind to the query pocket.

### Volumetric representation of pockets by spherical harmonics

In addition, our pocket representation is compared with a 3D volumetric representation of pockets by spherical harmonics, which was developed by the authors of the benchmark dataset of ligand pockets^{56} we use in this study. Among the three pocket shape approximation models proposed in their paper^{56}, we compare our results with the Interact Cleft Model. The Interact Cleft Model defines the volume of a ligand binding pocket by SURFNET^{39}, which places trial spheres of a certain range of sizes within 0.3 Å of protein atoms interacting with the bound ligand. The interacting atoms with the ligand are determined by HBPLUS^{79}. The model uses spherical harmonics functions for representing the volume of a pocket. Since spherical harmonics are not invariant to rotation, a pocket needs to be pose normalized (An alternative to the prior pose normalization is to store amplitudes of frequencies of spherical harmonics to achieve rotation invariance ^{80}). A pocket volume is first shifted so that its center of gravity is placed at the origin of the coordinate system. Then the pocket volume is rotated so that its moment of inertia tensor becomes diagonal with maximal values in *x* followed by *y* then followed by *z*. Now the outermost surface points of the surface volume is considered as a spherical function *f(θ,)* on a unit sphere and it is expanded as a series of spherical harmonics:

where the order l_{max} is set to 16, *Re[Y _{lm}(θ,)]* is the real part of the spherical harmonic functions, and

*c*is the associated coefficients. The similarity of two pockets are measured by the Euclidean distance (Eqn. 18) of the vectors of coefficients

_{lm}*c*of the two pockets. For more details of the procedure, refer to their papers

_{lm}^{56}

^{;}

^{57}.

### Benchmark datasets of ligand binding pockets

We used two datasets for benchmarking pocket retrieval performance of the methods. The first dataset compiled by Kahraman *et al.*^{56} is used to compare the performance of our methods with the previous 3D volumetric representation of pockets by spherical harmonics (the Kahraman set). This dataset consists of 100 proteins, each of which binds one of the following nine different ligands: adenosine monophosphate (AMP), adenosine-5′-triphosphate (ATP), flavin adenine dinucleotide (FAD), flavin mononucleotide (FMN), glucose (GLC), heme (HEM), nicotinamide adenine dinucleotide (NAD), phosphate (PO4), or steroid (STR). In the parentheses abbreviations of the ligand names are shown. The PDB IDs of ligand binding proteins in the dataset are listed in Table 1A. The tertiary structures of these proteins have been solved by X-ray crystallography and only structures which bind their cognate ligand are used. The proteins are each selected from different homologous families in the CATH database^{81} (*i.e.* H-level in CATH) so that they are not closely evolutionary related.

The second dataset contains in total of 175 proteins, each of which binds one of the 12 ligand molecules (Table 1B). This dataset is constructed based on the ligand bound and unbound protein pairs listed in Table 4 in the paper by Huang & Schroeder^{46}. They used the dataset for benchmarking pocket identification methods. Their original list can be found also at http://kiharalab.org/visgrid_suppl/, as we have also used it in our previous study^{36}. From this list, first we discarded proteins which bind non-natural ligands. Then, we consulted the PDBsum database^{82} and removed entries if they do not have sufficient number of the other non homologous PDB entries (with a sequence identity of less than 30%) that bind the same ligand molecule. This set is called the Huang dataset. The Kahrman set and the Huang set do not have overlap neither in terms of proteins nor ligand types. The purpose of this dataset is twofold: to test the proposed methods on another dataset and also to investigate the performance of the methods when unbound pockets are used as queries.

## Results

### Effect of rotation to the three 2D moments

To begin with, we examine the effect of rotation of pockets to the 2D moments-based pocket models, namely, the p-Z, the 2D Zernike, and the Legendre moments. In projecting the pocket geometry to a 2D plane, a degree of freedom still exist around the x-axis, which is defined as the direction from the center to the opening of the pocket (Fig. 1). It should be also noted that the rotation invariance in the projected 2D space does not ensure rotation invariance in the original 3D space. Thus, rotation should not alter the pocket descriptors to the level that the recognition of pockets of the same ligand type becomes impractical.

Here, a ligand binding pocket is rotated arbitrarily and the difference of the moments caused by the rotation (the rotation error) is evaluated. Concretely, the AMP binding pocket of asparagine synthetase (PDB: 12AS) is rotated around the axis *$\stackrel{\u20d7}{x}$*+ *$\stackrel{\u20d7}{y}$*+ *$\stackrel{\u20d7}{z}$* of an arbitrary coordinate set locating its origin at the center of gravity of the pocket. We computed and compared the moments of the pocket at each rotated position with and without pre-alignment: Firstly, we simply computed the moments of the pocket at each rotated position and compared them with the ones computed at the original position (*i.e.* without pre-alignment of pockets). Secondly, for each rotated pocket, we aligned it with the pocket at the original position before computing the moments (*i.e.* with pre-alignment). The pre-alignment consists of the following steps. The x-axis of the two pockets are aligned, then the z-axis is defined such that its principle moment of inertia (PMI) is maximized over all posisble directions on the plane orthogonal to the x-axis. From the mathematical point of view, the 2D Zenrike and the p-Z moments are invariant upon rotation around the axis while the Ledendre moments are not. Thus, it is expected that the comparison without pre-alignment gives better results for the 2D Zernike and p-Z than the Legendre moments. The comparison with the pre-alignment is performed to see whether the Ledendre moments shows comparable performance with the other two moments. The error is defined as the ratio of the Euclidean distance between the moments of the pocket at a rotated position and at the original position relative to the average Euclidean distance of the pocket (at the original position) to the other types pockets in the Kahraman dataset.

In Figure 3, the rotation error of the three moments is plotted with and without the pre-alignment of the pockets. First, as expected, the p-Z and 2D Zernike moments show lower error than the Legendre moments when pockets are not pre-aligned. Next, when pockets are pre-aligned, the error of all three moments is reduced remarkably. However, still the p-Z and 2D Zernike show a smaller error than the Legendre moments. A closer look at the results of the three moments with pre-alignment by computing the sum of the error values at each rotation angle (X-axis) shows that the p-Z has the smallest error with the value of 3.49, while the values of the 2D Zernike and the Legendre moments are 4.47 and 10.47, respectively.

### Pocket retrieval performance by the three 2D moments and the 3DZD

Next, we compare the 2D pocket models using the three 2D moments and the 3D pocket model using the 3DZD in terms of actual performance of identifying binding pockets of the same ligand. Note that ligands are pre-aligned for computing the Legendre moments. Figure 4 shows the Receiver Operating Characteristic (ROC) curve of the three 2D moments and the 3DZD averaged over searching results of different ligand binding pockets in the benchmark dataset. Concretely, given a query pocket, pockets in the database which are within a threshold Euclidean distance (Eqn. 5) are retrieved, and are then subject to evaluation by computing the false positive (x-axis) and the true positive (y-axis) rate. Varying the threshold value from strict to more permissive values yields the ROC curve. The false positive rate of a set of retrieved pockets for a query is defined as the ratio of the number of retrieved pockets of a different ligand (*i.e.* false positives) relative to the total number of pockets of a different ligand (*i.e.* false positives and true negatives) in the dataset. The true positive rate is the ratio of the number of correctly retrieved pockets (*i.e.* true positives) relative to the total number of pockets of the same type in the dataset.

The results are shown in Figure 4. Firstly, all the four moments perform better than random retrieval. Secondly, the p-Z and the 2D Zernike moments show almost identical performance on this plot, which is significantly better than the Legendre moments with the pre-alignment of the pockets. The 3D pocket model with the 3DZD has slightly higher AUC values than the p-Z and 2D Zernike when the false positive rate is small (0 to around 0.5) and has lower values for the latter half of the false positive rate (around 0.5 to 1.0). Quantitative computation of the Area Under the Curve (AUC) of the ROC curve (upper half of Table 2, results using the “Pocket shape only” descriptor) shows that the p-Z, the 2D Zernike moments, and 3DZD have an identical AUC value of 0.66 when the Euclidean distance is used. Note that this value is larger than the results by the spherical harmonics (0.64). The p-Z moments perform slightly better than the 2D Zernike moments and the 3DZD when the Manhattan distance (d_{M}) is used. Since the p-Z moments show a better performance among the three 2D moments (Legendre, p-Z, and 2D Zernike), we decided to use the p-Z moments with the Euclidean distance in the subsequent experiments, and further compared the performance with the 3D pocket model using the 3DZD. We have also tested the p-Z moments with the pocket pre-alignment but the improvement was not significant (0.75% improvement in the AUC value). Therefore, the pocket pre-alignment is not used in the following experiments.

### Combining pocket size information

Kahraman *et al.* reported that pocket retrieval accuracy improves when the shape descriptor by spherical harmonics is combined with pocket volume information^{56}. Inspired by their idea, we explore combinations of pocket shape by the p-Z moments or the 3DZD and the pocket size using a weighting factor, *w*. These two pieces of information are combined in the descriptor of a pocket, P* _{a}*, as the following vector:

where S_{Pa} is the size of the pocket *P _{a}*, A

_{k}

^{Pa}is the k-th value of the moments of the pocket

*P*(the pseudo-Zernike, the 2D Zernike, the Legendre, or the 3DZD), and

_{a}*N*is the total number of values of the moments. Thus, using the vector above, the Euclidean distance between the descriptors of two pockets,

*P*and

_{a}*P*, becomes:

_{b} where *S _{a}* and

*S*are the size of the two pockets. As the size of a pocket, we use the average distance from the center of gravity G of the pocket to the pocket surface. Table 3 shows the size of the nine different types of pockets in the Kahraman set and twelve ligand types in the Huang set. The average distance has a significant correlation coefficient of 0.853 to the molecular mass of ligands (g/mol).

_{b}In Figure 5, the weighting factor *w* of the pocket size term (Eqn. 23) is searched from 1.0 to 8.0(with an interval of 0.5) for the p-Z moments and 0.01 to 0.08 (with an interval of 0.01) for the 3DZD and the average AUC value over different pocket types is computed. In addition to the value for weighting factor *w*, we also examined different resolution of the p-Z moments and the 3DZD, *i.e.* the number of terms in the moments (x-axis). Mathematically, a target function is perfectly described by an infinite number of terms in the moments. However, practically using too many terms is inefficient and may even be harmful for our purpose, because the primary goal of this work is to compare and retrieve pockets of the same type that are not exactly identical in shape, rather than to describe a pocket’s shape as accurately as possible.

**A**, the pseudo-Zernike moments;

**B**, the 3DZD.

For the p-Z moments (Fig. 5A), we find that fifteen terms (which correspond to the order of up to n = 4 in Eqn. 2) give a sufficient AUC value and using more terms does not improve the results. In terms of the weight *w*, 4.5 gives the highest AUC value. For the 3DZD (Fig. 6A), the order of 20 with the weight of 0.04 gave the highest AUC value. The optimal weight is much smaller for the 3DZD because the average norm of 3DZD is two orders of magnitude lower than the p-Z moments.

The bottom half of Table 2 summarizes the effect of adding the pocket size information using the weight of 4.5 for the three 2D moments and 0.04 for the 3DZD. It is shown that the AUC value increases consistently by adding the pocket size in all the combinations of different moments and the distance metrics tested. Among all tested in Table 2, the best AUC value, 0.81, is achieved by the 3DZD with the pocket size using the Euclidean distance. The descriptor with the p-Z moments comes to the second with 0.79 followed by the 2D Zernike (0.78) and Legendre moments (0.77). The values of the 3DZD, the p-Z and the 2D Zernike moments are higher than the AUC value achieved by a spherical harmonics-based descriptor combined with the pocket volume proposed by Kahraman *et al*.^{56} (the right most column).

### The number of top scoring pockets to consider in the Pocket_score

For a given query pocket, the Euclidean distance is computed against all pockets in the dataset and then the final prediction of the binding ligand is made using *Pocket_score* (Eqn. 7). Since the final prediction depends on the number of closest pockets (the parameter *k* in Eqn. 20) to consider, we examined the effect of the value of *k* on the resulting success rate. In Figure 6, the average success rate of the nine ligands in the Kahraman set for *k* = 1 to 35 is plotted. The plot in Fig 6A is the results of the 2D pocket model with the p-Z moments while Fig 6B shows results by the 3DZD. Three pocket descriptors are tested: either the surface shape (G in Fig. 6) or the surface electrostatic potential (E) combined with the pocket size (*w* = 4.5 for the p-Z and 0.04 for the 3DZD as determined in Fig. 5) and the average distance of those by the two descriptors (G+E). In the Top1 results, the rate is measured with the highest scoring ligand being the correct one, while the Top3 allows the correct answer to lie in the first three highest scoring ligands.

When the top scoring ligand is counted (curves of TOP1 in Fig. 6), increasing the number of closest pockets to consider in the scoring function does not help much to improve the accuracy. However, it does make dramatic improvement when top three ligands are considered (TOP3). In the case of TOP3 prediction, the success rate of sharply improves by roughly over forty to fifty percentage points when *k* is set to 10 or higher as compared with the results with *k* = 1. The improvement is more significant in the 3DZD as compared with the p-Z 2D pocket model. In all three pocket descriptors, the success rate gradually increases until *k* is from about 15 to 25. We decided to use *k* = 24 for both p-Z and 3DZD for the subsequent analysis, because it gives the second best success rate by the pocket shape descriptor (G) in the TOP3 prediction (82.0% for the p-Z and 90.0% for the 3DZD) and also gives good performance by the combination of the shape and the electrostatic potential descriptor (G+E).

### Retrieval accuracy of individual pocket types in the Kahraman dataset

Up to the previous sections, we examined the pocket retrieval performance of three different 2D moments and a 3D pocket model using the 3DZD and determined the parameters for the pocket retrieval. Here we further discuss the retrieval accuracy of individual pocket types. We first show the results on the Kahraman dataset as it was used to examine the types of moments and the parameters. Then, later we show the results on the Huang dataset, which is an independent dataset from the Kahraman set. For the both datasets, performance of the p-Z moments and the 3DZD are compared.

Table 4 gives the success rate of retrieving individual binding ligands using *k* = 4.5 for the p-Z moments (Table 4A) and *k* = 0.04 for the 3DZD (Table 4B). For both p-Z and the 3DZD, all the three pocket descriptors (G, E, G+E) perform far better than the random retrieval. The pocket shape descriptor (G) shows the best average success rate in the TOP3 prediction consistently for the p-Z (76.3%) and for the 3DZD (82.7%). Adding the electrostatic information to the shape information, *i.e.* G+E descriptor, makes a small improvement in the TOP1 prediction in the case of the p-Z moments (Table 4A, from G: 41.2% to G+E: 41.4%), but give the same success rate for the 3DZD (Table 4B, both G and G+E give 36.1%). In terms of the TOP3 prediction, adding the electrostatic information slightly deteriorates the success rate for both p-Z moments and the 3DZD. This is consistent with a recent observation by Thornton group which reports that the electrostatic potential in ligand binding pockets are highly variable within families^{83}. Comparing the performance of the p-Z moments and the 3DZD, the p-Z moments show slightly higher success rate in the TOP1 prediction while the 3DZD show higher value in the TOP3 prediction success rate. The success rate differs from ligand to ligand and these trends are quite consistent for both for the p-Z moments and the 3DZD. For example, in TOP3 with the shape and the size descriptor (G), ATP, GLC, and FAD performs well while FMN and STR show poorer results. This implies that the difference in performance for each ligand is attributed not to the characteristics of our 2D/3D pocket models but to the actual similarity or divergence of pocket shapes of particular ligand types. The low retrieval accuracy of FMN can be explained by two main factors: Among the three smallest ligand types (GLC, FMN and STR), FMN is the most flexible one with an average RMSD of 1.08 Angstrom. Also, it is the third under-represented ligand type in the Kahraman dataset (6 structures), hence random retrieval will be relatively less accurate.

Table 4A also gives the retrieval results by solely using the pocket size (given in Table 3). It turned out that PO4 can be perfectly retrieved by just using the size, as it is the smallest ligand in the Kahraman set. The overall retrieval accuracy with the pocket size is 50.6 for the Top3 value, which might seem relatively high. But this is qualitatively consistent with the observation by Kahraman *et al.*^{56}, who made the dataset and reported that the AUC value by using size information is as high as 0.73 as compared with 0.77 achieved by combining the pocket shape and the size information (Table 2). In comparison with the performance by the pocket shape and size descriptor (G) of the p-Z moments (Table 4A) and the 3DZD (Table 4B), the addition of the shape information makes improvement or tie in all the cases of Top1 and Top3 values expect for three cases (the TOP1 value for NAD, the Top 3 value for HEM by the p-Z moments and the Top 1 value for AMP by the 3DZD). Thus overall the pocket shape information makes effective contribution to improving the retrieval accuracy.

The pocket retrieval success rate in Table 4 roughly agrees with the distance of all against all pockets shown in Figure 7, which visualizes the Euclidean distance of all the pocket pairs. Figure 7A and 7B show the distance by the p-Z moments and the 3DZD, respectively. It can be seen that all the ligand binding pockets have a close distance to the other pockets of the same type (*i.e.* diagonal squares are in darker gray), however, ligands with a poor retrieval success rate also show similarity to the other ligand types. For example, AMP binding pockets seem to be close to some of ATP binding pockets, and FMN binding pockets have a close distance to ATP binding pockets. HEM and NAD binding pockets also seem to be similar. Figure 8 examines mutual distance of four individual ligand binding pockets, AMP, ATP, FMN, and STR. In all the ligand cases, binding pockets which are relatively distant from the other members of the same ligand type tend to fail in binding ligand prediction. For example, in the case of AMP binding pockets with the 3DZD, 8gpbA and 1c0aA failed in the TOP3 results.

**...**

**...**

To have a better understanding of the pocket prediction process, we closely examined examples of search results for individual cases of FAD binding pockets by the p-Z moments as examples (Table 5). Table 5 shows two successful and two failed cases. 1eviB is a very successful case where five other FAD binding pockets are retrieved within top 5 ranks. In the case of 1e8gB, the search retrieved three FAD binding pockets contaminated with HEM binding pockets, which resulted in the second rank in the final prediction. On the other hand, 1cqxA did retrieve FAD binding sites within top 25 but the top hits are dominated by four other ligand types. No FAD binding pocket is retrieved within the top 25 in the case of 1jr8B.

### Retrieval Accuracy on the Huang dataset

We further tested the pocket models with the p-Z moments and the 3DZD on another independent dataset using the same parameters determined based on the Kahraman dataset (Table 6). This dataset, the Huang dataset, which consists of 175 proteins which bind one of the twelve ligands, have no overlap in the proteins and ligand types with Kaharaman set.

On this dataset, both p-Z moments and the 3DZD show lower success rate by the pocket shape descriptor (G) as compared to the results for the Kahraman set (Table 4). For example, the Top3 success rate by the pocket shape descriptor of the p-Z moments is 75.6%, which is a small decrease from 76.3% shown for the Kahraman set, while the shape descriptor of the 3DZD shows a larger decrease, from 82.7% (on the Kahraman set) to 59.8%. On the other hand, we observe higher success rate of the electrostatic potential descriptor (E) on this dataset relative to the Kahraman set both by the p-Z moments and the 3DZD. Of course both of our pocket models consistently show significantly better performance than random retrieval. In the case of the 3DZD (Table 6B), the combination of the shape and the electrostatic potential descriptors (G+E) shows improvement over the shape descriptor (G) due to the aforementioned two reasons, *i.e.* descrease in the success rate by the shape descriptor (G) and the good performance by the electrostatic potential descriptor (E) on this dataset.

### Comparison with existing methods

Next, we compare the pocket retrieval performance of our methods with two other existing methods, eF-Seek^{84} (http://ef-site.hgc.jp/eF-seek/) and SitesBase^{54} (http://www.modelling.leeds.ac.uk/sb/). Both methods mainly use geometrical shape information for quantifying similarty of pockets: eF-Seek represents protein surface as triangle meshes and employs a graph matching algorithm to seek similar local sites stored in the eF-Site database^{55}. On the other hand, SitesBase uses geometric hashing algorithm to identify equivalent atom constellations between pairs of ligand binding sites. Readers should be reminded that in principle an entirely fair comparison of existing servers is not possible, as each method has been trained on a different dataset and currently contains a different set of binding pockets in its own database. Hence this comparison is meant only to provide rough ideas of how our methods perform in comparison with others.

Since the databases used by these methods are different, we first identified proteins in the Tables 1A and and1B1B that are commonly included both in the databases of eF-Seek and SitesBase. Then, among the 21 ligand types in total in Table 1(nine in Table 1A and twelve in Table 1B), we selected twelve ligands for which most of the listed proteins are included in the eF-Seek and SitesBase databases. They are AMP, ATP, FAD, FMN, GLC, HEM, NAD, F6P, GAL, GUN, MMA, and PLM. The proteins for these ligands which are also found in eF-Seek and SitesBase are underlined with single or double line. For each of these ligands, randomly selected three proteins are used as queries (those underlined with double line in Table 1). eF-Seek and SitesBase return different number of pockets in a result file. Out of the proteins they return, we only selected underlined proteins and computed the evaluation values, *i.e.* Top1, Top3, and AUC values. The query protein which appear as the top hit is discarded in the evaluation. For eF-Seek results, we sorted the retrieved proteins by the distance from the ideal score, which we define as

We ranked the eF-Seek results in this way because eF-Seek does not rank the retrieved proteins but only provides a 2D plot of the Z-score and the coverage. The maximum Z-score value for the queries is 13.4 and the maximum coverage value is 1.0. The AUC value of the two methods is computed by randomly adding missing proteins from the selected proteins in Table 1 to the end of the list of retrieved proteins, assuming that these proteins are retrieved in that order. The AUC curve was computed ten times for eF-Seek and SitesBase and the average and the standard deviations are recorded.

The results are summarized in Table 7. The 3DZD pocket model shows the best performance among all in terms of all the metrics, *i.e.* Top1, Top3, and AUC. The p-Z pocket model comes to the second in terms of the AUC and ranked the third following SitesBase in the Top1 and Top3 value. Overall, our 2D and 3D pocket models show better or comparable performance to the two methods compared.

### Performance with ligand-free pockets

Shape of a ligand binding pocket will be slightly changed upon binding of the ligand molecule. To investigate how well our pocket models perform for ligand-free pockets, here we searched the Huang set (Table 1B) with ligand-free pocket shapes. The Huang set is very suitable for this type of testing since it orginates from a list of pairs of ligand-bound and ligand-free form of binding sites ^{46}. Ligand binding amino acid residues in the ligand-free proteins are identified by the sequence and structural alignment between the ligand-bound proteins. The RMSD value of ligand-bound and ligand-free proteins ranges from 0.19Å to 2.48Å with an average value of 0.86Å. This is consistent with a through survey by Brylinski and Skolnick ^{85}, which reports that the average RMSD of ligand-bound and ligand-free form is 0.74Å.

Figure 9 shows the AUC values of the ligand-bound pockets (filled dots) and ligand-free pockets (open squares) of the twelve ligand types using the shape desctriptor (G in Tables 4 and and5).5). Figure 9A is the results by the p-Z moments and 9B is those by the 3DZD. It is shown that the AUC values of ligand-free pockets are not particularly worse as compared with ligand-bound pockets. This may be because the shape of most of the ligand-free pockets do not differ too much as compared with the variation of shapes of the ligand-bound pockets (the largest average RMSD over all the pairs of ligand-binding pockets within the same ligand type is 3.19Å, which is in the case of RTL). In many cases, the AUC value of the ligand-free pockets is similar to the value of the closest ligand-bound pockets (shown in open circles).

### Perfomance with predicted pockets

In the actual situation of binding ligand prediction, binding pockets may not be known beforehand and hence binding pockets need to be predicted. To simulate the actual scenario, we examined the retrieval accuracy by using a predicted pocket in a query protein. A pocket in a query protein is predicted by LIGSITE^{43}, which identifies pocket regions solely by geometrical information. Table 8 shows the retrieval accuracy by the p-Z moments and the 3DZD on the Kahraman dataset. Compared to Tables 4A and and4B,4B, the TOP3 value dropped to almost half both for the p-Z moments and the 3DZD. The AUC value also show a large decrease from 0.79 to 0.52 for the p-Z moments and 0.81 to 0.53 for the 3DZD. We have also constructed a database of predicted pockets by LIGSITE and queried against it by the predicted pockets, but this procedure did not improve the results (data not shown). The retrieval performance with predicted binding pockets largely relies on the accuracy of binding pocket predictions. We can expect that the results will improve by employing recent more accurate binding pocket prediction methods^{45}^{;}^{86}, which combine geometrical information with sequence information.

### Implementation and computational speed of the algorithm

The programs for computing the pocket models and those for performing the pocket comparison were written in C. We implemented the program, named Pocket-Surfer, on a web page, http://kiharalab.org/pocket-surfer/. In the demo version, searches within the benchmark dataset by using the p-Z moments can be performed by specifying a PDB ID in Table 1 as a query, so that readers can experience how the method works and reproduce the results reported in this paper. In addition, we have also implemented an alpha version, from which a user-specified PDB entry can be used as a query. In this version, a pocket is detected by LIGSITE^{43} in a query protein and the detected pocket is scanned against the benchmark dataset of the 100 pockets. In both versions, pockets are compared in terms of shape and size.

The running speed of each subtask of the programs is shown in Table 9. These numbers are the average of five executions on a Linux computer with a Pentium 4 3.0 GHz processor. Here, the largest pocket on the surface of a protein, 1h2h, is extracted by LIGSITE^{43}. Next, the p-Z moments and the 3DZD are computed for the pocket, which is then compared with the 100 pockets in the database used in this study and finally the 100 pockets are sorted by the distance to the query pocket of 1h2h to make binding ligand prediction for the query. It should be noted that the database search can be performed very rapidly with our method once the descriptor of the query pocket is computed. Searching against the database of the 100 pockets only took 0.0125 seconds for the p-Z moments and 0.023 seconds for the 3DZD. We also measured the search speed for a database of 200 pockets (by doubling the number of pockets in the dataset), which resulted in 0.014 and 0.034 seconds, respectively for the p-Z moments and the 3DZD. Extrapolating these two measured search speeds, a search against a larger database of 62200 pockets, which is the number of entries in the whole PDB database as of the writing of the paper, would take only 0.94 and 6.85 seconds by the p-Z moments and the 3DZD, respectively. Recall that pocket descriptors to be stored in the database can be pre-computed just once, as our pocket comparison method does not need pose normalization of pockets for comparison. Note that this speed is much faster than eF-Seek where a search on the website takes for at least a couple of hours and often a couple of days.

## Discussion

In this paper, we have compared two methods for describing ligand binding pockets, one that uses the two dimensional p-Z moments and another one using the 3DZD. Our pocket representations are quite versatile in the sense that many different properties of pockets, not only the geometrical shape but also various physicochemical properties can be naturally represented and combined. The 2D pocket model with the p-Z moments and the 3D model with the 3DZD successfully retrieve pockets with the same binding ligand molecule in 76.3% and 82.7% of the cases within the top 3 closest hits (Tables 4A & 4B). The performance of our methods are favorably compared with similar methods including the spherical harmonics^{56}-based method, eF-Seek, and SitesBase in terms of the pocket retrieval success rate.

A significant advantage of our method is its very fast computational speed for a database search. Using Pocket-Surfer, searching a binding pocket within the whole PDB database could be performed on a desktop computer in 1 second using the 2D pocket model and about 7 seconds by the 3D pocket model. Note that the aforementioned spherical harmonics-based method needs pose normalization of pockets as a preprocessing step of shape comparison, since the spherical harmonics vary when an object is placed in different orientations. Pose normalization does not only add more computational cost but could also errors in comparison of protein shapes, since they are almost globular and determining the principle axes may not be robust.

As discussed in Introduction, local structure-based function prediction procedure includes two steps, detection of a potential ligand binding pocket in a query protein followed by matching the query pocket against a database of known binding pockets. This study is focusing on the second step of database searching. We showed that the two pocket models we introduced, one using the p-Z moments and another one using the 3DZD, perform reasonably well and preferably compared with the other existing methods. However, the experiment with predicted pockets by LIGSITE (Table 8) indicates that performance also depends largely on the accuracy of binding pocket detection step. Therefore, establishing a well coordinated procedure of detecting (predicting) and searching binding pockets is left as an important future direction of this work. Another key development for the improment is to investigate combinations of other features of pockets into the descriptor, such as hydrophobicity, flexibility or the degree of residue conservation. An intrinsic weakness of any shape-based method is that it is sensitive to the change of pocket shape due to any reasons including flexible nature of pockets, prediction, and binding of water molecules. Thus combining other features is aimed to compensate the limitation of shape information.

The urgency and the importance of computational characterization of protein structures has become clear as an increasing number of solved protein structures have been remaining of unknown function. However, computational protein structure analysis significantly lags behind sequence analysis^{7} in a practical sense, since almost no methods for global/local structure comparison have been developed until recently that concern conveniently fast running speed for handling large scale data. In contrast, real-time sequence database search has been realized more than a decade ago by BLAST^{8} and FASTA^{10}, and most of existing sequence analysis methods^{7} for homology, domain^{12}^{;}^{87}, and motif searches^{14}^{;}^{88}, can be performed in a real-time manner. As a solution for fast protein structure database search, we have recently proposed to represent proteins with their surface shape^{73}^{;}^{89} and employed the 3D Zernike descriptors that achieve real-time global protein structure database search^{70}^{;}^{71}^{;}^{75}. Along the same line, here we developed a method for real-time protein local pocket shape search. We believe that the fast protein local shape comparison method together with the global shape comparison methods have paved the way for further developments of fast and convenient protein structure analysis methods.

## Acknowledgments

This work was supported by grants from the National Institutes of Health (R01GM075004, U24 GM077905) and National Science Foundation (DMS0604776, DMS0800568, EF0850009, IIS0915801). The authors are grateful to Gregg Thomas for proof reading the manuscript.

## Reference List

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.8M)

- Detecting local ligand-binding site similarity in nonhomologous proteins by surface patch comparison.[Proteins. 2012]
*Sael L, Kihara D.**Proteins. 2012 Apr; 80(4):1177-95. Epub 2012 Jan 24.* - A new protein binding pocket similarity measure based on comparison of clouds of atoms in 3D: application to ligand prediction.[BMC Bioinformatics. 2010]
*Hoffmann B, Zaslavskiy M, Vert JP, Stoven V.**BMC Bioinformatics. 2010 Feb 22; 11:99. Epub 2010 Feb 22.* - Self-organizing fuzzy graphs for structure-based comparison of protein pockets.[J Proteome Res. 2010]
*Reisen F, Weisel M, Kriegl JM, Schneider G.**J Proteome Res. 2010 Dec 3; 9(12):6498-510. Epub 2010 Oct 22.* - [Development and validation of programs for ligand-binding-pocket search].[Yakugaku Zasshi. 2011]
*Oda A.**Yakugaku Zasshi. 2011; 131(10):1429-35.* - Chapter 4. Predicting and characterizing protein functions through matching geometric and evolutionary patterns of binding surfaces.[Adv Protein Chem Struct Biol. 2008]
*Liang J, Tseng YY, Dundas J, Binkowski TA, Joachimiak A, Ouyang Z, Adamian L.**Adv Protein Chem Struct Biol. 2008; 75:107-41. Epub 2009 Feb 26.*

- Identification of Distant Drug Off-Targets by Direct Superposition of Binding Pocket Surfaces[PLoS ONE. ]
*Schumann M, Armen RS.**PLoS ONE. 8(12)e83533* - A Comprehensive Survey of Small-Molecule Binding Pockets in Proteins[PLoS Computational Biology. 2013]
*Gao M, Skolnick J.**PLoS Computational Biology. 2013 Oct; 9(10)e1003302* - Unleashing the power of meta-threading for evolution/structure-based function inference of proteins[Frontiers in Genetics. ]
*Brylinski M.**Frontiers in Genetics. 4118* - Local functional descriptors for surface comparison based binding prediction[BMC Bioinformatics. ]
*Cipriano GM, N G Phillips Jr, Gleicher M.**BMC Bioinformatics. 13314* - In-depth performance evaluation of PFP and ESG sequence-based function prediction methods in CAFA 2011 experiment[BMC Bioinformatics. ]
*Chitale M, Khan IK, Kihara D.**BMC Bioinformatics. 14(Suppl 3)S2*

- Real-Time Ligand Binding Pocket Database Search Using Local Surface DescriptorsReal-Time Ligand Binding Pocket Database Search Using Local Surface DescriptorsNIHPA Author Manuscripts. Jul 2010; 78(9)2007PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...