Virtual screening scheme (adapted from Walters et al1).
A main goal of virtual screening is to select activity-enriched sets of molecules—or single molecules exhibiting desired activity—from the space of all synthetically accessible structures. Currently the most advanced HTS techniques allow for testing of ~105 compounds per day, and a typical corporate screening library contains several hundred thousand samples. Although these facts alone represent a technological revolution, the turnover numbers still are extremely small compared with the total size of chemical space.1 As a consequence, even ultra HTS combined with fast, parallel combinatorial chemistry can only be successful if a reasonable pre-selection of molecules (or molecular building blocks) for screening is done. Otherwise this approach will more or less represent a random search with a very small probability of success. While HTS and ultraHTS have made significant progress in recent years, we should bear in mind that it will be very costly to screen a million of compounds for activity in all the new receptor assays (estimated $0.1 to $10 per compound per screen). Even if a company has these resources, it is rare that they have access to a diverse one-million-compound screening library. Thus it can be advantageous to integrate VS tools into the drug discovery process to find leads with novel scaffolds by either starting from competitor compounds described in the literature and/or from a proprietary, existing scaffold. Once a reliable VS process has been defined it can save resources and limit experimental efforts by suggesting defined sets of molecules.
To reliably calculate prediction values or properties, the molecules under investigation must be represented in a suitable fashion. In other words, the appropriate level of abstraction must be defined to perform rational VS. A convenient way to do this is to employ molecular descriptors, which can be used to generate molecular encoding schemes reaching from general properties (e.g., lipophilicity, molecular weight, total charge, volume in solution, etc.) to very specific structural and pharmacophoric attributes (e.g., multi-point pharmacophores, field-based descriptors).2 Filtering tools can be constructed using a simplistic model relating the descriptors to some kind of bioactivity or molecular property. However, the selection of appropriate descriptors for a given task is not trivial and careful statistical analysis is required.
Virtual screening scheme (adapted from Walters et al1).
Graphical representations of four different types of trees. Human perception easily classifies these patterns as “tree”—a difficult task for technical information processing systems.
A chemist's decision as to which molecules to synthesize next is usually based on the available facts about a particular project, expert knowledge that was acquired over years, and to some extent on intuition. Software containing knowledge about a particular, limited, real-world problem can assist in this decision-making process (Definition 2.1).11,12 It is important to note that virtual screening systems are thought to complement the abilities of a human expert, e.g., by analyzing very large sets of data and prioritizing many different designs. Some aspects of human decision-making and reasoning can be adapted or mimicked by “intelligent” software, but many features will probably remain a domain of human perception, cognition, and intuition.
Expert systems are computer programs that help to solve problems at an expert level.12•
Some important scaffold structures that are amenable to solid phase combinatorial synthesis
Additional important inference rules are the modus tollens, and the chain rule which combines several implications:
Deductive logical programming using predicate logic is based on such rules.13,14 They can be used to derive hypotheses from true facts, i.e., they only consider the syntactic structure of the expressions. Several applications of logical reasoning systems have been described in the context of drug design purposes.15–17 Induction seems to be especially suited to perform learning from examples, and the chemical similarity approach can be regarded as founded on this concept, as illustrated by the following simplifying case:
It must be stressed that induction is useful to derive new hypotheses but is not a legal inference in a strict sense. This simply means that the conclusion “All pyrimidines are active in assay X” can be wrong. In contrast, deduction is denoted a legal inference because if only true axioms are given then the conclusions drawn by deduction are also true. For further details on logical reasoning and inferencing, see the literature.13,18–20 Inductive logic programming (ILP)18 represents a relatively new addition to the field of logic programming, which seems to be appropriate for SAR and SPC (structure-property correlation) modeling tasks.15 According to Plotkin,21 an inductive learning task can be described using a background theory (facts and rules), sets of positive and negative examples (e.g., active and inactive molecules), a candidate hypothesis and a partial ordering system for alternative hypotheses, where the following conditions apply:
The background knowledge should not be sufficient to explain all positive examples; otherwise the problem would already be solved (prior necessity).
The background knowledge should be consistent with all negative and positive examples (prior satisfiability).
The background knowledge and the hypothesis should together explain all positive examples (posterior sufficiency) and should not contradict any of the negative examples (strong posterior consistency); in the presence of noise, logical consistency is sufficient (weak posterior consistency).
If there are several hypotheses which fulfill conditions 1, 2, and 3, then the most general hypothesis should be selected as the result.
A Venn diagram representing the class relationship of the 20 genetically coded amino acids according to Taylor.25 This grouping of residues has been successfully applied to finding generalizing patterns in amino acid sequences.
One particular such rule is listed below. It was generated by a modified version of PROMIS and gives an idea of characteristic features found in peptide substrates of the mitochondrial processing peptidase (MPP):26–28
In a given matching sequence, this MPP cleavage site pattern starts with the class “ALL”. Moving towards the C-terminal end of the sequence, the next position must match one of the residues described by “VERY_HYDROPHOBIC OR SMALL OR LYSINE”, the second next residue must be positively charged, and so on. Such machine-generated rules can help to find and understand function-determining patterns in amino acid sequences. This general feature extraction approach complements other pattern matching routines used in sequence analysis.29,30 Its principle of using generic molecular descriptors (here: residue classes) is very similar to establishing an SAR model for drug design by adaptive rule formation.
By help of virtual library construction hitherto unknown parts of chemical space can easily be explored, and
the speed and throughput of virtual testing (fitness or quality calculations) can be far ahead of what is possible by means of “wet bench” experimental systems.
| Database | No. of molecules | Description |
|---|---|---|
| ACDa | > 250,000 | Available Chemicals Directory; catalogue of commercially available specialty and bulk chemicals from over 225 international suppliers |
| Beilsteinb | > 7,000,000 | Covers organic chemistry from 1779 |
| CSDc | > 200,000 | Cambridge Structural Database; experimentally determined three-dimensional structures of small molecules |
| CMCa | > 7,000 | Comprehensive Medicinal Chemistry database; structures and activities of drugs having generic names (on the market) |
| MDDRa | > 85,000 | MACCS-II (MDL) Drug Data Report; structures and activity data of compounds in the early stages of drug development |
| MedChemd | > 35,000 | Medicinal Chemistry database; pharmaceutical compounds |
| SPRESId | > 3,400,000 | Substances and bibliographic data abstracted from the world's chemical literature |
| WDIe | > 50,000 | World Drug Index; pharmaceutical compounds from all stages of development |
Molecular Design Limited, San Leandro, CA, U.S.A.
Beilstein Informationssysteme GmbH, Frankfurt, Germany
CSD Systems, Cambridge, UK.
Daylight Chemical Information Systems Inc., Claremont, CA, U.S.A.
Derwent Information, London, U.K.
Assessment of the diversity of a compound library is often a first step in virtual screening. The most relevant approach is clearly to assess the diversity space using chemical criteria and several algorithms are now available to do that. It is likely that after diversity analysis and extensive experimental screening of the library, at several targets and targets classes, the structure-activity database will point to areas of success and failure in terms of identifying leads. Thus the library may be said to be “GPCR-modulator rich”, “kinase-inhibitor poor” etc. An experiment-based understanding of the screening library diversity should also provide compounds that are “frequent hitters”, i.e. compounds that are not necessarily chemically reactive, but have structures that repeatedly bind to a range of targets via unspecific interactions or cause a false-positive signal for other assay-inherent reasons. Clearly removal of these compounds from the library is an advantage in HTS, as is an understanding of the reason for their promiscuity of interaction. A further issue relates to identifying a screening library subset, ostensibly representative of the diversity of the whole library, that is screened at all targets, usually as a priority in the screening campaign. Assessment of chemical versus operational understanding of diversity is critical in the design of the library subset. Moreover, there are advantages in screening the whole library. First, since HTS or uHTS is generally unconstrained by cost or compound usage, it is as easy to screen 250k compounds, as it is to screen 25k. Second, the screening campaign increases the likelihood of finding actives, especially for difficult targets, as well as finding multiple structurally distinct leads. Indeed, a direct comparison of the approach of screening a representative library has been reported from Pfizer, in which it was noted that 32 out of the 39 leads were missed in comparison to those found by screening the whole library.31 Alternatively, Pharmacopeia have reported that receptor antagonists for the CxCR2 receptor and the human bradykinin B1 receptor were derived from the same 150k compound library, made using the same four combinatorial steps. Noteworthy, this library was neither based on known leads in the GPCR field nor specifically targeted towards GPCRs. On the other hand, researchers at Organon reported that it is possible to rationally select various “actives” from large databases using appropriate “diversity” selection and “representativity” methods.32
The introduction of combinatorial chemistry, HTS and the presence of large compound selections have put us in the comfortable position, that there is a large number of hits to choose from for lead optimization—at least for certain classes of drug targets. We anticipate that while the size of the compound libraries and the number of high-throughput screens will continue to increase leading to a larger number of hits, the number of leads actually being followed up per project will roughly remain the same. The challenge is to select the most promising candidates for further exploration and computational techniques will play a very important role in this process. Assuming a hit rate of 0.1–1% and a compound-collection size of 106 compounds, we have (or will have) about 1k–10k hits that are potential starting points for further work. It is important to realize that while the screening throughput has increased significantly, the throughput of a traditional chemistry lab has not. While it is true, that automated and/or parallel chemistry is now routinely used there are still many molecules that are not amenable to these more automated and high-throughput approaches. Therefore the question is: “How can subsequent lead optimization fully exploit this vast amount of information?” Computational techniques can be used to address the question in a variety of ways:33
Many of the hits are false positive or “frequent-hitters”, which means that the observed effect in the assay is not due to the specific binding to a certain pocket in the molecular target. Docking techniques can be used to place all compounds from the chemical library in the binding pocket and score them. If HTS data are available, a comparison of the in silico docking results and the assay data can be used to prioritize the hits and focus the subsequent work on the more promising candidates. If no HTS data are available (e.g., when no assay amenable to HTS is possible or if no hits were obtained) then docking can be used to select compounds for biological testing. It should be mentioned that correlating docking results with HTS data will surely give a negative bias against allosteric inhibitors. The results obtained by docking procedures must therefore be analyzed with great care.
Many of the compounds have undesirable properties such as a low solubility, or high lipophilicity. In silico prediction tools can be used to rank the HTS hits. While it is generally true that insufficient PC properties can be remedied in the lead optimization process (e.g., large, lipophilic sidechains can be removed, or a “prodrug” can mask highly acidic groups), it may be advisable if possible to focus on compounds without obvious liabilities.
Toxicity and metabolic stability are extremely important parameters in the process of evaluating a compound for further development. In the past, these parameters were only taken into account at the later stages of the drug discovery process, partly because these parameters are time-consuming to establish and partly because small modifications to a molecule are known to have dramatic effect on these parameters. However, the availability of large databases on toxicity and metabolism has now increased the chance to sensibly relate chemical structures to these effects and to develop alert systems that again can be used to prioritize the hits. There exist however mixed views of this. While one may believe that ADMET predictions are obviously important and will become increasingly accurate in the future, it might still be reasonable to hesitate to apply such methods as a basis to prioritize HTS hits. It is likely that these hits will have little resemblance with the final clinical candidate, so these predictions may not be very relevant so early on. However, once there is a lead series, we should accelerate our effort to formulate a system-specific ADMET tool that can handle these lead structures/classes.
Chemical similarity searching is a straightforward practical approach to identify candidate molecules by pair-wise comparison of compounds. In its simplest form, the result of a similarity search in a compound database is a ranked list, where high-ranking structures are considered to be more similar to the query in a certain sense than low-ranking molecules. If either the query structure(s) or the database structures or both structures reveal a certain (desired or undesired) property or activity, some conclusions may be drawn for the molecules under investigation. Structures are compared based on a similarity value that is calculated from their molecular descriptors. There are two assumptions inherent to this idea, representing the hypothesis “if molecule A is more similar to the query molecule R than molecule B, then molecule A might more likely show some biological activity that is comparable to the activity of R”:
The molecular representation (descriptor) is assumed to appropriately cover those molecular attributes which are relevant for the underlying SAR/SPR.
The similarity measure applied is assumed to accurately relate differences in molecular descriptions to differences in the quality function ( Principle of Strong Causality).35
In the past, the analysis of assay data was primarily performed by medicinal chemists, looking at the active compounds and then deciding which hits the efforts should be focused on. First, with the increase in the number of experimentally determined hits, this approach becomes increasingly ineffective and computational techniques are increasingly used to classify the hits and derive hypotheses. Second, one should keep in mind that it is basically impossible for a human being also to take into account the large number of inactive compounds. The development of pharmacophore hypothesis, for example, typically requires the incorporation of information on inactive compounds.
By similarity searching, sets of candidate structures can be rapidly compiled from databases or virtual chemical libraries. Practical experience shows that such hypotheses are often weak and there clearly is no cure-all recipe or generally valid hypothesis leading to success in chemical similarity searching. Nevertheless, similarity searching provides a useful concept. A practicable measure of success can be expressed by an enrichment factor, ef, giving the ratio of the fraction of active molecules in the selected subset compared to the fraction of actives in the total pool (database). This value may be regarded as an estimate of the enrichment obtained compared to a random selection of molecules, as given by Equation 2.1.

“The molecular descriptor is the final result of a logical and mathematical procedure which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment.” (according to Todeschini and Consonni)2•
Data scaling is usually the first step of chemical similarity searching, feature extraction, hypothesis generation, and other types of virtual screening and machine learning. The most frequently applied scaling methods include scaling by range (Eq. 2) and scaling by standard deviation (autoscaling, Eq. 3). For most applications autoscaling is a method of choice, leading to data with zero mean and unit variance. In some cases, vector normalization to length one is a necessary preprocessing procedure (Eq. 4).

where i is the row index, and k is the column index of the raw data matrix X.

In Equations 3 and 4 n is the number of objects (molecules). Autoscaling results in data vectors scaled to a length of

Various similarity measures exist that can be used for chemical similarity searching. Very often a distance value dAB between pairs of molecules A and B (i.e., their descriptors ξA and ξB containing n elements each) forms the basis on which a similarity value is calculated. The frequently used Manhattan distance (also called Hamming distance or City-Block distance; Eq.5) and the Euclidean distance (Eq.6) are the first two examples of a general distance metric, the Minkowski or Lp-metric (Eq.7; see Eq. 1.9).



The similarity measure based on the Minkowski distance can be used to express molecular similarity (Eq.8). Completely similar or identical structures have a similarity value of sAB = 1, completely dissimilar molecules have sAB = 0.

where dAB(max) represents the maximal pair-wise distance found in the data set under investigation, e.g., the maximal distance between the query structure and a database compound. Many additional distance and similarity measures have found application in chemical similarity searching.36,37 The Tanimoto coefficient probably is the best known similarity index that is applied to comparison of bitstring representations of molecules (although it its application is not restricted to dichotomous variables). The set-theoretic definition of the Tanimoto coefficient is given by Equation 9, where χA is the number of bits set to 1 in the bitstring vector coding for molecule A, and χB is the number of bits set to 1 in the bitstring vector coding for molecule B. The range of values of the Tanimoto similarity measure is [0,1] for dichotomous variables.

Structures retrieved by similarity searching taking Midazolam (left) as the query structure. Top line: Tanimoto/Daylight method; Bottom line: CATS method.
Coding a chemical structure by the CATS topological atom type descriptor. A 2D molecular structure (a) is converted to the molecular graph (b), generalizing atom types are assigned (c), and the frequency of every atom pairs with a distance between 1 and 10 bonds is determined (d). For five atom types (lipophilic, L; hydrogen bond donor, D; hydrogen bond acceptor, A; positively charged, P; negatively charged, N) there are 15 possible pairs, resulting in a 15 × 10 = 150-dimensional histogram representing a molecular structure. In (d) an L-A pair over nine bonds is shown.
Schematic of the NAPAP-thrombin complex. On the left the most important interactions between the thrombin inhibitor NAPAP and the thrombin active site are shown. A simple pharmacophore model of thrombin activity is given on the right. L: lipophilic, D: hydrogen-bond donor, A: hydrogen-bond acceptor, P: positively charged or ionizable group. Pharmacophore models can be used for similarity searching and de novo design exercises.
A pharmacophore or pharmacophoric pattern is the ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response. (IUPAC recommendation 1997)42
Structure of mibefradil (1), a calcium channel blocking agent, and four selected isofunctional hits (2–5), which were retrieved by virtual database screening using the CATS software. Taking structure 5 as a query for CATS, a closely related structure (6) to mibefradil is retrieved. RTTC: recombinant T-type calcium channel, FLIPR: fluorometric imaging plate reader.
A common theme in molecular feature extraction is the transformation of raw data to a new co-ordinate system, where the axes of the new space represent “factors” or “latent variables”—features that might help to explain the shape of the original distribution. By far the most widely applied statistical feature extraction method in drug design belongs to the class of factorial methods: principal component analysis (PCA).12,48,49 PCA performs a linear projection of data points from the high-dimensional space to a low-dimensional space. In addition to PCA, non-linear projection methods like self-organizing maps (SOM), encoder networks, and Sammon mapping are sometimes employed in drug design projects.50 Since none of these methods require the knowledge of target values (e.g., inhibition constants, properties) or class membership (e.g., active/inactive assignments) they are termed “unsupervised”. Unsupervised procedures can be used to perform a first data analysis step, complemented by supervised methods later during the adaptive molecular design process.
Principal component analysis. PCA performs a projection of the m-dimensional data matrix X down to a d-dimensional subspace by means of the projection matrix LT, yielding the object co-ordinates in this plane, S (Eq.10; Eq.11). S is termed the score matrix with n rows (objects, molecules) and d columns (principal components). L is termed the loading matrix with d columns and p rows, and T denotes the matrix transpose.


Principal components (PC) of a set of two-dimensional data. The original co-ordinate system is spanned by x1 and x2. The orthogonal score vectors s1 and s2 are calculated according to the criterion of maximum variance.
A simple but useful algorithm for PCA is given by the NIPALS (nonlinear iterative partial least squares) technique. The following description of the algorithm is taken from Otto.12
Step 0
Scale the raw data matrix by the mean and normalize to length one
Step 1
Estimate the loading vector lT
Step 2
Compute the score vector: s = X l;
Compare the new and the old score vector. If the deviations of the elements of the two vectors are within a threshold (e.g., < 10−5) then go to Step 5, otherwise go to Step 3.
Step 3
Compute new loadings: lT = sT X;
Normalize the loading vector to length one:

Step 4
Repeat from Step 2 if the number of iterations does not exceed a predefined threshold (e.g., 100 iterations); otherwise go to Step 5.
Step 5
Determine the matrix of residuals: E = X−s lT.
If the number of principal components is equal to the number of previously fixed or desired components then go to Step 7; otherwise continue at Step 6.
Step 6
Use the matrix of residuals E as the new X-matrix and compute additional principal components s and loadings lT by means of Step 1.
Step 7
As a result, the matrix X is represented by a principal component model ac- cording to Eq.10.
The aim of Sammon's algorithm is to project points from a high-dimensional ( m-dimensional) space to a low-dimensional (n-dimensional) space which usually is two-dimensional. It is conventionally applied to exploratory data analysis. The original algorithm is an iterative method based on a gradient search.52 It finds a data distribution in the n-dimensional target space so that as much as possible of the original distribution in the m-dimensional space is preserved. In this non-linear mapping (NLM), the inter-point distances between vectors in the lower-dimensional space approximate the corresponding distances in the original m-dimensional space. (Note: The idea of Kruskal's mapping procedure is very similar to Sammon's mapping algorithm: again the inter-point distances in the n-dimensional space approximate the corresponding distances in the m-dimensional space, but the original distances are transformed by some monotonic, increasing function.53) The basis for mapping is given by the inter-point distance matrix. Multidimensional scaling (MDS) is a related technique which is also based on a similarity or dissimilarity matrix. A thorough comparison between Sammon's and Kruskal's mapping and MDS can be found elsewhere.54 Sammon's mapping is an optimization procedure starting from an initial configuration of n-dimensional vectors (e.g., randomly chosen or by taking n columns of the m-dimensional matrix X with maximum variances). A reasonable error function E is called “Sammon's stress” (Eq.12), measuring how well a distribution of k points in the n-space matches the distribution of k points in the m-space (i.e., the difference between the distance matrix of the original vector set, d, and the projected vector set, d):

An optimization algorithm is applied to decrease the stress, e.g., a steepest descent method to search for the minimum of the error function. Having found the distribution in the n-space after the t-th iteration, the new setting at time t+1 is given by Equation 13.

where η is the learning rate, and

The Sammon projection map complements those obtained with the SOM algorithm ( vide infra), auto-associative neural networks (AANN), multilayer perceptron (MLP, see Chapter 3) and principal component (PC) feature extractor. Lerner demonstrated for the example of chromosome classification that Sammon's (unsupervised) mapping is superior to classification based on the AANN and PC feature extractor and highly comparable with that based on the (supervised) MLP.55 Further thorough comparison of these and other supervised and unsupervised methods and application to a wide range of classification and feature extraction tasks must be performed to substantiate these findings. Additional information and different approaches to the NLM task can be found elsewhere.56–59
Encoder networks for nonlinear mapping of high-dimensional data. Neurons are drawn as circles, weights are represented by lines. Input neurons (white) are fan-out units, hidden-layer units (black) have a sigmoidal or linear activity, and the output neurons (gray) are linear
a) symmetrical network architecture attempting to reproduce the input patterns by going through a low-dimensional internal representation. Factor 1 and Factor 2 are the score values (co-ordinates) in the low-dimensional (here: two-dimensional) map.
b) conventional feed-forward network with two output neurons. The outputs represent the low-dimensional scores.
Architecture of a self-organizing map (SOM). Network containing (6 × 5) = 30 neurons. Each neuron is a four-dimensional vector represented by a stack of four cubes. An input signal (pattern vector x) leads to a response of a single neuron (“winner-takes-all”, gray-colored). Usually the top-down view of an SOM is shown.
Architecture of a self-organizing map (SOM). A toroidal SOM (top-down view). The neurons in the first and the second neighborhood to the gray-shaded neuron are indicated by black lines. The star symbol is in the second neighborhood of the neuron.
Principle of vector quantization. In this example, two-dimensional data vectors (pattern vectors; open arrowheads) form two distinct clusters. During the vector quantization process neuron vectors (filled arrowheads) move toward the centers of the clusters, thereby forming the cluster centroids.
Kohonen's algorithm represents a strikingly efficient way for mapping similar patterns, given as vectors close to each other in input space, onto contiguous locations in the output space.67 This is achieved by introducing a topology to the SOM neuron layer. The simplest topology is a chain of neurons, followed by a two-dimensional grid. Topological mapping can be achieved by two simple rules:
Locate the best-matching neuron (winner neuron).
Increase matching at this unit and its topological neighbors.
For the first rule only vector distances between the input patterns x and the neurons w must be calculated (Eq.15). The number of comparisons needed depends linearly on the size of the self-organizing system C which can be expressed by the number of neurons, c.

The complete SOM algorithm can be formulated as follows:

Step1
Initialize the self-organizing map A to contain N = N1 * N2 neurons ci:

with reference vectors Wci □ R n chosen randomly according to p(ξ) from the set of training patterns.
Initialize the connection set C to form a rectangular N1 x N2 grid.
Initialize the time parameter t = 0.
Step 2
Generate at random an input signal ξ according to p(ξ).
Step 3
Determine the winner neuron according to Equation 15

Step 4
Adapt each neuron r according to
where the Hamming distance d1 is used to measure the neuron-to-neuron distance on the SOM grid, and a Gaussian neighborhood around the winner neuron s is used

with the standard deviation of the Gaussian:

and

The time-dependent calculation of σ and e require initial and final values that must be defined prior to SOM training.
Step 5 Increase the time parameter: t = t + 1.
Step 6 If t < tmax then continue with Step 2, otherwise terminate.
Stages of SOM adaptation. A planar (10 × 10) SOM was trained to map a two-dimensional data distribution (small black spots). The receptive fields of the final map are indicated by Voronoi tessellation in the lower left projection. A and B denote two “empty” neurons, i.e., there are no data points captured by these neurons. The simulation was performed using the SOM tutorial software written by H.S. Loos and B. Fritzke;113 Figures were adapted from the graphical www output.

Many variations and extensions of Kohonen's algorithm have been published ever since his original paper appeared. For a recent overview, see for example a volume by Oja and Kaski.73 One major limitation of the original SOM algorithm is that the dimension of the output space and the number of neurons must be predefined prior to SOM training. Self-organizing networks with adapting network size and dimension provide more advanced and sometimes more adequate solutions to data mining and feature extraction.72 A disadvantage of Kohonen-networks can be the comparatively long training time needed, especially if large data sets are used since every data vector must be presented several times to the network for weight adaptation. Hybrid multi-layered UNN which can be trained extremely fast employing very large data sets have already been developed.71,74,75 These systems can provide an alternative classification tool to Kohonen-networks if real-time or on-line computation is required, e.g., for control and analysis of HTS results. Usually they contain more than two layers of neurons where from layer to layer a more subtle data classification is performed, and only parts of the network are adapted during one training cycle. This reduces the training time needed since a data vector is not compared to every weight vector as in classical Kohonen-networks. Especially combinations of supervised and unsupervised learning techniques are under steady development and represent a very active area of current research. The authors of this book are convinced that such systems will become an indispensable part of bio- and chemoinformatics in the field of drug discovery. Several practical applications of the SOM to compound classification, drug design, and chemical similarity searching are described in the following part of this Chapter.
Structures of two potent thrombin inhibitors: PPACK (left) and Argatroban (right).
Virtual screening for potential antidepressants by a self-organizing map. The SOM represents a topology-preserving visualization of a high-dimensional chemical space spanned by 150 descriptors (CATS). The distribution of a set of known 597 antidepressants is indicated by gray-shading (white: only antidepressants; black: only other drugs; gray: mixed cluster). A separation of antidepressant agents and “other” drugs can be observed. The two known antidepressants imipramine and fluoxetine were predicted to fall in the “antidepressants area” [neuron (4/7) and neuron (4/9)], and the NK1 inhibitor NKP-608 is located on an “activity island” on the map [neuron (7/4)].
Comparison of substance classes and compound libraries is a further application area of the SOM. In the following example, this method was used to compare drugs to a compilation of “nondrugs” in “Ghose & Crippen”-space.94–96 Each molecule was coded by a 120-dimensional vector giving the fragments counts of 120 molecular fragments defined by Ghose and coworkers.97,98 For graphical display the molecule distributions in this 120-dimensional two-point pharmacophore space were projected onto a toroidal map consisting of (15 × 15) = 225 neurons (clusters). To determine the raw classification accuracy of an SOM, the correlation coefficient, cc, according to Matthews was calculated (Eq. 18).99 In Equation 18, P is the number of positive correct predictions, N is the number of negative correct predictions, O is the number of false-positive predictions (overprediction), and U is the number of false-negative predictions (underprediction).

SOM projection of a chemical space filled with 4,998 drugs and 4,282 nondrugs. The frequencies of 120 Ghose & Crippen fragments were used to encode each molecule. Each square represents a cluster of molecules (Voronoi region). Note that the (10 × 10) map forms a torus. Data sets courtesy of J. Sadowski.
a) the ratio of drugs and nondrugs clustered is shown by grey scale shading (white: pure nondrug cluster, black: pure drug cluster).
b) binary classification of the distribution shown in (a). The Matthews correlation coefficient for this classification is cc = 0.48.
The extension of this approach to a comparison of natural products and trade drugs and consequent virtual library design was done by Lee and Schneider.96 It was demonstrated that natural compounds provide interesting novel scaffold architectures, which can be used in combinatorial drug design approaches. However, in most cases the scaffolds will have to be modified to provide synthetic feasibility and stability and prevent adverse pharmacokinetic effects. Taking such a natural scaffold in combination with synthetic side-chains might become a typical strategy in future drug design.100
SOM showing the distribution of 5726 trade drugs (left) and a 40 × 40 = 1600-member combinatorial library (right) in CATS topological pharmacophore space. Apparently these two compound collections do not overlap significantly.
The applicability of the SOM for mapping elements of protein structure-like secondary structure elements or surface pockets was demonstrated recently.101–103Knowledge of the 3D structure of a target protein undoubtedly is a rich source of information for computer-aided drug design. Of special interest are the size and form of the active site, and the distribution of functional groups and lipophilic areas. Due to the fact that the number of solved X-ray structures of proteins is rapidly increasing and thus the amount of information available, it is desirable to address questions related to coverage of the protein structure universe, conserved patterns of functional groups, or common ligand binding motifs.104,105 It is evident that such an analysis cannot be performed by visual inspection of structural models only. Automatic procedures for analysis, prediction, and comparison of macromolecular structuresin particular potential binding sites in proteins will be a very helpful tool.106 One such implementation of a computational method developed at Roche includes four steps:103 i) automated detection of protein surface pockets; ii) generation of a property-encoded solvent accessible surface (SAS) for each pocket; iii) generation of correlation vectors of the SAS to obtain rotation- and translation-invariant descriptors; and iv) SOM projection of these vectors onto a low-dimensional display. As a result, a two-dimensional map is obtained showing the distribution of surface cavities in a chemical property space. This method was originally applied to a set of 176 proteins from the Protein Data Base (PDB)107 containing a catalytically active zinc ion in the active site. On the resulting SOM, with only a small degree of mis-classifications the active site pockets were clearly separated from other surface cavities. A more detailed analysis revealed that the automated mapping of the active sites accurately reflects established enzyme classification. Such a projection and analysis technique can give new insight into local structural similarities between enzymes revealing completely different folds and functions. Furthermore, the SOM mapping technique allowed for the correct classification of surface pockets derived from proteins that were not contained in the training set. We are convinced that this and other similar techniques bear a significant potential for automated protein structure analysis and drug design.108 If possible, the analysis of macromolecular (target) features should parallel feature extraction from sets of known ligands to obtain desired novel designs.
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC].
Free Full text in PMC]