pmc logo image
Logo of pnasPNAS Home page.Reference to the article.PNAS Info for AuthorsPNAS SubscriptionsPNAS About

Formats:

Proc Natl Acad Sci U S A. 2007 July 3; 104(27): 11334–11339.
Published online 2007 June 26. doi: 10.1073/pnas.0702965104.
PMCID: PMC2040899
Applied Mathematics, Evolution
Defining functional distance using manifold embeddings of gene ontology annotations
Gilad Lerman* and Boris E. Shakhnovich
*Department of Mathematics, University of Minnesota, Minneapolis, MN 55455; and
Program in Bioinformatics, Boston University, Boston, MA 02215
To whom correspondence may be addressed. E-mail: lerman/at/umn.edu or Email: borya/at/bu.edu
Communicated by Ronald R. Coifman, Yale University, New Haven, CT, April 9, 2007.
Author contributions: G.L. and B.E.S. performed research and wrote the paper.
Received June 7, 2006.
Although rigorous measures of similarity for sequence and structure are now well established, the problem of defining functional relationships has been particularly daunting. Here, we present several manifold embedding techniques to compute distances between Gene Ontology (GO) functional annotations and consequently estimate functional distances between protein domains. To evaluate accuracy, we correlate the functional distance to the well established measures of sequence, structural, and phylogenetic similarities. Finally, we show that manual classification of structures into folds and superfamilies is mirrored by proximity in the newly defined function space. We show how functional distances place structure–function relationships in biological context resulting in insight into divergent and convergent evolution. The methods and results in this paper can be readily generalized and applied to a wide array of biologically relevant investigations, such as accuracy of annotation transference, the relationship between sequence, structure, and function, or coherence of expression modules.
Keywords: kernel methods, diffusion geometry, domain evolution, functional annotation, homology modeling
One of the fundamental questions in biology deals with the inter-relationship between structure, function and evolution. The need to precisely and quantitatively measure evolutionary relationships encouraged the development of robust and accurate sequence (1) and structure (2, 3) comparison methods. The importance of these algorithms to computational biology cannot be underestimated. For example, the efficacy of transferring functional annotation depends on the precision of these sequence and structure comparison algorithms (4, 5). Although significant progress has been made in defining distance between sequences and structures, a rigorous understanding of functional distance is still limited.
At first glance, the notion of functional distance is qualitative and subjective. The development of annotation systems that depict function in a machine readable format was the first step in treating functional annotation rigorously. For example, the Gene Ontology (6) (GO) has become the gold standard for describing molecular functions of genes and proteins. However, the GO is not naturally amenable to measuring distance. One complication is an intrinsic bias in annotation where large numbers of unrelated genes share the same annotation (ATPase), making those categories uninformative. Previous attempts at identifying functional relationships between genes focused mostly on calculating statistical over-representation of functional categories (7). These methods are well suited for quantifying coherence of function in sets of genes, but not useful for exploring structure–function or sequence–function relationships.
Recently, researchers have recognized the importance of measuring distance between annotations (8) and proposed a simple measure of distance using the shortest path algorithm (9). However, these kinds of distances lack resolution and are complicated by somewhat arbitrary characteristics of the ontology, e.g., when annotations on the same level differ in their degree of generality. Accordingly, we show that functional metrics based on shortest path algorithms perform significantly worse than methods based on diffusion-type manifold embedding (10) proposed in this work.
Defining distances between functional categories is integrally important due to potential insights into the coevolution of sequence, structure and function (11). For example, function broadly defined as all activities performed by a set of sequences that fold into a domain structure, can be represented as a weighted subgraph of the GO directed acyclic graph (DAG) (12). This representation of function was used to establish the importance of considering homology relationships in a phylogenetic context. In this paper, we introduce more accurate and sensitive functional distances based on diffusion-type manifold embeddings of GO annotations to explore the structure–function relationship in detail.
Manifold embedding techniques are based on kernels (see definition in Materials and Methods), which have already been successfully applied to various problems in bioinformatics (13). In particular, computational approaches aimed at integrating various data sets have explored the effect of adding GO kernels for use in subsequent classification by SVM (14). Although our approach also employs kernels defined on GO, there are several fundamental differences. Most importantly, we apply these kernels to quantify functional distances as opposed to applications centered on classification of data into specific categories. Moreover, our approach naturally extends the notion of functional distance to protein domains by using the geometric interpretation of the manifold embedding (see Materials and Methods). Finally, we apply functional distances to exploring coevolution of sequence, structure, and function.
Functional distances defined here via diffusion-type manifold embedding techniques allow for increased sensitivity and arbitrary levels of granularity. Using our measures of functional distance, we can estimate the average divergence of function with respect to structure, sequence or phylogenetic similarity. Although clearly an area of active research, we show that functional distances are already accurate enough to discover specific relationships between protein domain functions. Finally, we show how functional distances can be used to explore divergent as well as convergent evolution.
Defining Functional Distance.
The molecular function component of the GO represents functional annotations as nodes on a DAG (6). We can capitalize on the hierarchical structure of the DAG to define local distances between functional annotations. Consider that there are only 20 possible annotations at the top, and >2,000 on the fifth level of the Gene Ontology. Thus, comparison at the top level of hierarchy will be, by design, less precise than at the bottom level. One way to address this would be to model the inherent bias of the ontology by taking into account node usage (14). For example, consider a case where a large proportion of proteins are coannotated with a pair of GO terms, the distance between these nodes on the GO DAG will be large because their cooccurrence is not specifically correlated with shared function.
Thus, the basic idea behind building an appropriate kernel is that GO terms shared by few protein sequences will be assigned small local distances or equivalently high values of local similarities. Alternatively, general annotations appearing at the top of the ontology will be assigned large local distances (or small similarities). Using the intuition outlined above, we form a graph where weights represent local similarities and use several techniques of manifold and graph embedding to calculate global distances between functional annotations. Embedding strategies exploit the underlying geometry of the graph and can implicitly correct ambiguities in the ontology. Finally, we use a global measure of distance between GO terms in combination with representation of domain function as a GO subgraph (12) to compute meaningful functional distances between protein domains [see Materials and Methods and supporting information (SI) Text].
Correlating Functional Distance with Sequence, Structure, and Phylogenetic Proximity.
We use the well known correlations of function with sequence, structure (12), and phylogenetic profiles (15) to evaluate the efficacy of using manifold embedding to quantify functional relationships between domains. The embedding procedure involves defining local similarity weights as described above and using them to form a kernel (the types of kernels used here and their direct relation to the notion of manifold embedding are described in Materials and Methods). The choice of kernel is arbitrary, but integrally important in the definition of distance. Thus, we compared the performance of several kernels in their ability to accurately represent functional distance between protein domains. We report results for four different choices of kernels. The first three are formed by diffusion-type kernels, whereas the fourth is similar to previously proposed shortest distance between GO annotations (9).
We use Z scores (2) from DALI (16) to quantify structural proximity, BLAST (1) for sequence similarity and mutual information (MI) between phylogenetic profiles (15) for phylogenetic similarity (see Materials and Methods). We find that functional distances between protein domains calculated using diffusion-type kernels correlate well with sequence alignment, structural proximity and phylogenetic similarity (Fig. 1Fig. 1. a–c). Importantly, the dynamic range of the correlations is very large and the averaging due to binning almost insignificant. On the other hand, the distance metric based on the shortest path algorithm shows no significant correlation with either homology or phylogenetic similarity (Fig. 1Fig. 1.d). A clear benefit of developing a rigorous functional distance metric is the comparison of functional information in sequence, structure alignment, and phylogenetic profiling.
Fig. 1.
Fig. 1.
Fig. 1.
Correlation of functional distance with sequence, structure, and phylogenetic similarity. Functional distances between domains are based on manifold embedding of GO annotations using three different kernels and also formed by geodesic distances. Structural (more ...)
One thing to note from Fig. 1Fig. 1. is the dependence of the observed correlations on the choice of kernel. For example, the correlation between protein structure similarity and functional distance can be described by a first-order exponential decay, along the full range from far-diverged folds (Z = 6) to superfamily (Z = 9), and closely related proteins that belong to the same structural family (Z > 12) (11) and often share the same function. This behavior is similar for all diffusion-type kernels considered in the present work. However, the rate of exponential decay depends on the kernel. We observed that the LLE kernel shows the slowest decay rate (T = 0.54; T is the mean lifetime), whereas the inverse Laplacian kernel (pseudoinverse of the graph Laplacian) shows the steepest one (T = 0.23) (Table 1). Furthermore, it is reasonable to assume that sequence alignment will relate to functional distance through a logistic function. Indeed, good sequence alignment is highly informative of similarity in function, whereas above a certain threshold, sequence alignment provides little information about functional proximity. Once again, we note that the different diffusion kernels can be characterized by the slope of the transition in the logistic fit. Consistent with results on correlation with structure, we find that the LLE kernel shows the shallowest slope (S = 6.38), whereas the inverse Laplacian has the steepest transition slope (S = 13.88) (Table 1).
Table 1.
Table 1.
Fitted parameters to observed correlations between functional distance and sequence, structure, or phylogenetic similarity measures
We find that the differences in the observed correlations between functional distances derived from each kernel and sequence, structure and phylogenetic proximity measures can provide insight into the behavior of the kernel at different scales of resolution. The chosen diffusion-type kernels (pseudoinverse of the graph Laplacian, LLE and diffusion powers) emphasize different ranges of interaction between GO annotations and consequently result in range-specific resolutions. Specifically, the LLE kernel corresponds to a low power of diffusion and thus emphasizes shorter-range interactions between annotations. The diffusion kernel of power m = 7, represents a functional distance with good resolution at medium distances because it takes into account larger paths along the unified GO annotation graph. Consequently, the range of approximately linear correlation with sequence alignment shortens. At last, the inverse Laplacian takes into account all powers of diffusion and thus incorporates all paths along the unified GO annotation graph. Therefore, it has impressive resolution at longer functional distances.
Consistent with the explanation presented above, both the structural alignment and sequence alignment show increasingly sharp transitions when applying the LLE kernel, followed by the diffusion kernel with power m = 7 and at last the inverse Laplacian kernel (Table 1). Thus, manifold embedding of GO can produce a functional distances at needed resolution by choosing a kernel appropriate to the specific application. For maximum resolution at small functional distances, the LLE kernel is most appropriate, whereas maximum resolution at long distances can be achieved by using the inverse Laplacian kernel. However, as expected, the qualitative behavior of the correlations remains the same for all choices of diffusion kernels.
Building a Functional Domain Universe Graph.
Next, we wanted to explore whether our definition of functional distance that correlates on average with sequence, structural, and phylogenetic similarities (Fig. 1Fig. 1.) is accurate enough to yield biologically meaningful insights into the structure–function relationships of specific protein domains. We begin by creating a graph where nodes are domains colored by SCOP (17) fold annotation, and edges represent functional proximity calculated using the diffusion kernel (with m = 7). The graph is transformed into an unweighted version using an empirically derived threshold of F = 0.23. The resulting graph (Fig. 2Fig. 2.a) illustrates both the specific functional relationships between individual domains and global relationships between folds and functions.
Fig. 2.
Fig. 2.
Fig. 2.
Functional domain universe graph and structure-function coevolution. (a) Functional domain universe graph. The vertices of the graphs represent protein domains, whereas its edges represent functional similarity between domains. To draw the graph, we chose (more ...)
Two things become immediately apparent from functional embedding of the protein domain universe. First, at short functional distances, domains sharing fold classification form clusters sharing common function. Second, at intermediate functional distances, clusters of domains with related functions are proximal on the graph. For example, DNA-binding domains form a cluster that is close to the cluster containing exonuclease domains and transcription factor domains. As another example, Rossman fold domains performing oxidoreductase activity are separated by only one step from domains with dehydrogenase activity.
Although the graph shows separation of domains by fold and function, the structure–function relationship is clearly multifaceted. Functional clusters are not entirely monochromatic, e.g., functions are usually fulfilled by domains of several different folds. Some folds are also multifunctional and appear in clusters that are far from each other, e.g., Ferredoxins. Other folds are more functionally exclusive and only participate in clusters that are in close proximity, e.g., TIM beta/alpha barrel are mostly enzymatic functions. Finally, it appears that this representation of relationships between protein domain functions captures the separation of folds into functionally related superfamilies (17).
Exploring Structure–Function Coevolution.
Interestingly, there are certain domains that link proximal clusters. These domains may represent the intermediates in the evolutionary path from one function to another. For example, consider two clusters (labeled B and E on Fig. 2Fig. 2.a) populated mostly by 3-helical bundle domains. The B cluster contains domains responsible for DNA binding. Domains in this cluster bind to DNA nonspecifically (a representative structure is 1hlv, which is a centromere binding protein; ref. 18). On the other hand, the cluster labeled E is dominated by domains with the same 3-helical bundle structure, but those that bind to specific DNA sequences. These are mostly domains that carry out transcription initiation activity (a representative structure is the engrailed transcription factor 2hdd; ref. 19). Interestingly, there is one domain that also has the structure of a 3-helical bundle that is functionally proximal to both clusters and appears as the connecting hub. This domain is coded by a family of gamma-delta resolvases (1gdt; ref. 20). This is a family of proteins that binds to imperfectly conserved sequences (21) (Fig. 2Fig. 2.b).
Clearly, sequence binding specificity is not explicitly described by GO. However, the 3-helical bundles are a remarkable example of how GO embedding and the subsequent graph theoretical treatment can uncover relationships between structures by placing their functions in biological context. Subsequent application of evolutionary trace methods to the three families can uncover the residues responsible for the differential binding specificity of the 3-helical bundles and their mutational dynamics.
Specificity of DNA binding in 3-helical bundle domains is an example of divergent evolution where sequences are related by common ancestry (22). On the other hand, convergent evolution is often defined as two proteins with no apparent homology performing the same function (22). An additional benefit of defining functional distances is that we can easily detect instances of convergence by examining domains with close functional distance and no structural similarity. For example, using functional distances, we easily confirmed the well documented case of convergence of tRNA synthases [1pys (23) and 1a8h (24), F score = 0.001 and Z score < 2].
Machine readable representations of function, e.g., GO, are a necessary first step toward high-throughput functional annotation of data from whole-genome sequencing and structural genomics projects. Although these databases represent an intuitively appealing representation of function, they are not immediately amenable to accurate definitions of functional distance. Using nonlinear manifold embedding techniques, we were able to define distances between functional annotations and use those to quantify distances between protein domains. We find that diffusion kernels perform remarkably well in creating an accurate global distance metric applicable to quantifying functional relationships between protein domains.
As an example of specific insights that can be uncovered using the proposed distance metric, we explore functional relationships between 3-helical bundle domains which form two clusters in function space. These functional clusters turn out to be separable by the specificity of DNA binding. The family of sequences that are functionally similar to both clusters binds with intermediate specificity. We were also able to confirm examples of convergence where domains sharing close functional proximity appear to have evolved independently. Further exploration of this representation of the protein domain universe will undoubtedly uncover many more insights into the relationship between evolution of structure and function.
Kernel-based functional distance metrics have several important advantages over previously described methods (14), Euclidean measures (12), and shortest path algorithms (9). First, the diffusion-type manifold embedding techniques give rise to distances taking into account both the geometry of the ontology and intrinsic biases in annotations in a robust way (insensitive to small amounts of noise). In particular, distances between subgraphs of annotation (e.g., those representing protein domains) have a clear geometric interpretation. Secondly, manifold embedding learns distances between annotations, rather than using kernels for classifications or defining distances between genes. Consequently, this approach is more natural for evaluating and comparing relationships between sequence, structure, and function as opposed to previous metrics that focused on applying GO kernels as part of a heterogeneous dataset for classification of protein–protein interactions (14). As a result, these methods are significantly more general and can be applied in calculations of functional distances between arbitrary numbers of genes. Additionally, techniques presented here can be easily adapted to other ontologies. Finally, correlations with sequence, structure (2, 15) and phylogenetic proximity (Fig. 1Fig. 1.) show that metrics based on diffusion-type manifold embedding are significantly more accurate than previously proposed measures (9).
Having the ability to estimate “distance” in function space is fundamental to computational biology in the postgenomic era. A variety of computational tasks including assessment of annotation accuracy from homology modeling and module detection from microarray data can be facilitated by an accurate measurement of functional relationship between genes.
The GO DAG (6) can be found at www.geneontology.org. For structural proximity calculations, we use the Dali domain dictionary (2). The list of domains (3306) can be found at romi.bu.edu/kernel_mapping/dali.txt. We use ASTRAL (25) to determine the SCOP (17) annotation for each domain. We use BLAST (1) to compare domain sequences. Matlab codes computing the following functional distances between annotations and protein domains can be found in www.math.umn.edu/~lerman/supp/protein_distance. More specific details of the methods are discussed in SI Text.
Annotating Each Structure as a Subgraph on the GO.
To annotate structures using GO (6), we use the strategy (12) of collecting all annotations for sequences (from NRDB; ref. 26) that fold into the structure and reconstructing all paths up to the root of the GO DAG.
Local Similarities Between GO Annotations.
Formally, we form a unified graph G whose nodes are all annotation of GO appearing in protein domains and whose edges are the union of all edges of subgraphs representing protein domains. The local similarity weight wij on an edge connecting annotation i and j is defined as follows: wij = 1/nij where nij is the number of domain subgraphs containing that edge.
Similarities by Diffusion and LLE Kernels.
A (positive definite) kernel K for the unified graph is a real symmetric matrix whose size is N, the number of vertices of the unified graph, and whose eigenvalues are nonnegative. Its elements Ki,j represent local similarities between corresponding graph nodes (i and j). The diffusion kernels are based on local diffusion process on the unified graph. We first normalize the local similarity weights defined above by the degree matrix D, which is defined as follows:
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m01.jpg

The normalized matrix
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m02.jpg

represents local transition probabilities between GO annotations. Its symmetric version is
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m03.jpg

Following Coifman et al. (10), the diffusion kernel of “power” (or transition step) m is the matrix
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m04.jpg

A related diffusion kernel, suggested by Ham et al. (27), is formed by taking the pseudoinverse of the graph Laplacian, that is,
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m05.jpg

A similar LLE (28) (local linear embedding) kernel is obtained by following Ham et al. (27): We denote by e the uniform column vector of size N and length 1, that is, its elements are equation i1. We then set
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m06.jpg

Finally, we denote by λmax, the largest eigenvalue of M and form the LLE kernel by the formula
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m07.jpg

Other forms of diffusion kernels (29, 30) are described in SI Text.
Distances Between Annotations and Their Relation to Manifold Embedding.
Given a kernel K, we compute the distance d(x, y) between GO annotations x and y as follows:
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m08.jpg

This formula has a straightforward interpretation. Any kernel K can be written in the form: K(x, y) = left angle bracketF(x), F(y)right angle bracket, where F embeds the graph vertices into a Euclidean space (usually referred to as feature space). Consequently, Eq. 1 can be written as:
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m09.jpg

The distance d(x,y) thus represents the Euclidean distance between the embedded annotations (in feature space).
Assuming that the graph approximates a low-dimensional manifold or another continuous geometric structure, we view the graph embedding, F, as an approximation to a corresponding manifold embedding. The embedding and its corresponding distance are determined by the choice of kernel, which reflects geometric properties of the underlying graph or manifold. Indeed, when applying the diffusion kernel of power m (10), the corresponding distances measure the rate of connectivity between vertices according to paths of length m. The distances obtained by the inverse Laplacian represent the expected time to travel from one vertex to another vertex and then back to the original vertex (27). The LLE distance is similar to a diffusion kernel with low powers. The corresponding LLE embedding tries to preserve local distances to nearest points along the graph (see SI Text).
In the SI Text, we discuss efficient numerical evaluation of the functional distances for different kernels and large N.
The geodesic distances were calculated using Dijkstra's algorithm on the global GO graph (with local distances nij).
Distances Between Subgraphs Representing Protein Domain Functions.
We define the distance d(x,A) between a node x and the set of vertices A as follows:
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m10.jpg

The distance between the two sets of vertices A and B is then computed using the formula:
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m11.jpg

Variants of this “distance” and their properties are discussed in refs. 31 and 32.
Phylogenetic Similarity Between Protein Domains Based on Phylogenetic Profiles (P Score).
We evaluate the phylogenetic similarity between structures by BLASTing (1) the set of nonredundant sequences found to fold into each domain against all fully sequenced genomes. The similarity between any two domains is then just the empirical mutual information, MI, between their phylogenetic profiles (15). If x and y are two phylogenetic profiles, then
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m12.jpg

where pij(x, y), i, j = 0, 1, describe the frequencies of occurrence of all four possible combinations of presence (i/j = 1) or absence (i/j = 0) in the same genome for the two domains, pi(x), i = 0, are the frequencies of occurrence (i = 1) or absence (i = 0) in profile x and pj(y), j = 0, 1, are defined similarly. MI will be maximal if p00(x, y) = p11(x, y) = 0.5. That is, half of the terms of the two phylogenetic vectors are perfectly correlated (as a measure of nonorthologous gene displacement; ref. 33), whereas the terms in the other half are perfectly anticorrelated.
Curve Fitting (Fig. 1Fig. 1.).
All curve fitting was done using Origin 7 SR1 (www.originlab.com). Exponential decay was modeled using the equation
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m13.jpg

The values of T when correlating functional distance with structure and phylogenetic similarity are reported in Table 1. The correlation between sequence alignment and functional distance was modeled by the logistic function:
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m14.jpg

Here, the slope reported in Table 1 is simply
A mathematical equation, expression, or formula.
 Object name is zpq02707-6381-m15.jpg

All fitted functions had coefficients of determination in the range 0.89 < R2 < 0.97.
Supplementary Material
Supporting Text
Acknowledgments
We thank Mark Green and Institute for Pure and Applied Mathematics (University of California, Los Angeles) for inviting us to participate in a proteomics workshop, where we first met and started our discussion that led to this paper. G.L. thanks Ronald R. Coifman, Stephane Lafon, and Mauro Maggioni for introducing him to diffusion geometries and for forwarding him some of their papers and software. B.S. thanks Eugene Shakhnovich, Nick Grishin, Tim Reddy, and Joe Mellor for fruitful discussions and critical reading of the manuscript. G.L. is supported by National Science Foundation Grant 0612608.
Abbreviations
GOGene Ontology
DAGdirected acyclic graph.

Footnotes
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/cgi/content/full/0702965104/DC1.
1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Nucleic Acids Res. 1997;25:3389–3402. [PubMed]
2. Dietmann S, Holm L. Nat Struct Biol. 2001;8:953–957. [PubMed]
3. Shindyalov IN, Bourne PE. Nucleic Acids Res. 2001;29:228–229. [PubMed]
4. Sauder JM, Arthur JW, Dunbrack RL., Jr Proteins. 2000;40:6–22. [PubMed]
5. Gerstein M, Levitt M. Protein Sci. 1998;7:445–456. [PubMed]
6. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Nat Genet. 2000;25:25–29. [PubMed]
7. Berriz GF, King OD, Bryant B, Sander C, Roth FP. Bioinformatics. 2003;19:2502–2504. [PubMed]
8. Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M. Science. 2003;302:449–453. [PubMed]
9. Lord PW, Stevens RD, Brass A, Goble CA. Bioinformatics. 2003;19:1275–1283. [PubMed]
10. Coifman RR, Lafon S, Lee AB, Maggioni M, Nadler B, Warner F, Zucker SW. Proc Natl Acad Sci USA. 2005;102:7432–7437. [PubMed]
11. Shakhnovich BE, Max Harvey J. J Mol Biol. 2004;337:933–949. [PubMed]
12. Shakhnovich BE. PLoS Comput Biol. 2005;1:e9. [PubMed]
13. Schölkopf B, Tsuda K, Vert J-P. Kernel Methods in Computational Biology. Cambridge, MA: MIT Press; 2004.
14. Ben-Hur A, Noble WS. Bioinformatics. 2005;21(Suppl 1):i38–i46. [PubMed]
15. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Proc Natl Acad Sci USA. 1999;96:4285–4288. [PubMed]
16. Dietmann S, Park J, Notredame C, Heger A, Lappe M, Holm L. Nucleic Acids Res. 2001;29:55–57. [PubMed]
17. Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG. Nucleic Acids Res. 2004;32:D226–D229. [PubMed]
18. Tanaka Y, Nureki O, Kurumizaka H, Fukai S, Kawaguchi S, Ikuta M, Iwahara J, Okazaki T, Yokoyama S. EMBO J. 2001;20:6612–6618. [PubMed]
19. Tucker-Kellogg L, Rould MA, Chambers KA, Ades SE, Sauer RT, Pabo CO. Structure (London). 1997;5:1047–1054.
20. Yang W, Steitz TA. Cell. 1995;82:193–207. [PubMed]
21. Graham KS, Dervan PB. J Biol Chem. 1990;265:16534–40. [PubMed]
22. Ponting CP, Russell RR. Annu Rev Biophys Biomol Struct. 2002;31:45–71. [PubMed]
23. Mosyak L, Reshetnikova L, Goldgur Y, Delarue M, Safro MG. Nat Struct Biol. 1995;2:537–547. [PubMed]
24. Sugiura I, Nureki O, Ugaji-Yoshikawa Y, Kuwabara S, Shimada A, Tateno M, Lorber B, Giege R, Moras D, Yokoyama S, Konno M. Structure (London). 2000;8:197–208.
25. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE. Nucleic Acids Res. 2004;32:D189–D92. [PubMed]
26. Holm L, Sander C. Bioinformatics. 1998;14:423–429. [PubMed]
27. Ham J, Lee DD, Mika S, Scholkopf B. Proceedings of the Twenty-First International Conference on Machine Learning; Menlo Park, CA: AAAI Press; 2004. pp. 47–54.
28. Roweis ST, Saul LK. Science. 2000;290:2323–2326. [PubMed]
29. Kondor RI, Lafferty J. Machine Learning: Proceedings of the Nineteenth International Conference (ICML); San Francisco: Morgan Kaufmann; 2002. pp. 315–322.
30. Belkin M, Niyogi P. Neural Computation. 2003;15:1373–1396.
31. Memoli F, Sapiro G. Found Comp Math. 2005;5:313–347.
32. Dubuisson MP, Jain AK. Proceedings of the 12th IAPR; Los Alamitos, CA: IEEE Comp Soc Press; 1994. pp. 566–568.
33. Koonin EV, Mushegian AR, Bork P. Trends Genet. 1996;12:334–336. [PubMed]

See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph