![]() | ![]() |
Formats:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration, which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. Superfamily Assignments for the Yeast Proteome through Integration of Structure Prediction with the Gene Ontology 1 Department of Biochemistry, University of Washington, Seattle, Washington, United States of America 2 Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America 3 Department of Biology, Department of Computer Science, and Center for Comparative Functional Genomics, New York University, New York, New York, United States of America 4 Howard Hughes Medical Institute, University of Washington, Seattle, Washington, United States of America Andrej Sali, Academic Editor University of California San Francisco, United States of America * To whom correspondence should be addressed. E-mail: dabaker/at/u.washington.edu Received May 4, 2006; Accepted January 12, 2007. This article has been cited by other articles in PMC.Abstract Saccharomyces cerevisiae is one of the best-studied model organisms, yet the three-dimensional structure and molecular function of many yeast proteins remain unknown. Yeast proteins were parsed into 14,934 domains, and those lacking sequence similarity to proteins of known structure were folded using the Rosetta de novo structure prediction method on the World Community Grid. This structural data was integrated with process, component, and function annotations from the Saccharomyces Genome Database to assign yeast protein domains to SCOP superfamilies using a simple Bayesian approach. We have predicted the structure of 3,338 putative domains and assigned SCOP superfamily annotations to 581 of them. We have also assigned structural annotations to 7,094 predicted domains based on fold recognition and homology modeling methods. The domain predictions and structural information are available in an online database at http://rd.plos.org/10.1371_journal.pbio.0050076_01. Author Summary The three-dimensional structure of a protein can reveal much about that protein's evolutionary relationships and functions. Such information about all the proteins in an organism—the proteome—would offer a more global view of these relationships, but solving each structure individually would be a formidable task. In this study, we have parsed all Saccharomyces cerevisiae proteins into nearly 15,000 distinct domains and then used de novo structure prediction methods together with worldwide distributed computing to predict structures for all domains lacking sequence similarity to proteins of known structure. To overcome the uncertainties in de novo structure prediction, we combined these predictions with data on the biological process, function, and localization of the proteins from previous experimental studies to assign the domains to families of evolutionarily related proteins. Our genome-wide domain predictions and superfamily assignments provide the basis for the generation of experimentally testable hypotheses about the mechanism of action for a large number of yeast proteins. Introduction The yeast Saccharomyces cerevisiae is one of the most widely studied organisms, yet a large fraction of its proteins are of unknown structure and/or unknown function. Knowledge of the structure of a protein is critical to understand how it functions, and hence, a complete set of protein structures for yeast is desirable, but difficult to accomplish experimentally. The accuracy of de novo structure prediction methods, although far from the accuracy of experimental structures, has improved in recent years. The Rosetta de novo structure prediction method [1–4] is currently one of the best methods available for predicting the structure of proteins lacking obvious homology to known structures [5–8]. Application of Rosetta to genome-wide annotation has been limited by the difficulty of distinguishing accurate from inaccurate predictions and the computational cost associated with scaling the procedure to whole genomes. Initial results have been encouraging, showing promise on subsets of protein families and prokaryotic genomes [9,10]. We have previously [11] predicted structures for short Pfam families without structural information, and showed that a simple confidence function could partially separate correct structure predictions from incorrect predictions. There is a rich body of work on the relationship between superfamily (encoded in databases such as SCOP [12–14] and CATH [15]) and function (encoded in databases such as Kyoto Encyclopedia of Genes and Genomes [KEGG] [16] and Gene Ontology [GO] [17]). Although many superfamilies have been shown to carry out multiple functions, Hegyi and Gerstein [18] found that the majority of structure superfamilies carry out one or a few molecular functions, and conversely, that the majority of functions are carried out by one or a few SCOP superfamilies. This relationship can be exploited when predicting to which structure superfamily a protein belongs [9]. We describe an integrated approach for assigning protein domains to structure superfamilies that combines de novo structure predictions with GO function, process, and component annotations. We first parse all yeast proteins into putative structural domains using the Ginzu method [7,19]. Ginzu predicts domain boundaries by applying a hierarchy of sequence-based methods beginning with searching for homologs of known structure using PSI-BLAST [20] and ending by parsing block patterns in multiple sequence alignments (MSAs). After running Ginzu on the full proteome, we applied the Rosetta structure prediction method to domains shorter than 150 amino acids for which no homolog of known structure was found. The top structure predictions were compared to protein domains of known structure using the MAMMOTH protein structure comparison program [21]. The reliability of an assignment to a protein structure superfamily derived from these structure comparisons was evaluated using a logistic regression–based confidence function optimized on a large training set of Rosetta models for proteins of known structure. Superfamily predictions of increased accuracy were obtained by integrating GO function, component, and process annotations [17,22] from the Saccharomyces Genome Database [23] with the structure prediction data using a simple Bayesian approach. We predicted structures for 3,338 domains and have annotated 581 of them with novel SCOP superfamily assignments. The domain predictions, the predicted structures, and superfamily assignments are accessible at http://rd.plos.org/10.1371_journal.pbio.0050076_01. Results Predicting Structural Domains A total of 6,238 open reading frames (ORFs) were parsed into structural domains using Ginzu [7,19]. Ginzu was used successfully in Critical Assessment of Techniques for Protein Structure Prediction 6 (CASP6) to delineate domains within query proteins by sequentially searching for (1) sequence-detectable homology to the Protein Data Bank (PDB) using PSI-BLAST [20], (2) more-remote fold recognition hits to PDB structures [24,25], (3) hits to Pfam-conserved sequence family domains [26,27], and (4) block patterns in MSAs. This hierarchical application of methods is organized so that methods providing more reliable information are applied first, thus accuracy is not sacrificed as we apply multiple methods in an attempt to maximize comprehensive coverage of the genome. A total of 14,934 domains were predicted, of which 38% had a sequence-detectable homolog of known structure, and an additional 9% could confidently be annotated by fold recognition methods. A summary of the genome-wide domain parses is presented in Table 1, and a complete list of domain predictions are presented in Table S1.
Fold Recognition Although the confident fold recognition results generated as part of this study are not the main focus of this paper, they provide a wealth of information on proteins for which there are no detectable sequence homologs of known structure. The results for 1,361 domain annotations using fold recognition are detailed in Table S1 and are available at http://rd.plos.org/10.1371_journal.pbio.0050076_01. Protein Structure Prediction A total of 4,006 yeast protein domains shorter than 150 amino acids (a practical length limit for the Rosetta method) and not linked to known structures by PSI-BLAST or fold recognition methods were identified by Ginzu: 668 of these contained predicted transmembrane helices and were omitted; the remaining 3,338 domains were folded using the Rosetta de novo method. Ten thousand structure models were generated for each of these remaining 3,338 domains using the Rosetta de novo method [1,2,28] and then condensed to 30 representative models by clustering. The size of the calculation is significant and is estimated at 12 million CPU hours, or 1,350 CPU years. This calculation was performed on the World Community Grid (WCG) parallel grid computing facility provided by IBM (http://wcgrid.org). Superfamily Assignment by Structure Comparison The 30 representative models for each domain were compared to a database of experimentally determined protein structure domains (based on ASTRAL; see Materials and Methods) with representatives from all SCOP (version 1.67) superfamilies and evaluated using a confidence function (referred to as the MAMMOTH Confidence Metric [MCM]) described below. The confidence of a given prediction for a given protein-domain is estimated based on features resulting from the Rosetta structure prediction, clustering, and structure–structure matching steps (using MAMMOTH [21]). The primary improvement in the confidence function over our previous work [11] is the inclusion of the contact order (CO; average sequence separation of contacting amino acids [29]) of the residues superimposed in the MAMMOTH structure–structure alignment of the predicted structure with the matched structure; this CO term penalizes less-significant matches dominated by local contacts such as single long alpha helices. Figure 1
The confidence estimates derived from our SCOP benchmark set are likely to be somewhat inflated when applied to the yeast protein set for two reasons; first, as discussed in the following section, the domain boundaries are derived directly from experimental structures in our SCOP benchmark, but are subject to error for the yeast proteins, and second, in the SCOP benchmark set, there is by construction always at least one closely related structure in the correct superfamily, whereas proteins with novel folds in yeast may not belong to any pre-existing superfamily. Below and in Materials and Methods, we describe tests on two additional validation sets that include the above sources of error (and thus allow for the estimation of the effects of such errors on structure superfamily prediction). Although there is a non-negligible presence of errors in domain parsing and superfamily assignment, our results show that the superfamily assignments generated herein (see Table S2) should be valuable for stimulating the generation of experimentally testable hypotheses about the structure and often the mechanism of action of these proteins. Superfamily Assignment through Integration of Structure Predictions with Function There is a strong relationship between the function of a protein and its structural superfamily [18]. Most commonly, proteins in the same superfamily carry out one or a few functions. The reverse is also true; often only one or a few superfamilies are found to carry out a specific function. We derived probability distributions, P(GO|SF), that relate SCOP superfamily (SF) to molecular function, biological process, and cellular component (GO). We also constructed probability distributions, P(SF|D), that give the probability of a given superfamily, given the predicted structures (D), that is derived from the distributions of PMCM for a target, as described in Materials and Methods. These distributions were integrated to determine the degree to which a superfamily prediction is simultaneously compatible with the structure predictions and the functional annotation available for a given protein, using:
The superfamily distributions derived from the structure prediction data alone (P(SF|D)), the GO annotations (P(SF|GO)), and from the two together (P(SF|D,GO)), are compared in Figure 2
Internal Standards—Additional Validation of Confidence Metric Using Proteins Solved after Calculations Were Completed True performance of these technologies cannot be assessed on the benchmark dataset because the domain boundaries of this set are perfect (derived from known structures in the ASTRAL database). A subset of the proteins without links to known structure at the start of this project now have strong homology to a structure that has since been solved, see Figure 4
Importantly, we were able to use these sets of recently solved proteins to better characterize the errors associated with different confidence Ginzu domain predictions. We found that a subset of the incorrect domain parses which significantly diminish the chances of correctly predicting fold and function are easily removed using a simple filter (described in Materials and Methods). This domain-prediction filter allows us to recover more-accurate predictions for multi-domain proteins. We were able to classify 50% of the amino acids from the 6,238 attempted ORFs to SCOP superfamilies which is significantly higher than the 35% coverage achieved by a sequence-based hidden Markov model approach [30]. Novel SCOP Superfamily Assignments In this section, we discuss several protein complexes with components assigned to superfamilies by both GO-integration and MCM approaches. These predictions and the much larger set of predictions in the database accompanying this paper provide a basis for hypothesis generation and experimental testing, but it must be borne in mind that there is a significant probability that any single prediction is incorrect, as indicated by our estimates of error. The mediator complex, a large complex containing 24 polypeptides [31], has been shown to be required for transcriptional activation in many eukaryotic organisms and play key roles in transmitting regulatory information to the pre-initiation complex. During transcriptional initiation, it interacts with the RNA polymerase II holoenzyme and the promoter region. The role of the mediator complex in transcriptional regulation, and the complete makeup of this complex and its dynamic composition throughout different cell and developmental states (in response to specific regulators) are active areas of research. To date, several studies have explored the overall makeup of the complex by probing protein–protein interactions [31] and by electron microscopy of purified mediator complex, but to our knowledge, this complex has eluded higher resolution methods such as crystallographic analysis. Although there exists an extensive body of work on the overall function of this complex, the roles, positions, and structures of most of the individual polypeptide components remain undetermined. We find confident superfamily predictions for several proteins within this complex that were not structurally annotated prior to this work. Table 4 outlines these predictions, as well as their sources and confidence estimates. Several proteins in the Mediator head domain are predicted to contain DNA-binding domains. In addition, multiple head domain proteins are predicted to be long helical bundles, potentially serving as scaffolds. ROX3 contains two predicted domains, see Figure 5
TIF35 (Figure 5 The mitochondrial ribosome, or the mitoribosome, shares a number of protein components with bacterial ribosomes, but it is believed that the mitoribosomes have comparatively more proteins than their bacterial counterparts; many of the proteins associated with the mitoribosome have no detectable sequence similarity to other mitochondrial proteins [33]. We have predicted the structure for two components known to be associated with the mitoribosome [34,35]. MRPL37 (Figure 5 INH1 (Figure 5 Data Access All data are accessible via the Yeast Resource Center (YRC) public data repository [38] at http://rd.plos.org/10.1371_journal.pbio.0050076_01. The data will also be made available in other formats upon request. Discussion Comprehensive generation of three-dimensional structures with resolution or reliability of those determined by X-ray crystallography or nuclear magnetic resonance (NMR) is currently beyond the capabilities of any protein structure prediction method; these methods can, however, play an important role in generating structural annotations for whole genomes due to the much lower investment of resources required per protein domain. In this work, we have shown that it is possible to: (1) generate protein structure models on a genome-wide scale, (2) automate the assessment of the structure prediction quality, (3) convert the results into pre-existing encodings of structure in the form of SCOP superfamily classifications, and (4) augment the model-based assignment of SCOP superfamily by integrating with pre-existing function, process, and component information encoded in the GO database. We were able to assign SCOP superfamilies to 7,094 of the 14,934 predicted domains in yeast using PSI-BLAST and fold recognition methodology. A total of 4,006 of the remaining 7,840 domains were short enough (less than 150 amino acids) for de novo structure prediction. Of these, 668 were omitted because they contained at least one predicted transmembrane helix. Low-resolution structure models were built for the remaining domains using Rosetta; of these, 404 were assigned to superfamilies with confidence using MCM, and an additional 177 were assigned with confidence after integrating with GO process, component, and function annotations. A significant challenge in carrying out this work was the magnitude of the computation required for generating de novo structure predictions for large numbers of domains. Robust and fast methodology, efficient data storage, analysis tools, and data organization were required. Our use of distributed computing (http://wcgrid.org), innovative database architecture [39,40], and fully automatic methods were essential for this full-genome annotation. Yeast is particularly interesting because it is the focus of a vast global research effort. Future work will include an ongoing effort to scale this procedure to over 150 completely sequenced genomes as well as to employ recently developed higher resolution structure prediction methods [41] that produce more-accurate and reliable models, but require significantly greater computational resources per protein domain. The information content in the predicted structures may be further leveraged by integration with other data such as global quantitative measurements of mRNA, protein expression levels, DNA–protein, and protein–protein interactions. Such datasets are available for yeast and several other organisms as part of ongoing functional genomics efforts, and integration of these data types with the predicted structures should contribute to the annotation of protein functions. Materials and Methods Benchmark set; folding representatives from SCOP. Two representative domains from each SCOP [12–14] superfamily were folded using the Rosetta de novo method [1,2,28]. Superfamilies without members shorter than 200 amino acids were excluded, as were proteins for which Rosetta failed to produce predictions within a reasonable time. One thousand models were generated for each domain. This resulted in structure predictions for 998 domains for which the structures have been experimentally determined. The predicted structures were clustered by root mean square deviation (RMSD), and the centers of the top 30 clusters were compared to a domain database generated from ASTRAL 1.67 (reduced to 40% sequence identity) [42,43] using a modified version of MAMMOTH [21] that calculates the contact order of the aligned regions of the predicted structure and the ASTRAL domain. An overview of the statistics is presented in Table 5, and a detailed description of the results in Table S4.
The MAMMOTH Confidence Metric. The MCM estimates the probability that the MAMMOTH match between predicted structure and the ASTRAL domain (see previous section) has identified the correct superfamily and is based on the closeness of match (MAMMOTH Z-score), the length of the two proteins involved, LAstral and Lpredicted, the CO of the region of the predicted structure that was superimposable on the experimental structure, and the degree to which Rosetta converged during the generation of the set of predicted conformations (converg below; estimated during the clustering step). The general formula for the confidence functions is given in Equation 2, and the weights of the parameters (a, b, c, d, and the constant C) for the three models described in the following paragraph are presented in Table 3.
This model is similar to that used in previous studies [11], with two improvements. First, we have fit three separate logistic regression models, one for all alpha proteins, one for all beta proteins, and one for alpha and beta proteins; the size of the benchmark set and the fact that we are fitting a small number of parameters allows for this trifurcation of the benchmark set. Second, we compute the CO [29] over the matched region. This penalizes the scenario in which small numbers of long secondary structure elements (usually helices) are aligned; the CO term as well as the length ratio corrects for the overly confident score we would otherwise calculate based on convergence and MAMMOTH Z-score alone. We used 5-fold cross-validation to fit each of the three secondary structure class–specific confidence functions. For selecting between the three models for a query protein, we use secondary structure content predicted by PsiPred [44]. The alpha model is used for proteins with over 15% predicted alpha-helical content and under 15% beta-sheet content. The beta model is used for protein with more than 15% predicted beta strand and less than 15% alpha helical. The alpha/beta model was used for all other domains. Estimating superfamily probabilities, given the structure predictions. Given a set of predicted structures D for a given protein, we estimate the probability the protein belongs to superfamily, SF, P(SF|D) as follows. Each superfamily is initially assigned a probability corresponding to the maximum PMCM value for that superfamily over the top five PMCM values for all predicted conformations for the query protein; probabilities less than 0.2 are set to zero. If the sum of the raw probabilities is greater than 0.8, they are scaled linearly so that the sum is 0.8. Because of the uncertainties of de novo structure prediction, these scaled probabilities, Pscaled(SF|D), are then linearly combined with the background superfamily distribution, P(SF) (Equation 3):
The final distributions, P(SF|D), are guaranteed to have non-zero probabilities for every superfamily, and to sum to 1. The background distribution P(SF) ensures that (1) we do not disregard useful functional information at the integration with GO stage and (2) that we do not over interpret the confidence values derived from the benchmark training set. Integration of function. We obtain P(SF|D,GO) of a superfamily, SF, given both protein structure prediction, D, and GO annotations, GO, using Bayes' rule and the assumption that P(GO,D|SF) ~ P(GO|SF)*P(D|SF):
After substituting Equation 5 into Equation 4, both P(SF) and P(D) cancel. P(SF|D) is computed as described in the previous section, and P(GO|SF), P(GO), and P(SF) are computed from proteins in the PDB that are annotated with GO function, component, or process and also classified in SCOP. To deal with cases in which there is a single function annotation for a given superfamily, we allow for the possibility that the uniqueness of this mapping is due to under-sampling of superfamily space (as represented by the PDB) or function space (as represented by GO) by adding pseudo counts distributed according to the background superfamily distribution, Pastral95(SF), computed from ASTRAL 1.67 culled so that no sequences are more then 95% identical.
The parameter M (a regularization parameter controlling the relative contribution of our pseudo-counts) was estimated by carrying out function assignment given the superfamily over the benchmark set: we chose M to minimize the classification error estimated using 10-fold cross-validation. The overall procedure was relatively insensitive to the value of M ranging from one to ten with an optimal value of four. The P(SF|GO) are too diffuse for confident superfamily prediction from GO annotations alone, hence the integration with the structure prediction data is critical for accurate superfamily predictions. Equation 6 relies on the assumption that the functional annotations are independent and mutually exclusive, which is not the case. (GO is a directed acyclic graph [DAG], with an implicit conditional dependence of lower nodes on parent nodes.) Nodes can have multiple parents, thus the probability of the child nodes of a more general term are not guaranteed to sum to the probability of the parent term. To circumvent this problem, we assigned the combined probability for each superfamily by taking the maximum probability for that superfamily given the predicted structures and all functions, i.e., Equation 7:
Finally, the sum of P(SF|D,GO) for any given protein domain is normalized to sum to one; thus confident assignments are not made when there are strong matches to more than one superfamily. Datasets for evaluation. The performance of the MCM and the GO integration was evaluated on two independent datasets. The first dataset, from HPF project, consists of 768 predicted domains that now have a homolog with a known structure that is classified in SCOP 1.69. The homologs were identified by blasting predicted domains against all sequences from ASTRAL 1.69 and selecting those with a PSI-BLAST e-value less than 1 × 10−3. We also require that the shorter of the two sequences is more than 80% of the length of the longer one, and that 60% or more of the predicted domain is aligned with the ASTRAL domain. These domains are part of an ongoing project in which we predict structures for over 150 genomes; although domains with any homology to known structures are excluded, a number of structures have been solved and classified in SCOP during the 18 mo the project has been running. The scope of this separate project prohibited us from carrying out fold recognition calculations on these domains, and since domains that can be assigned using fold recognition methods will on average have higher MAMMOTH structural similarities to known structures than domains that cannot be assigned, results from this dataset represent an upper bound on performance on the dataset in this paper. The second dataset was generated the same way the HPF set was generated, but limited to yeast domains. The proteins from which these domains are derived have been subjected to fold recognition and hence give a better estimate of the true performance. This dataset is, however, too small for statistically significant conclusions to be made. Domain filter. Based on inspection of the results on the HPF dataset, domains from predicted two-domain proteins are excluded if both the domains are predicted using less-confident methods (MSA, unassigned, or Pfam domains), or if the domain under consideration is an MSA domain regardless of the neighboring domain type. A large fraction of these proteins have single domains, and correct superfamily matches are quite unlikely when models are only generated from domain fragments. Data production. The generation of structure predictions was divided into three completely automated steps: pre-processing, production (the running of Rosetta), and post-processing (clustering, superfamily assignment, and function integration). The pre-processing protocol includes domain prediction, prediction of secondary structure, disordered regions, trans-membrane helices [45], and signal peptides [46], and the local structure fragments and other files necessary for running Rosetta. This step was conducted in-house on two 64-CPU Linux clusters. The production step, generating 10,000 structure predictions, was completed in collaboration with IBM running Rosetta on the World Community Grid as part of a larger effort, and is estimated to have used 12 million CPU hours, or 1,350 CPU years. The post-processing step was performed in-house (using the same hardware as the pre-processing step), and included clustering and superfamily assignment by MCM and GO integration. The resulting dataset is complex, and is stored, queried, organized, and analyzed using an open-source software package, 2DDB [39,40] of our own construction. Table S1: Complete Listing of Domain Predictions for All ORFs in Yeast All 14,934 domains predicted from the 6,238 sequences are presented in detail. (4.7 MB PDF) Click here for additional data file.(4.6M, pdf) Table S2: Protein Structure Predictions with PMCM ≥ 0.8 The most confident predictions using the MCM are listed. (133 KB PDF) Click here for additional data file.(133K, pdf) Table S3: Protein Structure Prediction with P(SF|D,GO) ≥ 0.8 The most confident predictions using the GO-integration strategy are listed. (77 KB PDF) Click here for additional data file.(77K, pdf) Table S4: Benchmark Results Best predicted structure—the best predicted structure by RMS among the 1,000 created; Top5 is the cluster center from the five largest clusters; Best Match—the best domain match from all 30 cluster centers. (476 KB PDF) Click here for additional data file.(477K, pdf) Acknowledgments We thank IBM (Viktors Berstis, Bill Bovermann, Rick Alther, and Robin Willner) for dedicated access to the World Community Grid (http://www.wcgrid.org) and for porting Rosetta to the grid-client. We also thank Phil Bradley and Bill Noble for helpful discussions. Abbreviations
Footnotes ¤ Current address: Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America Competing interests. The authors have declared that no competing interests exist. Author contributions. LM, TND, RB, and DB conceived and designed the experiments. LM performed the experiments. LM, RB, and DB analyzed the data. LM, MR, CEMS, and DC contributed reagents/materials/analysis tools. LM, RB, and DB wrote the paper. Funding. This work was funded by the National Center for Research Resources of the National Institutes of Health by a grant to TND entitled “Comprehensive Biology: Exploiting the Yeast Genome,” P41 RR11823, the Howard Hughes Medical Institute, and the U.S. Department of Defense USAMRAA W81XWH-04-1-0307. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
J Mol Biol. 1997 Apr 25; 268(1):209-25.
[J Mol Biol. 1997]Methods Enzymol. 2004; 383():66-93.
[Methods Enzymol. 2004]Proteins. 2003; 53 Suppl 6():457-68.
[Proteins. 2003]Proteins. 2001; Suppl 5():98-118.
[Proteins. 2001]Mol Cell. 2003 Dec; 12(6):1353-65.
[Mol Cell. 2003]Nucleic Acids Res. 2002 Jan 1; 30(1):264-7.
[Nucleic Acids Res. 2002]J Mol Biol. 1995 Apr 7; 247(4):536-40.
[J Mol Biol. 1995]Structure. 1997 Aug 15; 5(8):1093-108.
[Structure. 1997]Nucleic Acids Res. 1999 Jan 1; 27(1):29-34.
[Nucleic Acids Res. 1999]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Proteins. 2003; 53 Suppl 6():524-33.
[Proteins. 2003]Proteins. 2005; 61 Suppl 7():193-200.
[Proteins. 2005]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Protein Sci. 2002 Nov; 11(11):2606-21.
[Protein Sci. 2002]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Proteins. 2003; 53 Suppl 6():524-33.
[Proteins. 2003]Proteins. 2005; 61 Suppl 7():193-200.
[Proteins. 2005]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Bioinformatics. 2003 May 22; 19(8):1015-8.
[Bioinformatics. 2003]Nucleic Acids Res. 2003 Jul 1; 31(13):3291-2.
[Nucleic Acids Res. 2003]J Mol Biol. 1997 Apr 25; 268(1):209-25.
[J Mol Biol. 1997]Proteins. 1999 Jan 1; 34(1):82-95.
[Proteins. 1999]Proteins. 2001; Suppl 5():119-26.
[Proteins. 2001]Protein Sci. 2002 Nov; 11(11):2606-21.
[Protein Sci. 2002]J Mol Biol. 2002 Sep 6; 322(1):65-78.
[J Mol Biol. 2002]Protein Sci. 2002 Aug; 11(8):1937-44.
[Protein Sci. 2002]J Mol Biol. 1999 Apr 23; 288(1):147-64.
[J Mol Biol. 1999]J Mol Biol. 2001 Nov 2; 313(4):903-19.
[J Mol Biol. 2001]Nucleic Acids Res. 2004; 32(18):5379-91.
[Nucleic Acids Res. 2004]Mol Cell Biol. 1998 Aug; 18(8):4935-46.
[Mol Cell Biol. 1998]Biochem J. 1998 Feb 1; 329 ( Pt 3)():433-48.
[Biochem J. 1998]Eur J Biochem. 1997 Apr 15; 245(2):449-56.
[Eur J Biochem. 1997]FEBS Lett. 1991 Jun 17; 284(1):51-6.
[FEBS Lett. 1991]J Biol Chem. 2002 Oct 18; 277(42):39289-95.
[J Biol Chem. 2002]Biochim Biophys Acta. 2000 May 31; 1458(2-3):428-42.
[Biochim Biophys Acta. 2000]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D378-82.
[Nucleic Acids Res. 2005]J Proteome Res. 2002 Mar-Apr; 1(2):135-8.
[J Proteome Res. 2002]BMC Bioinformatics. 2006 Mar 20; 7():158.
[BMC Bioinformatics. 2006]Science. 2005 Sep 16; 309(5742):1868-71.
[Science. 2005]Nucleic Acids Res. 2002 Jan 1; 30(1):264-7.
[Nucleic Acids Res. 2002]J Mol Biol. 1995 Apr 7; 247(4):536-40.
[J Mol Biol. 1995]J Mol Biol. 1997 Apr 25; 268(1):209-25.
[J Mol Biol. 1997]Proteins. 1999 Jan 1; 34(1):82-95.
[Proteins. 1999]Proteins. 2001; Suppl 5():119-26.
[Proteins. 2001]J Mol Biol. 2002 Sep 6; 322(1):65-78.
[J Mol Biol. 2002]Protein Sci. 2002 Aug; 11(8):1937-44.
[Protein Sci. 2002]J Mol Biol. 1999 Sep 17; 292(2):195-202.
[J Mol Biol. 1999]Proc Int Conf Intell Syst Mol Biol. 1998; 6():175-82.
[Proc Int Conf Intell Syst Mol Biol. 1998]J Mol Biol. 2004 Jul 16; 340(4):783-95.
[J Mol Biol. 2004]J Proteome Res. 2002 Mar-Apr; 1(2):135-8.
[J Proteome Res. 2002]BMC Bioinformatics. 2006 Mar 20; 7():158.
[BMC Bioinformatics. 2006]