![]() |
Formats:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © The Author(s) 2009 Structural genomics is the largest contributor of novel structural leverage 1Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th St., New York, NY 10032 USA 2Northeast Structural Genomics Consortium (NESG) and Columbia University Center for Computational Biology and Bioinformatics (C2B2), Columbia University, 1130 St. Nicholas Ave. Rm. 802, New York, NY 10032 USA 3Department of Biomedical Informatics, Columbia University, 630 West 168th St., New York, NY 10032 USA 4Center for Advanced Biotechnology, Department of Molecular Biology and Biochemistry and Northeast Structural Genomics Consortium (NESG), Rutgers University, 679 Hoes Lane, Piscataway, NJ USA 5Protein Structure Initiative Knowledge Base & RCSB PDB, Department of Chemistry and Chemical Biology, Rutgers University, 610 Taylor Rd., Piscataway, NJ 08854-8087 USA 6New York SGX Research Center for Structural Genomics (NYSGXRC), Department of Systems and Computational Biology, Department of Biochemistry, Albert Einstein College of Medicine, New York, NY USA 7Joint Center for Structural Genomics (JCSG), Burnham Research Institute, La Jolla, CA USA 8Midwest Center of Structural Genomics (MCSG), Biosciences Division, Argonne National Laboratory and Department of Structural Biology, University College of London (UCL), London, WC1E 6BT UK 9New York Consortium on Membrane Protein Structure (NYCOMPS), Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, NJ 08854 USA 10New York Consortium on Membrane Protein Structure (NYCOMPS), Columbia University, 1130 St. Nicholas Ave. Rm. 802, New York, NY 10032 USA Burkhard Rost, Phone: +1-212-8514669, Email: rost/at/rostlab.org. Corresponding author.Received November 15, 2008; Accepted December 8, 2008. Abstract The Protein Structural Initiative (PSI) at the US National Institutes of Health (NIH) is funding four large-scale centers for structural genomics (SG). These centers systematically target many large families without structural coverage, as well as very large families with inadequate structural coverage. Here, we report a few simple metrics that demonstrate how successfully these efforts optimize structural coverage: while the PSI-2 (2005-now) contributed more than 8% of all structures deposited into the PDB, it contributed over 20% of all novel structures (i.e. structures for protein sequences with no structural representative in the PDB on the date of deposition). The structural coverage of the protein universe represented by today’s UniProt (v12.8) has increased linearly from 1992 to 2008; structural genomics has contributed significantly to the maintenance of this growth rate. Success in increasing novel leverage (defined in Liu et al. in Nat Biotechnol 25:849–851, 2007) has resulted from systematic targeting of large families. PSI’s per structure contribution to novel leverage was over 4-fold higher than that for non-PSI structural biology efforts during the past 8 years. If the success of the PSI continues, it may just take another ~15 years to cover most sequences in the current UniProt database. Keywords: Protein structure determination, Structural genomics, Evolution, Protein universe Introduction Systematic targeting of the largest families without structural coverage The US contribution to Structural Genomics (SG), the Protein Structure Initiative (PSI), is funded by the National Institutes of Health-National Institute of General Medical Sciences (NIH-NIGMS). The second 5-year phase of the initiative, PSI-2, began in 2005. Four large-scale Structural Genomics Centers were created for high-throughput production of protein structures (JCSG, MCSG, NESG, NYSGXRC), as well as six Specialized Research Centers both charged with continuing to develop technologies needed for large-scale protein structure determination [30]. The four large-scale production centers are currently poised to generate over 3,000 entirely new experimental 3D structures of proteins for the biomedical research community in addition to the over 1,300 structures that originated from the pilot phase. At the end of the first 3 of those 5 years, the four centers had already deposited almost 2,000 new 3D structures (data from TargetDB, [9]). Through the development and advancements of biochemical, robotic, NMR, crystallographic and computational techniques, SG centers are decreasing the cost and time required to determine a protein structure in order to advance the structural coverage of sequence space and biomedical research. The development and advancement of high-throughput protein production and protein structure determination pipelines are critical to the eventual characterization of protein structure space, expanding our understanding of molecular evolution, and to address biomedical problems such as drug discovery. Metrics of success Several metrics of success have been developed to monitor the evolution of structural genomics during PSI [8, 18, 22]. These include (i) total numbers of PDB depositions, (ii) numbers of distinct sequences (<98% pairwise sequence identity) for which an experimental structure is determined, (iii) numbers of ‘novel structures’, defined as a structure for a protein having <30% sequence identity with any protein structure already in the PDB, (iv) first 3D structure from a particular domain family; (v) first 3D structure from a particular functional class of proteins, (vi) protein structures which provide a novel testable hypothesis about function, and other metrics. In the following paragraphs we outline some of these metrics relating to the value of experimental 3D structures to provide useful structural information about homologous protein sequences. Modeling leverage of experimental structures Homologous proteins from different organisms, defined as those that have evolved from a common recent ancestral protein, usually share similar 3D structures [10, 28, 31, 35]. Therefore, the PSI does not aim at producing structures of every protein from every organism. Instead, the PSI aims to identify structural domains in proteins, systematically organize these protein domains into sequence-structure families, and determine the 3D structure of one or a few representatives from many of these families. The ultimate goal is to attain structural coverage for every major protein domain family found in nature. Almost 50,000 experimentally determined 3D structures have been deposited into the PDB [4]. However, this accounts for less than 1% of the ~6 million protein sequences deposited into UniProt [2]. As genome sequencing technologies advance, sequence data is being generated at an ever increasing pace, not only for complete genomes of organisms but even for entire ecologies of hundreds or thousands of microorganisms (META genomics) [12, 36, 39]. Accordingly, the rate of discovery of new protein sequences will continue to increase much faster that the rate of protein structure determination. The fact that homologous protein domains have similar structures enables the application of homology, or comparative, modeling methods [17, 32]. Comparative modeling leverages in the information provided by each experimental structure many fold. For example, it has been proposed that experimental determination of 3D structures for one representative of the largest 1,000–2,000 protein domain families, would be sufficient to allow modeling, at some approximate level, of more than half of all the residues in all of UniProt [21, 38]. The “modeling leverage” of a particular 3D structure (modeling template) depends on several factors, including (i) the sequence similarity between the template with known experimental structure and target proteins of unknown structure, (ii) the method of comparative modeling, and (iii) the criteria by which a model is judged to be “useful”. The third factor (what is good enough?) can be especially difficult to ascertain, and rather inaccurate models (e.g. just the overall fold) are sufficient for some important applications of models, while other applications may require very high accuracy models. Benchmark studies suggest that sequence similarity of >40% over >50 residues generally provide models with heavy atom root-mean-square deviations of <2.5 Å from the true experimental structure [6, 11, 16, 24–27]. However, templates that are less sequence similar to the target structure may provide even higher accuracy models, and models generated for more sequence similar templates may result in less accurate models. Leverage also must be defined with respect to what portion of the target protein can be modeled from the experimental template, leading to metrics for full protein models, protein domain models, or residue models per experimental template. Modeling leverage also needs to be defined with respect to a particular sequence database; e.g. with respect to a particular version of UniProt. Structural coverage The concept of modeling leverage is intimately associated with the concept of structural coverage; i.e. the number or percentage of a particular set of protein sequences, domains, or residues, which can be modeled from a particular set of experimental protein templates. Structural coverage of the protein universe (i.e. a particular version of UniProt), of an entire proteome of an organism (e.g. the human proteome), of an ecology of organisms (e.g. all human gut microorganisms); or of a system of co-functioning proteins (e.g. proteins associated with a particular biological process), are all key metrics in measuring the success of SG that depend on the definition of modeling leverage. Novel modeling leverage and novel coverage Related to the concept of modeling leverage is the concept of novel modeling leverage [22], operationally defined as the number of proteins/domains/residues that could not be modeled (based on the above specific definition of leverage) as of the date the subject experimental structure was deposited into the public PDB [22]. The novel leverage provided by a set of experimental 3D structures across a particular set of protein sequences defines the novel coverage provided by these structures. This concept of leveraging experimental structures, and particularly novel leverage, has been fundamental to the process of target selection by large-scale centers during PSI-2. In particular, the large-scale centers systematically target the largest protein domain families for which we currently have little or no structural coverage. The need for a standard convention The modeling leverage value of a particular experimental structure, or the coverage of a set of sequences by a set of structures, depend upon the details of thresholds defined for sequence similarity that can be expected to provide a “useful” model, as outlined above. There are also certain technical issues which may or may not be accounted for in any method of assessing novel leverage. Examples of such issues, not used in the current work include: (i) while a sequence may be modeled from a structure already in the PDB on the date of deposition of subject structure, the subject structure may allow more accurate modeling of this sequence, and (ii) one may or may not discount the novel modeling leverage of a particular structure by the modeling leverage of experimentally-determined structures subsequently deposited in the PDB. It is simply not possible to define universal thresholds or criteria of model accuracy that are appropriate for the full range of applications for which models are used. Thus, the novel leverage reported for the same data by different groups may vary widely. Here, we adopt as a convention the definitions and thresholds proposed by Liu et al. [22] for assessing modeling leverage, novel modeling leverage, and the corresponding metrics of novel coverage. This is a convenient measure of “modelability” that is easily reproducible with relatively modest computing resources (the analysis presented here consumed less than 2 CPU-years). Methods Data set All data about the status of structural genomics targets were taken from TargetDB [9]. Leverage, novel leverage, and the corresponding metrics for coverage were determined by the method of Liu et al. [22]. The basic concept is the following. We begin with a fixed version of UniProt, in this case release 12.8 from Feb 2008; containing 5,678,599 protein sequences with 1,851,231,082 residues. For this version we compile the number of proteins and residues that align (PSI-BLAST E-value 10−10, 3 iterations on UniProt, one on PDB with background estimates based on UniProt size; for more details see Liu et al. [22]) to any protein of experimentally determined 3D structure deposited into the PDB at a given time point T = T0. Novel leverage is then everything that is not covered by this simple alignment protocol and has arisen from structures added to the PDB at T1 > T0; total leverage is computed as all structures in the PDB covered by this criteria. Novel structures We loosely referred to an experimental structure (more precisely the structure specified by a particular PDB identifier) as a novel structure if at least 50 residues of this structure could be used to create novel leverage. This implies in particular, that novelty was not at all constrained by any particular definition in terms of the similarity of this new coordinate set in terms of structure to any other structure already in the PDB. When compiling per-residue estimates for novel leverage, we did not apply any such threshold, instead, any single residue that could not have been modeled before counted. Novel leverage versus novel coverage Leverage and coverage are related metrics that differ essentially only in the perspective they provide:
In the context of this work, we used the DB = UniProt 12.8 (Feb. 2008), E0 = PSI-BLAST E-value < 10−10. Coverage often is compiled with respect to the same database as leverage, i.e. DS = DB. In fact, this is the metric that we compiled for this work. However, we have also compiled coverage values for the set of proteins in particular organisms, e.g. focusing on the structural coverage for the human proteome [29]. In principle, leverage and coverage are symmetric: both can be compiled on the same data set, and the only essential difference is that one counts numbers, the other percentages. Both leverage and coverage can be computed on a per-structure, on a per-residue or on per-annum base. Frequently, we also compiled those numbers as sums over all PSI structures in light of the sum over all PDB structures and/or over all PDB structures without those PSI-structures. The measures for leverage and coverage as defined above have a severe problem: they do not distinguish at all between structures that provide new information and those that simply confirm the information we already have in the PDB. This effectively implies that the measures as defined above do not capture a scientifically relevant reality. This problem is easy to fix: all we need to do is to compile the leverage/coverage at a given time and to then define the novelty provided by new structures as the added leverage and coverage. We have introduced this simple metric as “novel leverage” and “novel coverage”, and defined them by:
With the same choices as above: DB = UniProt 12.8, and E0 = PSI-BLAST E-value < 10−10. The deposition date in the PDB entry decides whether or not a structure is novel. One important and desired consequence of this definition is the following. Assume you solved a structure that has high impact in the sense that many groups use it as a basis for molecular replacement to do more accurate structures of the same or of a similar protein sequence. Then the first structure in this family of structures is recognized for the novel information it provided on the date it was deposited in the PDB. The problem that remains and that we have not addressed convincingly, yet, is how to measure the benefit of a structure that allows to build better models for proteins for which we can already build models. As indicated by Eq. 2A, only sequences that match to the sequence of the template with the minimal threshold (E-value < 10−10) count. Results and discussion Every other novel structure from the USA now from the PSI A primary goal of the PSI has been the development of automation and robotics for large-scale protein structure determination. It took a few years to scale the pipelines up to reaching “high-throughput” levels; currently some 600–800 protein structures per year (i.e. two structures per day). Progress is evident: over 1,300 structures originated from the pilot phase PSI-1 (2000–2005), and after 3 of the 5 years of PSI-2, the large-scale centers have already deposited almost 2,000 new 3D structures. This success is also evident in the increased contribution from PSI to all experimental structures deposited into the PDB: over the course of PSI-2 (2005/07/01–2008/09/19), PSI centers have contributed almost 9% of all structures world-wide and all structural genomics (SG) centers have contributed almost 18% of all structures (data not shown). As the PSI is entirely financed by the NIGMS at the NIH in the USA, its contribution should be compared directly to structures deposited into the PDB from US-based laboratories: in the first 3.25 years of PSI-2 (labeled 2005–2009, where 2009 represents only the first quarter of Year 4), PSI-2 centers alone had contributed about 18% of all structures deposited by US structural biology groups (Fig. 1
Given that novel leverage is an important criterion in PSI-2 target selection, PSI-generated structures also possess more novel leverage than structures from non Structural Genomic groups (Figs. 1
Structural coverage of sequence space continues to increase We froze a version of the entire sequence space known in Feb. 2008 (UniProt 12.8) and then estimated to what extent the structures deposited into the PDB at a given time point could have been used to structurally cover this sequence universe. We compiled two separate values, one estimating the per-protein coverage that considers a new arrival to cover a new protein when at least 50 consecutive residues were aligned above the threshold (E-value < 10−10, Methods; orange in Figs. 1
The PSI contribution to the coverage added by US structures is now exceeding the 50% mark, i.e. PSI-2 contributes more novel leverage, and hence more coverage than all other US efforts (Fig. 1 Overall, the structural coverage of UniProt 12.8 increased slowly, up until about 1992 (Fig. 2 If we reset the coverage clock to zero at the beginning of PSI, and compute the gain over the structural coverage in a given year (Fig. 2 PSI per-protein gain in novel leverage is 3–4 fold higher than PDB without PSI The success of PSI-2 in increasing novel leverage in a competitive environment is being demonstrated most clearly when we compile the annual increase in novel structural coverage per deposited structure. The per-structure leverage of PSI has consistently been 5–8 times higher than the corresponding number for non-SG structures (Fig. 3 Although novel leverage per structure has been dropping, the total number of novel structures solved by PSI groups has increased in each year of the program. Has this sufficed to counterbalance the increase in the difficulty of the task? One answer is provided by Fig. 2
PSI has by now targeted and worked on most of the largest 16,787 sequence-structure families with prokaryotic representatives. PSI-2 continues to pick the largest remaining families, however, those become smaller. The novel leverage of all non-PSI structures in the PDB is also decreasing. This is partly due to the same reason: the largest families are either structurally covered or continue to evade structure determination. Furthermore, as already discussed, the generation of novel leverage becomes increasingly challenging. Does this imply that attempts at experimentally determining structures for new sequence-structure families will be doomed? Despite efforts in optimizing novel leverage and providing structures for as yet uncharacterized domain sequence families, structural genomics has not discovered many truly novel structures [1, 5, 7, 8, 13, 15, 18]. Indeed, the discovery of previously unobserved protein structure space (new geometries and principles not seen before) is becoming increasingly difficult [22]. This implies that (i) we now know most protein structure geometries or folds and (ii) on average, staying within the vicinity of known structures is more likely to result in a successful structure determination. By design, PSI-2 has been attempting and succeeding in targeting proteins which are not similar to proteins with known structures, i.e. to increase the odds of discovering new territories through their development of high-through pipelines and technologies. To rephrase this in a common analogy: by focusing on protein domain families with no structural representatives PSI-2 has systematically targeted and succeeded in reaching “higher-hanging fruits”. Many other criteria for success Structural genomics, by design, is a hypothesis-generating instead of a hypothesis-driven endeavor. It shares this aspect with many new high-throughput genomics projects in the evolving molecular biology discipline although—unlike other genomics projects—structural genomics continues to generate very high-resolution, detailed molecular data. The success of the PSI is reflected by many aspects which range from increasing the speed of structure determination and deposition (both dramatically increased during this decade), through high literature impact and extreme reduction in the number of papers per structure to the push of automation and robotics which increases the diverse biophysical measurements readily available to researchers in related fields with different expertise. Objective criteria that allow the monitoring of the degree to which scientific endeavors deliver what they promised are naturally becoming integral parts of a landscape in which the funding for science shrinks, while the challenges for the scientist arguably increase, and in which an increasing fraction of all science is funded by temporal grants. Here, we have demonstrated that PSI-2 has been extremely successful by the aims it posed at the start: it contributed substantially toward the increase of novel leverage to the extent that a future without PSI will clearly imply a considerable lengthening of the time needed to cover today’s protein universe. Given that the PSI was successful in meeting the milestones that the PSI commission posed, the aim now is to finish with a wider perspective that considers the optimization of structural coverage as a means and not as an end. One aspect of structural genomics is the adventure of mapping unknown spaces. We seek connections to create maps. These objectives require the coincidence of a wealth of sequences and structures in spaces that have hardly been experimentally covered (i.e. families of unknown function) but appear to be extremely important, as demonstrated by the annotations for the universal family of EVE/PUA/PUA-like proteins enriched by structural genomics [5]. All these connections contribute to the understanding of protein evolution. The PSI has covered an immense fraction of the prokaryotic sequence-space in terms of generating protocols, reagents, and experimental data. This wealth is available today through the PSI Materials Repository (http://www.hip.harvard.edu/gateway/) and through the PSI Knowledge Base (http://kb.psi-structuralgenomics.org/). A relatively small fraction of the target families have so far yielded experimental structures, but this “small” fraction now contributes over one-third of the novel leverage worldwide, providing structural templates for over 300,000 new reliable protein structure models. Another long-term impact is the contribution toward making structure become an integral part of molecular biology and toward converting structure determination from an amazing art mastered by few into a pipeline accessible to many. Clearly the cost reduction, the development of sophisticated semi-automated high throughput pipelines contributed immensely to making this happen. Without structural genomics, today’s level of automation would not have been reached at all. The development of cheaper sequencing techniques was certainly no goal of the human sequencing project. But those techniques have been changing biology immensely over the last decade. 16–20 years to go to complete coverage of sequence universe? How much more is left to do? The following rationale provides an over simplified answer. Firstly, we have estimated that at least 20% of all residues in proteomes are not viable targets for structural genomics because they encode complex integral membrane proteins, long continuous coiled-coils regions, long regions that are natively unstructured, and leftovers from partial models (e.g. model A covers domain D1 from residues 6–55, model B covers domain D2 from residues 61–100 in a protein of 100 residues; this leaves 10 residues 1–5 and 56–60 as non-viable targets) [21]. Most of these 20% of the residues are in short regions not assigned to a particular domain and are probably some sort of domain linkers and embellishments. Put differently, 80% per-residue coverage implies “completion”. Secondly, today’s coverage is about 40%, i.e. 40% (80–40) remains to be done. Thirdly, extrapolating from Fig. 2 Clearly, the assumption of identical growth is overly optimistic: the rate has been kept at a linear growth only due to the focused effort of structural genomics. Given that PSI-2 has already cloned almost all the largest viable families, it is clear that the future leverage will be lower. Moreover, as new genomes are sequenced, only a fraction of these sequences map to known protein domain families, and the uncovered protein universe continues to grow. Furthermore, it might be argued that the 40% of the residues that remain to be structurally explored will constitute proteins that are much more challenging for structure determination than those in the 40% of the residues that are covered today. If so, structural genomics methods might fail to capture those residues in these much more challenging classes of proteins, and our assumption of a constant growth rate might be inappropriate. True, this might be so, and we have no scientific argument to dispel this concern. However, we can move back into the past and pretend to estimate for what was then the future: e.g. if we had taken the growth rate from 1994 to 2000 to estimate the coverage of 2008, we would have been completely right (Fig. 2 Where from here? We have established structural genomics as an extremely efficient way to discover new areas in the protein universe that will undoubtedly continue to invoke testable hypothesis for years to come. Will the trend continue? Can we extrapolate from today’s data, or will we need something completely different to efficiently cover what remains? Clearly, we have to improve structural determination for sequence-structure families from eukaryotes. Today, it requires some 5–10-fold more resources to determine the structure for an average eukaryotic protein than for an average prokaryotic protein. A considerable fraction of the untouched sequence space falls into sequence-structure families that exclusively represent eukaryotes. Clearly, targeting this important domain becomes an important objective. Another fact of PSI-2 was that structure determination has so far succeeded for less than 30% of all families targeted. Developing techniques that allow a substantial increase in this yield appears to be another important goal. The final question seems to be hovering around the issue of how much will the part of the universe without structural coverage differ from the part we cover today? Clearly, we need to find ways to make structural genomics work for types of proteins for which it has so far had only limited success, including membrane proteins, eukaryotic proteins, and secreted proteins. Are there any new structural principles out there that remain to be discovered and that totally elude today’s techniques for structure determination? Biology is so full of innovation and surprise that the answer will clearly be in the affirmative. To which extent this will be the case remains utter speculation. However, we have strong evidence that a considerable part of what is left falls into the category of proteins that are unusually flexible, or intrinsically unstructured and that possibly do not adopt regular structures without a binding partner. Do we therefore have to step up in terms of complexity and attack the problem of a structural genomics for complexes? Clearly, this will be one of the important challenges for both the short-term and long-term of structural genomics. Acknowledgments Thanks to Marco Punta (Columbia) for critical comments on the manuscript and work; to Paul Glick and Guy Yachdav (Columbia) for computer assistance, and to all those who contribute to making structural genomics become an exciting scientific breakthrough. This work was supported primarily by the grant U54-GM074958-01 to the Northeast Structural Genomics consortium (NESG). PSI would not be possible without the dedication and continued support of grant agencies and in particular of those administrating grants. We specially like to acknowledge many at NIGMS, in particular, of Jeremy Berg, John Norvell, Charles Edmonton, and Jerry Li. Last, not least, thanks to Amos Bairoch (SIB, Geneva), Rolf Apweiler (EBI, Hinxton), Phil Bourne (San Diego Univ.), John Westbrook (Rutgers), Helen Berman (Rutgers), and their crews for maintaining excellent databases and to all experimentalists who enabled this analysis by making their data publicly available. Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited. Abbreviations
References 1. Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG (2008) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36:D419–D425. doi:10.1093/nar/gkm993 [PubMed] 2. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M et al (2004) UniProt: the universal protein knowledgebase. Nucleic Acids Res 32:D115–D119. doi:10.1093/nar/gkh131 [PubMed] 3. Berman HM, Burley SK, Chiu W, Sali A, Adzhubei A, Bourne PE, Bryant SH, Dunbrack RL Jr, Fidelis K, Frank J et al (2006) Outcome of a workshop on archiving structural models of biological macromolecules. Structure 14:1211–1217. doi:10.1016/j.str.2006.06.005 [PubMed] 4. Berman H, Henrick K, Nakamura H, Markley JL (2007) The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res 35:D301–D303. doi:10.1093/nar/gkl971 [PubMed] 5. Bertonati C, Punta M, Fischer M, Yachdav G, Forouhar F, Zhou W, Kuzin AP, Seetharaman J, Abashidze M, Ramelot TA et al (2008) Structural genomics reveals EVE as a new ASCH/PUA-related domain. Proteins. doi:10.1002/prot.22287. 6. Bhattacharya A, Wunderlich Z, Monleon D, Tejero R, Montelione GT (2008) Assessing model accuracy using the homology modeling automatically software. Proteins 70:105–118. doi:10.1002/prot.21466 [PubMed] 7. Bourne PE, Allerston CK, Krebs W, Li W, Shindyalov IN, Godzik A, Friedberg I, Liu T, Wild D, Hwang S, et al. (2004) The status of structural genomics defined through the analysis of current targets and structures. Pac Symp Biocomput 9:375–386 [PubMed] 8. Chandonia JM, Brenner SE (2005) Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches. Proteins 58:166–179. doi:10.1002/prot.20298 [PubMed] 9. Chen L, Oughtred R, Berman HM, Westbrook J (2004) TargetDB: a target registration database for structural genomics projects. Bioinformatics 20:2860–2862. doi:10.1093/bioinformatics/bth300 [PubMed] 10. Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. EMBO J 5:823–826 [PubMed] 11. Fernandez-Fuentes N, Rai BK, Madrid-Aliste CJ, Fajardo JE, Fiser A (2007) Comparative protein structure modeling by combining multiple templates and optimizing sequence-to-structure alignments. Bioinformatics 23:2558–2565. doi:10.1093/bioinformatics/btm377 [PubMed] 12. Fraser-Liggett CM (2005) Insights on biology and evolution from microbial genome sequencing. Genome Res 15:1603–1610. doi:10.1101/gr.3724205 [PubMed] 13. Gerstein M, Edwards A, Arrowsmith CH, Montelione GT (2003) Structural genomics: current progress. Science 299:1663. doi:10.1126/science.299.5613.1663a [PubMed] 14. Grant A, Lee D, Orengo C (2004) Progress towards mapping the universe of protein folds. Genome Biol 5:107. doi:10.1186/gb-2004-5-5-107 [PubMed] 15. Harrison A, Pearl F, Sillitoe I, Slidel T, Mott R, Thornton J, Orengo C (2003) Recognizing the fold of a protein structure. Bioinformatics 19:1748–1759. doi:10.1093/bioinformatics/btg240 [PubMed] 16. Koh IYY, Eyrich VA, Marti-Renom MA, Przybylski D, Madhusudhan MS, Narayanan E, Grana O, Valencia A, Sali A, Rost B (2003) EVA: evaluation of protein structure prediction servers. Nucleic Acids Res 31:3311–3315. doi:10.1093/nar/gkg619 [PubMed] 17. Kopp J, Schwede T (2004) The SWISS-MODEL repository of annotated three-dimensional protein structure homology models. Nucleic Acids Res 32:D230–D234. doi:10.1093/nar/gkh008 [PubMed] 18. Levitt M (2007) Growth of novel protein structural data. Proc Natl Acad Sci USA 104:3183–3188. doi:10.1073/pnas.0611678104 [PubMed] 19. Liu J, Rost B (2003) Domains, motifs, and clusters in the protein universe. Curr Opin Chem Biol 7:5–11. doi:10.1016/S1367-5931(02)00003-0 [PubMed] 20. Liu J, Rost B (2004) CHOP: parsing proteins into structural domains. Nucleic Acids Res 32:W569–W571. doi:10.1093/nar/gkh481 [PubMed] 21. Liu J, Hegyi H, Acton TB, Montelione GT, Rost B (2004) Automatic target selection for structural genomics on eukaryotes. Proteins 56:188–200. doi:10.1002/prot.20012 [PubMed] 22. Liu J, Montelione GT, Rost B (2007) Novel leverage of structural genomics. Nat Biotechnol 25:849–851. doi:10.1038/nbt0807-849 [PubMed] 23. Marsden RL, Orengo CA (2008) Target selection for structural genomics: an overview. Methods Mol Biol 426:3–25. doi:10.1007/978-1-60327-058-8_1 [PubMed] 24. Marti-Renom MA, Stuart A, Fiser A, Sanchez R, Melo F, Sali A (2000) Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 29:291–325. doi:10.1146/annurev.biophys.29.1.291 [PubMed] 25. Marti-Renom MA, Madhusudhan MS, Fiser A, Rost B, Sali A (2002) Reliability of assessment of protein structure prediction methods. Structure 10:435–440. doi:10.1016/S0969-2126(02)00731-1 [PubMed] 26. Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A (2005) Critical assessment of methods of protein structure prediction (CASP)-round 6. Proteins 61:3–7. doi:10.1002/prot.20716 [PubMed] 27. Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A (2007) Critical assessment of methods of protein structure prediction-round VII. Proteins 69(Suppl 8):3–9. doi:10.1002/prot.21767 [PubMed] 28. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540 [PubMed] 29. Nair R, Fajardo E, Fiser A, Godzik A, Jaroszewski L, Marsden R, Orengo C, Rost B (2008) Progress at PSI—milestones measuring the success of structural genomics in the USA. Columbia University, New York . 30. Norvell JC, Berg JM (2007) Update on the protein structure initiative. Structure 15:1519–1522. doi:10.1016/j.str.2007.11.004 [PubMed] 31. Orengo CA, Michie AD, Jones DT, Swindells MB, Thornton JM (1997) CATH—a hierarchic classification of protein domain structures. Structure 5:1093–1108. doi:10.1016/S0969-2126(97)00260-8 [PubMed] 32. Pieper U, Eswar N, Braberg H, Madhusudhan MS, Davis FP, Stuart AC, Mirkovic N, Rossi A, Marti-Renom MA, Fiser A et al (2004) MODBASE, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Res 32:D217–D222. doi:10.1093/nar/gkh095 [PubMed] 33. Pieper U, Eswar N, Davis FP, Braberg H, Madhusudhan MS, Rossi A, Marti-Renom M, Karchin R, Webb BM, Eramian D et al (2006) MODBASE: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res 34:D291–D295. doi:10.1093/nar/gkj059 [PubMed] 34. Redfern OC, Harrison A, Dallman T, Pearl FM, Orengo CA (2007) CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput Biol 3:e232. doi:10.1371/journal.pcbi.0030232 [PubMed] 35. Sander C, Schneider R (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9:56–68. doi:10.1002/prot.340090107 [PubMed] 36. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428:37–43. doi:10.1038/nature02340 [PubMed] 37. Watson JD, Todd AE, Bray J, Laskowski RA, Edwards A, Joachimiak A, Orengo CA, Thornton JM (2003) Target selection and determination of function in structural genomics. IUBMB Life 55:249–255. doi:10.1080/1521654031000123385 [PubMed] 38. Yeats C, Lees J, Reid A, Kellam P, Martin N, Liu X, Orengo C (2008) Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res 36:D414–D418. doi:10.1093/nar/gkm1019 [PubMed] 39. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W et al (2007) The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol 5:e16. doi:10.1371/journal.pbio.0050016 [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Structure. 2007 Dec; 15(12):1519-22.
[Structure. 2007]Bioinformatics. 2004 Nov 1; 20(16):2860-2.
[Bioinformatics. 2004]Genome Biol. 2004; 5(5):107.
[Genome Biol. 2004]Proteins. 2004 Aug 1; 56(2):188-200.
[Proteins. 2004]Methods Mol Biol. 2008; 426():3-25.
[Methods Mol Biol. 2008]IUBMB Life. 2003 Apr-May; 55(4-5):249-55.
[IUBMB Life. 2003]Proteins. 2005 Jan 1; 58(1):166-79.
[Proteins. 2005]Proc Natl Acad Sci U S A. 2007 Feb 27; 104(9):3183-8.
[Proc Natl Acad Sci U S A. 2007]Nat Biotechnol. 2007 Aug; 25(8):849-51.
[Nat Biotechnol. 2007]EMBO J. 1986 Apr; 5(4):823-6.
[EMBO J. 1986]J Mol Biol. 1995 Apr 7; 247(4):536-40.
[J Mol Biol. 1995]Structure. 1997 Aug 15; 5(8):1093-108.
[Structure. 1997]Proteins. 1991; 9(1):56-68.
[Proteins. 1991]Nucleic Acids Res. 2007 Jan; 35(Database issue):D301-3.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D115-9.
[Nucleic Acids Res. 2004]Genome Res. 2005 Dec; 15(12):1603-10.
[Genome Res. 2005]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]PLoS Biol. 2007 Mar; 5(3):e16.
[PLoS Biol. 2007]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D230-4.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D217-22.
[Nucleic Acids Res. 2004]Proteins. 2004 Aug 1; 56(2):188-200.
[Proteins. 2004]Nucleic Acids Res. 2008 Jan; 36(Database issue):D414-8.
[Nucleic Acids Res. 2008]Proteins. 2008 Jan 1; 70(1):105-18.
[Proteins. 2008]Bioinformatics. 2007 Oct 1; 23(19):2558-65.
[Bioinformatics. 2007]Nucleic Acids Res. 2003 Jul 1; 31(13):3311-5.
[Nucleic Acids Res. 2003]Annu Rev Biophys Biomol Struct. 2000; 29():291-325.
[Annu Rev Biophys Biomol Struct. 2000]Proteins. 2007; 69 Suppl 8():3-9.
[Proteins. 2007]Nat Biotechnol. 2007 Aug; 25(8):849-51.
[Nat Biotechnol. 2007]Nat Biotechnol. 2007 Aug; 25(8):849-51.
[Nat Biotechnol. 2007]Bioinformatics. 2004 Nov 1; 20(16):2860-2.
[Bioinformatics. 2004]Nat Biotechnol. 2007 Aug; 25(8):849-51.
[Nat Biotechnol. 2007]Curr Opin Chem Biol. 2003 Feb; 7(1):5-11.
[Curr Opin Chem Biol. 2003]Proteins. 2004 Aug 1; 56(2):188-200.
[Proteins. 2004]PLoS Comput Biol. 2007 Nov; 3(11):e232.
[PLoS Comput Biol. 2007]Nucleic Acids Res. 2003 Jul 1; 31(13):3311-5.
[Nucleic Acids Res. 2003]Structure. 2002 Mar; 10(3):435-40.
[Structure. 2002]Structure. 2006 Aug; 14(8):1211-7.
[Structure. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D230-4.
[Nucleic Acids Res. 2004]Proteins. 2007; 69 Suppl 8():3-9.
[Proteins. 2007]Nucleic Acids Res. 2008 Jan; 36(Database issue):D419-25.
[Nucleic Acids Res. 2008]Pac Symp Biocomput. 2004; ():375-86.
[Pac Symp Biocomput. 2004]Proteins. 2005 Jan 1; 58(1):166-79.
[Proteins. 2005]Science. 2003 Mar 14; 299(5613):1663.
[Science. 2003]Bioinformatics. 2003 Sep 22; 19(14):1748-59.
[Bioinformatics. 2003]Proteins. 2004 Aug 1; 56(2):188-200.
[Proteins. 2004]