![]() | ![]() |
Formats:
|
||||||||||||
Copyright © 2007 by The National Academy of Sciences of the USA Applied Biological Sciences Quantitative assessment of protein function prediction from metagenomics shotgun sequences *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and ‡Max Delbrück Centre for Molecular Medicine, D-13092 Berlin, Germany §To whom correspondence should be addressed. E-mail: bork/at/embl.de Edited by Michael S. Waterman, University of Southern California, Los Angeles, CA, and approved July 17, 2007 Author contributions: E.D.H., A.H.S., C.v.M., and P.B. designed research; E.D.H., A.H.S., T.D., L.J.J., and J.R. performed research; E.D.H., A.H.S., T.D., I.L., L.J.J., and J.R. analyzed data; and E.D.H., A.H.S., and P.B. wrote the paper. †Present address: Institute of Molecular Biology, Y55-L76, University of Zurich, Winterthurerstrasse 190, CH-8057 Zurich, Switzerland. Received March 23, 2007. This article has been cited by other articles in PMC.Abstract To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families. Keywords: fatty acid, heme, neighborhood, environmental genomics, metagenome annotation Recent years have seen an explosion in the amount of shotgun sequence data gathered from diverse natural environments. Since 2004, almost 2 billion base pairs resulting from published large-scale metagenomics sequencing projects have been deposited [as of January of 2007 (1–8)], eclipsing the entire 764 Mbp of previously sequenced microbial genomes (9). Large-scale environmental sequencing efforts have the potential to considerably enhance our understanding of cellular processes, identify ubiquitous as well as unique biological functions in each environment, and close the gaps in our knowledge between genotype, phenotype, and environment. Until the identified ORFs are correctly annotated with biological functions, however, we are simply left with a vast amount of information but no contextual knowledge, analogous to the early days of genome sequencing. Currently, characterizing an unknown sequence involves comparing it to sequences or protein domains of known function in public databases, usually by using BLAST (10) or other homology search tools (11). By applying BLAST-based annotation methods to newly sequenced genomes, functions can typically be assigned to ≈70% of the gene products (11–13). Unfortunately, these predictions have been estimated to include 13–15% database propagation errors (14) and are only possible if the unknown sequence has at least one BLAST hit. To complement homology-based function prediction, particularly in prokaryotes, additional information from genomic neighborhood (15, 16), phylogenetic profiles (17), gene coexpression (18), and gene fusion (19, 20) has been used and combined (18, 21). As yet, only the exploitation of genomic neighborhood (including gene fusions) is feasible in the context of metagenomic shotgun data. In the first large-scale shotgun metagenomics projects from four diverse and complex environments [tropical surface water from the Sargasso Sea near Bermuda (2), farm soil from Minnesota (4), an acidophilic biofilm from an iron ore mine in northern California (1), and three samples from “whale fall” carcasses on the deep Pacific and Antarctic ocean floor (4)], functions have been predicted based on sequence similarity for only 27–48% of the 1.4 million genes in the different samples [see supporting information (SI) Table 1]. This implies that for the majority of proteins in the environment, functions remain unknown, and no attempt has yet been made to discover novel functionality. Furthermore, for each project, different methods, parameters, and even definitions of function were used, which are often not easily accessible to the community, making a comparison of the different samples difficult. To be able to comprehensively predict functions from various metagenomics samples and to get a consistent overview of function in different environments, we developed a sensitive prediction protocol that complements BLAST- and domain-based function predictions with newly developed and adapted gene neighborhood methods. Applying this protocol to the samples revealed a considerable predictive power, indicating that function can be inferred for most of the genes on earth; yet the majority of functions appear to reside in numerous rare, small protein families that remain largely unexplored. Results and Discussion An Operational Definition of Protein Function. Biological function is a fuzzy term summarizing a complex concept applicable to different spatial scales (22, 23). At the molecular and (sub-)cellular level, an operational framework with clearly defined terms and thresholds is therefore required when attempting to quantify protein function. To infer specific function from existing database annotations by using homology, we require similarity to an environmental (partial) ORF >60 bits, corresponding roughly to an e-value of 10−8 in Uniref90 searches (4). This level of sequence similarity is rather strict in terms of homology identification but without further analysis may be insufficient to distinguish between paralogs and orthologs, thus not capturing all functional features such as enzyme substrate specificity. It is, however, sufficient to capture basic functionality. We used a hierarchical classification scheme, favoring manual annotation, to divide environmental ORFs and, for comparison, 124 prokaryotic proteomes into four categories based on the level of functional annotation possible: (i) those with strong similarity to, or in the genomic neighborhood of, a gene with specific functional annotation; (ii) those with strong similarity to genes with nonspecific functional information, weak but significant similarity to genes with any functional annotation, or in the genomic neighborhood of either of these; (iii) those with strong similarity to, or in the genomic neighborhood of, a gene of unknown function; (iv) those with neither similarity to sequences in annotated databases nor significant genomic neighborhood (Fig. 1
We used sequence similarity to infer functional information from the KEGG (24), COG (12), UniRef90 (25), SMART (26), and Pfam (27) databases (see Materials and Methods for parameter choices, benchmarks, and definitions of functional annotation). We used gene neighborhood evidence from the STRING database (21) and adapted existing gene neighborhood function prediction methods, based on intergenic distance and evolutionary conservation, for use in fragmented shotgun metagenomics data. First, we exploited the fact that intergenic distances tend to be shorter between genes of the same operon than between operons (28). Although several operon prediction methods have been introduced that are based solely on intergenic distances (28–31), they are species-specific, trained with experimentally verified transcript information (28), and/or require the context of a complete genome. Here, we calibrated directly on each sample to establish the likelihood of being functionally associated, given a positional distance within a read. Second, we used the fact that neighboring ORFs are more likely to be functionally associated if they are conserved over long evolutionary distances (15, 16, 32). We recorded multiple occurrences of neighboring genes, measured the sequence similarity of the respective neighborhoods to each other, and derived a metric based on evolutionary distance. We then combined these measures for intergenic and evolutionary distance to predict functional relationships between genes in the metagenomic data (see Materials and Methods). Consistent Functional Characterization of ORFs in Four Environmental Data Sets. By combining homology searches and neighborhood methods, we were able to infer specific functional information for 76% of the 1.4 million predicted environmental ORFs and a more general level of functional information for another 7% (dark and light green segments respectively of the outermost ring in Fig. 2
In the original reports of the metagenomics data sets, specific functions were assigned to 27–48% of the predicted gene products (1, 2, 4), indicating marked differences in the function prediction protocols caused by various technical issues such as the stringency of BLAST cutoffs, the choice of functional databases, and variations in gene calling (a comparison is presented in SI Table 1; for an expanded comparison see ref. 9). Because our benchmarks and manual confirmations of parameter settings show a negligible false-positive rate (see Materials and Methods), we believe that the near doubling in functional assignments is not caused by a looser function definition or more spurious assignments but is due to better utilization of existing functional information. The latter uncovers marked trends such as overrepresentation at the gene, family, or pathway level, in line with earlier studies (4) (SI Table 5). For example, we find that bacterial chemotaxis, flagellar assembly, and type III secretion genes are 3-fold more frequent in the genomes than the metagenomes (dominated by the surface sea water data set), perhaps because of the futility of bacterial motility in strong ocean currents. On the other hand, genes involved in amino acid metabolism as well as in the biosynthesis of nucleotides, carbohydrates, and lipids are significantly underrepresented in the genomes as compared with the metagenomes, perhaps because of the bias toward sequencing obligate pathogens, which tend to acquire these compounds from their hosts. Comparison of Environmental Samples. Among the four environments, the fraction of functional assignments differs considerably as it does between organisms (Fig. 2 Predicting Functional Novelty: In-Depth Analysis of Two Neighborhood-Based Findings. Whereas homology-based methods require additional analysis to identify novel functions (e.g., via novel subgroups in a characterized sequence family), neighborhood methods can directly provide novel functional associations. Novelty can be obtained either by (i) seeing unexpected functional coupling of known genes or (ii) assigning unknown genes to known processes. The first is evident in the fact that there are as many as 5,851 pairs of neighboring COGs unique to metagenomes, even though these COGs occur individually in the 124 prokaryotic genomes, implying many novel functional interactions. These frequently include enzymes involved in amino acid biosynthesis with novel links to numerous protein degradation and regulatory proteins, probably reflecting the different nutritional constraints (SI Table 6). The second can be seen in the 75,448 ORFs (5% of the total) that are solely characterized by neighborhood. Here, we provide detailed functional annotation for two families: a previously uncharacterized gene family associated with a well known pathway (heme biosynthesis) and a transcription factor, unique to the Sargasso Sea data set, that potentially regulates the coupling of two opposing processes (fatty acid biosynthesis and degradation). These and other functional predictions, including annotations for nearly half a million previously uncharacterized proteins, are available online (www.bork.embl.de/Docu/harrington). Neighborhood information can help characterize a gene family if members of that gene family occur next to different genes belonging to the same pathway in different species. By using such a query, we discovered members of a large uncharacterized gene family (COG1981) with several hundred ORFs in the surface sea water and whale fall samples, adjacent to various enzymes from the well studied heme biosynthesis pathway (Fig. 3
Whereas the heme-associated gene family had previously been observed in fully sequenced genomes, another family of 20 members was found exclusively in the surface sea water samples by using our clustering procedure (see Materials and Methods). Even though no homology could be found by using our automated methods, detailed analysis revealed weak but significant similarity to a family of helix–turn–helix (HTH) transcription factors. An examination of its neighboring genes implies that this family is found in a variety of species, the most closely related being Actinobacteria. As the genes are on various contigs with differing gene orders, we could assign it to an entire operon that additionally contains three downstream genes consistently occurring in the same orientation. The first downstream gene of unknown function (NOG05011) has been observed in completely sequenced genomes; in-depth sequence and secondary structure analyses suggest an enzymatic function (data not shown). The second and third genes of this potential operon (COG1024 and COG1960) catalyze successive steps of the β-oxidation of fatty acids (usually involved in degradation) (38, 40). Interestingly, this invariant operon, apparently controlled by the newly predicted transcriptional regulator, frequently occurs downstream of various genes involved in fatty acids biosynthesis (Fig. 3 Functional Prediction vs. Functional Diversity. As more environments are explored, we expect that core protein functions (for example, translational machinery) will be seen repeatedly and will dominate every sample. Novel, rare, and perhaps environment-specific functions, on the other hand, might not be classifiable because they are not yet captured by the experimental studies that underlie most current knowledge about biological function. To reconcile our gene-centric view of the data with a function-based one, we performed an all-against-all similarity search of all predicted ORFs in all four environments, clustered the results into gene families, and recorded their functional status according to our operational definition (see Fig. 4
Materials and Methods Sequence Data and Similarity Searches. We analyzed published microbial shotgun sequence data from four environmental samples, totaling 1,438,944 genes: 1,086,400 genes from tropical surface water from the Sargasso Sea (2), 183,586 genes from farm soil from Minnesota (4), 122,146 genes from isolated whale fall carcasses (4), and 46,862 genes from an acidophilic biofilm from an iron ore mine (1). In parallel, we analyzed 344,619 genes from 124 prokaryotic genomes from the STRING database (21) (SI Table 7). Analyses were carried out at three different levels of stringency, the figures reported here use a bit score cutoff of 60 bits for orthology assignment (a prerequisite to predict specific functions) and 40 bits for homology assignment (for details of parameter exploration see SI Text). To map functionally characterized domains to metagenomic ORFs, we scanned the HMM profile signatures from Pfam (27) and SMART (26) against the metagenomic sequences by using HMMER (http://hmmer.wustl.edu/) software and applied the corresponding family-specific cutoffs. Gene Family Analysis. We grouped genes from all four environmental data sets into 206,217 gene families by first constructing a single-linkage graph of an all-against-all BLAST (60-bit cutoff), with nodes representing proteins, and edges representing BLAST hits between proteins weighted by BLAST bit scores. This graph was then clustered by using Markov chain linkage clustering with an inflation value of 1.1 (44, 45) (SI Table 8). Function Prediction Using Sequence Similarity. ORFs were assigned to KEGG pathways and COGs by using the method described by Tringe et al. (4) using a 60 bit cutoff. For the 124 prokaryotic genomes, the KEGG and COG assignments from the STRING database were used. ORFs were also compared against the UniRef90 database, divided into functionally characterized and uncharacterized clusters (see SI Text) and annotated with domains from the SMART and Pfam databases. These annotations were combined in a hierarchical manner, favoring manually annotated databases, placing each ORF into one of the above categories. By definition, any ORF that mapped to KEGG was considered to have a specific function assigned. Of the remaining ORFs, those that mapped to a COG were considered to have a specific function assigned, with the exception of those in functional classes “R” and “S,” which were considered to have nonspecific and no function assigned, respectively. The remaining ORFs were considered to have specific functional annotation if they had strong similarity (>60 bits) to functionally characterized UniRef90 clusters, nonspecific functional annotation if they contain a domain from the SMART or Pfam A database or have remote similarity (>40 bits) to functionally characterized UniRef90 clusters. All other ORFs were considered to have no function assigned, those with similarity to uncharacterized UniRef90 clusters were considered to be part of a family, and the rest singletons. Function Prediction Using Genomic Neighborhood. Using the contig positions of the ORFs in each data set, we constructed a list of pair-wise neighborhoods. For this analysis, we considered only codirectionally transcribed genes (for the treatment of overlapping genes, see SI Text). To investigate the conservation of neighborhoods, we constructed a graph for each set of homologous neighborhoods. An edge was placed between two neighborhoods if there were BLAST hits >60 bits between both pairs of genes, except in cases where a gene from one neighborhood hit both genes in the other. This graph was then used to construct clusters of neighborhoods representing a conserved gene pair. To estimate the evolutionary distance over which a neighborhood is conserved, we adapted a weighting scheme used for multiple sequence alignment (46) to derive a score with the property that it will be low for small clusters of closely related sequences and large for clusters with distantly related sequences. For each metagenomic data set, we then constructed a benchmark set of pair-wise neighborhoods where both genes have a KEGG mapping. At each intergenic and evolutionary distance within the benchmark set, we determined the proportion of neighborhoods that map to the same KEGG pathway. This relationship was then interpolated and used to derive a value P for each neighborhood in the data set, corresponding to the probability that a pair of genes in a neighborhood is functionally related (SI Figs. 6 and 8–11 and SI Table 2). We also applied this method to individual organisms (SI Figs. 7 and 12 and SI Table 3) to assess the effect of species-specific genome architectures on the method. It is clear that the relationship between intergenic and evolutionary distance and P is highly species-specific. The vast majority of P values exceed the random expectation (16%, the probability that a random pair of genes map to the same KEGG pathway). To ensure that we were dealing with high quality predictions, we considered a pair of genes to be functionally linked only if the P value was >0.4 [found to have an accuracy approaching 70% at the level of functional modules (47)]. For the ORFs that map to COGs, additional neighborhood information was taken from the STRING database (see SI Text). Supporting Information
Acknowledgments We thank the P.B. group for helpful discussions. E.D.H. was supported by the European Community's FP6 Marie Curie Fellowship for Early Stage Training (E-STAR) under contract number MEST-CT-2004-504640. This work was supported by the European Union 6th Framework Program (Contract No. LSHG-CT-2004-503567). Footnotes The authors declare no conflict of interest. This article is a PNAS Direct Submission. This article contains supporting information online at www.pnas.org/cgi/content/full/0702636104/DC1. References 1. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF. Nature. 2004;428:37–43. [PubMed] 2. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, et al. Science. 2004;304:66–74. [PubMed] 3. Hallam SJ, Putnam N, Preston CM, Detter JC, Rokhsar D, Richardson PM, DeLong EF. Science. 2004;305:1457–1462. [PubMed] 4. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, et al. Science. 2005;308:554–557. [PubMed] 5. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard NU, Martinez A, Sullivan MB, Edwards R, Brito BR, et al. Science. 2006;311:496–503. [PubMed] 6. Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE. Science. 2006;312:1355–1359. [PubMed] 7. Garcia Martin H, Ivanova N, Kunin V, Warnecke F, Barry KW, McHardy AC, Yeates C, He S, Salamov AA, Szeto E, et al. Nat Biotechnol. 2006;24:1263–1269. [PubMed] 8. Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, Gordon JI. Nature. 2006;444:1027–1031. [PubMed] 9. Raes J, Harrington ED, Singh AH, Bork P. Curr Opin Struct Biol. 2007;17:362–369. [PubMed] 10. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. J Mol Biol. 1990;215:403–410. [PubMed] 11. Bork P, Koonin EV. Nat Genet. 1998;18:313–318. [PubMed] 12. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al. BMC Bioinformatics. 2003;4:41. [PubMed] 13. Huynen MA, Snel B, von Mering C, Bork P. Curr Opin Cell Biol. 2003;15:191–198. [PubMed] 14. Brenner SE. Trends Genet. 1999;15:132–133. [PubMed] 15. Dandekar T, Snel B, Huynen M, Bork P. Trends Biochem Sci. 1998;23:324–328. [PubMed] 16. Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N. Proc Natl Acad Sci USA. 1999;96:2896–2901. [PubMed] 17. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Proc Natl Acad Sci USA. 1999;96:4285. [PubMed] 18. Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D. Nature. 1999;402:83–86. [PubMed] 19. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. Science. 1999;285:751. [PubMed] 20. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA. Nature. 1999;402:86–90. [PubMed] 21. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P. Nucleic Acids Res. 2005;33:D433–D437. [PubMed] 22. Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y. J Mol Biol. 1998;283:707–725. [PubMed] 23. Bork P, Serrano L. Cell. 2005;121:507–509. [PubMed] 24. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. Nucleic Acids Res. 2004;32:D277–D280. [PubMed] 25. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al. Nucleic Acids Res. 2006;34:D187–D191. [PubMed] 26. Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P. Nucleic Acids Res. 2006;34:D257–D260. [PubMed] 27. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al. Nucleic Acids Res. 2004;32:D138–D141. [PubMed] 28. Salgado H, Moreno-Hagelsieb G, Smith TF, Collado-Vides J. Proc Natl Acad Sci USA. 2000;97:6652–6657. [PubMed] 29. Price MN, Huang KH, Alm EJ, Arkin AP. Nucleic Acids Res. 2005;33:880–892. [PubMed] 30. Okuda S, Katayama T, Kawashima S, Goto S, Kanehisa M. Nucleic Acids Res. 2006;34:D358–D362. [PubMed] 31. Yan Y, Moult J. Proteins. 2006;64:615–628. [PubMed] 32. Korbel JO, Jensen LJ, von Mering C, Bork P. Nat Biotechnol. 2004;22:911–917. [PubMed] 33. Raes J, Korbel JO, Lercher MJ, von Mering C, Bork P. Genome Biol. 2007;8:R10. [PubMed] 34. Giovannoni SJ, Tripp HJ, Givan S, Podar M, Vergin KL, Baptista D, Bibbs L, Eads J, Richardson TH, Noordewier M, et al. Science. 2005;309:1242–1245. [PubMed] 35. von Mering C, Hugenholtz P, Raes J, Tringe SG, Doerks T, Jensen LJ, Ward N, Bork P. Science. 2007;315:1126–1130. [PubMed] 36. Torsvik V, Ovreas L. Curr Opin Microbiol. 2002;5:240–245. [PubMed] 37. Yayanos AA. Annu Rev Microbiol. 1995;49:777–805. [PubMed] 38. Michal G. Biochemical Pathways: An Atlas of Biochemistry and Molecular Biology. New York: Wiley; 1999. 39. Frankenberg N, Moser J, Jahn D. Appl Microbiol Biotechnol. 2003;63:115–127. [PubMed] 40. Yang XY, Schulz H, Elzinga M, Yang SY. Biochemistry. 1991;30:6788–6795. [PubMed] 41. Lakin-Thomas PL, Brody S. Annu Rev Microbiol. 2004;58:489–519. [PubMed] 42. Alexandre G, Greer-Phillips S, Zhulin IB. FEMS Microbiol Rev. 2004;28:113–126. [PubMed] 43. Bebout BM, Garcia-Pichel F. Appl Environ Microbiol. 1995;61:4215–4222. [PubMed] 44. Enright AJ, Van Dongen S, Ouzounis CA. Nucleic Acids Res. 2002;30:1575–1584. [PubMed] 45. van Dongen S. A Cluster Algorithm for Graphs. Amsterdam: National Research Institute for Mathematics and Computer Science in The Netherlands; 2000. 46. Gerstein M, Sonnhammer EL, Chothia C. J Mol Biol. 1994;236:1067–1078. [PubMed] 47. von Mering C, Zdobnov EM, Tsoka S, Ciccarelli FD, Pereira-Leal JB, Ouzounis CA, Bork P. Proc Natl Acad Sci USA. 2003;100:15428–15433. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Science. 2004 Apr 2; 304(5667):66-74.
[Science. 2004]Science. 2004 Sep 3; 305(5689):1457-62.
[Science. 2004]Science. 2005 Apr 22; 308(5721):554-7.
[Science. 2005]Science. 2006 Jan 27; 311(5760):496-503.
[Science. 2006]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]Nat Genet. 1998 Apr; 18(4):313-8.
[Nat Genet. 1998]BMC Bioinformatics. 2003 Sep 11; 4():41.
[BMC Bioinformatics. 2003]Curr Opin Cell Biol. 2003 Apr; 15(2):191-8.
[Curr Opin Cell Biol. 2003]Trends Genet. 1999 Apr; 15(4):132-3.
[Trends Genet. 1999]Science. 2004 Apr 2; 304(5667):66-74.
[Science. 2004]Science. 2005 Apr 22; 308(5721):554-7.
[Science. 2005]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]J Mol Biol. 1998 Nov 6; 283(4):707-25.
[J Mol Biol. 1998]Cell. 2005 May 20; 121(4):507-9.
[Cell. 2005]Science. 2005 Apr 22; 308(5721):554-7.
[Science. 2005]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D277-80.
[Nucleic Acids Res. 2004]BMC Bioinformatics. 2003 Sep 11; 4():41.
[BMC Bioinformatics. 2003]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D187-91.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D257-60.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D138-41.
[Nucleic Acids Res. 2004]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Science. 2004 Apr 2; 304(5667):66-74.
[Science. 2004]Science. 2005 Apr 22; 308(5721):554-7.
[Science. 2005]Curr Opin Struct Biol. 2007 Jun; 17(3):362-9.
[Curr Opin Struct Biol. 2007]Science. 2004 Apr 2; 304(5667):66-74.
[Science. 2004]Science. 2005 Apr 22; 308(5721):554-7.
[Science. 2005]Genome Biol. 2007; 8(1):R10.
[Genome Biol. 2007]Science. 2005 Aug 19; 309(5738):1242-5.
[Science. 2005]Science. 2007 Feb 23; 315(5815):1126-30.
[Science. 2007]Appl Microbiol Biotechnol. 2003 Dec; 63(2):115-27.
[Appl Microbiol Biotechnol. 2003]Biochemistry. 1991 Jul 9; 30(27):6788-95.
[Biochemistry. 1991]Annu Rev Microbiol. 2004; 58():489-519.
[Annu Rev Microbiol. 2004]FEMS Microbiol Rev. 2004 Feb; 28(1):113-26.
[FEMS Microbiol Rev. 2004]Appl Environ Microbiol. 1995 Dec; 61(12):4215-4222.
[Appl Environ Microbiol. 1995]Science. 2004 Apr 2; 304(5667):66-74.
[Science. 2004]Science. 2004 Apr 2; 304(5667):66-74.
[Science. 2004]Science. 2005 Apr 22; 308(5721):554-7.
[Science. 2005]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D433-7.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D138-41.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2002 Apr 1; 30(7):1575-84.
[Nucleic Acids Res. 2002]Science. 2005 Apr 22; 308(5721):554-7.
[Science. 2005]J Mol Biol. 1994 Mar 4; 236(4):1067-78.
[J Mol Biol. 1994]Proc Natl Acad Sci U S A. 2003 Dec 23; 100(26):15428-33.
[Proc Natl Acad Sci U S A. 2003]