![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2003 Oxford University Press Gene3D: structural assignments for the biologist and bioinformaticist alike 1Biomolecular Structure and Modelling Group, Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK 2Department of Crystallography, Birkbeck College, Malet Street, Bloomsbury, London WC1E 7HX, UK 3EMBL—European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK *To whom correspondence should be addressed. Tel: +44 2076797193; Email: c.orengo/at/ucl.ac.uk aPresent address: Stuart C. G. Rison, Royal Vetinary College, Department of Pathology and Infectious Diseases, Royal College Street, London NW1 0TU aThe authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors Received August 15, 2002; Revised October 2, 2002; Accepted October 2, 2002. This article has been cited by other articles in PMC.Abstract The Gene3D database (http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D/) provides structural assignments for genes within complete genomes. These are available via the internet from either the World Wide Web or FTP. Assignments are made using PSI-BLAST and subsequently processed using the DRange protocol. The DRange protocol is an empirically benchmarked method for assessing the validity of structural assignments made using sequence searching methods where appropriate assignment statistics are collected and made available. Gene3D links assignments to their appropriate entries in relevent structural and classification resources (PDBsum, CATH database and the Dictionary of Homologous Superfamilies). Release 2.0 of Gene3D includes 62 genomes, 2 eukaryotes, 10 archaea and 40 bacteria. Currently, structural assignments can be made for between 30 and 40 percent of any given genome. In any genome, around half of those genes assigned a structural domain are assigned a single domain and the other half of the genes are assigned multiple structural domains. Gene3D is linked to the CATH database and is updated with each new update of CATH. INTRODUCTION Considerable progress has been made in the field of genome annotation in the past five years and it is now evident that some structural or functional annotation can be provided for most of the genes in any given organism (6–12). Currently, the state of the art allows up to 80% (7) of the genes in any given organism to be assigned functional or structural annotation. Most annotations methods rely almost solely on inheriting functional annotation via sequence comparison but one must exercise a degree of caution when interpreting such results. This is particularly pertinent when considering the annotation of distant homologues [~30% sequence identity, (13)]. The benefit of structural annotation is often useful when assessing the functional annotations of these homologues. Use of structural data enables 3D models to be built to inform functional predictions (14,15). Gene3D aims to provide the biologist with reliable precalculated relationships to protein structures and, as a result, the relevant links to the functional and structural data curated within the CATH domain structure classification database. These data can then be used as the starting point for homology modelling or evolutionary studies. A related resource, SUPERFAMILY (16), is linked to the SCOP structural database (17). METHODS The Gene3D database is derived from data produced by the DomainFinder algorithm (18) and the DRange protocol (2). This resource is created by scanning the sequences from the CATH structural domains against a large database derived from the non-redundant sequence database from GenBank that contains the sequences from the completed genomes. The PSI-BLAST (1) iterative database search algorithm is used (19) to scan CATH database sequences against the GenBank sequences. Preprocessing is carried out by DomainFinder and the DRange protocol selects and validates the putative structural annotations suggested by DomainFinder. Gene3D and the associated DRange protocol are described below. DomainFinder and DRange The Gene3D population process is illustrated in Figure Figure1.1
In the subsequent step, the DomainFinder algorithm is used to convert the ‘raw’ hits into ‘Ranges’ (18). These ‘Ranges’ act as descriptors which indicate which regions of a gene are putatively thought to belong to which CATH Homologous Superfamilies (Fig. (Fig.1C).1 Results Gene3D is the repository for structural assignments verified using the DRange protocol and is available on-line at http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D/. This protocol is applied to all complete genomes released. In May 2002, Gene3D included whole genome structural assignments for 66 genomes. The data are also available via the CATH FTP site at ftp://ftp.biochem.ucl.ac.uk/pub/Gene3D/. Typical assignments statistics, for four typical genomes in the database, one from each of the major branches of life (one multicellular eukaryote, one unicellular eukaryote, an archaea and a bacterium) are presented in Table 1. The level of assignment ranges from ~22% to ~55% of the genes in a given organism in the database receive annotation with at least one structural domain. Of these genes, usually around half are annotated with a single domain and the other half of the genes are assigned multiple domains (see Fig. Fig.2).2
Cursory inspection of the assignment data shows that bacterial and archaeal genomes pick up approximately the same ratios of the various types of CATH domains and that no single genome appears to be strongly biased in the type of CATH domains it utilises (see Fig. Fig.3).3
In the database, the eukaryotic genomes pick up the least annotation which may be due to a prokaryotic bias in the structures that are deposited within the PDB (20). The Gene3D Web Server The Gene3D web server is made up of a number of inter linked web pages which allow the retrieval of data on specific genes within the represented genomes. Each genome features an entry page (Fig. (Fig.4A)4
DISCUSSION The data within Gene3D are there to provide biologists and bioinformaticists with an initial stepping stone from which structural, functional and evolutionary studies can begin. In future, we hope to integrate Pfam domain assignments (12) to maximise the annotated coverage of genomes and we also hope to provide alignments of the CATH or Pfam domains to the genes that they matched. It is our hope that by integrating Pfam domain assignments, we can provide the assignments for most, if not all, of the genes in the complete genomes. That we can annotate so much of the complete genome sequences from the structure databases alone suggests that we may not need to solve structures for every sequence but rather for every sequence family containing relatives of high sequence identity (for example ~40%) sequence identity. In such families, homology modelling could then be used to predict the structures of all the relatives from one representative structure. This bodes well for the success of the structural genomics projects. REFERENCES 1. Altschul S., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [PubMed] 2. Buchan D., Shepherd,A., Lee,D., Pearl,F., Rison,S., Thornton,J. and Orengo,C. (2002) Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database. Genome Res., 12, 503–514. [PubMed] 3. Laskowski R. (2001) PDBsum: summaries and analyses of PDB structures. Nucleic Acids Res., 29, 221–222. [PubMed] 4. Pearl F., Martin,N., Bray,J., Buchan,D., Harrison,A., Lee,D., Reeves,G., Shepherd,A., Sillitoe,I., Todd,A., Thornton,J. and Orengo,C. (2001) A rapid classification protocol for the CATH domain database to support structural genomics. Nucleic Acids Res., 29, 223–227. [PubMed] 5. Bray J., Todd,A., Pearl,F., Thornton,J. and Orengo,C. (2000) The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues. Protein Eng., 13, 153–165. [PubMed] 6. Gerstein M. (1997) A structural census of genomes: comparing bacterial, eukaryotic and archaeal genomes in terms of protein structure. J. Mol. Biol., 274, 562–576. [PubMed] 7. Teichmann S., Chothia,C. and Gerstein,M. (1999) Advances in structural genomics. Curr. Opin. Struct. Biol., 9, 390–399. [PubMed] 8. Muller A., MacCallum,R. and Sternberg,M. (1999) Benchmarking PSI-BLAST in genome annotation. J. Mol. Biol., 293, 1257–1271. [PubMed] 9. Iliopoulos I., Tsoka,S., Andrade,M., Janssen,P., Audit,B., Tramontano,A., Valencia,A., Leroy,C., Sander,C. and Ouzounis,C. (2001) Genome sequences and great expectations. Genome Biol., 2, Interactions0001. [PubMed] 10. Apweiler R., Biswas,M., Fleischmann,W., Kanapin,A., Karavidopoulou,Y., Kersey,P., Kriventseva,E., Mittard,V., Mulder,N., Phan,I. and Zdobnov,E. (2001) Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes. Nucleic Acids Res., 29, 44–48. [PubMed] 11. Kanehisa M., Goto,S., Kawashima,S. and Nakaya,A. (2002) The KEGG databases at GenomeNet. Nucleic Acids Res., 30, 42–46. [PubMed] 12. Bateman A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S., Griffiths-Jones,S., Howe,K., Marshall,M. and Sonnhammer,E. (2002) The Pfam protein families database. Nucleic Acids Res., 30, 276–280. [PubMed] 13. Todd A., Orengo,C. and Thornton,J. (2001) Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol., 307, 1113–1143. [PubMed] 14. Laskowski R., Luscombe,N., Swindells,M. and Thornton,J. (1996) Protein clefts in molecular recognition and function. Protein Sci., 5, 2438–2452. [PubMed] 15. Luscombe N., Laskowski,R. and Thornton,J. (1997) NUCPLOT: a program to generate schematic diagrams of protein–nucleic acid interactions. Nucleic Acids Res., 25, 4940–4945. [PubMed] 16. Gough J. and Chothia,C. (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res., 30, 268–272. [PubMed] 17. Lo Conte L., Brenner,S., Hubbard,T., Chothia,C. and Murzin,A. (2002). SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res., 30, 264–272. [PubMed] 18. Pearl F., Lee,D., Bray,J., Buchan,D., Shepherd,A. and Orengo,C. (2002) The CATH extended protein-family database: providing structural annotations for genome sequences. Protein Sci., 11, 233–244. [PubMed] 19. Altschul S., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSI–BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [PubMed] 20. Westbrook J., Feng,Z., Jain,S., Bhat,T., Thanki,N., Ravichandran,V., Gilliland,G., Bluhm,W., Weissig,H., Greer,D., Bourne,P. and Berman,H. (2002) The Protein Data Bank: unifying the archive. Nucleic Acids Res., 30, 245–248. [PubMed] 21. Michie A., Orengo,C. and Thornton,J. (1996). Analysis of domain structural class using an automated class assignment protocol. J. Mol. Biol., 262, 168–185. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
J Mol Biol. 1997 Dec 12; 274(4):562-76.
[J Mol Biol. 1997]Nucleic Acids Res. 2002 Jan 1; 30(1):276-80.
[Nucleic Acids Res. 2002]Curr Opin Struct Biol. 1999 Jun; 9(3):390-9.
[Curr Opin Struct Biol. 1999]J Mol Biol. 2001 Apr 6; 307(4):1113-43.
[J Mol Biol. 2001]Protein Sci. 1996 Dec; 5(12):2438-52.
[Protein Sci. 1996]Protein Sci. 2002 Feb; 11(2):233-44.
[Protein Sci. 2002]Genome Res. 2002 Mar; 12(3):503-14.
[Genome Res. 2002]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Protein Sci. 2002 Feb; 11(2):233-44.
[Protein Sci. 2002]Genome Res. 2002 Mar; 12(3):503-14.
[Genome Res. 2002]Nucleic Acids Res. 2002 Jan 1; 30(1):245-8.
[Nucleic Acids Res. 2002]J Mol Biol. 1996 Sep 20; 262(2):168-85.
[J Mol Biol. 1996]Nucleic Acids Res. 2002 Jan 1; 30(1):276-80.
[Nucleic Acids Res. 2002]