• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jul 1, 2003; 31(13): 3799–3803.
PMCID: PMC168962

OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms

Abstract

OntoBlast allows one to find information about potential functions of proteins by presenting a weighted list of ontology entries associated with similar sequences from completely sequenced genomes identified in a BLAST search. It combines, in a single analysis step, the search for sequence similarities in several species with the association of information stored in ontologies. From each identified ontology term a list of genes, which share the functional annotation, can be retrieved. The OntoBlast function is an integral part of the ‘Ontologies TO GenomeMatrix’ tool which provides an alternative entry point from ontology terms to the Genome–Matrix database. OntoBlast's web interface is accessible on the ‘Ontologies TO GenomeMatrix Gate’ page at http://functionalgenomics.de/ontogate/.

INTRODUCTION

The integration of sequence data with information from functional analyses of genes is an important and challenging task. Functional annotations of sequences allow first insights into the processes in which a gene product might be involved. A possible way to provide such an annotation is the association of a gene or protein sequence with predefined terms describing known characterised functions. A widely used structured vocabulary of this type is the Gene Ontology (GO) resource (1), which consists of three ontologies describing molecular functions, biological processes and cellular components.

A growing number of associations of genes, gene products and database identifiers to GO terms are readily available via the internet either from the GO website (http://www.geneontology.org/#indices, http://www.geneontology.org/#annotations), the GO Annotation@EBI (2) (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/) or other external species specific databases. There are also several tools which permit to browse and search the GO ontologies and to display associated entries in external databases either using one of many GO browsers (http://www.geneontology.org/#tools) or the SRS service (http://srs.ebi.ac.uk/). A few tools allow the use of GO terms to locate associated human and/or mouse genes (CGAP GO Browser: http://cgap.nci.nih.gov/Genes/GOBrowser/ or the EP GO Browser: http://ep.ebi.ac.uk/EP/GO/) to identify relations between GO terms and diseases [Genes2Diseases (3): http://www.bork.embl-heidelberg.de/g2d/] or to use protein database accession numbers to retrieve the corresponding GO terms (ProToGO: http://www.protogo.cs.huji.ac.il/).

Recently BLAST servers have been made available which combine their search results directly with annotations from GO. One such service is the GOst software tool which can be accessed from the AmiGO browser (http://www.godatabase.org/cgi-bin/go.cgi). Another BLAST server which retrieves automatically associated GO terms is GOblet (http://goblet.molgen.mpg.de/) developed as a project within the NGFN (Nationales Genomforschungsnetz) in Germany.

The primary purpose of OntoBlast (OB) is not to provide just a value-added BLAST server, but to generate a list of ontology terms associated with the query sequence which serve as entry points linking to the Genome-Matrix (GM, http://genome-matrix.org), a multi species, gene-region database which was introduced at the GM2002 meeting in Shanghai (4) (http://hgm2002.hgu.mrc.ac.uk/Abstracts/Publish/WorkshopPosters/WorkshopPoster01/hgm0023.htm) and will be described elsewhere (A. Hewelt et al., in preparation). Using these links as part of the ‘Ontologies TO GenomeMatrix’ tool, it is possible to identify, in a second step, genes which are related to the original query sequence, not by structural similarity, but by sharing functional annotations. The information accessible from the GM database can than assist in the analysis and evaluation of the proposed associations between sequence and functions.

Sequence similarity is a frequently used feature to generate annotations and has also been used, together with protein domain analysis, in systematic protein annotation projects working with the Gene Ontology (5,6).

MATERIALS AND METHODS

Source data

Genome data from nine species have been prepared for this tool and correspond to the datasets used for the GM project as shown in Table Table11.

Amino acid sequences have been extracted from SWISS-PROT (14) (http://www.ebi.ac.uk/swissprot/), TrEMBL (http://www.ebi.ac.uk/trembl/) and species specific databases. At present the three ontologies from Gene Ontology are implemented (data have been downloaded from the GO ftp site ftp://ftp.geneontology.org/), but it is planned to add other vocabularies like the EC enzyme classification. GO terms listed after a BLAST search are linked to GM entries via categorised tables and sorted gene lists.

This information, linking each ontology term (and its parents) to all associated genes (including homologs), as well as to the corresponding amino acid sequences, is pre-calculated on a regular basis to ensure an up-to-date and fast display. About 58 000 such data files each, for the cumulative and non-cumulative version, are produced.

Web interface

The main part of the user interface consists of two frames. The ‘search frame’ displays the form in which the query sequence and parameters for the BLAST search can be entered and later the actual BLAST result. The ‘list frame’ shows the weighted list of ontology terms. Selecting such a term displays the table with the GM entry links in the search frame.

The query sequence can be entered as simple amino acid sequence or in FASTA format. Either all or only databases from selected species can be searched. The ontology terms in the list frame are grouped by the different ontologies they belong to. Each entry consists of a number of dots indicating the terms position (depth) in the hierarchy of the ontology, followed by the term name (forming a link to the corresponding pre-calculated table which serves as entry point to the GM and also shows the complete upwards branch of the ontology tree). Underneath the term a list of gene names, which are part of the BLAST result and are associated to that term, as well as the weighting number for this term, are shown. Each name forms a link to the corresponding entry position in the BLAST result, allowing to look-up its E-value and with that the degree of similarity to the query sequence. The weighting numbers are used to sort the term list and provide a very simplistic way to judge the likely quality of the potential association to the query sequence. These numbers are calculated by multiplying the E-values of all sequences in the BLAST result associated with the term in question. They give an indication of how strong the evidence for a term is, relative to other terms. The lower the number, the stronger the sequence similarities and more trustworthy the association. Other important factors are the absolute number of different genes associated with the term and from how many different species they originate.

The standard BLAST output in the search frame shows the gene names as direct links to summary information pages in their respective source databases (which provides an easy possibility to see alternative names and known biological information for each gene) and is followed by the species name in brackets (except for mouse and human genes which can be identified by their ENSEMBL ids). Sequence alignments are only shown if selected in the search form, to speed up the display. If alignments are included the score (bits) numbers for those results are links and can be used to jump to the corresponding display.

RESULTS

Example searches

The search only requires to paste an amino acid sequence into the query field. The default settings (search all databases, E-value threshold 0.001, cumulative mode off) should usually be kept for the first search. If the cumulative mode is on, the whole trees with all parent terms of the matching GO terms are included, but this can easily obscure the listing. If no matching sequences are found, the E-value threshold can be increased, although this also increases the likelihood of false positive results. After clicking the BLAST button the BLAST search is performed and the two result frames are displayed after a short time, depending on the length of the query sequence and the free capacity on the server (BLAST version 2.2.4 with default query sequence filtering).

In order to check the reliability of the function, a large number of amino acid sequences have been used for searching and the results have been compared with known data. Sequences have been chosen, for which the underlying gene has no direct association to GO terms recorded, but has links to either InterPro (15) entries or other information, suggesting a certain functionality of the gene product. The GO terms found by this tool can therefore not originate directly from the query sequence, but only from other sequences which show similarities to it.

Figure Figure11 shows the content of the two result frames after searching with the Schizosaccharomyces pombe sequence SPBC337.11. The GeneDB entry for this gene does not provide any GO ids, but suggests as related function ‘Zinc-containing alcohol dehydrogenase superfamily’ (InterPro, IPR002085) and ‘Zinc-binding dehydrogenase’ (Pfam). The BLAST search result shows 26 sequences (E-value set to default 0.001) originating from seven different species. The weighted list of GO terms clearly confirms with the GeneDB functions. Eleven matching sequences from four species are associated with the molecular function ontology term ‘zinc binding’ and nine sequences from three species with the term ‘alcohol dehydrogenase, zinc dependent’ (these two GO terms are the same terms which are provided by the InterPro entry IPR002085).

Figure 1Figure 1
The content of the two result frames after searching with the S.pombe sequence SPBC337.11.

The association to ‘NADPH:quinone reductase’ is also supported by the InterPro description, which mentions that this family includes NADP-dependent quinone oxidoreductase. It also states that the enzyme has been recruited as an eye lens protein in some species. This correlates with the associations to ‘structural constituent of eye lens’ (molecular function ontology), as well as ‘vision’ and ‘sensory organ development’ (biological process ontology).

Tables Tables225 show summary results of further searches with various sequences. Many more searches have been performed which are not shown but gave similar positive results. Relevant information found in species specific databases or InterPro are indicated to allow a comparison with the search results obtained with OB. In the column labelled ‘BLAST result’ the number of similar sequences is shown (using the indicated E-value) together with the number of different species they originated from. The column labelled ‘GO associations’ lists all GO terms associated with the sequences shown in the previous column (the numbers in brackets show the number of sequences associated with the GO term, followed by the number of different species they belong to, followed by the weighting number of the term).

Table 2.
Results of the search with S.pombe gene SPBC146.09c (SWISS-PROT/TrEMBL ID: Q9Y802; E-value: 0.001)
Table 5.
Results of the search with C.elegans gene B0303.11 (SWISS-PROT/TrEMBL ID: P34261; E-value: 0.001)

CONCLUSION

OntoBlast provides a quick and simple way to test if a potential function can be predicted for an unknown sequence, if similar sequences associated with an ontology entry can be found in any of the searchable species databases and which other genes share those ontology terms. While results can certainly contain a number of false positive ontology terms, which have been selected by insignificant sequence similarities, the examples show, that if a positive functional correlation between similar sequences exists, it can be highlighted by the OB tool. A large number of sequences from all included species with known function have been tested in order to check the reliability and specificity of the returned results. Many functions suggested by the resulting ontology terms showed a clear and correct correlation to the described known protein information. In general, sequence similarities with an E-value <1.0e−4 gave reasonable assignments to ontology terms. Replacing the very simple weighting number, which is now used to sort the ontology term list, by a more sophisticated mechanism, including the statistical likelihood of associating certain GO terms with genes from certain species, would probably allow to distinguish even more clearly between significant and random assignments. The simultaneous comparison to all genes from nine species gives the advantage to often find assignments supported by hits from two, three or more species.

Further analysis is greatly assisted by direct links from the resulting ontology terms to all associated genes in the Genome–Matrix database (including their surrounding gene regions linked to ortholog genes from other species), providing a unique and direct access to a large collection of relevant structural and functional information from many disperse data sources.

Table 3.
Results of the search with D.melanogaster gene FBgn0004395 (SWISS-PROT/TrEMBL ID: Q960U9; E-value: 0.001)
Table 4.
Results of the search with D.melanogaster gene FBgn0010292 (SWISS-PROT/TrEMBL ID: P51406; E-value: 0.001)

ACKNOWLEDGEMENTS

The Genome–Matrix is a project of the Vertebrate Genomics department (Head Hans Lehrach) of the Max-Planck-Institute for Molecular Genetics, Berlin (MPI) in cooperation with the Resource Centre of the German Human Genome Project gGmbH, Berlin (RZPD). Human and mouse specific matrixes have been compiled by Steffen Hennig (MPI), ortholog data calculated by Alia Ben Kahla (MPI), GUI and database programming by Andreas Hewelt, Qing Dong and Ram Narang (RZPD).

REFERENCES

1. The Gene Ontology Consortium (2000) Gene Ontology: tool for the unification of biology. Nature Genet., 25, 25–29. [PMC free article] [PubMed]
2. Camon E., Magrane,M., Barrell,D., Binns,D., Fleischmann,W., Kersey,P., Mulder,N., Dinn,T., Maslen,J., Cox,A. and Apweiler,R. (2003) The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL and InterPro. Genome Res., 13, 662–672. [PMC free article] [PubMed]
3. Perez-Iratxeta C., Bork,P. and Andrade,M.A. (2002) Association of genes to genetically inherited diseases using data mining. Nature Genet., 31, 316–319. [PubMed]
4. Hewelt A., Ben Kahla,A., Hennig,S., Nagel,A., Himmelbauer,H., Zehetner,G., Haas,S., Vingron,M., Yaspo,M.L. and Lehrach,H. (2002) The GenomeMatrix Information Retrieval System, Poster Abstracts of HGM2002 (Human Genome Meeting, April 14–17, 2002, Shanghai, China). Genome Informatics and Annotation, Abstract 23.
5. Xie H., Wasserman,A., Levine,Z., Novik,A., Grebinskiy,V., Shoshan,A. and Mintz,L. (2002) Large-scale protein annotation through gene ontology. Genome Res., 12, 785–794. [PMC free article] [PubMed]
6. Schug J., Diskin,S., Mazzarelli,J., Brunk,B.P. and Stoeckert,C.J. Jr (2002) Predicting gene ontology functions from ProDom and CDD protein domains. Genome Res., 12, 648–655. [PMC free article] [PubMed]
7. Hubbard T., Barker,D., Birney,E., Cameron,G., Chen,Y., Clark,L., Cox,T., Cuff,J., Curwen,V., Down,T. et al. (2002) The Ensembl genome database project. Nucleic Acids Res., 30, 38–41. [PMC free article] [PubMed]
8. Clamp M., Andrews,D., Barker,D., Bevan,P., Cameron,G., Chen,Y., Clark,L., Cox,T., Cuff,J., Curwen,V. et al. (2003) Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res., 31, 38–42. [PMC free article] [PubMed]
9. Harris T.W., Lee,R., Schwarz,E., Bradnam,K., Lawson,D., Chen,W., Blasier,D., Kenny,E., Cunningham,F., Kishore,R. et al. (2003) WormBase: a cross-species database for comparative genomics. Nucleic Acids Res., 31, 133–137. [PMC free article] [PubMed]
10. The FlyBase Consortium (2003) The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res., 31, 172–175. [PMC free article] [PubMed]
11. Steen R.G., Kwitek-Black,A.E., Glenn,C., Gullings-Handley,J., Van Etten,W., Atkinson,O.S., Appel,D., Twigger,S., Muir,M., Mull,T. et al. (1999) A high-density integrated genetic linkage and radiation hybrid map of the laboratory rat. Genome Res., 9, AP1–AP8. [PubMed]
12. Weng S., Dong,Q., Balakrishnan,R., Christie,K., Costanzo,M., Dolinski,K., Dwight,S.S., Engel,S., Fisk,D.G., Hong,E. et al. (2003) Saccharomyces Genome Database (SGD) provides biochemical and structural information for budding yeast proteins. Nucleic Acids Res., 31, 216–218. [PMC free article] [PubMed]
13. Bahl A., Brunk,B., Crabtree,J., Fraunholz,M.J., Gajria,B., Grant,G.R., Ginsburg,H., Gupta,D., Kissinger,J.C., Labo,P. et al. (2003) PlasmoDB: the Plasmodium genome resource. A database integrating experimental and computational data. Nucleic Acids Res., 31, 212–215. [PMC free article] [PubMed]
14. Boeckmann B., Bairoch,A., Apweiler,R., Blatter,M., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O'Donovan,C., Phan,I. et al. (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370. [PMC free article] [PubMed]
15. Mulder N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Barrell,D., Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P. et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., 31, 315–318. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...