Logo of narLink to Publisher's site
Nucleic Acids Res. Jul 1, 2009; 37(Web Server issue): W296–W299.
Published online Apr 30, 2009. doi:  10.1093/nar/gkp268
PMCID: PMC2703916

The Microbe browser for comparative genomics

Abstract

The Microbe browser is a web server providing comparative microbial genomics data. It offers comprehensive, integrated data from GenBank, RefSeq, UniProt, InterPro, Gene Ontology and the Orthologs Matrix Project (OMA) database, displayed along with gene predictions from five software packages. The Microbe browser is daily updated from the source databases and includes all completely sequenced bacterial and archaeal genomes. The data are displayed in an easy-to-use, interactive website based on Ensembl software. The Microbe browser is available at http://microbe.vital-it.ch/. Programmatic access is available through the OMA application programming interface (API) at http://microbe.vital-it.ch/api.

INTRODUCTION

About a thousand complete microbial genomes have been sequenced to date [961 genomes in the Genomes On Line Database (GOLD) on 1 April 2009 (1)], and many different methods have been used to predict genes, yielding large differences in gene annotation even across closely related species. No single computational method yet achieves perfect gene predictions. Furthermore, very few entries have been kept up-to-date in the primary databases such as GenBank (2). We therefore felt that it was important to provide a unified interface to the various gene prediction packages to allow biologists to evaluate them in their genomic and evolutionary contexts.

This leads to another important computational challenge, namely the identification of orthologs. Many studies, such as the prediction of gene function, phylogenetic reconstruction and genomics context analyses, depend on accurate predictions of orthology. Among genes that share a common ancestor, only genes that are separated by a speciation event are actual orthologs (3). To address the need for reliable ortholog sources, several initiatives have been created for better ortholog prediction [see (4) for a comparison]. Among these resources, Orthologs Matrix Project (OMA) stands out by its efficient and robust computational method allowing continuous updating with novel genomes (5) and its ability to exclude non-orthologs, conferring a high reliability in the prediction of true orthologous relationships (4).

Interactive genome browsers have proved invaluable to the community for visualizing genes and experimental data in their genomic context, and as hubs connecting many biomedical databases (6,7). Genome browsers also provide comparative genomics information by displaying homologous regions in a single view. However, most browsers concentrate on eukaryotic genomes, so that biologists working on microbial genomes are restricted to standalone programs such as the Artemis Comparison Tool (8) or web sites such as the Joint Genome Institute's Integrated Microbial Genomes tools (http://img.jgi.doe.gov/) or GeneDB (http://www.genedb.org) that are more complex to use, can only handle a few genomes at a time and do not integrate as much information via a single interface.

Derived genomic databases that connect and expand reference databases are important in particular for automated analyses such as dataset comparisons. The EBI Genome Reviews database (9) provides complete genome sequence and annotation data, continuously updated and extended with automated and manual annotation in UniProtKB (10). The NCBI RefSeq resource (11) provides a coherent set of sequences, genes and transcripts, some of which have been manually annotated. Frustratingly, the EBI and NCBI resources use distinct sets of identifiers (UniProtKB accession number and protein_id for EBI; RefSeq accession number, GeneID and GI number for NCBI) that make it hard to navigate between databases using different references. Furthermore, UniProtKB curators not only extend and uniformize annotation, but they also modify gene sequences, changing translational start site predictions, correcting frameshifts or adding genes missing from the original submission. This information is propagated to Genome Reviews but not to the source DDBJ/EMBL/GenBank entries, which can only be modified by the original submitter. This introduces an additional divergence between databases, as it becomes non-trivial to identify the ‘same’ gene in two different databases where the gene might have neither the same identifier scheme nor the same coordinates.

The Integr8 database (9) aggregates curated information on completely sequenced genomes, including taxonomy down to the precise strain level, and cross-references to all chromosomes and plasmids comprising the complete genome.

We introduce the Microbe browser, a web server that uses the Integr8 database to organize and correlate genomic sequences and annotation from the GenBank, Genome Reviews and RefSeq databases. We use the powerful Ensembl web code (7) to present the resulting data in a fully interactive, user-friendly and platform-independent manner.

METHODS

Source data are retrieved daily from primary public servers. Integr8 and Genome Reviews are the source of genome data, including curated gene sets and annotation and cross-references to UniProtKB, InterPro, Gene Ontology and the Protein Data Bank. GenBank and RefSeq are the source of NCBI cross-references (RefSeq accession, GeneID and GI number). The OMA database provides orthology predictions for pairs of genes. Pre-computed gene predictions from the Glimmer (12), GeneMark, GeneMarkHMM (13) and Prodigal (http://compbio.ornl.gov/prodigal) packages are provided by the NCBI, and predictions by the EasyGene method (14) are downloaded from the EasyGene web site (http://servers.binf.ku.dk/cgi-bin/easygene/search).

The Genome Reviews data are used as a reference, because it incorporates substantial automatic and manual annotation from the gold standard UniProtKB knowledgebase (10). Cross-references from GenBank and RefSeq genes are merged into Genome Reviews records based on the position of the 3′-end of the genes. This allows to correctly map not only genes for which no cross-references exist between the databases, but also those for which the 5′-end (start site) has been possibly changed by UniProtKB curators.

USAGE

The Microbe browser home page is used for organism selection and search term input, which can be a gene name or a cross-reference to any of the source databases. Several view pages are available, the three most informative are detailed below. The user can easily navigate across those pages and detailed online help is available.

The gene report page integrates data on gene sequence and annotation, orthologs and cross-references to the major biological databases.

The chromosome view pages (Figure 1) display the original genome annotation submitted in the DDBJ/EMBL/GenBank source databases, the modified annotation from UniProtKB (via Genome Reviews) and the gene predictions of several popular packages.

Figure 1.
Chromosome view of Mycobacterium tuberculosis CDC 1551. GenBank source annotation (black full boxes), Genome Reviews reference annotation from UniProtKB (coloured full boxes) and predictions from five software packages (hollow boxes). UniProtKB curators ...

The chromosome comparison pages (Figure 2) display regions surrounding orthologous genes in two or more organisms, highlighting orthology relationships between them, and reveal cases of synteny (co-localized orthologs). This display scales up to comparing a few species with detailed positional information, while specialized software has been proposed to visualize synteny across dozens of species in a summarized display (15).

Figure 2.
Chromosome comparison view of regions around the pgl gene in Escherichia coli O157:H7, E. coli K12 and Yersinia pestis KIM5. Genes are coloured by InterPro families, and orthologous gene pairs are connected. Among the adjacent mod, ybh and bio genes present ...

For software developers, programmatic access to the orthology relationships is available via web services through the OMA APIs at http://microbe.vital-it.ch/api.

CONCLUSION

Designed primarily for biomedical researchers, the Microbe browser runs an easy-to-use, interactive view allowing to visualize gene predictions, orthology and synteny relationships and to navigate across databases. Data originates from established bioinformatics databases: DDBJ/EMBL/GenBank source genomic data, annotation and cross-references to the major biological databases retrieved from Genome Reviews and RefSeq, pairwise gene orthology predictions from OMA, and alternative gene predictions from several prediction packages. Future developments will include fungal genomes and metagenomic data.

FUNDING

Swiss Institute of Bioinformatics. Funding for open access charge: Swiss Institute of Bioinformatics and Ecole Polytechnique Fédérale de Lausanne.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We thank R. Fabbretti and V. Flegel for IT support, and also A. Auchincloss, T. Lima and A. Kapopoulou for critical reading.

REFERENCES

1. Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC. The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2008;36:D475–D479. [PMC free article] [PubMed]
2. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2009;37:D26–D31. [PMC free article] [PubMed]
3. Fitch WM. Distinguishing homologous from analogous proteins. Syst. Zool. 1970;19:99–113. [PubMed]
4. Altenhoff AM, Dessimoz C. Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comp. Biol. 2009;5:e1000262. [PMC free article] [PubMed]
5. Roth AC, Gonnet GH, Dessimoz C. Algorithm of OMA for large-scale orthology inference. BMC Bioinformatics. 2008;9:518. [PMC free article] [PubMed]
6. Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, et al. The UCSC Genome Browser Database: update 2009. Nucleic Acids Res. 2009;37:D755–D761. [PMC free article] [PubMed]
7. Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, et al. Ensembl 2009. Nucleic Acids Res. 2009;37:D690–D697. [PMC free article] [PubMed]
8. Carver T, Berriman M, Tivey A, Patel C, Böhme U, Barrell BG, Parkhill J, Rajandream MA. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database. Bioinformatics. 2008;24:2672–2676. [PMC free article] [PubMed]
9. Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I, et al. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res. 2005;33:D297–D302. [PMC free article] [PubMed]
10. UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 2009;37:D169–D174. [PMC free article] [PubMed]
11. Pruitt KD, Tatusova T, Klimke W, Maglott DR. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 2009;37:D32–D36. [PMC free article] [PubMed]
12. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999;27:4636–4641. [PMC free article] [PubMed]
13. Borodovsky M, Mills R, Besemer J, Lomsadze A. Prokaryotic gene prediction using GeneMark and GeneMark.hmm. Curr. Protoc. Bioinformatics. 2003 Chapter 4, Unit4.5. [PubMed]
14. Nielsen P, Krogh A. Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics. 2005;21:4322–4329. [PubMed]
15. Lemoine F, Labedan B, Lespinet O. SynteBase/SynteView: a tool to visualize gene order conservation in prokaryotic genomes. BMC Bioinformatics. 2008;9:536. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...