![]() | ![]() |
Formats:
|
||||||||||||
Copyright © 2008 The Author(s) Strepto-DB, a database for comparative genomics of group A (GAS) and B (GBS) streptococci, implemented with the novel database platform ‘Open Genome Resource’ (OGeR) Institute for Microbiology, Technische Universität Braunschweig, Spielmannstrasse 7, 38106 Braunschweig, Germany *To whom correspondence should be addressed. Tel: Phone: +49 531 391 5810; Fax: +49 531 391 5854; Email: i.retter/at/tu-bs.de Received August 15, 2008; Revised September 19, 2008; Accepted September 22, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Streptococci are the causative agent of many human infectious diseases including bacterial pneumonia and meningitis. Here, we present Strepto-DB, a database for the comparative genome analysis of group A (GAS) and group B (GBS) streptococci. The known genomes of various GAS and GBS contain a large fraction of distributed genes that were found absent in other strains or serotypes of the same species. Strepto-DB identifies the homologous proteins deduced from the genomes of interest. It allows for the elucidation of the GAS and GBS core- and pan-genomes via genome-wide comparisons. Moreover, an intergenic region analysis tool provides alignments and predictions for transcription factor binding sites in the non-coding sequences. An interactive genome browser visualizes functional annotations. Strepto-DB (http://oger.tu-bs.de/strepto_db) was created by the use of OGeR, the Open Genome Resource for comparative analysis of prokaryotic genomes. OGeR is a newly developed open source database and tool platform for the web-based storage, distribution, visualization and comparison of prokaryotic genome data. The system automatically creates the dedicated relational database and web interface and imports an arbitrary number of genomes derived from standardized genome files. OGeR can be downloaded at http://oger.tu-bs.de. INTRODUCTION The development of cost-efficient DNA sequencing methods has caused an explosion of prokaryotic genome sequencing projects (1,2). The exploration of new genome sequences is strongly supported by the availability of related genomes that can be used as templates. Correspondingly, strain-specific properties can be traced back to differences in the genomes of compared strains. The comparison of the gene composition of several bacterial genomes from different strains of the same species revealed that only a fraction of genes is shared among the analyzed strains. This so-called core-genome is complemented by a fraction of distributed genes that are only present in some strains and absent in others (3). The supra- or pan-genome of a species is defined as the core-genome plus all distributed genes. It became clear that the pathogenicity of certain bacteria strongly depends on the fraction of distributed genes in the genome (4). Due to the medical impact of pathogenic Streptococcus pyogenes (GAS) and Streptococcus agalactiae (GBS) infections, several genome projects focused on the elucidation of serotypic variants of these Gram-positive bacteria (5). Several streptococci genomes are available at the NMPDR (6), a database that focuses on microbial pathogens. Moreover, comprehensive databases provide comparative analysis features for prokaryotic genomes. These include MicrobesOnline (7), IMG (8) and GenoList (9), amongst others. However, for S. agalactiae it was predicted that the available reservoir of distributed genes is so large that new genes will be discovered even after hundreds of elucidated genomes (10). Therefore, comparative analyses of GAS and GBS genomes require the incorporation of all available sequence data. For sequencing projects usually confidential data handling is required prior publishing of the results. For this purpose, local data storage and analysis is essential. Several software tools have recently been published that offer local solutions for comparative analysis of prokaryotic genome data. PSAT (11) is a web tool that visualizes the conservation of gene order among a given set of organisms. Although PSAT supplies a very useful overview about the relatedness of different genomes, it does not offer the query functions of a typical genome database, i.e. direct gene and protein queries with detailed information on obtained results. These features are provided for example by JCoast (12), a tool for the comparative analysis of prokaryotic genomes that is based on GenDB (13). However, JCoast is a local solution that does not support the distribution of data by a web server. For this reason, we have developed Open Genome Resource (OGeR) as a generic web-accessible database and bioinformatics tool platform for the storage, visualization and comparative analysis of prokaryotic genome data. OGeR is suited to supply convenient assistance for reading and interpretation of genome files for biologists. The system is very flexible as it supports the import of an arbitrary selection of prokaryotic genome DNA sequence flat files. After the initial installation, the system is automatically generated, so that the update to new genome releases is very simple. Thus, OGeR can aid annotation and controlled data distribution in sequencing projects that depend on confidential data handling. In this article, the functionalities of Strepto-DB are introduced as an example of application for OGeR. The database Strepto-DB provides an up-to-date resource for all GAS and GBS genomes that are currently publicly available, including unfinished WGS sequences. It supplies a convenient platform for the (pan-)genome analysis and interpretation of GAS and GBS. Strepto-DB was developed as part of the ERA-NET PathoGenoMics project that conducts a comprehensive comparative molecular analysis of GAS and GBS pathogenesis (http://www.pathogenomics-era.net). FEATURES OF STREPTO-DB Data content, exploration and visualization The current Strepto-DB release 8.8 provides access to 13 GAS genomes, 8 GBS genomes and 7 plasmids. These comprise 41804 protein coding genes, including 902 ‘unique’ genes for which no orthologs in any of the other strains could be detected (Table 1). To visualize the respective sizes of pan-genomes and core-genomes, Venn diagrams are provided as Supplementary Data.
The query options of the Strepto-DB web interface are summarized in Table 2. The database can be searched by gene and protein names, gene ontology (GO) and other functional annotation terms. Sequences can be searched either as strings and regular expressions or by BLAST. A genome viewer provides a scalable overview over the locus of the genes of interest on the chromosome. For each gene, Strepto-DB provides a gene and a corresponding protein entry that comprise functional annotation including GO terms and EC numbers, respectively. Furthermore, links to external data resources are provided. These include EMBL-Bank (14), UniProt (15), Integr8 (16), ExPASy (17), NCBI Gene and Protein (18), KEGG (19), BRENDA (20) and PRODORIC (21). For gene entries, the genomic context is visualized as a map in an interactive genome browser that centers on a gene when selected by a mouse click. The selected gene is marked in red. Below this genome map, the genome browser displays a frame plot of the GC content. The genome browser also displays the DNA sequence of the referring genome section with coding regions in color. At the bottom of the gene entry, the Genomic Data field provides the gene sequence in various formats and the option for download in FASTA format.
Search for homologous proteins and intergenic region analysis Strepto-DB allows for the alignment of both coding and non-coding DNA sequences within the Streptococcus genomes of interest. Homologous proteins were pre-calculated by reciprocal BLAST searches. The proteome comparison query supplies an overview about the conservation of proteins between different strains. After the selection of a reference genome and one or more comparison genomes, this query returns lists of those proteins that are conserved between the selected strains. In addition, each protein entry provides a list of homologous proteins. On demand, the identified homologs are aligned with the MUSCLE alignment tool (22) and displayed with the Jalview visualization software (23). Furthermore, the genomic context of the various homologous genes can be displayed as genome maps. As an example, Figure 1
In the intergenic regions, conserved DNA sequence motifs can function as regulator binding sites. Thus, an analysis of the intergenic DNA sequences might reveal information on the regulation of the respective downstream genes. Strepto-DB provides an intergenic region analysis that is composed of three tools: first, a BLAST search that aligns the intergenic region DNA sequence of choice with the intergenic regions of the referring homologous genes of other Streptococcus strains. This similarity search can be started by a mouse click on the region of interest on the homologs' genome map. Second, selected intergenic regions can be analyzed for conserved sequence motifs with the MEME motif discovery tool (25). Third, each intergenic region entry includes a link to the Virtual Footprint analysis tool (21). Virtual Footprint uses position weight matrices from the PRODORIC database to predict transcription factor binding sites within the promoter region of a gene. Taken together, these methods provide very useful supplementary evidence for potential regulator binding sites, generating hypotheses for experimental verification. THE ‘OPEN GENOME RESOURCE’ (OGeR) PLATFORM FOR THE COMPARATIVE ANALYSIS OF PROKARYOTIC GENOMES OGeR is generically applicable for the storage and comparison of related prokaryotic genomes. As one example, Strepto-DB was set up and is maintained with OGeR and therefore provides an example for its functionalities. Thus, the Strepto-DB database and all features of the web interface were automatically compiled. System architecture OGeR consists of three components, a relational database, a setup that processes input data and imports them into the database and a web interface that queries the database (Figure 2
Implementation and local installation OGeR is implemented as a PHP application that uses an Apache web server and operates on a PostgreSQL database. Local installation requires a Linux operating system and the installation of the corresponding PHP and Apache software packages. For the creation of a new OGeR-based database, the OGeR setup procedure requests the required information and imports the desired genomes into the system. Data download is performed by the wget program. Local genome sequences can be imported in EMBL or GenBank format. Subsequently, homologous proteins are determined by an all-against-all BLAST search (26) of the proteins that are annotated in the imported flat files. As the BLAST search follows a quadratic time complexity, this step limits the number of genomes that can be imported in a reasonable amount of time on a given computing hardware. The BLAST results are evaluated to detect homologous proteins. Thereby, ‘homology’ is defined as a double reciprocal BLAST hit with a given maximal E-value. For Strepto-DB, an E-value cutoff of 1*e-5 was chosen. Finally, the setup finishes with the creation of a new web interface for the database. A detailed installation instruction facilitates the installation and setup procedure. The OGeR web interface uses CGView (27) for the genome viewer. Multiple alignments are performed with MUSCLE (22) and depicted with Jalview (23). As CGView and JalView are implemented as Java applets, the client web browser requires Java installation. However, multiple alignments can alternatively be shown in a simple view that does not depend on Java. CONCLUDING REMARKS We have implemented a simple integrated database and bioinformatics platform named OGeR for the comparative analysis of related genomes. This platform was subsequently employed for comparative genomic analyses of 21 Streptococcus genomes with establishment of the Strepto-DB platform. Conserved and distributed genes were deduced for the analyzed strains and used for core- and pan-genome prediction. FUNDING German Bundesministerium für Bildung und Forschung (ERA-NET grant 0313936C to J.K. and R.M.). and Deutsche Forschungsgemeinschaft (Sonderforschungsbreich 578 to I.B. and R.M.). Funding for open access charges: Deutsche Forschungsgemeinschaft (Sonderforschungsbreich 578). Conflict of interest statement. None declared. ACKNOWLEDGEMENTS We would like to thank Bernd Hoppe for excellent technical assistance and financial management. REFERENCES 1. Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC. The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2008;36:D475–479. [PubMed] 2. Medini D, Serruto D, Parkhill J, Relman DA, Donati C, Moxon R, Falkow S, Rappuoli R. Microbiology in the post-genomic era. Nat. Rev. Micro. 2008;6:419–430. 3. Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. The microbial pan-genome. Curr. Opin. Genet. Dev. 2005;15:589–594. [PubMed] 4. Ehrlich GD, Hiller NL, Hu FZ. What makes pathogens pathogenic. Genome Biol. 2008;9:225. [PubMed] 5. Lefébure T, Stanhope MJ. Evolution of the core and pan-genome of Streptococcus: positive selection, recombination, and genome composition. Genome Biol. 2007;8:R71. [PubMed] 6. McNeil LK, Reich C, Aziz RK, Bartels D, Cohoon M, Disz T, Edwards RA, Gerdes S, Hwang K, Kubal M, et al. The National Microbial Pathogen Database Resource (NMPDR): a genomics platform based on subsystem annotation. Nucleic Acids Res. 2007;35:D347–353. [PubMed] 7. Alm EJ, Huang KH, Price MN, Koche RP, Keller K, Dubchak IL, Arkin AP. The MicrobesOnline Web site for comparative genomics. Genome Res. 2005;15:1015–1022. [PubMed] 8. Markowitz VM, Szeto E, Palaniappan K, Grechkin Y, Chu K, Chen IMA, Dubchak I, Anderson I, Lykidis A, Mavromatis K, et al. The integrated microbial genomes (IMG) system in 2007: data content and analysis tool extensions. Nucleic Acids Res. 2008;36:D528–533. [PubMed] 9. Lechat P, Hummel L, Rousseau S, Moszer I. GenoList: an integrated environment for comparative analysis of microbial genomes. Nucleic Acids Res. 2008;36:D469–474. [PubMed] 10. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc. Natl Acad. Sci. USA. 2005;102:13950–13955. [PubMed] 11. Fong C, Rohmer L, Radey M, Wasnick M, Brittnacher M. PSAT: a web tool to compare genomic neighborhoods of multiple prokaryotic genomes. BMC Bioinformatics. 2008;9:170. [PubMed] 12. Richter M, Lombardot T, Kostadinov I, Kottmann R, Duhaime M, Peplies J, Glockner F. JCoast - a biologist-centric software tool for data mining and comparison of prokaryotic (meta)genomes. BMC Bioinformatics. 2008;9:177. [PubMed] 13. Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T, Clausen J, Kalinowski J, Linke B, Rupp O, Giegerich R, et al. GenDB–an open source genome annotation system for prokaryote genomes. Nucleic Acids Res. 2003;31:2187–2195. [PubMed] 14. Cochrane G, Akhtar R, Aldebert P, Althorpe N, Baldwin A, Bates K, Bhattacharyya S, Bonfield J, Bower L, Browne P, et al. Priorities for nucleotide trace, sequence and annotation data capture at the Ensembl Trace Archive and the EMBL Nucleotide Sequence Database. Nucleic Acids Res. 2008;36:D5–12. [PubMed] 15. UniProt Consortium The universal protein resource (UniProt). Nucleic Acids Res. 2008;36:D190–195. [PubMed] 16. Mulder NJ, Kersey P, Pruess M, Apweiler R. In silico characterization of proteins: UniProt, InterPro and Integr8. Mol. Biotechnol. 2008;38:165–177. [PubMed] 17. Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD, Bairoch A. ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 2003;31:3784–3788. [PubMed] 18. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008;36:D13–21. [PubMed] 19. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008;36:D480–484. [PubMed] 20. Barthelmes J, Ebeling C, Chang A, Schomburg I, Schomburg D. BRENDA, AMENDA and FRENDA: the enzyme information system in 2007. Nucleic Acids Res. 2007;35:D511–514. [PubMed] 21. Münch R, Hiller K, Grote A, Scheer M, Klein J, Schobert M, Jahn D. Virtual Footprint and PRODORIC: an integrative framework for regulon prediction in prokaryotes. Bioinformatics. 2005;21:4187–4189. [PubMed] 22. Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004;5:113. [PubMed] 23. Clamp M, Cuff J, Searle SM, Barton GJ. The Jalview Java alignment editor. Bioinformatics. 2004;20:426–427. [PubMed] 24. Tettelin H, Masignani V, Cieslewicz MJ, Eisen JA, Peterson S, Wessels MR, Paulsen IT, Nelson KE, Margarit I, Read TD, et al. Complete genome sequence and comparative genomic analysis of an emerging human pathogen, serotype V Streptococcus agalactiae. Proc. Natl Acad. Sci. USA. 2002;99:12391–12396. [PubMed] 25. Bailey TL, Williams N, Misleh C, Li WW. MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006;34:W369–373. [PubMed] 26. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PubMed] 27. Grant JR, Stothard P. The CGView Server: a comparative genomics tool for circular genomes. Nucleic Acids Res. 2008;36:W181–184. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Nucleic Acids Res. 2008 Jan; 36(Database issue):D475-9.
[Nucleic Acids Res. 2008]Curr Opin Genet Dev. 2005 Dec; 15(6):589-94.
[Curr Opin Genet Dev. 2005]Genome Biol. 2008; 9(6):225.
[Genome Biol. 2008]Genome Biol. 2007; 8(5):R71.
[Genome Biol. 2007]Nucleic Acids Res. 2007 Jan; 35(Database issue):D347-53.
[Nucleic Acids Res. 2007]Genome Res. 2005 Jul; 15(7):1015-22.
[Genome Res. 2005]Nucleic Acids Res. 2008 Jan; 36(Database issue):D528-33.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D469-74.
[Nucleic Acids Res. 2008]BMC Bioinformatics. 2008 Mar 26; 9():170.
[BMC Bioinformatics. 2008]BMC Bioinformatics. 2008 Apr 1; 9():177.
[BMC Bioinformatics. 2008]Nucleic Acids Res. 2003 Apr 15; 31(8):2187-95.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2008 Jan; 36(Database issue):D5-12.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D190-5.
[Nucleic Acids Res. 2008]Mol Biotechnol. 2008 Feb; 38(2):165-77.
[Mol Biotechnol. 2008]Nucleic Acids Res. 2003 Jul 1; 31(13):3784-8.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2008 Jan; 36(Database issue):D13-21.
[Nucleic Acids Res. 2008]BMC Bioinformatics. 2004 Aug 19; 5():113.
[BMC Bioinformatics. 2004]Bioinformatics. 2004 Feb 12; 20(3):426-7.
[Bioinformatics. 2004]Proc Natl Acad Sci U S A. 2002 Sep 17; 99(19):12391-6.
[Proc Natl Acad Sci U S A. 2002]Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W369-73.
[Nucleic Acids Res. 2006]Bioinformatics. 2005 Nov 15; 21(22):4187-9.
[Bioinformatics. 2005]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 2008 Jul 1; 36(Web Server issue):W181-4.
[Nucleic Acids Res. 2008]BMC Bioinformatics. 2004 Aug 19; 5():113.
[BMC Bioinformatics. 2004]Bioinformatics. 2004 Feb 12; 20(3):426-7.
[Bioinformatics. 2004]