• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jan 2009; 37(Database issue): D408–D411.
Published online Oct 21, 2008. doi:  10.1093/nar/gkn749
PMCID: PMC2686588

PEDANT covers all complete RefSeq genomes

Abstract

The PEDANT genome database provides exhaustive annotation of nearly 3000 publicly available eukaryotic, eubacterial, archaeal and viral genomes with more than 4.5 million proteins by a broad set of bioinformatics algorithms. In particular, all completely sequenced genomes from the NCBI's Reference Sequence collection (RefSeq) are covered. The PEDANT processing pipeline has been sped up by an order of magnitude through the utilization of precalculated similarity information stored in the similarity matrix of proteins (SIMAP) database, making it possible to process newly sequenced genomes immediately as they become available. PEDANT is freely accessible to academic users at http://pedant.gsf.de. For programmatic access Web Services are available at http://pedant.gsf.de/webservices.jsp.

INTRODUCTION

Since its first announcement in 1997 (1), the PEDANT genome database has steadily grown to become one of the most comprehensive collections of automatically annotated genomes. As of September 2008, PEDANT covers all complete genomes as provided by the RefSeq (2) database. In total 861 completely sequenced genomes from all three domains of life as well as 2081 complete viral genomes are available (Table 1). Here, we define a ‘complete genome’ as a genome whose chromosomal datasets exist as RefSeq records or Ensembl (3) entries and genes have been predicted. For those eukaryotic genomes (currently 33) that are available both from RefSeq or Ensembl, we provide the annotation of both versions. This results in a total number of 2975 genome databases with 4.5 million proteins occupying 3.1 TB of storage. All PEDANT databases are continuously updated. For example, assignments of genes to the MIPS Functional Catalog (FunCat) (4) have been recently recalculated using the new 2.1 version of FunCat (http://mips.gsf.de/projects/funcat).

Table 1.
The number of species from major taxonomic groups contained in the PEDANT genome database as of September 2008

The current version of the software driving the PEDANT web site, which we refer to as PEDANT3, represents an industry-strength Java workbench that supports large-scale grid computing and utilizes a work-flow-based processing engine (D. Frishman et al., manuscript in preparation). Dozens of custom workflows are available: generic workflows for eukaryotic, prokaryotic and viral genomes as well as more specialized workflows supporting specific genome groups (gram-positive versus gram-negative bacteria, fungi, plants), data types (EST collections, raw contigs without any predicted Open Reading Frames (ORFs), protein-only datasets, etc.) and bioinformatics methods (e.g. alternative gene prediction techniques). Advanced protein and DNA viewers implemented using server-side Java provide graphical representation of protein annotation features as well as genetic elements on chromosomes.

NEW FEATURES AND IMPROVEMENTS

Genome import pipeline

Given the quick pace of genome sequencing keeping track of currently available data and obtaining them from source databases for local processing represents a time-consuming and technically challenging task. In order to organize a more efficient import of genomes to PEDANT from various sources, we set up a specialized processing pipeline (Figure 1). In the first step, we acquire a list of available genomes from each genome resource. Then we try to find out the Entrez genome project ID by using the Entrez Programming Utilities (eUtils, http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) and querying the NCBI databases (5) for genome project information. If available, we use the genome project ID as a primary key for a given genome, otherwise the NCBI taxonomy ID is utilized. The advantage of genome project IDs is that they are stable in contrast to the taxonomy IDs which may change, especially for the species/strains of newly sequenced genomes. The genome IDs are then stored in our local meta-database which also serves as the data basis for generating the full genome list for the PEDANT web page.

Figure 1.
UML activity model of the PEDANT genome import and processing pipeline. Symbols according to the UML 2.0 specification (http://www.uml.org) for activity diagrams.

Data retrieval procedures have been adapted to several different sources of genome information. For downloading RefSeq genomes, we use a patched version (retry on connection timeouts, improved error handling) of the NCBI ToolBox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox) program. For Ensembl genomes, we install the provided MySQL database dumps (ftp://ftp.ensembl.org/pub/current_mysql) at our local MySQL server and extract the genomic data directly.

Retrieval of genomes not contained in RefSeq and Ensembl can only be done in a semi-automatic fashion with manual verification. In many cases, RefSeq lists the involved genome sequence centers where original data can be obtained. Another useful resource to locate genomes is ‘the genomes online database (GOLD)’ (6). We then retrieve the assembly and annotation data directly from the sequence centers and check them for missing sequences, nonunique identifiers and unusual formatting. If the gene annotation data are missing or in a draft version (especially fungal genomes), gene predictions are carried out or existing models are improved dependent on the annotation project (7,8).

Integration of PEDANT and SIMAP

Calculating and updating protein similarities and domain assignments is the most time consuming and computationally expensive task in our genome annotation pipeline. Previously, BLASTP (9) and InterProScan (10) searches required up to 80% of the total CPU time of the PEDANT genome annotation workflow. To master the high number of newly sequenced genomes and to keep the data in PEDANT up-to-date, a radical reduction of this huge computational effort has become necessary.

The most obvious answer to this problem is to utilize high-performance computing facilities and avoid redundant calculations. The similarity matrix of proteins (SIMAP) (11) provides precalculated and up-to-date all-against-all alignments as well as domain assignments for essentially all publicly available protein sequences (21 million as of this writing). Our recent efforts to integrate PEDANT with SIMAP made it possible to avoid computationally intensive BLASTP and InterProScan runs and have led to a dramatic acceleration of the genome annotation work. Compared with de novo calculations, retrieving similarities and domains from the SIMAP database reduces the required CPU time by factors between 5 and 60. A typical bacterial genome with 3000 predicted genes can be processed at MIPS in <40 min using 60 Sun Grid Engine (SGE, http://gridengine.sunsource.net) nodes.

To generate and obtain these data, we have developed a computational workflow that coordinates the tasks between PEDANT and SIMAP. The first step in this workflow involves the import and maintenance of genome sequences and primary annotation provided by the respective source databases in PEDANT. In a subsequent step, SIMAP automatically retrieves protein and sequence data from PEDANT. If novel protein sequences previously unknown to SIMAP have been imported, their similarities to all other protein sequences and their domain architecture are calculated in SIMAP by utilizing large public resource computing facilities (12). As soon as the precalculated data are completely available in SIMAP, a notification event is triggered to start the SIMAP-based methods in PEDANT. These methods have been implemented as remote Enterprise Java Bean (EJB) invocations, which allow for rapid and efficient retrieval of data from SIMAP. One method designed to replace BLASTP retrieves homologs from a composite nonredundant database that includes PDB, UniProt/Swissprot, UniProt/TrEMBL, as well as all protein sequences already present in PEDANT. The second method which serves as a substitute for InterProScan retrieves precalculated protein domain assignments considering all InterPro member databases according to the InterPro XML format specification, except for the TMHMM (13), SignalP (14) and TargetP (15) methods which are run by PEDANT itself considering the appropriate genomic context (i.e. gram stain for signal peptides).

Web Services

The comprehensive collection of 3000 extensively annotated genomes provides a unique foundation for data mining and large-scale investigation of genome properties. While information on a limited number of genes of interest can be conveniently explored using the PEDANT web interface, any computational analysis of genomes at large necessitates local access to data. However, the large amount of annotation data computed for 4.5 million PEDANT proteins makes systematic dissemination of database dumps or flat files unpractical (although we do provide them upon request). Instead, we offer a simple, transparent and computer language-independent remote access based on the Web Service technology. This service has been implemented as a document style, SOAP-based Web Service (see http://www.w3.org/TR/soap12-part0). It can be easily integrated into own applications since for most computer languages libraries exist to access these kind of services. The functions provided by the Web Service are described in a Web Service Description File (WSDL, see http://www.w3.org/TR/wsdl), which allows for an automatic generation of a client program, e.g. by using the Perl SOAP::Lite (http://www.soaplite.com) or the Java Axis (http://ws.apache.org/axis/java/index.html) libraries.

The PEDANT3 WSDL File can be found at http://mips.gsf.de/webservice/pedant3/Pedant3Access BeanService/Pedant3AccessWebService?wsdl. At present the service provides the following query types:

  1. return the list of organisms processed in PEDANT,
  2. return the computational methods used to annotate a particular organism,
  3. return a result overview (e.g. which functional category appears how many times) for a certain method in a certain organism,
  4. return the genetic elements of an organism,
  5. return the result of a certain method for a single genetic element or for a whole genome ordered by its genetic elements.

For the latter query type it is possible to search in both directions: the service can return all genetic elements having a certain property (e.g. a certain functional attribute), or all properties of a certain genetic element (e.g. all functional attributes of a protein). Furthermore, in the former case it is possible to query several genomes at once. For BLASTP- and SIMAP-based methods, it is possible to restrict the results by an E-Value cutoff. A detailed overview of the Web Service functionality can be found at http://pedant.gsf.de/webservices.jsp.

The PEDANT3 Web Service encapsulates the complicated internal data structures of the PEDANT database and returns the results in a generic format that consists of key-value pairs of properties assigned to a given genetic element. This generic format assures that the end-user client software will not have to be reprogrammed if new methods are introduced into the PEDANT system.

DISCUSSION

There is no fixed release cycle for PEDANT. As soon as new genomes become available at RefSeq or any other listed genome resource, they will be imported, processed and made available via the web server. However, since SIMAP has a monthly release cycle, the computation of a genome by PEDANT is typically finished roughly 1 month after its import. Since the PEDANT3 software is now stable and all genomes from the previous version, PEDANT2, have been either migrated or reimported into PEDANT3, we took PEDANT2 and its Web Service offline. We also discarded all incomplete genomes previously available via PEDANT2 because the new high-throughput technologies now allow finishing genome sequencing projects on a very short-time frame.

In the future, genomes from further resources [i.e. USCS Genome Browser Database (16), Vega (17)] will be imported and previously imported genomes will be kept up-to-date. We are also in the process of supplementing the PEDANT web site by multiple new features, including viewing the genome project information [RefSeq status, source sequence centers, whole-genome shotgun (WGS) (18) sequencing coverage, number of records, etc.], taxonomic selection of genomes and improved search capabilities. A cross-genome index for precomputed annotations is nearly finished and will be available online shortly. This will allow for comparison of genomes based on their annotated features, such as domain content, functional categories and structural folds.

FUNDING

Funding for open access charge: Helmholtz Gemeinschaft.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We are grateful to Volker Stümpflen for assistance with the Web Services.

REFERENCES

1. Frishman D, Mewes H.-W. Pedantic genome analysis. Trends Genet. 1997;13:415–416.
2. Pruitt KD, Tatusova T, Maglott DR. Ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–D65. [PMC free article] [PubMed]
3. Hubbard TJP, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al. Ensembl 2007. Nucleic Acids Res. 2007;35:D610–D617. [PMC free article] [PubMed]
4. Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Güldener U, Mannhaupt G, Münsterkötter M, et al. The funcat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 2004;32:5539–5545. [PMC free article] [PubMed]
5. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2008;36:D13–D21. [PMC free article] [PubMed]
6. Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC. The genomes on line database (gold) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2008;36:D475–D479. [PMC free article] [PubMed]
7. Güldener U, Mannhaupt G, Münsterkötter M, Haase D, Oesterheld M, Stümpflen V, Mewes H.-W, Adam G. Fgdb: a comprehensive fungal genome resource on the plant pathogen fusarium graminearum. Nucleic Acids Res. 2006;34:D456–D458. [PMC free article] [PubMed]
8. Kämper J, Kahmann R, Bölker M, Ma L.-J, Brefort T, Saville BJ, Banuett F, Kronstad JW, Gold SE, Müller O, et al. Insights from the genome of the biotrophic fungal plant pathogen ustilago maydis. Nature. 2006;444:97–101. [PubMed]
9. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
10. Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R. Interproscan: protein domains identifier. Nucleic Acids Res. 2005;33:W116–W120. [PMC free article] [PubMed]
11. Rattei T, Tischler P, Arnold R, Hamberger F, Krebs J, Krumsiek J, Wachinger B, Stümpflen V, Mewes H.-W. Simap–structuring the network of protein similarities. Nucleic Acids Res. 2008;36:D289–D292. [PMC free article] [PubMed]
12. Rattei T, Walter M, Arnold R, Anderson D, Mewes W. Using public resource computing and systematic pre-calculation for large scale sequence analysis. Lect. Notes Bioinform. 2007;4360:11–18.
13. Kahsay RY, Gao G, Liao L. An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes. Bioinformatics. 2005;21:1853–1858. [PubMed]
14. Bendtsen JD, Nielsen H, vonHeijne G, Brunak S. Improved prediction of signal peptides: Signalp 3.0. J. Mol. Biol. 2004;340:783–795. [PubMed]
15. Emanuelsson O, Nielsen H, Brunak S, vonHeijne G. Predicting subcellular localization of proteins based on their n-terminal amino acid sequence. J. Mol. Biol. 2000;300:1005–1016. [PubMed]
16. Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, et al. The ucsc genome browser database: 2008 update. Nucleic Acids Res. 2008;36:D773–D779. [PMC free article] [PubMed]
17. Wilming LG, Gilbert JGR, Howe K, Trevanion S, Hubbard T, Harrow JL. The vertebrate genome annotation (vega) database. Nucleic Acids Res. 2008;36:D753–D760. [PMC free article] [PubMed]
18. Staden R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res. 1979;6:2601–2610. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...