Logo of narLink to Publisher's site
Nucleic Acids Res. Jan 1, 2005; 33(Database issue): D622–D627.
Published online Dec 17, 2004. doi:  10.1093/nar/gki040
PMCID: PMC539994

openSputnik—a database to ESTablish comparative plant genomics using unsaturated sequence collections

Abstract

The public expressed sequence tag collections are continually being enriched with high-quality sequences that represent an ever-expanding range of taxonomically diverse plant species. While these sequence collections provide biased insight into the populations of expressed genes available within individual species and their associated tissues, the information is conceivably of wider relevance in a comparative context. When we consider the available expressed sequence tag (EST) collections of summer 2004, most of the major plant taxonomic clades are at least superficially represented. Investigation of the five million available plant ESTs provides a wealth of information that has applications in modelling the routes of plant genome evolution and the identification of lineage-specific genes and gene families. Over four million ESTs from over 50 distinct plant species have been collated within an EST analysis pipeline called openSputnik. The ESTs were resolved down into approximately one million unigene sequences. These have been annotated using orthology-based annotation transfer from reference plant genomes and using a variety of contemporary bioinformatics methods to assign peptide, structural and functional attributes. The openSputnik database is available at http://sputnik.btk.fi.

INTRODUCTION

Complete genome sequencing has become the standard modus operandi for bacterial genomics, and tens of eukaryotic genomes have also been completely sequenced (see http://www.genomesonline.org). Plant genomics is, however, frequently hindered by the typically large and repetitive nature of the genome. Certain plant species have genome sizes that dwarf the human genome; the 1C genome size for broad bean (Vicia faba) is at least 26 000 Mb (Plant DNA C-values database), or over eight times the size of the human genome. The selection of candidate plant genomes for complete sequencing is, therefore, based on the scientific and anthropocentric value of the plant and the feasibility of a meaningful sequencing and assembly strategy. While several diverse plant species [Arabidopsis thaliana (1), Oryza sativa (2,3) and Populus trichocarpa] have been or will shortly be completely sequenced, the majority of plant genomes remain largely inaccessible. Arabidopsis and rice are certainly model plant systems but, are neither truly representative of any other given species nor are they general indicators for gene content across the whole plant kingdom. The first forays into comparative plant genomics using Arabidopsis and rice as reference genomes have demonstrated that there is a remarkable degree of underlying sequence diversity between these species (2,3). This firmly advocates the need to at least sample the protein-coding component of more taxonomically ‘exotic’ plant genomes.

cDNA preparation and expressed sequence tag (EST) sequencing remain a dominant methodology for accessing the protein coding (and expressed) portion of the genome. Many laboratories are independently sequencing very large numbers of sequences from a broad and bio-diverse spectrum of plant species (Figure (Figure1).1). EST sequences retain their exalted status for several reasons [for a review see (4)].

  1. They are technically simple to produce and cheap to sequence.
  2. ESTs provide a robust approximation of the expressed gene content of the parental genome under given sampling conditions and can be used for primitive expression profiling between tissues (5).
  3. The extensive redundancy typical of EST collections also allows for the selection of putative molecular markers (6,7).
  4. cDNAs may be used as a substrate for arraying, to create cDNA microarrays; this allows for true gene expression profiling (8).
Figure 1
A depiction of the phylogenetic relationships among the major plant lineages as published previously (23). The evolutionary tree has been overlaid with the names of plant species having large EST collections (>5000 sequences) that are available ...

With an excess of 5.4 million sequences from over 320 species, the current public plant EST sequence databases (EMBL release 80) (9) are a valuable and contextually rich but under-utilized resource. If we consider just the large EST collections with over 5000 ESTs, 5.1 million ESTs from 74 species are represented. These species, while highly biased towards the key plant taxonomic clades of the rosids, asterids and monocots, still contain representative species, from other key taxonomic groups. The species represented contain representatives of single cellularity—the red and brown algae and lower plants—gymnosperms, basal angiosperms and the angiosperms. With such a wealth of signals for investigation of the underlying genomic changes in gene-content, protein structures and domain composition, the EST collections surely deserve detailed analysis and investigation.

The openSputnik database has been designed as an interim platform for the exhaustive annotation and analysis of EST sequences in a comparative context. In addition to clustering sequences, a peptide sequence is identified, thus, providing a more sensitive target for the identification of functional and structural features. Sequences are placed in context with the currently available complete plant genomes and are associated with other clustered EST collections. The openSputnik database, thus, creates a platform upon which the intricate patterns of generalist house-keeping genes and lineage-specific gene families may be teased apart. The completed EST project annotations are available as a searchable web resource. While the provision of an integrated resource containing a diverse mixture of clustered and contextually placed unigene sequences is not unique [e.g. TIGR Gene Indices database (10), NCBI Unigenes (11) or PlantGDB at Iowa State University (12)], the openSputnik database is currently distinct in its focus towards functionally describing unigene sequences on the basis of both orthologous gene annotations and the application of bioinformatics methods for ab initio annotations.

IMPLEMENTATION AND STARTING MATERIAL

The openSputnik database has been programmed using the Java programming language and utilizes the PostgreSQL relational database management system to archive and retrieve sequences and their annotations. Therefore, openSputnik is largely platform-independent and has been implemented using a server–client model to allow for calculation in a distributed and heterogeneous computational environment. The methods implemented within openSputnik are described as functional objects and the analytical pathway is described as a directed acyclic graph (Figure (Figure2).2). The current version of openSputnik utilizes the complete public plant EST collection that was available from the European Molecular Biology Laboratory (EMBL) at the start of Spring 2004 (EMBL release 78). A rule was imposed so that EST collections of at least 4500 sequences would be included. Over four million EST sequences representing 55 distinct plant species were identified using this rule. These sequences were loaded onto the openSputnik database schema.

Figure 2
A simplification of the directed acyclic graph that describes the analytical pipeline used to build the openSputnik database. As starting material, species-specific EMBL flat files are imported and all annotations are retained. This creates a sequence ...

SEQUENCE CLUSTERING

Prior to sequence clustering, ESTs were aggressively trimmed of any likely residual vector or polylinker sequences using the Crossmatch application (P. Green, unpublished data) and the National Center for Bioinformatics Information (NCBI) UniVec database. Sequences <55 nt in length were excluded at this stage. To prevent the aggregation of sequences on the basis of low complexity sequence islands, all low complexity sequences were masked using the RepeatBeater algorithm (Biomax informatics, Martinsried, Germany). The masked sequences were clustered into pools of related sequences using a suffix tree based approach (HPT2 algorithm; Biomax informatics). To encourage the aggregation of sequences, HPT2 was run using a similarity threshold of 0.7 and a number of network iterations equalling the number of masked ESTs. The resulting clusters were assembled into unigene sequences using the CAP3 algorithm with standard settings. Within the larger EST collections, some HPT2 identified clusters contain many members. To simplify the analysis, larger clusters were truncated to an arbitrary threshold of a maximum of 2500 ESTs. Some individual ESTs representing the most highly expressed genes were absent from their cognate unigenes.

PEPTIDE PREDICTION

It is probable that each derived unigene sequence represents an expressed and properly spliced mRNA. Extensive amounts of either 5′-untranslated region (5′-UTR) or 3′-UTR may exist within the unigene sequences. The identification of a meaningful peptide sequence lends value to the dataset by allowing us to exclude sequences of low protein-coding potential, and additionally allows the use of peptide-annotation algorithms. ESTScan (13) models have been trained for each of the underlying species. Training data were produced by identifying probable open reading frame (ORF) sequences from a BLASTX (14) analysis against the Swiss-Prot (15) database arbitrarily filtered at 1E−10. ESTScan was used with the derived model to predict the most likely peptide for each unigene sequence. The numbers of ESTs, unigenes and peptides are shown for each of the 55 openSputnik plant species along with estimates of actual coding potential and redundancy across the individual libraries (Table (Table11).

Table 1.
Table summarizing the sequence content of the openSputnik database

DATABASE CONTENTS

The unigene sequences and peptides from each of the included species have been annotated using a selection of bioinformatics tools that are relevant to comparative genomics and biological understanding. Sequences are annotated for structural and functional characters using InterPro domains (16), TMHMM for the identification of transmembrane domains (17), TargetP for the prediction of organellar targeting (18) and SignalP for subcellular localization (19). The blast algorithm is used to reflect similarities of individual sequences with known proteins in the Swiss-Prot database, predicted proteins in the UniProt database (20) and to organism specific sets of proteins not restricted to A.thaliana, O.sativa or aggregated plant proteins. The complete sequence collections are summarized using the MIPS catalogue of functionally annotated proteins (Funcat) (21) and Gene Ontology terms (22). A collection of methods has been implemented to provide the typical figures and charts that are often seen in EST collection publications. Graphical representation of sequence lengths, number of ESTs within unigenes and clone-library representation are all included. Also included are reports summarizing the functional distribution of unigenes using both GOSlims and the MIPS Funcat.

DATABASE ACCESS

A query interface to the openSputnik database is provided by a web application product written for the Zope web application server. The openZputnik portal at http://sputnik.btk.fi provides access to all core EST collections through a single unified interface. Selecting EST projects will display a list of all available projects. When an openSputnik collection is selected, an interface that provides routes to the underlying data will be displayed. Different methods are included for EST sequences, unigene sequences and peptide sequences. Additionally, a page is included to access sequences on the basis of pre-computed reports and a BLAST server is included so that sequences may be identified on the basis of similarity to a known sequence. Sequences may be identified on the basis of a variety of criteria not restricted to GC content, length, name or predicted function.

When a sequence is selected, a single page summary report is displayed for the sequence. This summarizes key information that includes wherever appropriate, the best BLAST matches, functional information and physical attributes. Navigation tabs are provided so that a user may access all primary information derived or associated with a single sequence.

DATA AVAILABILITY AND FUTURE DIRECTIONS

All data within the openSputnik database is freely available to the scientific community. Please contact the author to request the inclusion of additional methods. The analytical pipeline may be applied to novel and proprietary sequence collections as either a collaboration with, or as a service of, the Bioinformatics Core facility provided at the Turku Centre for Biotechnology. The openSputnik SQL schema and complete database dumps are available upon request. The source code to the openSputnik engine and core reporting architecture is being open-sourced and released to Source Forge (www.sourceforge.com).

The openSputnik group will prepare one or two releases of the clustered plant unigenes per year. Additional plant species will be included into the pipeline as they exceed our arbitrary size threshold. Additional groups of organisms will be integrated in the future with a comparative mammalian unigene database planned for spring 2005. Additional emphasis is being placed on the creation of generic reports that can distil the essence of large and heterogeneous sequence collections. Further synchronization of the completed resources with the Gene Ontology and dynamic integration and comparison of groups of species is in progress. The challenge is to stay abreast with the ever-growing collections of sequences and the novel bioinformatics methodologies that offer us the ability to better understand the nuances within our sequence collections.

REFERENCES

1. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 408, 796–815. [PubMed]
2. Yu J., Hu,S., Wang,J., Wong,G.K., Li,S., Liu,B., Deng,Y., Dai,L., Zhou,Y., Zhang,X. et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science, 296, 79–92. [PubMed]
3. Goff S.A., Ricke,D., Lan,T.H., Presting,G., Wang,R., Dunn,M., Glazebrook,J., Sessions,A., Oeller,P., Varma,H. et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science, 296, 92–100. [PubMed]
4. Rudd S. (2003) Expressed sequence tags: alternative or complement to whole genome sequences? Trends Plant Sci., 8, 321–329. [PubMed]
5. Satou Y., Kawashima,T., Kohara,Y. and Satoh,N. (2003) Large scale EST analyses in Ciona intestinalis: its application as Northern blot analyses. Dev. Genes Evol., 213, 314–318. [PubMed]
6. Thiel T., Michalek,W., Varshney,R.K. and Graner,A. (2003) Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor. Appl. Genet., 106, 411–422. [PubMed]
7. Kota R., Rudd,S., Facius,A., Kolesov,G., Thiel,T., Zhang,H., Stein,N., Mayer,K. and Graner,A. (2003) Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.). Mol. Genet. Genomics, 270, 24–33. [PubMed]
8. Drmanac R. and Drmanac,S. (1999) cDNA screening by array hybridization. Methods Enzymol., 303, 165–178. [PubMed]
9. Kulikova T., Aldebert,P., Althorpe,N., Baker,W., Bates,K., Browne,P., van den Broek,A., Cochrane,G., Duggan,K., Eberhardt,R. et al. (2004) The EMBL Nucleotide Sequence Database. Nucleic Acids Res., 32, D27–D30. [PMC free article] [PubMed]
10. Quackenbush J., Cho,J., Lee,D., Liang,F., Holt,I., Karamycheva,S., Parvizi,B., Pertea,G., Sultana,R. and White,J. (2001) The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res., 29, 159–164. [PMC free article] [PubMed]
11. Wheeler D.L., Church,D.M., Federhen,S., Lash,A.E., Madden,T.L., Pontius,J.U., Schuler,G.D., Schriml,L.M., Sequeira,E., Tatusova,T.A. et al. (2003) Database resources of the National Center for Biotechnology. Nucleic Acids Res., 31, 28–33. [PMC free article] [PubMed]
12. Dong Q., Schlueter,S.D. and Brendel,V. (2004) PlantGDB, plant genome database and analysis tools. Nucleic Acids Res., 32, D354–D359. [PMC free article] [PubMed]
13. Iseli C., Jongeneel,C.V. and Bucher,P. (1999) ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol., 138–148. [PubMed]
14. Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. [PubMed]
15. Boeckmann B., Bairoch,A., Apweiler,R., Blatter,M.C., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O'Donovan,C., Phan,I. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370. [PMC free article] [PubMed]
16. Mulder N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Barrell,D., Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P. et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., 31, 315–318. [PMC free article] [PubMed]
17. Krogh A., Larsson,B., von Heijne,G. and Sonnhammer,E.L. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol., 305, 567–580. [PubMed]
18. Emanuelsson O., Nielsen,H., Brunak,S. and von Heijne,G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol., 300, 1005–1016. [PubMed]
19. Bendtsen J.D., Nielsen,H., von Heijne,G. and Brunak,S. (2004) Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol., 340, 783–795. [PubMed]
20. Apweiler R., Bairoch,A., Wu,C.H., Barker,W.C., Boeckmann,B., Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M. et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res., 32, D115–D119. [PMC free article] [PubMed]
21. Mewes H.W., Frishman,D., Guldener,U., Mannhaupt,G., Mayer,K., Mokrejs,M., Morgenstern,B., Munsterkotter,M., Rudd,S. and Weil,B. (2002) MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 30, 31–34. [PMC free article] [PubMed]
22. Harris M.A., Clark,J., Ireland,A., Lomax,J., Ashburner,M., Foulger,R., Eilbeck,K., Lewis,S., Marshall,B., Mungall,C. et al. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res., 32, D258–D261. [PMC free article] [PubMed]
23. Pryer K.M., Schneider,H., Zimmer,E.A. and Ann Banks,J. (2002) Deciding among green plants for whole genome studies. Trends Plant Sci., 7, 550–554. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...