• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinfoLink to Publisher's site
Bioinformatics. Dec 1, 2008; 24(23): 2798–2800.
Published online Nov 7, 2008. doi:  10.1093/bioinformatics/btn520
PMCID: PMC2639278

GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus

Abstract

The NCBI Gene Expression Omnibus (GEO) represents the largest public repository of microarray data. However, finding data in GEO can be challenging. We have developed GEOmetadb in an attempt to make querying the GEO metadata both easier and more powerful. All GEO metadata records as well as the relationships between them are parsed and stored in a local MySQL database. A powerful, flexible web search interface with several convenient utilities provides query capabilities not available via NCBI tools. In addition, a Bioconductor package, GEOmetadb that utilizes a SQLite export of the entire GEOmetadb database is also available, rendering the entire GEO database accessible with full power of SQL-based queries from within R.

Availability: The web interface and SQLite databases available at http://gbnci.abcc.ncifcrf.gov/geo/. The Bioconductor package is available via the Bioconductor project. The corresponding MATLAB implementation is also available at the same website.

Contact: yidong/at/mail.nih.gov

1 INTRODUCTION

The NCBI Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) represents the largest public repository of microarray data in existence (Edgar et al., 2002; Barrett et al., 2007). The Bioconductor project (Gentleman et al., 2004) contains hundreds of state-of-the-art methods for the analysis of microarray and genomics data. Previously we published software, called GEOquery (Davis and Meltzer, 2007), which effectively establishes a bridge between GEO microarray data and Bioconductor and facilitates reanalysis using novel and rigorous statistical and bioinformatic tools. However, a difficulty that remains in dealing with GEO is to find, based on the experimental metadata, the microarray data that are of interest especially for large-scale and programmatic access of GEO data. As part of the NCBI Entrez search system, GEO can be searched online via web pages or using NCBI eUtils (http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html). However, the NCBI/GEO web search is not yet full featured, particularly for programmatic access. NCBI eUtils offers another option for finding data within the vast stores of GEO, but it is cumbersome to use, often requiring multiple complicated eUtils queries to get the relevant information. GEOmetadb was developed in an attempt to make querying the GEO metadata both easier and more effective. GEOmetadb includes a web-based query engine, supported by a MySQL database backend, with several convenient utilities and a Bioconductor package, called GEOmetadb, which queries a locally installed GEOmetadb SQLite database that we update regularly and supply for download; each can be used independently of the other.

2 RESULTS

2.1 GEO metadata parsing

GEO has an open, adaptable design that can handle variety and a minimum information about a microarray experiment (MIAME)-compliant (Brazma et al., 2001) infrastructure that promotes fully annotated submissions. The basic record types in GEO include Platforms (GPL), Samples (GSM), Series (GSE) and DataSets (GDS), of which GDS records are assembled by GEO curators and others are supplied by submitters. Essentially, information in each GEO record can be divided into two parts, a metadata part and the data part. The information in metadata part is critical for finding GEO microarray data of interest. NCBI offers several different methods to access GEO records, which we utilize to capture all GEO metadata for different GEO data types accordingly. Hypertext preprocessor (PHP, http://www.php.net) functions were written to parse, extract, reformat, construct data elements and interact with a MySQL database (http://www.mysql.com/) for storage and querying. The PHP function for parsing GDS SOFT files was adopted from the EzArray software (Zhu et al., 2008). The GEOmetadb MySQL database was designed to store parsed GEO metadata and relationships between them (Fig. 1). All data in GEOmetadb are faithfully parsed from GEO and no attempt is made to curate, semantically recode, or otherwise clean up GEO data. All field names are also taken from GEO records except for minor changes to improve usability in SQL queries. Fields containing multiple records are generally stored as delimited text within the same record; this denormalization significantly reduces complexity and improves efficiency of queries. SQLite 3 database (http://www.sqlite.org/) is a widely used, cross-platform SQL database engine which is a self-contained, embeddable, serverless, transactional SQL database engine. The RSQLite package (James, 2008) includes an embedded SQLite database engine and can interact with any SQLite database; each database exists as a simple file, which is easily exchanged and is platform independent. An R script converts the GEOmetadb MySQL database to an SQLite 3 database file that contains data identical to those in the GEOmetadb MySQL database. The SQLite version of GEOmetadb is maintained and distributed for local installation.

Fig. 1.
Diagram of GEO entity relationships in GEOmetadb.

2.2 GEOmetadb bioconductor package

The GEOmetadb Bioconductor package is simply a thin wrapper around the GEOmetadb SQLite database. The package also includes extensive documentation and example queries. The function getSQLiteFile is the standard method for downloading and unzipping the most recent GEOmetadb SQLite file from the server. The function geoConvert performs conversion of one GEO entity to other associated GEO entities, providing a very fast, convenient mapping between GEO types. To convert ‘GPL96’ to other possible GEO entities in the GEOmetadb.sqlite:

An external file that holds a picture, illustration, etc.
Object name is btn520i1.jpg

The example provided below utilizes RSQLite function dbGetQuery to extract all affymetrix GeneChips that have .CEL supplementary submission to GEO.

An external file that holds a picture, illustration, etc.
Object name is btn520i2.jpg

2.3 The GEOmetadb online search tool

The GEOmetadb online search tool is a web-based search interface for searching, viewing and downloading GEO metadata stored in the GEOmetadb MySQL database. GEO metadata records can be searched by individual data type or by a flexible, efficient, powerful combined GSE-GPL-GSM search, as shown in Figure 2, where GEO entities in the tables are linked by relationships between them. Essential query fields are provided with drop-down menu for popular entries, and keyword search for full-text querying from multiple text fields in GEO. Other features include multiple field query, query within results, creating lists, flexible display options, downloading and detailed views of any results.

Fig. 2.
Screen-capture of GEOmetadb online search: combined GSE-GPL-GSM search.

3 CONCLUSIONS

With the continued growth in the volume and complexity of microarray data available via NCBI GEO, it is critical that researchers have efficient, flexible, powerful methods for querying those data. While GEO offers several options for finding microarray data, GEOmetadb provides an alternative, yet much more flexible and efficient, set of tools for both online and programmatic access to GEO metadata. We expect that improved access to GEO metadata will not only enhance researchers’ abilities to find data of interest, but also provide a possibility for users to create a customized GEO metadata database, e.g. annotating experiments with controlled vocabulary and integrating with other biological data sources.

ACKNOWLEDGEMENTS

We would like to thank BioInforx for the BxAF search functionality used on the web query pages and the NCBI GEO staff for valuable input and support during the development process.

Funding: Intramural Research Program of the NIH, National Cancer Institute, Bethesda, USA.

Conflict of Interest: none declared.

REFERENCES

  • Edgar R, et al. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–210. [PMC free article] [PubMed]
  • Barrett T, et al. NCBI GEO: mining tens of millions of expression profiles—database and tools update. Nucleic Acids Res. 2007;35:D760–D765. [PMC free article] [PubMed]
  • Brazma A, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat. Genet. 2001;29:365–371. [PubMed]
  • Gentleman RC, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80.1–R80.16. [PMC free article] [PubMed]
  • Davis S, Meltzer PS. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and Bioconductor. Bioinformatics. 2007;23:1846–1847. [PubMed]
  • Zhu Y, et al. EzArray: a web-based highly automated Affymetrix expression array data management and analysis system. BMC Bioinformatics. 2008;9:46–55. [PMC free article] [PubMed]
  • James DA. RSQLite: SQLite interface for R. R package version 0.6-8. 2008 http://cran.r-project.org/web/packages/RSQLite/index.html.

Articles from Bioinformatics are provided here courtesy of Oxford University Press

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...