|
Entrez GEO Profiles and Entrez GEO DataSets query tutorial
Background
GEO stores a wide assortment of high-throughput experimental data,
processed in a variety of ways. To enable statistical
analyses, GEO data are assembled into comparable sets, or GEO
DataSets (GDS). A GDS represents a curated collection of biologically
and statistically comparable GEO samples. Samples within a dataset refer to
the same platform, that is, they interrogate a common set of elements.
Calculations are computed on the VALUE column provided by the submitter
in GEO sample data tables. These value measurements are assumed to be
calculated in an equivalent manner for each sample within a dataset,
that is, considerations such as background processing and normalization
are consistent across the dataset. GDSs form the basis of GEO's query,
data display, and analysis tools.
GDS experiment descriptions and value measurements are
stored in two NCBI Entrez databases:
Entrez GEO DataSets:
Entrez GEO DataSets contains curated GEO dataset definitions to facilitate identification of experiments of interest.
Entrez GEO DataSets can be searched with any text found in either the curated DataSet or the original
submitter-supplied GEO records that make up the DataSet.
Entrez GEO Profiles:
Entrez GEO Profiles stores individual, dataset-specific gene expression profiles. Entrez
GEO Profiles may be used to query for specific genes of interest, profiles of
interest based on flagged significant effects or similar expression profile patterns, and
related profiles based on sequence similarity.
Query construction
Data of interest may be located by entering text in the Entrez GEO DataSets
or Entrez GEO Profiles search boxes (or similarly using the "query DataSets" and "query Gene profiles"
boxes on the GEO home page).
As with other Entrez databases, searches may be refined using a Boolean phrase
restricted to any number of supported attribute fields.
The Preview/Limits link on
Entrez GEO DataSets
and Entrez GEO Profiles pages assist greatly in construction of
complex queries. Alternatively, complex search statements can be written and executed directly in the search boxes.
To perform such a search, specify the search terms, their fields, and the Boolean operations
to perform on the term using the following syntax:
term [field] OPERATOR term [field]
where term(s) are the search terms, the field(s) are the search fields and qualifiers, and the OPERATOR(s)
are the Boolean operators (AND, OR, NOT). The indexes (available on the Preview/Limits page) may be used to
browse and/or select the terms by which data are described.
Fields for Entrez GEO DataSets include:
[All Fields] [Author] [Experiment Publication Date] [Experiment Type] [Filter] [GDS Creation/Update Date] [GDS Text] [GEO Accession] [GEO Description/Title Text]
[Gene Name or Description] [Number of Samples] [Number of Platform Probes] [Organism] [Platform Reporter Type] [Reporter Identifier]
[Sample Source] [Sample Title] [Sample Value Type] [Reporter Identifier] [Sample Source] [Sample Value Type] [Submitter Institute]
[Subset Description] [Subset Variable Type]
Fields for Entrez GEO Profiles include:
[All Fields] [Experiment Type] [Filter][Flag Information] [Flag Type] [GDS Text] [GEO Accession] [GEO Description/Title Text] [GI]
[Gene Description] [ID_REF] [Max Value Rank] [Min Value Rank] [Number of Samples] [Organism] [Platform Reporter Type] [Ranked Standard Deviation]
[Reporter Identifier] [Sample Source] [Sample Value Type]
The following query examples further demonstrate how to effectively mine GEO data.
Search for an experiment of interest
Search for datasets of interest using Entrez GEO DataSets. Entrez GEO DataSets retrievals
display the dataset title, a brief experiment description, organism, experimental variables,
and links to the complete GDS record, parent platform, and reference series records.
Example:
To identify all dual channel nucleotide microarray experimental
datasets exploring metastasis in humans, use the "query DataSets" (or Entrez GEO DataSets) box, and enter:
"dual channel"[Experiment Type] AND metastasis AND human[Organism]
Search for a gene of interest
Search for a gene of interest using Entrez GEO Profiles.
Entrez GEO retrievals display individual, precomputed, dataset-specific gene
expression/molecular abundance profile charts. Click the chart thumbnails to
view a breakdown of the dataset experimental design. Entrez GEO Profiles retrievals also display whatever gene identifier information
is available (gene name, GenBank accession, clone ID, ORF), mapping information,
the dataset title, and additional flags regarding outliers and detection calls.
Retrievals are listed in order of most-interesting-first, based on a scoring scheme that considers
flagged effects, expression level, outliers, and variability.
Profiles for a gene of interest may be located by entering all or part of the gene name,
gene symbol, alias, GenBank accession number, clone ID, ORF name, or the reference identifier
(ID_REF) from the parent platform.
Example:
To view profiles of kallikrein family genes across all datasets, use the 'query Gene profiles'
(or Entrez GEO Profiles) box, and enter:
kallikrein
To limit these kallikrein retrievals to datasets investigating progesterone, enter:
kallikrein AND progesterone[GDS Text]
Search for interesting/significant/specific gene expression profiles
Several fields are available to
refine an Entrez GEO Profiles search to
help identify interesting or significant molecular abundance
profiles.
GEO datasets are partitioned into subsets that reflect
experimental design. Genes are flagged as having significant effects in
relation to subset types if the values or ranks pass a threshold of statistical
difference between any non-single subset and another. Thus, queries can be made
for genes that show interesting expression profiles with regard to experimental subsets.
Example:
To view profiles showing interesting value subset effects in either dataset GDS186 or GDS187,
use the "query Gene profiles" (or Entrez GEO Profiles) box, and enter:
(GDS186 OR GDS187) AND "value subset effect"[Flag Type]
The value measurements of each sample in a dataset are rank-ordered.
It is possible to refine searches to view genes with profiles
that fall within a specified abundance bracket.
Example:
To view profiles that fall into the top 1% abundance rank bracket in at least one sample in dataset GDS186,
use the "query Gene profiles" (or Entrez GEO Profiles) box, and enter:
GDS186 AND 100[Max Value Rank]
A range can also be specified. To view the top 5%, enter:
GDS186 AND 96:100[Max Value Rank]
Variability for each gene is calculated using the
standard deviation of rank across the dataset. It is possible to refine searches to view highly
variable gene expression profiles across a dataset.
Example:
To view profiles that fall into the top 1% variable molecular abundance profiles in dataset GDS186,
use the "query Gene profiles" (or Entrez GEO Profiles) box, and enter:
GDS186 AND 100[Ranked Standard Deviation]
Search for a sequence of interest
The GEO BLAST tool queries Entrez
GEO Profiles for molecular abundance profiles of interest based on
nucleotide sequence similarity. The GEO BLAST database contains all
GenBank identifiers represented on microarray platforms or SAGE
libraries in GEO. This interface is helpful in identifying sequence
homologs of interest, e.g., related gene family members or for
cross-species comparisons.
|
|
|
|