Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Nat Methods. Author manuscript; available in PMC Jul 27, 2009.
Published in final edited form as:
PMCID: PMC2716375
NIHMSID: NIHMS59171

AILUN: Re-annotating Gene Expression Data Automatically

To the editor

Gene Expression Omnibus (GEO) 1 is a public repository for gene expression data. While the amount of data in GEO has grown exponentially, the number of publications citing GEO has only grown linearly. The difficulty in data reuse lies with the mapping of probes in GEO data sets to established gene identifiers, which can change as annotations for the underlying sequences change2. Therefore, microarray results need to be re-evaluated with the latest probe annotations. There have been several previous efforts to re-annotate microarray probe identifiers 3,4 but only for a few platforms and species.

We built a fully automated system, AILUN, to re-annotate all types of microarrays in GEO periodically by relating every probe ID to Entrez Gene IDs. First, we collected all gene identifiers from Entrez Gene and UniGene and built a Universal Gene Identifier Table (UGIT). We then matched each column of every GEO platform with UGIT to find the best matching column and type of external identifier, and annotated each probe ID with Entrez Gene IDs. (Supplementary Methods and Supplementary Fig. 1 on line).

UGIT contained 75 million (M) gene identifiers of 90 types for 3585 species. AILUN successfully re-annotated 66% gene expression platforms, enabling reuse of 77% samples across 79 species. The platform annotation coverage was 5 times larger than GEO (Table 1) and 94% identical for those probes annotated by AILUN and GEO. To validate, we compared the annotations on Affymetrix U133A 2.0 across AILUN, GEO, and NetAffx5 using Brainarray3 as the gold standard, which is based on probe sequence matching. AILUN tied NetAffx at 97% precision and 97% recall, and outperformed GEO with 98% precision and 86% recall (Supplementary Table 1-3 and Supplementary Discussion on line).

Table 1
Performance comparison. AILUN and GEO are compared on the number of re-annotated array platforms and the number of samples enabled for reuse.

The server (http://ailun.stanford.edu) offers four functions to help users re-annotate platforms. Platform annotation adds the latest annotations to any uploaded result file. Cross-species mapping maps platform annotations to other species. Platform comparison compares any two platforms to find corresponding probes mapping to the same gene. Gene Search finds deposited platforms and samples in GEO for any list of genes.

Supplementary Material

Figs_Tables

Figure

ACKNOWLEDGEMENTS

Supported by Lucile Packard Foundation for Children's Health, National Library of Medicine (K22 LM008261), National Institute of General Medical Sciences (R01 GM079719), Howard Hughes Medical Institute, and Pharmaceutical Research and Manufacturers of America Foundation. We thank Alex Skrenchuk and Annie Chiang from Stanford University for computer support and manuscript review, respectively.

Footnotes

COMPETING INTERESTS STATEMENTS

The authors declare no competing financial interests.

REFERENCES

1. Barrett T, et al. Nucleic Acids Res. 2007;35:D760–765. [PMC free article] [PubMed]
2. Perez-Iratxeta C, Andrade MA. BMC Bioinformatics. 2005;6:183. [PMC free article] [PubMed]
3. Dai M, et al. Nucleic Acids Res. 2005;33:e175. [PMC free article] [PubMed]
4. Tsai J, et al. Genome Biol. 2001;2 SOFTWARE0002.
5. Liu G, et al. Nucleic Acids Res. 2003;31:82–86. [PMC free article] [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...