NCBI » GEO » Info » About GEO2RLogin

About GEO2R

Background

GEO2R is an interactive web tool that allows users to compare two or more groups of Samples in a GEO Series in order to identify genes that are differentially expressed across experimental conditions. Results are presented as a table of genes ordered by significance.

GEO2R performs comparisons on original submitter-supplied processed data tables using the GEOquery and limma R packages from the Bioconductor project. Bioconductor is an open source software project based on the R programming language that provides tools for the analysis of high-throughput genomic data. The GEOquery R package parses GEO data into R data structures that can be used by other R packages. The limma (Linear Models for Microarray Analysis) R package has emerged as one of the most widely used statistical tests for identifying differentially expressed genes. It handles a wide range of experimental designs and data types and applies multiple-testing corrections on P-values to help correct for the occurrence of false positives. Thus, GEO2R provides a simple interface that allows users to perform R statistical analysis without command line expertise.

Unlike GEO's other DataSet analysis tools, GEO2R does not rely on curated DataSets and interrogates the original Series Matrix data file directly. This allows a greater proportion of GEO data to be analyzed in a timely manner. However, it is important to realize that this tool can access and analyze almost any GEO Series, regardless of data type and quality, so the user must be aware of GEO2R Limitations and caveats.

How to use Back to top

Enter a Series accession number

If you followed a link from a Series record, the GEO accession box will already be populated. Otherwise, enter a Series accession number in the box, e.g., GSE25724. If the Series is associated with multiple Platforms, you will be asked to select the Platform of interest.

Define Sample groups

In the Samples panel, click 'Define groups' and enter names for the groups of Samples you plan to compare, e.g., test and control. Up to 10 groups can be defined. At least two groups must be defined in order to perform the test. Groups can be removed using the [X] feature next to the group name.

Assign Samples to each group

Screenshot of GEO2R samples table

To assign Samples to a group, highlight relevant Sample rows. Multiple rows may be highlighted either by dragging the cursor over contiguous Samples, or using Ctrl or Shift keys. When relevant Samples are highlighted, click the group name to assign those Samples to the group. Repeat for each group. Not all Samples in a Series need to be selected for the test to work.

Use the Sample metadata columns to help determine which Samples belong to which group. The table is populated with Accession, Title, Source name and individual Characteristics fields from the Sample records. You can change which fields are displayed using the Columns box at the upper right corner of the table, and the columns can be sorted by clicking the table headers.

Perform the test

After Samples have been assigned to groups, click [Top 250] to run the test with default parameters.

Alternatively, you can use features in the other tabs to first assess the Sample value distributions, or edit default test parameters. For example, you can select an alternative P-value adjustment method in the Options tab then go back to the GEO2R tab and click [Top 250] to run the test with revised parameters. Details regarding each edit option are provided in the Edit options and features section below.

Interpret the results table

Screenshot of GEO2R results table

Results are presented in the browser as a table of the top 250 genes ranked by P-value. Genes with the smallest P-value are the most significant. Click on a row to reveal the gene expression profile graph for that gene. Each red bar in the graph represents the expression measurement extracted from the value column of the original submitter-supplied Sample record. The Sample accession numbers and group names are listed along the bottom of the chart.

Use the Select columns feature to modify which data and annotation columns are included in the table. Information about the meaning of the data columns is provided in the Summary statistics section.

If you want to edit the test parameters, you can do so in the Options tab, then go back to the GEO2R tab and click Recalculate to apply the edits.

To see more than the top 250 results, or if you want to save the results, the complete results table may be downloaded using the Save all results button. The downloaded files are tab-delimited and suitable for opening in a spreadsheet application such as Excel.

Tutorial Video

Edit options and features Back to top

Value distribution

This feature allows you to calculate and view the distribution of the values for the Samples you have selected. Values are the original submitter-supplied data upon which GEO2R calculations are performed. Viewing the distribution is important for determining if your selected Samples are suitable for comparison; see Limitations and caveats for more information. Generally, median-centered values are indicative that the data are normalized and cross-comparable.

Value distributions may be viewed graphically as a box plot. The graphic can be saved by right-clicking on the image. Alternatively, the distribution can be exported as a tab-delimited number summary table.

Options

Apply adjustment to the P-values: Limma provides several P-value adjustment options. These adjustments, also called multiple-testing corrections, attempt to correct for the occurrence of false positive results. The Benjamini & Hochberg false discovery rate method is selected by default because it is the most commonly used adjustment for microarray data and provides a good balance between discovery of statistically significant genes and limitation of false positives. If you want to change the adjustment method, go to the Options tab and select another method. References for each method are provided below. The adjusted P-values are listed in the Adj P-value column of the results table.

Apply log transformation to the data: The GEO database accepts a variety of data value types, including logged and unlogged data. Limma expects data values to be in log space. To address this, GEO2R has an auto-detect feature that checks the values of selected Samples and automatically performs a log2 transformation on values determined not to be in log space. Alternatively, the user can select Yes to force log2 transformation, or No to override the auto-detect feature. The auto-detect feature only considers Sample values that have been assigned to a group, and applies the transformation in an all-or-none fashion.

Category of Platform annotation to display on results: Select which category of annotation to display on results. Gene annotations are derived from the corresponding Platform record. Two types of annotation are possible:

NCBI generated annotation is available for many records. These annotations are derived by extracting stable sequence identification information from the Platform and periodically querying against the Entrez Gene and UniGene databases to generate consistent and up-to-date annotation. Gene symbol and Gene title annotations are selected by default. Other categories of NCBI generated annotation include GO terms and chromosomal location information.

Submitter supplied annotation is available for all records. These represent the original Platform annotations provided by the submitter. Note that there is a lot of diversity in the style and content of submitter supplied annotations and they may not have been updated since the time of submission.

Profile graph

This tab allows you to view a specific gene expression profile graph by entering the corresponding identifier from the ID column of the Platform record. This feature does not perform any calculations; it merely displays the expression values of the gene across Samples. Sample groups do not need to be defined for this feature to work.

R script

This tab prints the R script used to perform the calculation. This information can be saved and used as a reference for how results were calculated.

Limitations and caveats Back to top

The GEO database is a public repository that archives thousands of original high-throughput functional genomic studies submitted by the scientific community. These studies represent a large diversity of experimental types and designs, and contain data that are processed and normalized using a wide variety of methods. GEO2R can access and analyze almost any GEO Series, regardless of data type and quality, so the user must be aware of the following limitations and caveats.

Check that Sample values are comparable: GEO2R operates on Series Matrix files which contain data extracted directly from the VALUE column of Sample tables. Submitters are asked to supply normalized data in the VALUE column, rendering the Samples cross-comparable. The majority of GEO data do conform to this rule. GEO applies no further processing other than to perform a log2 transformation on values determined not to be in log space (see Options section). However, some studies, such as dual channel loop design data, may generate values that do not have a common reference and are not directly comparable. Some studies may contain Sample value data that are not normalized, or have a design such that the Samples were never intended to be directly compared. Yet other studies do not have sufficient replicate Samples to perform a robust statistical analysis. Users should examine the original Series to understand the experimental design, and check the 'Data processing' field or VALUE description in the original Sample records for information on what the values represent. The box plot feature on the Value distribution tab is provided to help users assess whether the distributions of values across Samples are median-centered, which is generally indicative that the data are normalized and cross-comparable.

Data type restriction: GEO2R operates on data in Series Matrix files which contain data extracted directly from the VALUE column of Sample tables. Some categories of GEO Samples do not have data tables (e.g., high-throughput sequencing or genome tiling arrays) and thus cannot be analyzed using GEO2R.

Within-Series restriction: GEO2R operates on Series Matrix files. Thus, analyses are restricted to Samples that occur within one Series; it is not possible to perform cross-Series comparisons.

Failed jobs: Occasionally, a GEO2R analysis will fail because some aspect of the input data is not compatible with the GEOquery or limma packages. In such cases, native BioConductor errors are reported.

255 Sample limit: GEO2R operates on data in Series Matrix files. These files contain a maximum of 255 Samples, thus, Series containing more than 255 Samples cannot currently be examined using GEO2R.

10 minute timeout: GEO2R currently has a 10 minute cutoff imposed on job processing. If the Series you are attempting to analyze has a large number of Samples and/or genes, the analysis may not run to completion.

More information and references Back to top

Summary statistics

GEO2R provides the following summary statistics as generated by the limma topTable function. More information about each statistic is provided in chapter 10 of the limma users guide.

adj.P.Val P-value after adjustment for multiple testing. This column is generally recommended as the primary statistic by which to interpret results. Genes with the smallest P-values will be the most reliable.
P.Value Raw P-value
t Moderated t-statistic (only available when two groups of Samples are defined)
B B-statistic or log-odds that the gene is differentially expressed (only available when two groups of Samples are defined)
logFC Log2-fold change between two experimental conditions (only available when two groups of Samples are defined)
F Moderated F-statistic combines the t-statistics for all the pair-wise comparisons into an overall test of significance for that gene (only available when more than two groups of Samples are defined)

General references

  • Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, Vol. 3, No. 1, Article 3.
  • Smyth, G. K. (2005). Limma: linear models for microarray data. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor, R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds.), Springer, New York, pages 397-420.
  • Sean Davis and Paul S. Meltzer (2007). GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics 23(14): 1846-1847
  • R documentation: Table of Top Genes from Linear Model Fit

Adjustment test references

  • R documentation: Adjust P-values for Multiple Comparisons
  • Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289-300.
  • Benjamini, Y., and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, 1165-1188.
  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65-70.
  • Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika, 75, 383-386.
  • Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800-803.
  • Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46, 561-576.
  • Sarkar, S. (1998). Some probability inequalities for ordered MTP2 random variables: a proof of Simes conjecture. Annals of Statistics, 26, 494-504.
  • Sarkar, S., and Chang, C. K. (1997). Simes' method for multiple hypothesis testing with positively dependent test statistics. Journal of the American Statistical Association, 92, 1601-1608.
  • Wright, S. P. (1992). Adjusted P-values for simultaneous inference. Biometrics, 48, 1005-1013.

Last modified: July 26, 2016