Submitting high-throughput sequence data to GEO
- Assembling your submission
- Metadata spreadsheetNew version available!
- Processed data files
- Raw data files
- Uploading your submission
- General Information
Assembling your submission Back to top
GEO accepts next generation sequence data that examine quantitative gene expression, gene regulation, epigenomics or other aspects of functional genomics using methods such as RNA-seq, miRNA-seq, ChIP-seq, RIP-seq, HiC-seq, methyl-seq, etc. We process all components of your study, including the samples, project description, processed data files, and we submit the raw data files to the Sequence Read Archive (SRA) on your behalf.
Once you have determined that GEO is an appropriate resource for your data type (see categories of data we do and do not accept), data should be submitted using the spreadsheet-based submission method described below. Alternatively, if your metadata are already in a database, and you can generate and export data in SOFT text format, you may prefer to use SOFT format.
There are three required components for the spreadsheet-based submission method:
- a metadata spreadsheet
- processed data files
- raw data files
Details about each component are described below.
-
Metadata spreadsheet (Updated May 24, 2023)
Metadata refers to descriptive information about the overall study, individual samples, all protocols, and references to processed and raw data file names. Information is supplied by completing all fields of a metadata template spreadsheet. Guidelines on the content of each field are provided within the spreadsheet.
-
Processed data files
Processed data are a required part of GEO submissions. The final processed data are defined as the data on which the conclusions in the related manuscript are based. We do not expect standard alignment files (e.g., BAM, SAM, BED) as processed data since conclusions are expected to be based on further-processed data. When standard alignments are the only processed data available, please write to us to inquire about whether your data are suitable for submission to GEO. Requirements for processed data files are not fully standardized and will depend on the nature of the experiment:
-
Expression profiling analysis usually generates quantitative data for features of interest.
Features of interest may be genes, transcripts, exons, miRNA, or some other genetic entity.
Two levels of data are often generated:
- raw counts of sequencing reads for the features of interest, and/or
- normalized abundance measurements, e.g., output from Cufflinks, Cuffdiff, DESeq, edgeR, etc.
- Either or both of these data types may be supplied as processed data. They may be formatted either as a matrix table or individual files for each sample. Provide complete data with values for all features (e.g., genes) and all samples, not only lists of differentially-expressed genes.
- ChIP-Seq data might include peak files with quantitative data, tag density files, etc. Common formats include WIG, bigWig, bedGraph.
Features (e.g., genes, transcripts) in processed data files should be traceable using public accession numbers or chromosome coordinates. The reference assembly used (e.g., hg19, mm9, GCF_000001405.13) should be provided in the metadata spreadsheet.
A description of the format and content of processed data files should be provided in the metadata spreadsheet data processing fields.
If you provide WIG, bedGraph, GFF, or GTF files, please refer to the UCSC file format FAQ for format requirements.
-
Expression profiling analysis usually generates quantitative data for features of interest.
Features of interest may be genes, transcripts, exons, miRNA, or some other genetic entity.
Two levels of data are often generated:
-
Raw data files
Raw data are a required part of GEO submissions. We will submit raw data files to SRA for you. The raw data files should be the original files containing reads and quality scores, as generated by the sequencing instrument (unless the raw files are barcoded/multiplexed, see below for further instructions).
Raw Data File Formats: Acceptable file formats include FASTQ, as well as other formats described in the SRA File Format Guide. Files that do not conform to supported format requirements will be deleted from our systems.
Barcode/Multiplexed Data: Whenever possible, we do require that files be demultiplexed so that each barcoded sample ends up with a dedicated run file. However, for single-cell sequencing studies (e.g. 10x Genomics, Drop-Seq, InDrops), we can support the submission of multiplexed files in cases where these files are required for reanalysis in your pipeline, or when demultiplexing would create an unmanageable number of files.
Paired-end Experiments: We usually expect 2 files per run (4 files per run when sequences and qualities are included in separate files). If submitting FASTQ files, please submit the original unedited files from the Illumina pipeline. Edited files may not be processed correctly by SRA.
MD5 Checksums: We recommend that submitters provide MD5 checksums for their raw data files. The checksums are used to verify file integrity. Checksums can be calculated using the following methods:
- Unix: md5sum <file>
- OS X: md5 <file>
- Windows: Application required. Many are available for free download.
Data File Compression: Individual files can be compressed to speed transfer, but this is not required. Acceptable compression formats are gzip and bzip2 (i.e. files ending with a .gz or .bz2 extension). Never compress binary files (e.g., BAM, bigWig, bigBed), and DO NOT upload ZIP archives (files with a .zip extension).
Uploading your submission Back to top
There are two steps for submission:
1. Transfer all your files to the GEO FTP server.
If your files are >5 terabytes, please contact us and do not transfer your files until you hear back from us.
2. After the FTP transfer is complete, notify GEO using the Submit to GEO web form
General Information Back to top
Data provision and standards
GEO sequence submission procedures are designed to encourage provision of MINSEQE elements:
|
Administration
All standard GEO administration and processing procedures apply to sequence submissions. These include:
More information on these aspects is provided in our FAQ. |
Categories of sequence submissions processed by GEO Back to top
GEO accepts | GEO does not accept |
Studies concerning quantitative gene expression, gene regulation, epigenetics, or other functional genomic studies. Examples include:
If you have questions about whether GEO can accept your data type, please e-mail GEO. |
For more information about submitting data to NCBI, please refer to the Submission Wizard. |
---|