 |
Submitting high-throughput sequence data to GEO and SRA
|
Introduction
|
|
GEO accepts various categories of functional genomic sequence data generated by next-generation sequencing methodologies (e.g., Illumina, 454 Life Sciences and SOLiD Applied Biosystems).
We accept data for studies that examine gene expression, gene regulation, epigenetics, methylation status, or other studies where
measuring molecular abundance is central to the experimental design (see below for links to example records).
Data provision and standards
The GEO database supports and encourages provision of all elements
of a study with a view to facilitating comprehensive interpretation of an experiment (see draft MINSEQE proposal).
GEO sequence submission procedures are designed to encourage provision of all the following elements:
thorough descriptions of the biological samples under investigation, and procedures to which they were subjected
thorough descriptions of the protocols used to generate and process the data
processed data files (e.g., filtered sequence reads, detection counts, alignment files, peak files, etc)
original short read format sequence files which will be uploaded to NCBI's Short Read Archive sequence database
|
Administration
All standard GEO administration and processing procedures apply to sequence submissions. These include:
Unique and stable GEO accession numbers are issued to experiments; these accessions can be cited in manuscripts
GEO accession numbers are typically issued within 5 business days after completion of submission
Data can be held private until publication
Reviewers can have password-controlled access to private records
Submitters can update their records at any time
More information on these aspects is provided in our FAQ.
|
|
Categories of sequence submissions processed by GEO
|
|
| GEO accepts |
GEO does not process |
Studies concerning gene expression, gene regulation, epigenetics,
or other functional genomic studies.
Examples include:
RNA-seq (example)
small RNA discovery and profiling (example)
ChIP-seq (example)
methyl-seq (example)
bisulfite sequencing (example)
digital gene expression tag profiling (example)
traditional SAGE (see Web submission instructions)
If you have questions about whether GEO can accept your data type, please do not hesitate to contact us at
geo@ncbi.nlm.nih.gov.
|
whole genome sequencing
metagenomic sequencing
resequencing projects
survey sequencing, whole exome, etc
For information on how to submit these types of data to NCBI, please refer to the SRA home page or contact the Short Read Archive database
at sra@ncbi.nlm.nih.gov.
|
Important: Human subject data
For all studies involving human subjects, it is the submitter's responsibility
to ensure that the data and files supplied to GEO protect participant privacy in accordance with
all applicable laws, regulations and institutional policies. Make sure to remove any direct personal
identifiers from your submission. These identifiers are listed in
http://privacyruleandresearch.nih.gov/research_repositories.asp, footnote 1.
|
|
Deposit instructions
|
|
Sequence data may be submitted using one of the following formats:
GEOarchive spreadsheet format
Recommended for most submissions. Full instructions are provided in the section below.
SOFT format
Suitable if your metadata are already in a database, and you can generate and export data in SOFT plain text format.
» Complete instructions
GEOarchive spreadsheet format
GEOarchive has three components:
[1] Metadata spreadsheet
[2] Processed data files [3] Raw data files
Details about each component are described in the following table:
| Metadata spreadsheet |
'Metadata' refers to descriptive information and protocols for the overall experiment and individual Samples,
and references to external processed and raw data file names.
Information is supplied by completing all fields of a metadata template spreadsheet:
Illumina metadata spreadsheet (template and example)
454 metadata spreadsheet (template and example)
Guidelines on the content of each field is provided within the spreadsheets.
NOTE: These templates may change slightly in coming months, so please download template immediately prior to when you intend to use it.
|
| Processed data file(s) |
Requirements for processed data files are not yet fully standardized and will depend on the nature of the experiment.
Multiple types and levels of processed data files per Sample can be accepted, for example, a ChIP-seq Sample
would typically have alignment files (e.g., bed format) and peak files (e.g., wig format). A miRNA profiling experiment would typically have filtered,
unique sequence reads with counts and mappings. The file names should be referenced as appropriate in the Metadata spreadsheet. Please consider
including a 'readme' file with your submission detailing the content of each of the columns in the processed data table files.
|
| Raw data files |
The raw data files should be the original short read format sequence and quality files.
The names of these files should be referenced as appropriate in the Metadata spreadsheet.
It is very important to provide raw data files with your submission.
These files will be uploaded to NCBI's Short Read Archive
sequence database which has tools to help users view, query, browse and download sequence data. Also, without raw data your
submission may not meet the requirements of the journal you are publishing with. We understand that the volumes
of raw data can be very large and difficult to transfer - please contact us
if you need advice with this matter.
Barcode data:
At this time, we prefer that submitters split run files so that each
barcoded sample ends up with a dedicated run file based on the barcode sequences.
Accepted file types and packaging instructions are as follows:
| Technology |
Accepted File Type(s) |
Notes |
Illumina (please choose one of these three options) |
.srf
preferred option |
Users can download the Staden io_lib package in order to get the illumina2srf utility.
To produce an SRF file for a lane's worth of data, change the working directory to the run folder and do:
illumina2srf -R -P -N <run>:%l:%t: -n %x:%y -o <center_name>_<run>_<lane>.srf s_<lane>_*_seq.txt
where <center_name> is the short name of the sequencing center or other individual name, <run> is the flowcell name for the run
(for example 080117_EAS56_0068), and <lane> is the desired lane.
Please produce one SRF file per lane.
Please do not compress the SRF files as the format is nearly optimal in terms of compression. |
.fastq
(see example)
|
Contains base calls and phred-like quality scores per read.
It is important that fastq files contain all the original data, i.e., the reads should be:
- complete (not trimmed subsequences), and
- unfiltered (e.g., it is not sufficient to submit only sequences that align)
Please generate one fastq file per run.
|
_qseq
(see example)
|
Contains base calls and phred-like quality scores per read.
It is important to package these files in the form: <all data from one lane>.tar.gz
|
_seq.txt _prb.txt (see example)
|
_seq.txt contains base calls per read
_prb.txt contains per channel pseudo-phred quality scores
provision of _sig2.txt text files is optional, they contain phase-corrected signal intensity values
It is important to package these files in the form: <all data from one lane>.tar.gz
We cannot process these data if they are packaged incorrectly.
|
454 |
.sff |
Contains flowgram (base call, phred quality score, flow value).
The .sff files should reflect the sequencing run setup.
If the entire picotitre plate was used, then one .sff file per run should be submitted.
If the picotitre plate was divided into two or more regions, then a .sff file for each region should be submitted.
If a .sff file contains more than one run, or more than one region in the run,
please break up this file into constituent parts using the sfffile utility from the 'Off Rig' software package provided by Roche.
The read names found in the .sff file are meaningful and reflect the addressing scheme for the picotitre plate as well as a globally unique run id.
Please do not rewrite this name as such addressing information will be lost.
The .sff file format is nearly optimal in terms of footprint, so there is little to be gained by further compressing them.
Therefore, please provide .sff files uncompressed.
Your sequencing data may have been produced by the 454 contract sequencing center (454MSC).
Please ask 454MSC to provide .sff files for your project. |
AB SOLiD |
.srf |
Instructions for converting SOLiD system reads to .srf files using solid2srf are provided on the
Applied Biosystems solid2srf site.
Please do not compress the SRF files as the format is nearly optimal in terms of compression. |
HeliScope |
to be determined |
Please contact us
for instructions if you want to submit HeliScope data |
|
|
Data submission
|
|
Zip, rar, or tar all the files described above into a single archive named using your GEO user ID
(),
e.g. _files.zip, and then transfer to us using one of the FTP methods outlined below.
After transferring files, please send an e-mail to geo@ncbi.nlm.nih.gov
with the following information:
- GEO account user name ();
- Name(s) of the archive file(s) deposited;
- Public release date (up to 1 year from now).
Files we cannot identify will be removed from our FTP site without being processed.
These submission procedures and requirements will be refined in coming months.
However, the accession numbers we assign to your data are stable and will not change.
If you have any suggestions or concerns regarding any of these issues, please
email us at geo@ncbi.nlm.nih.gov.
|
|
|
|
|

|
|
|
 |