Handout    NAR 2009 Paper     NAR 2002 Paper     FAQ     Email GEO  
   NCBI > GEO > Info

   

Submitting high-throughput sequence data to GEO



Introduction

GEO archives various categories of functional genomic sequence data generated by next-generation sequencing methodologies. We accept data for studies that examine gene expression profiling, gene regulation and epigenetics. To see existing next-generation sequence records in GEO, see links to example records for specific data categories or browse all next-generation sequence studies in GEO DataSets.


Data provision and standards

GEO sequence submission procedures are designed to encourage provision of the following elements:
  • Thorough descriptions of the biological samples under investigation, and procedures to which they were subjected
  • Thorough descriptions of the protocols used to generate and process the data
  • Final processed (or summary) data from which the conclusions in associated manuscripts are based
  • Original raw data files containing sequence reads and quality scores, which will be uploaded to NCBI's Sequence Read Archive (SRA) database.
Administration

All standard GEO administration and processing procedures apply to sequence submissions. These include:
  • Unique and stable GEO accession numbers are issued to studies; these accessions can be cited in manuscripts
  • GEO accession numbers are typically issued within 5 business days after completion of submission
  • Data can be held private until publication
  • Reviewers can have password-controlled access to private records
  • Submitters can update their records at any time
More information on these aspects is provided in our FAQ.




Categories of sequence submissions processed by GEO


GEO accepts GEO does not process
Studies concerning quantitative gene expression, gene regulation, epigenetics, or other functional genomic studies.

Examples include:
  • mRNA profiling (example)
  • small RNA profiling (example)
  • ChIP-Seq (example)
  • methyl-Seq (example)
  • bisulfite sequencing (example)
  • digital gene expression tag profiling (example)
  • traditional SAGE (see Web submission instructions)

  • If you have questions about whether GEO can accept your data type, please do not hesitate to contact us at geo@ncbi.nlm.nih.gov.
    • transcriptome or transcript assemblies: submit the raw reads to SRA and the assembly data to the Transcriptome Shotgun Assembly Database
    • whole genome sequencing
    • metagenomic sequencing
    • resequencing or copy number projects
    • survey sequencing, whole exome, etc
    For information on how to submit these types of data to NCBI, please refer to the Submit sequence data to NCBI guidelines.


    Important: Human subject data

    For all studies involving human subjects, it is the submitter's responsibility to ensure that the data and files supplied to GEO protect participant privacy in accordance with all applicable laws, regulations and institutional policies. Make sure to remove any direct personal identifiers from your submission. These identifiers are listed at http://privacyruleandresearch.nih.gov/research_repositories.asp.
    If there are patient privacy concerns regarding making data fully public through GEO, please submit to NCBI's dbGaP database. dbGaP has controlled access mechanisms and is an appropriate resource for hosting sensitive patient data.



    Deposit instructions


    Sequence data may be submitted using one of the following formats:

    • GEOarchive spreadsheet format: recommended for most submissions. Full instructions are provided below.

    • SOFT format: suitable if your metadata are already in a database, and you can generate and export data in SOFT plain text format » Complete SOFT format instructions


    GEOarchive spreadsheet format

    GEOarchive has three components:
    1. Metadata spreadsheet
    2. Processed data files
    3. Raw data files
    Details about each component are described in the following table:

    Metadata spreadsheet Metadata refers to descriptive information and protocols for the overall study and individual samples, and references to processed and raw data file names. Information is supplied by completing all fields of a metadata template spreadsheet: Guidelines on the content of each field is provided within the spreadsheets.

    NOTE: The template is frequently updated, so please download immediately prior to when you intend to use it.
    Processed data files The final processed data are defined as the data on which the conclusions in the related manuscript are based. Requirements for processed data files are not yet fully standardized and will depend on the nature of the experiment:
    • Expression profiling data can often be presented as a matrix listing features of interest (such as gene, transcript, exon, miRNA) and normalized abundance measurements (e.g. RPKM) for each Sample.
    • ChIP-Seq data might include tag density files, peak files with quantitative data, etc... Common formats include bed, wig, tab-delimited text, etc...
    The identifiers used to annotate processed data files should be traceable via the use of publicly available identifiers or chromosome coordinates along with the genome assembly build and version on which the data are based.

    Submitters should provide a thorough description of the format and content of the files.

    If you provide BED, WIG, bedGraph, GFF, or GTF files, please refer to the UCSC file format FAQ for requirements.

    Alignment files (e.g. BAM, SAM) should not be supplied as processed data files.
    Raw data files We will submit raw data files to NCBI's Sequence Read Archive (SRA) database for you. The raw data files should be the original files containing reads and quality scores, as generated by the sequencing instrument.

    Raw Data File Formats: Accepted file formats and packaging instructions are listed in the table below. Files that do not conform to supported format requirements will be deleted from our systems. More information about accepted raw data file formats is provided in the SRA File Format Guide.

    Barcode/Multiplexed Data: Submitters are required to de-multiplex their raw data files prior to submission so that each barcoded sample ends up with a dedicated run file. Reads should not be trimmed.

    Paired-end Experiments: We usually expect two files per run (four files per run when sequences and qualities are included in separate files). Submitters should provide the average insert size of the molecules sequenced (excluding linkers, adapters, etc...) and the standard deviation of the insert sizes. Include this information in the library construction protocol of the metadata spreadsheet.

    MD5 Checksums: We recommend that submitters provide MD5 checksums for their raw data files. The checksums are used to detect errors introduced during ftp transfer. Checksums can be calculated using the following methods:
    • Unix: md5sum <file>
    • OS X: md5 <file>
    • Windows: Application required. Many are available for free downloads.
    Data File Compression: We recommend that submitters compress (e.g., gzip) their raw data files prior to submission to shorten the ftp transfer time. Provide the MD5 checksums of the compressed files (if applicable).


    Raw Data File Formats

    Technology Accepted File Types Notes
    Illumina
    fastq
    (see example)
    Text files with 4 lines per read. Do not combine data from multiple lanes into a single fastq file. There should be 1 fastq file for each lane (2 fastq files for paired-end experiments). Do not submit raw files which use a non-standard Illumina quality scoring system since these files may not be processed correctly. Contact GEO if you have any questions or concerns.
    Illumina_native_qseq
    (see example)
    Tab-delimited text files with 11 columns.
    • Illumina pipeline versions 1.3 and later: the basecalling program Bustard creates a single qseq file for the lane (two files for mate pairs).
    • Illumina pipeline versions prior to 1.3: Bustard creates one qseq file per tile (i.e., multiple files per lane). It is important to package these files in the form:<all data from one lane>.tar.gz. Do not include any non-qseq files. Do not gzip the individual qseq files.
    scarf
    (see example)
    Text file with one read per line. Lines are colon-separated and include read name, sequence, and quality.
    Illumina_native
    (seq and prb)
    (see example)
    Tab-delimited sequence (_seq.txt) and qualities (_prb.txt) files. A native format generated by Illumina pipeline version 1.3 and earlier. It is important to package these files in the form:
    <all data from one lane>.tar.gz
    srf See the SRA File Format Guide for instructions.
    AB SOLiD SOLiD_native_csfasta and SOLiD_native_qual
    (see example)
    AB SOLiD native sequence files (e.g., .csfasta) and quality files (e.g., _QV.qual). Both files must be submitted. Do not tar the files.
    For paired-end sequencing data, submit mate-pair files (e.g., F3 and R3).
    srf See the SRA File Format Guide for instructions.
    454 454_native_seq and 454_native_qual 454 native sequence files (e.g., .fna or .seq) and quality files (e.g., .qual). Both files must be submitted. Do not tar the files.
    sff See the SRA File Format Guide for instructions. Do not compress sff files.
    fastq Demultiplexed 454 raw data files from barcoded experiments may be submitted in fastq format.
    HeliScope Helicos_native
    (see example)
    Text files similar to fastq format with 4 lines per read. See the SRA File Format Guide.
    Complete Genomics,
    Ion Torrent,
    and PacBio
    N/A Please contact geo@ncbi.nlm.nih.gov for submission instructions.


    Data submission


    Zip, rar, or tar all the files described above into a single archive named using your GEO user ID (), e.g. _files.zip, and then transfer to us using one of the FTP methods outlined below. Alternatively, you can create a directory on the FTP site and transfer individual files. The directory should be named using your GEO user ID (/).

    Do not transfer files unless you are confident that you have a complete submission that includes all required raw data files, processed data files and metadata spreadsheet. We do not have the resources to store incomplete submissions. Incomplete submissions will be deleted from our systems.

    After transferring files, please send an e-mail to geo@ncbi.nlm.nih.gov with the following information:
    1. GEO account user name ();
    2. Name(s) of the archive file(s) deposited;
    3. Public release date (up to 3 years from now).
    It is important to send us this e-mail notification because files we cannot identify will be removed from our FTP site without being processed. We do not send automated confirmation that files have been received. You should expect to receive an e-mail from a curator within 5 business days after you send us the e-mail notification.

    If you have any questions or concerns regarding data transfer, please e-mail us.








    | NLM | NIH | GEO Help | NCBI Help | Disclaimer | Section 508 |
    NCBI Home NCBI Search NCBI SiteMap