File Format Guide

Introduction

This page reviews the submission file formats currently supported by the Sequence Read Archives (SRA) at NCBI, EBI, and DDBJ, and gives guidance to submitters about current and future file formats and policies regarding SRA submissions.

Some things to keep in mind:

  • The SRA is a raw data archive, and requires per-base quality scores for all submitted data. Therefore, FASTA and other sequence-only formats are not sufficient for submission! FASTA can, however, be submitted as a reference sequence(s) for BAM files or as part of a FASTA/QUAL pair (see below).
  • SRA accepts binary files such as BAM, SFF, and HDF5 formats and text formats such as FASTQ.

BAM files

Binary Alignment/Map files (BAM) represent one of the preferred SRA submission formats. BAM is a compressed version of the Sequence Alignment/Map (SAM) format (see SAMv1 (.pdf)). BAM files can be decompressed to a human-readable text format (SAM) using SAM/BAM-specific utilities (e.g. samtools Different site ) and can contain unaligned sequences as well. SRA recommends aligning to an unmodified known reference, if possible, to enable subsequent users to view the alignments in the Sequence Viewer or to compare the alignments with other alignments on the same reference.

SAM is a tab-delimited format including both the raw read data and information about the alignment of that read to a known reference sequence(s). There are two main sections in a SAM file, the header and the alignment (sequence read) sections, each of which are described below. Note that this documentation will focus on a description of the SAM format with respect to submission of BAM files to the SRA (i.e. SRA doe not accept SAM files for submission). A more comprehensive discussion of the format specifications can be found at the samtools Different site website.

SAM Header Example:

@HD    VN:1.4    SO:coordinate
@SQ    SN:CHROMOSOME_I    LN:15072423
UR:ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/invertebrates/Caenorhabditis_elegans/
WBcel215/Primary_Assembly/assembled_chromosomes/FASTA/chrI.fa.gz    AS:ce10    
SP:Caenorhabditis elegans
 
@SQ    SN:CHROMOSOME_II    LN:15279345    
UR:ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/invertebrates/Caenorhabditis_elegans/
WBcel215/Primary_Assembly/assembled_chromosomes/FASTA/chrII.fa.gz     AS:ce10    
SP:Caenorhabditis elegans  
 
@RG    ID:1    PL:ILLUMINA    LB:C_ele_05    DS:WGS of C elegans    PG:BamIndexDecoder
@PG    ID:bwa    PN:bwa    VN:0.5.10-tpx

Ideally, the SN value should be a versioned accession (e.g., NC_003279.7, rather than CHROMOSOME_I). This will allow the SRA to unambiguously identify the reference sequence(s) and process the BAM file with minimal intervention. Otherwise, submitters are strongly encouraged to include the "URL/URI" that can be used to obtain the reference sequence(s) and AS tags to clearly define which assembly has been used (as above).

If the data are instead aligned to a local or submitter-defined set of references (including any modifications to accessioned assemblies), then the submitter must include a reference fasta along with each submitted bam file. Note: the FASTA header line(s) MUST match the SN names provided in the BAM file exactly.

Deviation from these recommended practices will require manual intervention by SRA staff in order to process a BAM file and can delay completion of a submission and acquisition of accession numbers.

SAM Alignment Example:

3658435    145    CHROMOSOME_I    1    0    100M    CHROMOSOME_II    2716898    0    
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT
AAGCCT    
@CCC?:CCCCC@CCCEC>AFDFDBEGHEAHCIGIHHGIGEGJGGIIIHFHIHGF@HGGIGJJJJJIJJJJJJJJJJJJJJJJJJJJJHHHHHFF
FFFCCC    RG:Z:1    NH:i:1    NM:i:0
    
5482659    65    CHROMOSOME_I    1    0    100M    CHROMOSOME_II    11954696    0    
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT
AAGCCT    
CCCFFFFFHHGHGJJGIJHIJIJJJJJIJJJJJIJJGIJJJJJIIJIIJFJJJJJFIJJJJIIIIGIIJHHHHDEEFFFEEEEEDDDDCDCCCA
AA?CC:    RG:Z:1    NH:i:1    NM:i:0

The header and alignment section are internally consistent: each aligned read has an RNAME (reference sequence name, 3rd field) that matches an SN tag value from the header (e.g., CHROMOSOME_I), and, if provided, the alignment read group optional field (RG:Z:) is consistent with the read group ID in the header (1). It is also important to ensure that the FLAG fields (2nd field in each line) are correctly set for the data. The SRA pipeline will attempt to resolve incorrect FLAG values, but sufficiently incorrect values can lead to processing errors. The SRA does not archive optional and non-standard tags/field values contained in the alignment section. However, the entire header section of the bam file is retained. Additionally, although the SAM format allows for an equal sign (=) in the sequence field to represent a match to the reference sequence or only an asterisk (*) in both the sequence and quality fields, the SRA processing software does not recognize either of these formats.

Please note that unexpected notations used to indicated paired reads can lead to failure to recognize the pairs and an improper SRA archive (i.e. paired reads are treated like fragments). For example, using :0 and :1 at the end of the read names is atypical and is currently not recognized as an indication of read 1 and 2 in a pair. It would be better to exclude these notations and provide the two reads with the same names. Expected notations for particular platforms will work. For example, Illumina reads with /1 or /2 appended is an expected notation. Further, neglecting to set the proper bits for paired reads in the SAM/BAM flags (e.g. multi-segment template 1-bit, first segment 64-bit, and last segment 128-bit) or splitting paired reads into separate bam files can result in an improper SRA archive or failure to generate the SRA archive.

Tack When submitting BAM files of aligned reads to the SRA you must also specify an assembly - the reference genome that your reads were aligned against. You can identify your reference assembly by its name or accession from the NCBI Assembly database. UCSC and Ensembl assembly names may also be used. If the assembly is not available from a public repository you will need to submit your own (local) assembly in FASTA format (reference_fasta) along with your BAM file.

CRAM files

Another acceptable SRA submission format is the CRAM format (see CRAMv3(.pdf)). Files received in this format are converted to the BAM format for processing. The references provided in this format are treated in the same manner as BAM references with the added possibility of a check against the European Nucleotide Archive (ENA) CRAM reference registry.

SFF files

In the absence of a BAM file, Standard Flowgram Files or SFF is the preferred input format for 454 Life Sciences (now part of Roche) data; IonTorrent data can also be submitted as SFF. Extensive technical details about the format can be obtained here Different site .

Tack Submitters of SFF data should ensure that the data are demultiplexed (if barcoded) – this is particularly common in pyrotag / 16S rRNA amplicon sequencing.

HDF5 files

HDF5 is a data model, library, and file format for storing and managing data. The SRA accepts bas.h5 and bax.h5 file submissions for PacBio-based submission and .fast5 files for submissions related to MinION Oxford Nanopore.

PacBio

Submission of data from the RS II instrument requires one (1) bas.h5 file and three (3) bax.h5 files. Do not link more than one PacBio RS II to an SRA run and please do not change the bax.h5 files names from those indicated in the bas.h5 file.

Depending on the platform used for your PacBio sequencing project, the following data files with respective extensions are produced and required for SRA submission.

PacBio RS Platform Data Files Delivered
PacBio RS
  1. xxxx.metadata.xml (optional but desirable)
  2. xxxx.bas.h5
PacBio RS II
  1. xxxx.metadata.xml (optional but desirable)
  2. xxxx.bas.h5 (optional but desirable)
  3. xxxx.1.bax.h5
  4. xxxx.2.bax.h5
  5. xxxx.3.bax.h5

Please be sure to list the files for each SMRT Cell in a separate Run or on a separate row of your sra_metadata sheet.

PacBio documentation on bax.h5 / bas.h5 format: bas.h5ReferenceGuide.pdf.

MinION Oxford Nanopore

In this case, there are 1-3 sequences per fast5 HDF file (one spot of information) and the entire set of fast5 files should be submitted in a tar.gz file. You must submit the fast5 files generated after base calling.

Learn more about this platform at Oxford Nanopore Technologies Different site website.

HDF5 tools

HDF5 tools: http://www.hdfgroup.org/products/hdf5_tools Different site

FASTQ files

Fastq consists of a defline that contains a read identifier and possibly other information, nucleotide base calls, a second defline, and per-base quality scores, all in text form. There are many variations.

The following terms and formats are defined in general:

  • Identifier and other information: text string terminated by white space.
  • Bases: fastq sequence should contain standard base calls (ACTGactg) or unknown bases (Nn) and can vary in length.
  • Qualities options:

    Decimal-encoding, space-delimited [0-9]+ | <quality>\s[0-9]+
    Phred-33 ASCII [\!\"\#\$\%\&\'\(\)\*\+,\-\.\/0-9:;<=>\?\@A-I]+
    Phred-64 ASCII [\@A-Z\[\\\]\^_`a-h]+

    Quality string length should be equal to sequence length.

    In a limited set of cases, log odds or non-ASCII numerical quality values will succeed during an SRA submission.

Files from various platforms employing this format are acceptable:

@<identifier and expected information>
<sequence>
+<identifier and other information OR empty string>
<quality>

Where each instance of Identifier, Bases, and Qualities are newline-separated. Extra information added beyond the <identifier and expected information> examples is likely to be discarded/ignored.

As indicated above, the Qualities string can be space-separated numeric Phred scores or an ASCII string of the Phred scores with the ASCII character value = Phred score plus an offset constant used to place the ASCII characters in the printable character range. There are 2 predominant offsets: 33 (0 = !) and 64 (0=@).

Paired-end FASTQ

Although generally the case, there are some instances where paired reads are not a forward read paired with a reverse read.

Paired-end data submitted in FASTQ format should be submitted in one of two formats:

  1. As separate files for forward and reverse reads, in which the reads are in the same order.
  2. As interleaved, or "8-line", FASTQ, in which forward and reverse reads alternate in the file and are in order (i.e., read "1F", followed by read "1R", then read "2F", then "2R").

SRA supports the following forward/reverse read indicators: '/1' and '/2' at the end of the read name or newer Illumina style '1:Y:18:ATCACG' and '2:Y:18:ATCACG'.

Tack Concatenated FASTQ (in which all forward reads are followed by all reverse reads) is not supported.

Platform specific FASTQ files

454 fastq

@<454_universal_accession>

Under Roche 454, SRA accepts both 'pre-split' or 'post-split' 454 fastq sequences. Paired 'post-split' 454 reads must be provided in separate files or in the interleaved format. 'Split' means the 454 linker has been located/removed and used to split the sequence into biological read pairs (and all other technical reads have been removed).

Ion Torrent fastq

@<Run_ID>:<Chip_Row_Coordinate>:<Chip_Column_Coordinate>

In the same manner as Roche 454, SRA only accepts 'pre-split' Ion Torrent sequences or 'post-split' Ion Torrent single read fragments in a fastq form. Paired 'post-split' Ion Torrent reads will require submission in a BAM file. 'Split' means the Ion Torrent linker has been located/removed and used to split the sequence into biological read pairs (and all other technical reads have been removed).

Recent Illumina fastq

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<xpos>:<y-pos> <read>:<is filtered>:<control number>:<index>

<index> values for Illumina fastq can be barcodes.

Older Illumina fastq

@<machine_id>:<lane>:<tile>:<x_coord>:<y_coord>#<index>/<read>

<index> values for Illumina fastq can be barcodes.

QIIME de-multiplexed sequences in fastq

@<SampleID-based_identifier> <Original_information> orig_bc=<original_barcode> new_bc=<corrected_barcode> bc_diffs=<0|1>

PacBio CCS (Circular Consensus Sequence) or RoI (Read of Insert) read

@<MovieName>/<ZMW_number>

PacBio CCS subread

@<MovieName> /<ZMW_number>/<subread-start>_<subread-end>

Helicos fastq with a fixed ASCII-based Phred value for quality

@VHE-242383071011-15-1-0-2

Characteristic use of a quality '/', which gives a Phred value of 14.

The native format for helicos is fasta so converting to fastq requires creating a default quality score. The default value selected by the SRA team is '14'.

FASTA files

Fasta files adhering to the definition lines described in the fastq section are acceptable, too, although fastq is preferred (a file type of fastq should still be specified). The SRA assigns a default quality value of 30 in this case and expects this format:

>(identifier and other information)
<sequence>

FASTA with QUAL file pairs

Fasta files may be submitted with corresponding qual files, too. These are recognized in the SRA data processing pipeline as equivalent to fastq and should be specified as fastq when submitting the data files.

Files from some platforms (mostly older Illumina and Roche 454) employing this format are acceptable and the entries in the pair of files should look like:

File 1

>READNAME
BASES

File 2

>READNAME
QUALITIES

Where READNAME must be identical between files for a given read, and QUALITIES are generally in whitespace-separated decimal values.

Note the following guidelines for FASTA/QUAL pairs of files:

In a given pair of files, there must be the same number of reads in both. For a given read, there must be the same number of BASES and QUALITIES, i.e., if the BASES are trimmed to remove barcodes, then the same scores must be removed from the QUALITIES, etc.

CSFASTA with QUAL Files

The files have an optional header that is identified by lines that begin with the hash/pound/number sign (#). The HEADER can be defined as:

# <date> <path> [--flag]* --tag <tag> --minlength=<length> --prefix=<prefix> <path>
# Cwd: <path>
# Title: <flowcell>

The permissible CSFASTA format is as follows:

#HEADER (multiple lines)
>TAGNAME
BASES

The permissible QUAL format is as follows:

#HEADER (multiple lines)
>TAGNAME
QUALITIES

As with FASTA/QUAL pairs, there are several rules for pairs of CSFASTA/QUAL files. TAGNAME must be identical between files for a given read, and QUALITIES are generally in whitespace-separated decimal values.

Note the following guidelines for CSFASTA/QUAL pairs of files:

In a given pair of files, there must be the same number of reads in both. For a given read, there must be the same number of color space digits and QUALITIES, i.e., the BASES line is typically 1 character longer than the number of QUALITIES (due to the color space indexing base that begins each BASES string). HEADER must be identical between paired files.

Also see SOLiD™ Data Format and File Definitions Guide (.pdf)

Legacy Formats

These formats are still accepted by SRA, but are considered out-of-date and not recommended for submission. If you are able to update your files to a more common format please do so before submitting to SRA.

SRF files

SRF is a generic format for DNA sequence data. This format has sufficient flexibility to store data from current and future DNA sequencing technologies. This is a single input file format for all downstream applications and a read lookup index enabling downstream formats to reference reads without duplication of all of the read specific information.

Sequence Read Format (SRF) homepage: http://srf.sourceforge.net/ Different site .

Native Illumina

Submitters may submit native data from the primary analysis output of the Illumina GA.

The filetype is Illumina_native and constituent files for a run should be tarred together into a single tar file.

Illumina GA readname can be defined as follows:

<flowcell> = [a-zA-Z0-9_-]{2}+
       <lane> = 1..8
       <title> = 1..1024
            <X> = 1..4096
            <Y> = 1..4096
<sep> ::= [_\t]
READNAME ::= [<flowcell><sep> | s_]<lane><sep><tile><sep><x><sep><y>

Within a related set of files, reads are grouped by tile. Reads should be fixed length, and the number of quality scores and bases is the same in each.

Allowed characters:

BASES: AaCcTtGgNn

QUALITIES: \!\"\#\$\%\&\'\(\)\*\+,\-\.\/0-9:;<=>\?\@A-I]+ or \@A-Z\[\\\]\^_`a-h]+

QSEQ

The basecalling program Bustard emits a _qseq.txt file for each lane (two files for mate pairs). Paired-end data are presented in the orientation in which they were sequenced (5'-3'& 3'-5').

Each read is contained on a single line with tab separators in the following format:

  • Machine name: Unique identifier of the sequencer.
  • Run number: Unique number to identify the run on the sequencer.
  • Lane number: Positive Integer (currently 1-8).
  • Tile number: Positive Integer.
  • X coordinate of the spot: Integer (can be negative).
  • Y coordinate of the spot: Integer (can be negative).
  • Index: Positive Integer (no indexing should have a value of 1).
  • Read Number: 1 for single reads; 1 or 2 for paired-ends.
  • Sequence (BASES)
  • Quality: the calibrated quality string (QUALITIES).
  • Filter: Did the read pass filtering? 0 - No, 1 - Yes.

Machine Specific Information

File types accepted by platform in approximate order of preference (formats that are least desirable marked with '*', those with uncertain outcome marked with '?'):

Illumina

bam, fastq, qseq, fasta+qual*?, native*, srf*?

SOLiD

bam, csfasta + QV.qual, srf*?

Roche 454 (formerly Life Sciences)

bam, sff, fastq, fasta+qual*?

IonTorrent

bam, sff, fastq, fasta+qual*?

PacBio

bam, hdf5, fastq

MinION Oxford Nanopore

hdf5, fastq

Helicos

bam, fastq

Capillary (Sanger)

bam, fastq*?

CompleteGenomics

native, bam*

Complete Genomics format – see CG Data File Formats Different site . This format requires providing tarred versions of the ASM, LIB, and MAP sub-directories for a successful submission to take place. Additionally, processing of reference sequences occurs in the same manner as for BAM and CRAM files. For this format, please contact SRA prior to submission.


Contact SRA

Contact SRA staff for assistance at sra@ncbi.nlm.nih.gov

Support Center

Last updated: 2018-02-06T15:27:56Z