NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

SRA Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010-.

Cover of SRA Handbook

SRA Handbook [Internet].

Show details

File Format Guide

Created: ; Last Update: March 18, 2014.


This document reviews the file formats currently supported by the Sequence Read Archives (SRA) at NCBI, EBI, and DDBJ, and gives guidance to submitters about current and future file formats and policies regarding SRA submissions.

The SRA is one of the International Nucleotide Sequence Databases and this Collaboration (INSDC) sets policies and goals for the partner databases. This document is intended to be compatible with INSDC policies.


This document guides submitters of sequencing data in order to:

  • Specify which data formats are currently supported by SRA.
  • Enable submitters to validate and convert data prior submission to avoid unnecessary data transfers.
  • Improve the speed of submission processing.
  • Reduce the probability of failed submissions.
  • Improve other services provided by SRA by freeing up time previously spend to correct and transform data.

This document guides depositors to the Archives so that they may:

  • Understand how to prepare data for submission to one of the Archives.
  • Know what formats are supported by Archives facilities or toolkits, and which ones may have to be developed by the user.
  • Understand why technical issues limit the usage of certain file formats.

External Documents and Links

Revision History

  • Reviewed by NCBI 2014-03-18
  • Reviewed by NCBI 2012-07-11
  • Reviewed by NCBI 2009-10-01
  • Reviewed by EBI 2009-10-07
  • Reviewed by DDBJ 2009-11-27

Overview of Input Formats

General Considerations

The SRA is a “raw data” archive, and requires per-base quality scores for all submitted data. Thus, unlike GenBank and some other NCBI repositories, FASTA and other sequence-only formats are not sufficient for submission. FASTA can, however, be submitted as a reference sequence(s) for BAM files or as part of a FASTA/QUAL pair (see below).

The SRA data model has transitioned from “dumps” of whole flowcell lanes or production runs into a semi-curated database of sample-specific sequencing libraries. This has implications for the types of data that we accept. Most specifically, barcoded/batch files should be split into per-sample data files (“demultiplexed”). Demultiplexing makes the sample - data linkage unambiguous in our database and should improve both the clarity and usability of submitted data. Please email vog.hin.mln.ibcn@ars if you have specific questions about data requirements vis-à-vis samples.

Conversion to the SRA archive format (described below) is NOT required for submission. However, the SRA Toolkit can be used to “test load” your files locally if you would like to validate them prior to submission. BAM files can be evaluated with ‘bam-load’ and FASTQ files can be evaluated with ‘latf-load’ (first released in Toolkit version 2.3.5). These load utilities are effectively stand-alone and can be run by most submitters. Other SRA loading software, such as ‘sff-load’, ‘abi-load’, etc. are dependent on SRA XML documents and are only recommended for advanced users. If you elect to test load your data file(s) and encounter problems, please email vog.hin.mln.ibcn@ars if you have questions.

Preferred Formats

The SRA generally prefers to obtain “container files”. Container in this context means an unambiguous binary file. These are objects that contain both the data and a description or specification of the data. Examples include BAM, SFF, and PacBio HDF5 formats. Containers have the following advantages:

  • All data for a given library is contained in one file.
  • Data are indexed for random access.
  • Data are compressed so gzip and other compression utilities are discouraged.
  • Data are streamable (can be read from one input handle).
  • Data are self-identifying (file type can be interrogated with file).
  • Data come with run-time configuration and execution parameters, including run date, instrument name, flowcell name, processing program and version, etc.

Text formats, such as FASTQ, are supported, but are not the preferred submission medium. Poorly defined specifications and high variability within these formats tend to lead to a higher frequency of failed or problematic submissions. Wherever possible, submitters are encouraged to submit data in a container format, as described above.

Figure 1 shows the hierarchy of input file types supported by the SRA. Table1 shows which properties of input data file formats are supported.



Figure 1 – Input file types supported by the SRA

File modelArchive ready?Stream-able on load?Auxiliary data?Run meta data?Com-pressed?Indexed?Read names parseable ?Read names indexable?
Illumina nativeYYNNNNNN

BAM (Binary Sequence Alignment/Map)

BAM is the preferred submission format for the SRA. BAM is the binary (compressed and indexed) version of SAM. BAM files can be read out as human-readable SAM through the use of BAM/SAM-specific utilities (like SAMtools), or with a conventional decompression utility like gzip/gunzip. SAM is a generic tab-delimited format that includes both the raw read data and information about the alignment of that read to a known reference sequence(s). There are two main sections in a SAM file, the header and the alignment (sequence read) sections, each of which are described below. Note that this documentation will focus on a description of the SAM format with respect to submission of BAM files to the SRA. A more comprehensive discussion of the format specifications can be found at the SAMtools website.

SAM Header Section

Each line in a SAM header begins with ‘@’, followed by a two-character code that identifies the type of information encoded in the line. A typical SAM file can contain HD (header), SQ (reference sequence) line(s), RG (read group) line(s), and PG (program) descriptions in the header section. An example SAM header is shown below. Note that this is to highlight the format – not all sections and tags are required.

@HD    VN:1.4    SO:coordinate
@SQ    SN:CHROMOSOME_I    LN:15072423    UR:    AS:ce10    SP:Caenorhabditis elegans
@SQ    SN:CHROMOSOME_II    LN:15279345    UR:    AS:ce10    SP:Caenorhabditis elegans
@SQ    SN:CHROMOSOME_III    LN:13783700    UR:    AS:ce10    SP:Caenorhabditis elegans
@SQ    SN:CHROMOSOME_IV    LN:17493793    UR:    AS:ce10    SP:Caenorhabditis elegans
@SQ    SN:CHROMOSOME_V    LN:20924149    UR:    AS:ce10    SP:Caenorhabditis elegans
@SQ    SN:CHROMOSOME_X    LN:17718866    UR:    AS:ce10    SP:Caenorhabditis elegans
@RG    ID:1    PL:ILLUMINA    LB:C_ele_05    DS:WGS of C elegans    PG:BamIndexDecoder
@PG    ID:bwa    PN:bwa    VN:0.5.10-tpx

Ideally, the “SN” value should be a versioned accession (e.g., NC_003279.7, rather than “CHROMOSOME_I”). This will allow the SRA to unambiguously identify the reference sequence(s) and process the BAM file with minimal intervention. Barring that, submitters are strongly encouraged to use the “UR” (URL/URI that can be used to obtain the reference sequence(s)) and “AS” tags to clearly define which assembly has been used (as above). If the data are instead aligned to a “local” or submitter-defined set of references (including any modifications to accessioned assemblies), then the submitter must include a “reference fasta” along with each submitted bam file. The FASTA header line(s) must match the “SN” names provided in the BAM file exactly. Deviation from these recommended practices will require manual intervention by SRA staff in order to process a BAM file and can delay completion of a submission.

SAM Alignment Section

The alignment section contains the sequence and quality information, ideally in a sorted order to reduce file size and improve indexing. Each read is contained on a single line, and all fields are tab-delimited and in an order defined by the SAM specification guide. Aside from the read ID (QNAME in SAM jargon), SEQ, and QUAL fields, most other fields are determined and reported by the software used to generate the SAM/BAM file and should not be manually edited. Below is an example alignment section that continues from the above example header.


Note that the header and alignment section are internally consistent: Each read has an RNAME (reference, 3rd value) that matches an SN tag value from the header (e.g., “CHROMOSOME_I”), and the read group tag (“RG:Z:”) is consistent with the read group ID in the header (“1”). It is also important to ensure that the FLAG fields (2nd value in each line) are correctly set for the data; the SRA pipeline will attempt to resolve incorrect FLAG values, but sufficiently incorrect values can lead to processing errors.

External Documents and Links

SAMtools software and SAM-format specification documents:

Standard Flowgram Format (SFF)

454 Life Science (now part of Roche) and NCBI developed SFF to encode 454 flowgrams. In the absence of a BAM file, SFF is the preferred input format for 454 data. IonTorrent data can also be submitted as SFF. Extensive technical details about the format can be obtained here. In general, though, submitters of SFF data should ensure that the data are demultiplexed (if barcoded) – this is particularly common in pyrotag / 16S rRNA amplicon sequencing.

External Documents and Links

Tools for viewing and processing SFF files:

PacBio HDF5

Pacific BioSystems uses HDF5, a container file with a directory-like structure, to store raw data. The SRA accepts both bas.h5 and bax.h5 file submissions. Note that submission of data from the RS II instrument requires one (1) bas.h5 file and three (3) bax.h5 files.

External Documents and Links

PacBio documentation on bax.h5 / bas.h5 format (PDF): Reference Guide.pdf

HDF5 tools:


FASTQ is not a specified file format, but a style similar to “FASTA”. It consists of readname headers, nucleotide base calls and per-base quality scores in text form. There are many variations.

The following terms and formats are defined in general:

 READNAME = Text string terminated by white space.
    BASES = [ACGTNactgn.]+
QUALITIES = [0-9]+ | <quality>\s[0-9]+ (Decimal-encoding, whitespace or tab-delimited)
            [\!\"\#\$\%\&\'\(\)\*\+,\-\.\/0-9:;<=>\?\@A-I]+ (Phred-33 ASCII)
            [\@A-Z\[\\\]\^_`a-h]+ (Phred-64 ASCII)

The permissible FASTQ format is simply:


Where each instance of READNAME, BASES, and QUALITIES are newline-separated.

As indicated above, the QUALITIES string can be whitespace-separated numeric Phred scores or an ASCII string of the Phred scores with the ASCII character value = Phred score plus an offset constant used to place the ASCII characters in the printable character range. There are 2 predominant offsets: 33 (0 = !) and 64 (0=@).

Paired-end FASTQ

Paired-end data submitted in FASTQ format should be submitted in one of two formats: (1) As separate files for forward and reverse reads, in which the reads are in the same order. (2) As interleaved, or “8-line”, FASTQ, in which forward and reverse reads alternate in the file and are in order (i.e., read “1F”, followed by read “1R”, then read “2F”, then “2R”, etc.

Concatenated FASTQ (in which all forward reads are followed by all reverse reads) is not supported.


FASTA files may be submitted if accompanied by corresponding QUAL files. These are recognized in the SRA data processing pipeline as equivalent to FASTQ and should be specified as “fastq” when submitting the data files. Borrowing from the FASTQ description above, the general format for FASTA/QUAL pairs is:





Where READNAME must be identical between files for a given read, and QUALITIES are generally in whitespace or tab-separated decimal values. Note the following guidelines for FASTA/QUAL pairs of files:

  • In a given pair of files, there must be the same number of reads in both.
  • For a given read, there must be the same number of BASES and QUALITIES, i.e., if the BASES are trimmed to remove barcodes, then the same scores must be removed from the QUALITIES, etc.

Vendor-specific FASTQ variants

Illumina FASTQ

There are two general styles of FASTQ produced by Illumina machines. The older format is emitted from Gerald, the secondary analysis pipeline. This format contains 64-offset (ASCII ‘@’ = 0) quality encoding. Paired end data are presented in the orientation in which they were sequenced (5’-3’-3’-5’).

The index and read number labels are defined as:

  • Index: string. Currently 0, should be the index of the multiplexed sample in barcoded experiments, for example @EAS51_105_FC20G7EAAXX_R1:1:1:471:409#ATCACG/2
  • Index values are processed and stored as SRA spot groups
  • Read Number: 1 for single reads; 1 or 2 for paired ends.



The newer Illumina FASTQ variant (as of CASAVA 1.8), use 33-offset quality encoding (ASCII ‘!’ = 0) and have a different READNAME format:

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence>

Specific example:

@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG


The 454 READNAME is a 14 character alphanumeric string that encodes the plate, region and raster address of the read. The plate name is an encoding of a timestamp plus one character hash value that is virtually unique. The region is a two place decimal indicating the gasket division (there is always at least one gasket). The raster coordinate indicates the x and y coordinates on the plate modulus 4096 in base 36 encoding. Paired end data are presented in the orientation in which they will be aligned to a reference (5’-3’-5’-3’), which is the same orientation in which they were sequenced.

 <plate> = [A-Z0-9]{7}
<region> = [0-9]{2}
    <xy> = [A..Z0..9]{5}
QUALITIES = [0-9]+ | <quality>\s[0-9]+ 
 READNAME = <plate><region><xy>

Reads are sorted in order of READNAME. Records are variable length. Files are analogous to FASTA/QUAL (described above), and should be specified as ‘fastq’ in SRA submissions.

The grammar for the 454 sequence file:


The grammar for the 454 qualities file:


Helicos FASTQ

Bindings for Helicos FASTQ are:

<flowcell> ::= VHE-[0-9]+
<channel> ::= 1-25
<field> ::= 1-1100
<camera> :: [1234]
<position> :: 1-100000
<sep> ::= [-]
READNAME::= <flowcell><sep><channel><sep><field><sep><camera><sep><position>
QUALITIES ::= [!-I]+

A single record grammar is:


For example,


Helicos reads can be variable in length, but the number of BASES and QUALITIES must be the same for a given read.

SOLiD native

SOLiD users may submit CSFASTA and QUAL files as SOLiD native data. Primary analysis output of the SOLiD system is in color space. Paired end data are presented in the same orientation in which they were sequenced (5’-3’-5’-3’).

Specific bindings for the ABI SOLiD System are:

<flowcell> = [a-zA-Z0-9_-:]{2}+
   <slide> = 0..1
   <panel> = 1..4096
       <X> = 1..4096
       <Y> = 1..4096
     BASES =  [TtGg][0123\.]+
 QUALITIES = [0-9]+ | <quality>\s[0-9]+ 
     <sep> = [_]
  READNAME = <flowcell><sep><slide><sep><panel><sep><x>sep><y>
   TAGNAME = <panel><sep><x><sep><y><sep><tag>

The interpretation of the separator (<sep>) is right associative. Reads are sorted in panel order within a given set of related files. All SOLiD data are fixed length.

The files have an optional header that is identified by lines that begin with the hash/pound/number sign (#). The HEADER can be defined as:

# <date> <path> [--flag]* --tag <tag> --minlength=<length> --prefix=<prefix> <path>
# Cwd: <path>
# Title: <flowcell>

The grammar for the CSFASTA file is:

#HEADER (multiple lines)

The grammar for the QUAL file is:

#HEADER (multiple lines)

As with FASTA/QUAL pairs, there are several rules for pairs of CSFASTA/QUAL files. TAGNAME must be identical between files for a given read, and QUALITIES are generally in whitespace or tab-separated decimal values. Note the following guidelines for CSFASTA/QUAL pairs of files:

  • In a given pair of files, there must be the same number of reads in both.
  • For a given read, there must be the same number of color space digits and QUALITIES, i.e., the BASES line is typically 1 character longer than the number of QUALITIES (due to the color space indexing base that begins each BASES string).

External Documents and Links

Applied Biosystems documentation on 2 base encoding (PDF):

Complete Genomics (CG) native

The SRA is able to process Complete Genomics data, though these data require a unique workflow for submission: the SRA pulls the data in its native directory structure directly from S3. Please contact the SRA for details (vog.hin.mln.ibcn@ars).

External Documents and Links

Complete Genomics documentation on formats:

Legacy formats

Sequence Read Format (SRF)

SRF is a community standard developed by James Bonfield and Asim Siddiqui. It has been used to contain large amounts of Illumina and SOLiD data for deposit and have served as a backing storage format. Several implementations exist. Io_lib based implementations maintained as part of the Staden package.

External Documents and Links

Sequence Read Format (SRF) homepage:

Illumina native

Submitters may submit native data from the primary analysis output of the Illumina GA. The filetype is “Illumina_native” and constituent files for a run should be tarred together into a single tar file.

Illumina GA readname can be defined as follows:

<flowcell> = [a-zA-Z0-9_-]{2}+
    <lane> = 1..8
    <tile> = 1..1024
       <X> = 0..4096
       <Y> = 0..4096
<sep> ::= [_:\t]
READNAME ::= [<flowcell><sep> | s_]<lane><sep><tile><sep><x>sep><y>

The interpretation of the separator (<sep>) is right associative. Within a related set of files, reads are grouped by tile. Reads should be fixed length, and the number of quality scores and bases is the same in each.

Allowed characters:

    BASES = [AaCcTtGgNn\.]+
QUALITIES = \!\"\#\$\%\&\'\(\)\*\+,\-\.\/0-9:;<=>\?\@A-I]+


The basecalling program Bustard emits a _qseq.txt file for each lane (two files for mate pairs). Paired end data are presented in the orientation in which they were sequenced (5’-3’-3’-5’).

Each read is contained on a single line with tab separators in the following format:

  • Machine name: unique identifier of the sequencer.
  • Run number: unique number to identify the run on the sequencer.
  • Lane number: positive integer (currently 1-8).
  • Tile number: positive integer.
  • X: x coordinate of the spot. Integer (can be negative).
  • Y: y coordinate of the spot. Integer (can be negative).
  • Index: positive integer. No indexing should have a value of 1.
  • Read Number: 1 for single reads; 1 or 2 for paired ends.
  • Sequence (BASES)
  • Quality: the calibrated quality string. (QUALITIES)
  • Filter: Did the read pass filtering? 0 - No, 1 - Yes.

seq, prb, int

The _seq.txt, _prb.txt, and _int.txt files are emitted by Bustard, the primary analysis program. In Illumina pipeline versions 1.3 and earlier produced tab files in the following formats first defined in the 1.1 version of the GA pipeline:

The sequence text files (_seq.txt) have this format:


The qualities text files have four scores per base call (_prb.txt) in this format:

<READNAME>\t{%d %d %d %d}+  with value range [-40,40]

The intensity text files have four scores per base call (_int.txt) in this format:

<READNAME>\t{%5.1f %5.1f %5.1f %5.1f}+  with value range [-16384.0,16383.0]

Each of these files was either presented tile by tile, or in one file per lane. The number of reads must be equal between the input files for a lane. Illumina pipeline versions 1.4 and later could only produce these files by running Bustard under non-default conditions.

Illumina scarf

Another text file output by Gerald analysis stage is a single colon separated file with one record per line containing read name, sequence, and quality.

Overview of SRA output formats

SRA native format (VDB)

SRA files do not have a fixed format, but are actually portable database files (VDB; “vertical database”) with embedded schema. The schema is recorded on a per-object basis, allowing us to change schema over time while ensuring that older databases remain accessible. The database-like structure of SRA data files allows relatively simply interconversion between multiple different formats. The various ‘dump’ utilities of the SRA Toolkit are specifically designed to provide this conversion. The utility ‘vdb-dump’ can be used to interrogate the native SRA data format directly.


The Toolkit utility ‘sam-dump’ can be used to output any SRA data file into SAM format. Note that only data submitted with alignment data (e.g., submitted as aligned BAM) will output aligned SAM. All other datasets will output unaligned, un-headered SAM.


All SRA data can be converted to FASTQ format using ‘fastq-dump’. Since SRA data are stored in a concatenated form, it is important to note that specific options may have to be invoked in order for paired-end fastq to be formatted correctly during output. It is recommended that new users review fastq-dump documentation to ensure proper output formatting before committing to large dataset extractions.


Only those datasets submitted as SFF are suitable for conversion back into SFF format. All other submitted data formats lack the information required to generate SFF. Consequently, the utility ‘sff-dump’ will provide a clear error message if a given dataset cannot be converted to SFF.


All SRA data can be output into color space data. The utility ‘abi-dump’ can be used to output CSFASTA and QUAL data files (with appropriate options, fastq-dump can be used to output “CSFASTQ” format).

Illumina native formats

All SRA data can be output into Illumina native format, as it is functionally similar to FASTQ. The Toolkit utility ‘illumina-dump’ can be used to output data into “standard” Illumina native, or qseq depending on the options invoked.

PubReader format: click here to try


Other titles in this collection

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...