NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

SRA Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010-.

Bookshelf ID: NBK47532

Submission Guide

Created: January 13, 2009; Last Update: January 20, 2012.

1 Overview

1.1 Scope

The Sequence Read Archive (SRA) at NCBI accepts primary sequencing data from so-called “next generation” sequencing platforms, including Roche 454®, Illumina Illumina®, Life Technologies SOLiD®, Helicos Biosciences HeliScope®, CompleteGenomics®, Pacific Biosciences SMRT®.

Sequencing data should be submitted to the SRA rather than the regular Trace Archive. The Trace Archive is intended as the repository of sequencing data from gel/capillary platforms (Applied Biosystems 370® and 3730® , Megabace, and Licor sequencers).

1.3 Notices

Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government, and shall not be used for advertising or product endorsement purposes.

2 Terms of Usage

2.1 Permanence

Accessions issued by the SRA are always maintained and never reused. If a desired record has been withdrawn, then a message to this effect will be displayed to anyone who tries to access it. If a record has been superseded by a successor record, this fact will be presented to anyone trying to access it. Only in rare cases where the record needs to be expunged from the archive will a user not be able to access it.

2.2 Authentication

Submissions are managed through secure channels. These channels include PDA, NIH level login through CIT, and FTP accounts secured by passwords. We will correspond with submitters via email about submission and curation issues, but we do not exchange data by email. At this time MyNCBI is used for authenticating to the SRA submission pages or accounts.

Please keep your PDA and file transfer accounts secure. Please do not reuse someone else’s accounts. Center accounts are provided for the convenience of automated pipelines and where multiple users need to manage submissions. The authentication information for such an account should be maintained securely by the Center. Accounts may be disabled or withdrawn after a long period of disuse in order to comply with NCBI security requirements.

2.3 Limitations

The Sequence Read Archive at NCBI is a public resource and the decision whether to submit data to this resource is the responsibility of the submitter. Prospective submitters should be aware of the following issues:

Never submit data without the permission of the principal investigator.

Most human data gathered from research subjects are under strict privacy controls and/or usage restrictions and must be handled with protections as determined by the research institution’s Institutional Review Board (IRB), the funding agencies, and the laws of the United States or the submitter’s home country. The dbGaP resource at NCBI may be a more appropriate broker for human sequencing data requiring controlled access due to these considerations. Data from whole genome, transcriptome, epigenome, and metagenome (which may include human contaminants) may fall into this category. Data gathered from human subjects, certain cell lines, and metagenomes may be covered as well.

Data submitted as part of a journal manuscript may have a publication embargo placed on it by the journal editors. The submitter can place a “hold until publish” restriction on the submission to the SRA as part of the submissions process.

Data that might relate to patents and intellectual property may be submitted to NCBI, but the submitter is responsible for ensuring that procedures and policies of his/her institution or company are observed.

Some environmental data gathered in the territory of certain countries, including territorial waters, may have sovereign legal restrictions on their use. NCBI cannot accept such data since NCBI is not able to enforce any usage restrictions..

Submitters must ensure that data obtained as part of a criminal investigation is free of any judicial restrictions on its use.

Submitters are responsible for obtaining all necessary permissions from the collecting institution for forensic and paleontological data.

The United States and many other countries have laws governing trade in endangered species. Please be aware that nucleotide material gathered from such samples may be subject to restrictions. Over-specific metadata accompanying submissions may also be inappropriate for when the samples are rare or endangered.

2.4 Modification

NCBI allows submitters to modify their records. Such requests must be formally entered using the SRA submission mechanisms. Informal requests by email will not be accepted. Only the center or individual that created the record can change it. Please write NCBI if you have changed affiliations and wish to update old records. This may require agreement from the original institution.

2.5 Curation

From time to time records deposited at the SRA must be updated with changes needed in order that the data continue to conform with the data model for the archive, to update data as it changes (for example finalizing publication information), to change data that are clearly wrong (for example correcting external references to other data or resources), and to add additional relevant metadata as they become available. NCBI will contact the submission owner on a best effort basis. The submission owner should maintain up to date contact information with NCBI to receive word of such changes.

Actual instrument data are not changed by NCBI. Only the submitter can make such modifications.

2.6 Availability

While NCBI tries to maintain maximum uptime of its servers on a 24x7 basis, no guarantee of availability is offered to users. Submissions that are interrupted by downtime may have to be restarted by the user.

Technical assistance is available on a limited basis during business hours USA Eastern Time. There is no guarantee for level of service regarding manual assistance.

3 Data Model

The SRA data model is discussed in detail elsewhere, but here is a brief overview.

The SRA tracks the following five objects:

Study - Identifies the sequencing study or project and may contain multiple experiments.

Sample - Identifies the organism, isolate, or individual being sequenced. Each unique source will require a sample. Samples may be reused in multiple studies.

Experiment - Specifies the sample, sequencing protocol, sequencing platform, and data processing that will result one or more runs.

Run - Identifies run data files, the experiment they are contained in, and any runtime parameters gathered from the sequencing instrument.

Analysis - Packages data associated with sequnce read objects that are intended for downstream usage or that otherwise needs an archival home. Examples include assemblies, alignments, spreadsheets, QC reports, and read lists.

Image srasubmit_f1.jpg

In addition, all details concerning submissions are contained in a separate document called Submission, which contains center specific submitting information, contacts, actions for the archive, and an optional file manifest.

Objects can be archived in the SRA at different points in time. Multiple submissions documents can be submitted. For example, study, sample, and experiment objects can be created at an early stage, with run data being submitted as the data are produced.

All SRA objects created with XML files can be referenced by an alias. Prior to receiving an accession, the alias is the unique identifier of the metadata object. Each alias must be unique for an object type within that submitter’s namespace. Once the metadata object has been created, it can be referenced either through the alias or by accession.

4 Obtaining NCBI Accounts Needed for Submission

4.1 Establish a NCBI Identity

Before interacting with NCBI, please obtain a personal identity account. This will allow you to make submissions, track results, change records now or later, and hold or release records. There are two kinds of NCBI identity sufficient to do business with SRA:

NCBI PDA - NCBI-created and managed account for primary data submitters. If you belong to a submitting Center and will play a role in monitoring and maintaining primary data submissions, please identify this fact through your account profile and also email SRA with this information.

NIH - For any NIH personnel who has credentials managed through CIT and can use their NIH identity to login

4.2 Establish a Center Name

Certain submitters might find it necessary to establish a more formal relationship for their sequencing center or lab. These are typically large groups that are planning to use an automated system to create submissions and transfer their files.

IndividualsCenters
TrackingInteractive tool onlyXML telemetry available
SubmissionsLow frequencyOften > 10 per year
Contact InformationPDA accountPermanent contacts required
Time to SubmitImmediateRequires setup
Size of FilesFiles usually < 10 Gb
(due to FTP constraints)
Any file size
Users Able to UpdateSubmitter onlyAny account linked to center
MaintenanceInteractive tool onlyInteractive or XML updates
Status NoticesNo noticesNotice for:
Updates
Outages
UploadsFTP onlyFTP or Aspera

If your lab, center, or group has submitted in past, there might already be a center established. Please check here for your center or lab. ftp://ftp-trace.ncbi.nlm.nih.gov/sra/reports/Centers/centers.tab Please contact sra/at/ncbi.nlm.nih.gov to be added to an established Center’s user list or to provide information for creating a new Center.

To create a new Center, please provide the following information:

1.

suggested center abbreviation (16 char max)

2.

center name (full)

3.

center URL

4.

center mailing address (including country and postcode)

5.

phone number (main phone for center or lab)

6.

contact person (someone likely to remain at the location for an extended time)

7.

contact email (ideally a service account monitored by several people)

Please read section 5.3 Transmitting Data to NCBI to select a method of data transfer for your center.

5 Submitting Data

5.1 Understanding Submission Modes

5.1.1 High Throughput Submissions

NCBI has a fully automated pipeline for processing data files and XML metadata for submissions to SRA. NCBI can provide recommendations and feedback for those submitters wishing to automate their submission process to SRA.

5.1.2 Individual Submissions

Individual submitters may submit files through private FTP or Aspera to NCBI. These accounts are shared by all such submitters, but the files once written are not readable even by the submitter. In this case the center name is “Individual” and submissions are not tracked by institution. Please ensure that contact information written into the submission documents will allow NCBI to confirm receipt of the files and to communicate any problems.

To submit individually, please

  • Create a NCBI PDA account

  • Complete submission metadata on the SRA website.

  • Write sra/at/ncbi.nlm.nih.gov to request the FTP address of the current private FTP account or to be sent a private key for using the Aspera account. Please include the accession of the completed submission when requesting the upload information.

  • For FTP, use put to transmit the file(s) to the private FTP box.

  • For Aspera, use the ascp program to transfer data files to the private account.

The Individual Submissions channel is intended for small submissions or uploads of test submissions when the submitter does not yet have a dedicated account.

5.1.3 Interactive Submissions

A web tool for creation and management of SRA submissions is available at the following location. http://www.ncbi.nlm.nih.gov/Traces/sra_sub/sub.cgi?&m=submissions&s=default

An NCBI PDA account is required to use the interactive tool.

5.2 Packaging Data for Submission

Multiple data files may be packed into a tar archive, however each file within the tar archive must be individually specified in a run to be linked properly.

5.2.1 Data for Interactive Submissions

The interactive tool does not have the ability to transmit files. Users will need to use Aspera or FTP to transmit their data files once their submission metadata is complete.

5.2.2 Bulk Submissions

For bulk submissions where XML documents describing the submission and metadata accompany the run data, please follow these guidelines:

  • Always include a submission.xml file with your submission

  • Please do not send bare xml files, always package xml in an uncompressed tar file.

  • Please send run data files separately from XML.

  • Please do not compress data files. This delays processing of the submission.

5.3 Transmitting Data to NCBI

You will have to transmit your run data files to NCBI. This cannot be done through the interactive submission tool. Run data files (SFF, SRF, etc) can be quite large. It is NOT necessary to compress files transmitted to NCBI but files compatible with either gzip or bzip2 can be processed.

5.3.1 FTP

The FTP service provided to established centers has long been the normal method for transferring trace data with NCBI. Users are recommended to switch to the Aspera client for downloads, and to use the included program ascp (Aspera secure copy) for uploads.

5.3.1.1 Limitations using FTP

Traditionally NCBI has relied on FTP as the means for transferring large files. Bandwidth for transfers is typically 100 Mbps but slower for international transfers. NCBI asks that submitters not use FTP for transfers of files larger than 10 GB due problems with complete transfer of very large files. Limitations of the FTP protocol and the transmission path between NCBI and file submitters make this requested size cap necessary.

5.3.1.2 Individual Submissions via FTP

The NCBI Trace Archives maintains a private FTP address available to individual submitters. After completing metadata for a submission, please write to sra/at/ncbi.nlm.nih.gov for the current address, which contains both the FTP address and login string. The unix/linux shell command will be

ftp://sra:<password>@ftp-private.ncbi.nlm.nih.gov/

Please observe the following rules when using this submission method:

Maximum 10 Gigabyte file size

Maximum 10 file limit per submission

Choose a unique filename that also will be easy for you to identify

This directory has special access rules. You can stat the directory (list the files), but you cannot read any file (or download a file). Once deposited, a file cannot be overwritten. The files are removed as soon as processed, or if they have remained too long on the server. It is your responsibility to complete the submission transaction in a reasonable amount of time so that the files you have deposited through this channel can be processed by the submission system.

5.3.1.3 Bulk Submissions via FTP

High-volume submissions should be uploaded to the dedicated FTP account for your center.

For example, a user working for the mycenter center will deposit SRA data into the ‘short_read’ directory of the FTP account’s login directory as follows:

ftp: ftp-private.ncbi.nlm.nih.gov
login: mycentre_trc
passwd: !jXYZZ3@ce


 > cd short_read
 > put myfiles.tar.gz
 > quit

Please double check that the transmitted file size agrees with the original file.

5.3.1.4 FTP from Windows

It is possible to upload to NCBI FTP sites from Windows. Use Windows Explorer to access the individual FTP address as follows. Then drag and drop the submission files from your source directory into the destination directory that the Explorer tool has opened.

Image srasubmit_f2.jpg

You can also login using your center account, and utilize Windows Explorer to navigate and upload.

5.3.1.5 Troubleshooting FTP

If you are having trouble with your FTP connection to NCBI, try

1.

Setting passive mode rather than active mode

2.

Ask your sysadmin to increase FTP buffer size to 32 MB

3.

Try another host, or another platform (Windows instead of Unix)

4.

Try another FTP client software:

Unix ncftp (http://www.NcFTP.com)

Windows filezilla (http://filezilla.sourceforge.net/)

If you still have trouble, please write us with the following details:

1.

time of transfer (GMT or local time)

2.

IP address of FTP client (the system you are transmitting from)

3.

version of operating system software (Unix - uname -a, or cat /proc/version)

4.

FTP account used

5.

specific error messages (connection closed, etc)

5.3.2 Disk and Tape

Archive users can also request or submit data on disk or tape. The following are requested:

  • LTO4 (we can also read LTO3 and LTO2)

  • External HDD with USB2.0 or FireWire interface enclosure with WinNT (FAT32) partition type, so any Windows or Linux computer can read them.

  • NTFS, Ext3, or other large format drives. Please ensure they are delivered with an enclosure. We prefer a USB interface.

For return of sent media, please provide a waybill for shipping. If you are requesting a download by disk or tape, please send us the media first, along with a waybill for return shipping.

Please use the following shipping address:

Martin Shumway, Staff Scientist
DHHS/NIH/NLM/NCBI
45 Center Drive
Bldg. 45/Room 6A N 24
MSC 6510
Bethesda, MD 20892
shumwaym@ncbi.nlm.nih.gov
tel: 301.402.4041
fax: 301.402.9651

5.3.3 Aspera

5.3.3.1 The fasp Protocol

The FASP protocol from Aspera (www.asperasoft.com) uses UDP (User Datagram Protocol), eliminating the latency issues seen with TCP, and provides bandwidth up to 1 Gbps to transfer data. It has a restart capability if data transfer is interrupted midstream and is well behaved. If there is other data traffic on your network connections, ascp will slow transfers to avoid starving other protocols. We have seen effective throughput up to 600 Mbps for a single site.

NCBI is implementing Aspera for two use cases, occasional users and those who download files through the SRA webpage (Aspera Connect client), and bulk users who will be uploading or downloading large amounts of data (ascp)

5.3.3.2 Aspera Connect

Aspera Connect is software that allows download via a web plugin for popular browsers on Linux, Windows, and Mac as well as a command line tool that allows scripted data transfer. Aspera Connect is free and NCBI site users may use Aspera Connect to exchange data with NCBI.

Download and install AsperaConnect software from Aspera Connect under the download tab. Version 2.4.0 and later provides many performance improvements and can improve transfer rates, please ensure that the latest version of Aspera Connect is being used. The default configuration of the Aspera Connect plugin is not optimal. Please go to Preferences -> Network, choose 'Specify exact connection speeds' and enter 622 Mbps for both Downstream and Upstream speeds. Also please uncheck 'Enable queuing' under the General tab.

5.3.3.3 Aspera Upload for Individual Users

NCBI has opened an Aspera channel for individual users to upload their submission data files. To upload to the account, users must have Aspera Connect installed and use the ascp program from the command line. The upload command will look like:

ascp -i <key file> -QT -l600m -k1 <file(s) to transfer> asp-sra@upload.ncbi.nlm.nih.gov:<directory>

The private key full or relative pathname must be used. Many institution networks block UDP transfers. Please check with your network administrator to ensure UDP transfer will be enabled for the following IP range: 130.14.29.0/24. Please complete all submission metadata through the interactive submission tool before uploading data files. Completed data files will be transferred out of the upload directory. Typically within 24 hours any completed data files will have been linked and loaded. Users will need to receive a private key from NCBI by contacting sra/at/ncbi.nlm.nih.gov. Please provide the SRA accession of a completed submission waiting for data files when requesting access to the individual Aspera upload channel.

5.3.3.4 Initiating an Account for Aspera Bulk Transfers

Please set up a Center identity for your institution or lab if you do not already have one. All users that will be associated with that account will need to log in to SRA using their PDA account so that their account may be linked to the Center.

Your local firewall must permit UDP data transfer on port 33001-33009 for the following IP range: 130.14.29.0/24 in both directions to allow the fasp traffic to pass and must allow ssh traffic outbound to NCBI.

5.3.4 Microsoft Windows Users:

Download puttygen: http://the.earth.li/~sgtatham/putty/latest/x86/puttygen.exe

Run puttygen.exe to create an ssh key:

Image srasubmit_f3.jpg

Make sure that SSH-2 RSA Parameter option is selected, and that the “Number of bits in a generated key” be set to 1024. Then press “Generate” (moving the mouse to generate a key).

Generating a key will result in something like this:

Image srasubmit_f4.jpg

Click “Save Private Key” to retain the private key. NOTE - leave “Key passphrase” and “Confirm passphrase” empty (otherwise, you will be prompted to enter the passphrase whenever you do an Aspera transaction).

Copy the text from the “Public Key for pasting into OpenSSH authorized_keys file” text box. The OpenSSH public key must look like the following example. Other formats can’t be used as the public key.

ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAIEAoQNz1WIxVOvdRL9fx
 … jVp9nc= rsa-key-20090113

5.3.5 Linux/UNIX Users-

Puttygen - Download the PuTTY software for UNIX

http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

For questions concerning PuTTY installation on UNIX, please see the README file provided in the downloaded source.

To generate a putty private key:

../puttygen -O private -t rsa -b 1024 -o puttyprivate.ppk

To generate an open-ssh public key from the private key:

../puttygen puttyprivate.ppk -O public-openssh -o publicssh.pub

ssh-keygen - More recent versions of ascp can also use OpenSSH private keys. The utility ssh-keygen can generate a public and private key pair in a compatible format using the following command.

ssh-keygen -f sra-rsakey

If in doubt whether the version of ascp installed can handle OpenSSH format private keys, please use a PuTTY format private key.

In order for a submission center to access (i.e. transfer and receive files from NCBI using Aspera Connect), the public ssh key must be provided to NCBI. This key should be emailed to: sra@ncbi.nlm.nih.gov with subject line “Aspera connect authorization request”.

SSH keys are used for establishing secure connections to remote computers

5.3.5.1 Using ascp for File Transfers

The command line program ascp is a utility delivered with the Aspera Connect package.

You can run the ascp program with the following parameter settings:

-Q(for adaptive flow control)

-l(maximum bandwidth of request, try 200M-600M and adjust as needed)

-m(minimum bandwidth of request, try 0)

-rrecursive copy

-Tno encryption (speeds up transfers). Connection remains secure however data being transferred is not.

-i<private key file>

-k1enables automatic resume on transfers

Try experimental transfers starting at 100 Mbps and working up to 400-500 Mbps.

Select the bandwidth setting that gives good performance with unattended operation.

Copy the file to:

ascp -i <private key file> -QTr –k1 <file(s) to transfer> -l200M asp-<center>@upload.ncbi.nlm.nih.gov:test/

where

<private key file>::= fully qualified path & file name where the generated private key was saved.

<files(s) to transfer>::= names of files to transfer (including path)

<center>::= name assigned to the submission center, provided by sra/at/ncbi.nlm.nih.gov if not already in existence.

100M::= tunable mbit/sec bandwidth

The ascp command on Microsoft Windows is located by default in

c:\program files\aspera\Aspera Connect\bin\ascp

The ascp program on Mac in located at aspera/bin/ascp

The ascp program on Linux is located at <install directory>/bin/ascp

It is possible to run ascp in an autonomous, unattended manner that does not require repeated login. Please send us the public key of a SSH key pair and we will add it to our authentication system.

5.3.5.2 Transmitting Data with ascp

Use the command line utility ascp to copy files directly to a remote host:

ascp -i <private key file> -QTr <file to transfer> -l 300M asp-<center>@upload.ncbi.nlm.nih.gov:incoming/

where

<private key file>::= fully qualified path & file name where the generated private key was saved.

<files(s) to transfer>::= names of files to transfer (including path)

<center>::= name assigned to the submission center, provided by

sra/at/ncbi.nlm.nih.gov if not already in existence.

300M::= tunable mbit/sec bandwidth

5.3.5.3 Administering Remote Files

Do not delete files in order to “make space”. The SRA is responsible for maintaining adequate space by removing files that have already been processed. If files are not being deleted it is likely because of a backlog in the SRA.

If a submission file needs to be replaced, wait until you have a replacement and then overwrite the file (do not delete it). Please DO replace zero length files or files that have been truncated. If a “junk” file has been transmitted by mistake, it can be removed.

NOTE - files that have not been attached to any submission may be deleted after a certain amount of time. It is recommended that you consult with the SRA Administrators for the current expiration policy.

You may establish a secure connection to the SRA by using putty.exe along with your private ssh key. For example:

putty.exe -i <private key file> asp-<center>@upload.ncbi.nlm.nih.gov

where

<private key file>::= fully qualified path & file name where the generated private key was saved.

<center>::= name assigned to the submission center, provided by sra/at/ncbi.nlm.nih.gov if not already in existence

Once connected, you may use the ‘ls’ command to view the directory. You will not be able to change directories (e.g., use of the ‘cd’ command is disabled). Valid ls commands include:

 ls -l test #lists the content oftest subdir in long format
 ls -l incomimg #lists the content of incoming subdir in long format
 ls -l #lists the content of home directory in long format
 ls -lR #lists the content of all entries in home directory

To remove a file, use the rm command. For example:

rm incoming/badfile
5.3.5.4 Debugging ascp Transfers

To make a test downloads using ascp please try this command:

ascp -i <private key file> -QTr <files to transfer> -l100M asp-<center>@upload.ncbi.nlm.nih.gov:test/

where

<private key file> ::= fully qualified path & file name where the

generated private key was saved.

<files(s) to transfer>::= names of files to transfer (including path)

<center>::= name assigned to the submission center, provided by

sra/at/ncbi.nlm.nih.gov if not already in existence.

100M::= tunable mbit/sec bandwidth

Be sure that the local storage is fast enough to sustain this rate. We have seen problems with download if the target storage is on slow network volume. If you wish, examine unix /var/log/messages for a fasp log file, and send that to Aspera support.

Note that when a submitter uses a wild card for submissions, 0 length files matching the shell expansion are created in the destination directory. These placeholders can be present for a time before the actual download takes place. Therefore, some buffer time should be added to any process on the transmission side that is responsible for determining whether the transfer succeeded.

A connection error like this one may be due to expiration of license key, or incorrect private key:

ascp: session open failed.
>> ascp: (remote) failed to initiate session, consult log.
>> Ssh error: SSH connection failure: 130.14.29.99:22 Server reported 
>> failure exit code 1
5.3.5.5 Caveats

Supplying a directory as a source will cause the creation of the corresponding sub-directory tree on the destination. To avoid this, ensure that you execute the ascp command while in the source directory and provide a list of files to be transferred.

5.3.5.6 Known Problems

Please be aware that ':' (colon) character is not allowed in filenames by ascp command and files need to be renamed prior to transfer.

6 Tracking Submissions

The Sequence Read Archive Submissions page tracks submissions by SRA number and current status. There are two tabs: Completed Submissions, and Attention. Please look at the Attention tab to track any problems that might have arisen from submission. Please write to NCBI with any questions about why a submission has not completed.

7 Preparing Submissions

The best way to submit to the SRA is using the Interactive Submission Tool . You can prepare the metadata objects, and enter run filenames and checksums here. You must separately transmit data to NCBI using one of the abovementioned means.

7.1 Preparing Run Data

The SRA is intended as a repository of data output by “primary analysis” phase of the sequencing platform: sequencing results in fasta form along with instrument data indicating probability of correctness for each basecall (qualities) and signal intensity measurements (intensities).

The Sequence Read Archive does NOT accept fasta only datasets due to the inability to evaluate the quality of such data.

All submitted data must be raw data received from the sequencing machine without any edits.

7.1.1 Roche/454

The SRA accepts deposits of sequencing read data from the 454 platform in the .sff format. These files should reflect the sequencing run setup. If the entire picotitre plate was used, then one .sff file per run should be submitted. If on the other hand the picotitre plate was divided into two or more regions, then a .sff file for each region should be submitted. If a .sff file contains more than one run, or more than one region in the run, please break up this file into constituent parts using the sfffile utility from the “Off Rig” software package provided by Roche.

Data SeriesNumber of ChannelsDescription
.sff1Flowgram (base call, phred quality score, flow value)

The read names found in the .sff file are meaningful and reflect the addressing scheme for the picotitre plate as well as a globally unique run id. Please do not rewrite this name as such addressing information will be lost. The sff file format is nearly optimal in terms of footprint, so there is little to be gained by further compressing them. Therefore, please provide .sff files uncompressed.

The sequencing data may have been produced by the 454 contract sequencing center (454MSC). Please ask 454MSC to provide .sff files for your project.

7.1.2 Illumina Genome Analyzer

7.1.2.1 Illumina native data other than qseq

Original versions of the Illumina pipeline produced the following text files. These files should be converted into SRF files for submission to SRA. Recent updates to our file handling pipeline prevents the use of this data type.

7.1.2.2 Illumina SRF

For Illumina pipeline versions 1.1 and 1.3, the conduit for primary data is the sequence read format (SRF). Users should download the Staden io_lib package in order to get the solexa2srf utility.

To produce a primary analysis SRF submission file for a lane’s worth of data, change the working directory to the run folder and do:

illumina2srf -R -P -N <run>:%l:%t: -n %x:%y 
-o <center_name>_<run>_<lane>.srf s_<lane>_*_seq.txt

where

<center_name> is the short name of the sequencing center or other individual name,

<run> is the flowcell name for the run (for example 080117_EAS56_0068), and

<lane> is the desired lane.

To produce a primary analysis SRF submission file for a lane’s worth of paired-ends data, change the working directory to the run folder and do:

illumina2srf -R -P -N <run>:%l:%t: -n %x:%y -2 <cycle>
-o <center_name>_<run>_<lane>.srf s_<lane>_*_seq.txt

where

<center_name> is the short name of the sequencing center,

<run> is the flowcell name for the run

<lane> is the desired lane, and

<cycle> indicates the cycle number that starts the second read.

Each flowcell contains 8 lanes but not all lanes are used for production. Also, some lanes are devoted to other projects. Finally, the size of the SRF file produced by this process can be expected to be about 2 GB. For these reasons, it is desirable to produce one SRF file per lane. The SRF file format is nearly optimal in terms of footprint, so there is nothing to be gained by further compressing them. Therefore, please provide .srf files uncompressed.

7.1.2.3 Illumina Text Formats including current qseq

In pipeline releases 1.4 and later, Illumina switched to text only forms of data. There are a variety of these forms depending on the pipeline version, the point at which data is extracted from the pipeline (pre or post-alignment), and whether certain features such as bar coding or paired ends are being used. Qseq files that are divided by tile should be concatenated in tile order into a single file per read per lane. For instance, a paired run should result in two qseq files per lane. Please contact SRA if you have any questions about your Illumina text file format.

7.1.3 Applied Biosystems SOLiD System

Primary analysis data from the SOLiD System is delivered in “color space”, without translation into base space. Quality scores and signal intensities are based on the color calls.

7.1.3.1 SOLiD Native Format

NCBI currently supports “SOLiD_native” format submission. There will be one .csfasta and _QV.qual for each mate in a lane. To enter SOLiD_native metadata for a Run through the Interactive Tool, the ‘read’ data file is the .csfasta and its corresponding ‘quality’ file information should be entered directly below the read data file.

Sequencing data with minimal instrumentation output is appropriate for applications where the main goal is abundance measurement rather than reconstruction of original sequence.

Data SeriesNumber of ChannelsDescription
.csfasta1Base calls per read in color space
_QV.qual1Color space quality scores

For paired end data two files of each file type will exist (F3 and R3).

7.1.3.2 SOLiD SRF Format

NCBI recommends submission of SOLiD data in SRF format. Please download the SRF conversion utility at http://solidsoftwaretools.com/gf/project/srf/ .

7.1.4 Helicos HeliScope

NCBI is now taking datasets from the HeliScope. Please write us at sra/at/ncbi.nlm.nih.gov for special instructions.

8 Preparing Metadata

A salient feature of the SRA is the distinction given to metadata. Rather than embedding these with every run record, sequence read metadata are organized into a collection of XML files that capture as much, or as little, information as the submitter cares to give. Many pieces of information can be provided in the form of links and tag-value pairs, eliminating the need to negotiate complicated data representation ontologies.

Submission ObjectDescriptionXML Schema specification
StudyXML file specifying sequencing studySRA.study.xsd
SampleXML file specifying the target of sequencingSRA.sample.xsd
ExperimentXML file specifying experimental organization and parametersSRA.experiment.xsd
RunOne of more XML descriptors linking run data to their experiments.SRA.run.xsd

Submissions can include any combination of these documents. The set of xml submission documents must be combined together into a file produced by the tar utility (unix/linux), or zip (windows), and must include one submission xml document. Do not mix run data files (sff, srf, fastq, etc) into this tar file.

8.1 BioProject Registration

Whole genome sequencing projects should be registered with the BioProject resource at NCBI before submitting to SRA. Please access the BioProject Submission Form to submit.

8.2 Taxonomy Registration

Most single organism genome and transcriptome sequencing projects need a Taxon Id to help specify the sample being sequenced. Please consult the Entrez Taxonomy resource to see whether your organism is represented, and request an entry to be created if not. The taxon id is needed for submission preparation.

8.3 Reference Fields and Namespaces

All the XML files can take either names (aliases) to identify dependencies. These names need only be unique throughout the submission. Eventually, the SRA will replace these names with actual accessions while preserving the referential integrity of the records.

8.4 Required Fields

Each XML file has certain required fields. The XML schema document these entries and most are self-explanatory. Decisions that the submitter should make include:

  • Study type

  • Whether a BioProject id exists for the study

  • The center project name or id

  • Whether a Taxonomy id exists for the sample

  • Whether an anonymous id exists for the sample

  • Sequencing platform used (for example, 454 GS 20 or 454 GS FLX)

  • Library source, strategy, selection, layout if applicable, protocol

  • Spot or cluster layout (use of adapters, linkers, bar codes, etc)

  • Read processing selection

In addition, the submitter should think about the relationship between experiments and samples, runs and experiments, and whether to split any of these objects to represent distinct information

Finally, submitters may consider providing ancillary information, including links, Entrez links, and attribute tag-value pairs. These can be created for any of the five record types.

8.5 Preparing Submission Files

Aspects of the submission process pertaining to the submission itself have been broken out into their own XML descriptor. Contact information, transaction requests, exceptions, and file manifests can be listed here. Contacts should be provided for questions or problems pertaining to the particular submission.

Submission ObjectDescriptionXML Schema specification
SubmissionXML file specifying submission sessionSRA.submission.xsd

A checksum should be computed for each run file delivered as part of the submission and entered into the submission.xml record. Please use the unix md5sum or equivalent utility. It is not necessary to provide checksums for the metadata xml files.

8.6 New Submission Protocol

Check your XML files for correctness with respect to the current published schema.

Check your XML files for completeness and referential integrity.

Verify checksums on run and analysis files.

Open FTP to the trace FTP site for your Center.

Change directory to ./short_read (FTP), or ./incoming (aspera)

Deposit the files using the ftp put or mput command (or ascp for aspera).

Confirm receipt of the submission in the SRA Tracking Page. Once processed, you will be able to download the submitted data through this page. Please let us know if you are under a tight publication deadline and we will try to accommodate your needs.

Please write to us at sra/at/ncbi.nlm.nih.gov with any questions about status or access.

9 Managing Existing Submissions

9.1 Update Submissions

Submitters can update their records through the interactive submission tool. It is recommended that you revisit your submission in order to annotate it with publication links and additional sample information. Once loaded, certain aspects of a run record cannot be updated (please write to NCBI if you wish to do this for some reason).

Bulk updates can be performed on the Project, Sample, and Experiment objects by submitting replacement XML files for the affected objects. An update XML document must identify the target schema (study, experiment, sample, run, etc) using the MODIFY action that calls out a file of replacement XML. The alias of the submission xml document must correspond to an existing submission that covers the objects that are to change.

9.2 Hold Until Publish (HUP)

An essential feature of the SRA is the ability to hold a submission until a manuscript reporting on the research is accepted or released by a journal:

Hold until date - This is appropriate for scheduled release of a publication

Release - The dataset can be released immediately to the public.

The hold can expire its term, or the submitter may send a Release message to NCBI indicating that the submission can be released to the public. A Release message can apply to the entire SRA object, or individual objects within the SRA submission. Any dependent objects are implicitly released. For example, releasing a certain experiment has the effect of releasing all its runs as well.

9.3 Versioning

SRA submissions are not explicitly versioned. Rather, a complete change history is stored for metadata and any version of the metadata can be accessed. Content such as run and analysis data are never modified. If these must be changed then current ones are deprecated (Withdrawn) and replacements added.

9.4 Curation

From time to time NCBI needs to update metadata in order to correct mistakes, propagate changes in other resources (for example taxonomic changes), and edit information in order to comply with editing requirements, copyrights, and data release policies. These “curation” changes may occur without necessarily seeking the approval of the original submitter. Run and analysis data will never be changed in this way. Also, original titles, descriptions, and names will be preserved as much as possible.

9.5 Suppression

At submitter’s request, a certain record (submission, study, experiment, sample, run) can be suppressed.

Suppression simply marks the record as deprecated. Suppressed records are never actually deleted (except for technical reasons including for example a loading error). Suppressed records can still be accessed by accession, but the accession will be marked as having been suppressed. Suppressed records are not indexed, and copies of them are removed from download facilities.

Please write to NCBI if you wish to request suppression of a record or dataset.

10 Tracking Submissions

The SRA provides several methods for monitoring progress and status of submitted data (submission telemetry).

  • Using the interactive submission tool to monitor progress of submissions and file transfers

  • Using the accessions report (centers only, aspera users only)

  • Using the SRA XML annotated with accessions (centers only, aspera users only)

  • Using Entrez SRA (released objects only). Note that it can take 1-2 business days before released objects are fully indexed in Entrez.

  • Download tree. Objects should be available for download by aspera or FTP within 4 hours of their release.

10.1 Interactive Submission Tool

The Interactive Submission Tool can be used to monitor progress of submissions. There are two views:

Submission View – This view presents the SRA submissions and their component metadata objects (last updated submission is the most recent listed). Color coded boxes indicate how many components are at which state (green is loaded).

Tracking View – This view lists the files that have been received but not yet processed. Files may persist in this view if the file has not been identified as belonging to a run, or has failed to load, or whose checksum does not match that specified in the submission metadata (due to transmission or other error).

Please see the SRA Quick Start Guide for more details.

10.2 Accessions Report

The Accessions Report is a list of SRA metadata objects and their status. This report is a tab delimited file called SRA_Accessions. The fields are defined as follows:

Accession – Accession (SRX, SRR, etc) of the object at NCBI.

Submission – Submission accession (SRA) associated with the object.

Status – Status of the object in the archive:

Live – The object is indexed and available for retrieval

Suppressed – The object has been removed from indexing but can still be retrieved. This state usually reflects objects that have been superceded by successor objects.

Unpublished – The object has not been published.

Withdrawn – The object has been expunged from the Archive. This state reflects rare situations where data was inappropriately released and copies in the Archive must be completely removed.

Updated – ISO date of the last update of the object.

Published – ISO date of the initial publication (release) of the object, and when it appeared on the Archive public site.

Received – ISO date at which the Archive received the data from the submitter.

Type – The object’s document type, currently {SUBMISSION, STUDY, SAMPLE, EXPERIMENT, RUN, ANALYSIS}

Center – Short name for the submitting center.

Visibility – Whether the object has been archived at the open SRA no usage restrictions (public), or at the controlled access SRA (usage restrictions in place, the user must apply for access to the data). Note that visibility is orthogonal to the publication or embargo status of the data.

Alias – The submitter’s name for the object.

Md5sum – The MD5 checksum of the metadata object. This is computed in a canonical way, see below.

10.3 Metadata XML

The XML metadata for each submission in the dump are emitted in XML files annotated with the any accessions assigned during the submission process. For the daily increment, only those submissions that are new for that day are dumped. Each month, a comprehensive dump of all the live metadata in the Archive is produced. Users could seed their own local database of SRA metadata from this dump. The XML has certain portions of the public record removed from the dump, including submitter contact information and data load directives, which are not important to users of the Archive.

The location of these public archive records is:

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=faspftp_metadata&m=downloads&s=download_reports

In addition, each submitting center has the same data pertaining to their submissions (both public and released, and unreleased or suppressed) available from their aspera upload directory (aspera account holders only):

For open SRA

upload@ncbi.nlm.nih.gov/asp-mycenter/outgoing/Metadata

For protected SRA

gap-upload@ncbi.nlm.nih.gov/asp-mycenter/outgoing/Metadata

Use the ascp or ftp programs to retrieve the incremental or last comprehensive metadata package file.

The metadata dump can provide input to the “roundtrip” processing module for a submitters LIMS system. Newly submitted documents can be downloaded to obtain the accessions assigned by NCBI. Documents can be downloaded in order to compare with internal state to make sure that the version at NCBI is current. Documents can be inspected to see that certain modification operations succeeded.

10.4 How to compare metadata versions

SRA metadata are not versioned in an explicit, public way. Rather, metadata are tagged when they change in a substantive way. A checksum is used to record the current content. One can tell whether the content has changed by comparing the checksum to a previously computed value.

Accession Submission  Status    Md5sum
SRA000001    SRA000001    public        d703b0b98a686a84ff232b9967e3d55c
SRP000057    SRA000001    public        bdac682ff9dca87f158b3d327832ec66
SRR000289    SRA000001    public        2fc12909aff893cdc20c48c0aa875bdf
SRS000246    SRA000001    public        7bf577c49d282529bc5f35f63137e6c0
SRX000068    SRA000001    public        a474043e97911936fe49f06f7a301aa5
SRA000002    SRA000002    public        7810982f118198eaf207a351e1550aa9
SRP000058    SRA000002    public        4c5d2a1c8a7fca885a09d690e49a5d06
SRR000290    SRA000002    public        aed8942276489b55ab98e282725ee920
SRS000247    SRA000002    public        999486234ce2e7420e16f169c2a86578

The md5sum value is equivalent to putting an xmllint 'noblanks' version of the xml associated with an accession in a file by itself (without a line feed) and executing md5sum -b on that file. If there is no meta data difference, then no increment is generated for that center on that day.

On the 1st of every month, a complete meta data dump is created in addition to an incremental dump. The first meta data dump for a new center is both an incremental and complete dump.

Obtain a copy of the script used to get md5 values for each accession chunk in a meta data xml file:

wget ftp-trace.ncbi.nlm.nih.gov:/sra/utilities/getMetaMd5.pl

The usage is:

getMetaMd5.pl < meta xml file path (based on ending in .xml) >


       OR


getMetaMd5.pl < file containing list of meta xml file paths >


       OR


<list of meta xml file paths> | getMetaMd5.pl

So you can provide the path to a single .xml file, a file containing a list of paths to xml files, or pipe to it a list of paths to xml files.

The getMetaMd5.pl script runs on Linux and requires

  • perl to be at /usr/local/bin/perl (version 5.8.3 or higher),

  • xmllint in your executable path (libxml 20630 or higher),

  • xsltproc in your executable path (version 10102 or higher),

  • Digest::MD5, a perl library for calculating md5 sums, and,

  • parseMeta.xsl, to be in the same directory as the getMetaMd5.pl executable.

10.5 How to view files you have uploaded to your aspera account

To access NCBI servers with limited shell access a submitter must use their secret key (usually used with ascp for file transfer).

For access from unix/linux/macos the secret key must be in openssh format. In this case ssh command is used and command line is as:

For open SRA account:

 ssh -i secretkey.openssh asp-YOURNAME@upload.ncbi.nlm.nih.gov

For protected SRA account:

 ssh -i secretkey.openssh asp-YOURNAME@gap-upload.ncbi.nlm.nih.gov

For similar access from windows the key must be in putty format. And the putty.exe command should be used. The command line is as:

For open SRA account:

 putty.exe -i secretkey.ppk asp-YOURNAME@upload.ncbi.nlm.nih.gov

For protected SRA account:

 putty.exe -i secretkey.ppk asp-YOURNAME@gap-upload.ncbi.nlm.nih.gov

This limited shell has aspsh> as a prompt and allows only few commands like ls,cp,mv,rm. The cd command is not allowed so you must use ls with directory name as an argument.

Examples:

aspsh> ls -l
total 240
drwxrwsr-x  2 5608 trace  4096 Dec  9  2008 analysis
drwxrwsr-x  2 5608 trace 65536 May  7 10:08 incoming
drwxrwsr-x  2 5608 trace  4096 Jul  1  2008 logs
drwxrwsr-x  3 5608 trace  4096 Apr 15  2009 outgoing
drwxrwsr-x  2 5608 trace  8192 Apr 27 20:04 test
drwxrwsr-x  2 5608 trace 12288 May  7 11:05 trash


aspsh> ls -l analysis
total 0


aspsh> ls -l incoming
total 15663023352
-rw-rw-r--  1 asp-bcm trace   16504539868 May  6 10:06 0083_20090930_2_SP_ANG_HSAP_NG_005sA_01003244491_4.srf
…

Copyright Notice: http://www.ncbi.nlm.nih.gov/books/about/copyright/

Cover of SRA Handbook
SRA Handbook [Internet].
Bethesda (MD): National Center for Biotechnology Information (US); 2010-.

Download

Recent activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...