NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

SRA Application Notes [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010-.

Cover of SRA Application Notes

SRA Application Notes [Internet].

Show details

TCGA Submission Protocol (Deprecated)

.

Created: ; Last Update: July 14, 2011.

StatusInactive
Active Date2009-04-16
Inactive Date2012-5-18
ScopeNCBI dbGaP SRA

1 Overview

This document describes the submission protocol for raw sequencing data and primary reference genome alignments for the Cancer Genome Atlas Project (TCGA), a NIH Roadmap study sponsored by the NCI and NHGRI. TCGA sequencing and alignment data come from human clinical samples and are considered identifying. In order to implement research use guidelines and enforce patient privacy rights, these data are accessed by users through the dbGaP authorized access distribution mechanism. Submitters of data to TCGA need to follow similar security procedures by submitting through the protected SRA interface, which deposits data into the dbGaP system. Excerpts of de-identified meta-data are exported to the public SRA and are available for search through the NCBI Entrez system.

1.3 Notices

Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government, and shall not be used for advertising or product endorsement purposes.

2 Data Scope

2.1 Study

The SRA study for TCGA is SRP000677. This study points to the current dbGaP study (phs000178). The current version of the study is v3, but this changes every few months to reflect updates to the phenotypes and sample membership visible to the users of the study.

2.2 Samples

SRA samples are prepared by dbGaP and exported to the public Entrez BioSamples resource in abbreviated form. Here is an example record whose SRA accession is SRS061071:

http://www.ncbi.nlm.nih.gov/biosample/80112

2.3 BAM Files

Along with raw sequencing data. it is now typical to submit primary reference alignments of sequence reads.

The TCGA has mandated submission of primary sequencing data in the form of binary sequence Alignment/Mapping (BAM). The payload of this file contains both the sequencing data (in bases, quality scores,and read names produced by the instrument) and read placements with annotations about strand, alignment, and quality features. BAM files are sufficient to meet the submission needs of this project.

A requirement of BAM submission is that the reference genome be precisely specified. Please see SRA Analysis Submission Guide for specific requirements of BAM files.

2.4 ArchiveBAM Submission (Future)

A replacement for BAM that is suitable for archiving of both raw sequencing data and primary read placements is under design. This will allow for consolidated submissions of sequencing and read placements within the SRA Run object, eliminating much of the complexity associated with BAM file submission. Introduction of this service is expected in 2011.

2.5 SRA Run Submission (Legacy)

BAM files that use runs already archived in the SRA. In order to relate a read_group label to an existing archived SRA run, submitters should include reference to the run in the analysis XML submission, for example:

<RUN accession="SRR018666" read_group_label="A"/>

3 Submission Modalities

3.1 Submitting Center

You must have a center designation in order to submit sequencing data. Current TCGA centers are:

Centercenter_name
Baylor College of MedicineBCM
BC Cancer Agency Michael Smith Genome Sciences CentreBCCAGSC
Broad InstituteBI
Harvard Medical School - Raju Kucherlapati LabHMS-RK
Johns Hopkins University – University of Southern California collaborationJHU-USC
University of North Carolina at Chapel Hill - Lineberger Comprehensive Cancer CenterUNC-LCCC
Washington University, Genome Sequencing CenterWUGSC

3.2 Protected SRA

You must upload to the center specific protected host address, for example

detcetorp/retnecym-psa/:vog.hin.mln.ibcn@daolpu-pag. You must identify your submission as a protected submission, as follows:

<SUBMISSION …>

<ACTIONS>

..

<PROTECT/>

3.3 Aspera

You must use the aspera utility. The ftp and secure https protocols are not appropriate for data of this magnitude and are not supported. You must use encryption when transmitting data to NCBI.

3.4 XML

Submission metadata must be rendered in SRA XML. Spreadsheets, tab files, bare BAM files are not sufficient to complete the archiving process. There is no interactive submission tool available for protected SRA submissions.

3.5 Tracking

Submitters should track the progress of their SRA submissions at NCBI.

Entrez SRA is not yet aware of public analysis objects (SRZ accessions). However, you can track submission of analysis objects in one of three ways:

  • Using the interactive submission tool, which also highlights problems with submission metadata or files
  • Using the display of analysis objects released to SRA, including those accessible only through dbGaP, sorted by accession:

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=table&f=analysis&m=data&s=analysis

  • Using the SRA Telemetry feature

SRA telemetry includes SRA xml for each deposited object and a tab file showing the status of each object by its NCBI accession. Deposits of metadata objects including analysis objects (SRZ) can be tracked for each depositor by downloading the SRA_Accessions tab file generated each day for each submitter.

For example,

AccessionSRZ000300
SubmissionSRA025062
StatusLive
Updated2010-10-21T18:19:06Z
Published2010-10-21T18:19:06Z
Received2010-10-13T21:08:02Z
TypeANALYSIS
CenterBI
Visibilitycontrolled_access
Alias (of the SRA Analysis object, NOT the bam file)G1743.TCGA-06-0188-01A-01D-0373-08.bam
Md5sum (of the XML object, NOT the bam file)9324c67f9b47311e9973f50ec9ada3f7

A full description of this tab file can be found in the SRA Submission Guide http://www.ncbi.nlm.nih.gov/books/NBK47532/

3.6 Exchange Area

For most of 2009-2010, the “exchange area”, or anteroom, was utilized as a mechanism for collaborative exchange of TCGA BAM files. This area was needed while NCBI constructed submission pathways for BAM files.

The TCGA Exchange area is being taken out of service to be replaced by regular authorized access distribution.

  • The ./asp-mycenter/exchange/TCGA directory has been removed
  • The ./asp-mycenter/exchange/TCGA_phs000178 directory is no longer writable. Please do not deposit any new files into this area. BAM files and SRA metadata tar packages should be deposited into the regular ./asp-mycenter/protected area.
  • Checksums (md5) should be included in the SRA Analysis metadata xml, rather than in a separate file.
  • NCBI will compute a BAM index file (bai) on receipt of the data.

3.7 Release and Publication

The TCGA operates on a rolling submission policy meaning that each submission is released immediately. Please specify this in your submission XML:

<SUBMISSION …>

<ACTIONS>

..

<RELEASE/>

Release of project data is the responsibility of dbGaP. dbGaP follows a periodic release policy that corresponds to sample phenotype submission, quality control, and release.

4 Data Preparation

4.1 Study

Submitters do not create a SRA study for their submission. Rather, the SRA experiment is set to reference the SRA study. Here are the available studies:

SRA StudydbGaP studyGenome project idTitle
SRP000677phs00017841443The Cancer Genome Atlas (TCGA)

This binding can be expressed in XML as:

<STUDY_REF accession="SRP000677" />

4.2 Samples

Sample records are also created by NCBI for use by submitters. Each SRA analysis object and SRA experiment object references one or more sample records. This binding can be expressed in SRA experiment XML as:

<SAMPLE_REF accession="SRS096084" />

and in SRA analysis XML as:

<TARGET accession="SRS061581" sra_object_type="SAMPLE" />

You can test the existence of a BioSample record by looking up the TCGA aliquot id or using the SRA Sample accession. For example, try

http://www.ncbi.nlm.nih.gov/biosample/?term=TCGA-06-0876-10A-01D-1003-01

http://www.ncbi.nlm.nih.gov/biosample?term=SRS096084

To find BioSample records in bulk, it is currently necessary to obtain from NCBI a lookup table of dbGaP sample names to BioSample accessions for each dbGaP study that is being submitted. Samples may be in a loaded state or have been received and awaiting phenotype data and are therefore unreleased. It is also possible that samples have been withdrawn from the study. To confirm that the sample names you have correspond to those tracked at dbGaP, and to ensure that the samples for which you intend to submit data are still active in the database, please write NCBI for the latest sample lookup table that relates BioSample records (SRS) to TCGA aliquot barcodes (for example, TCGA-06-0876-10A-01D-1003-01).

TCGA will be migrating to UUIDs as sample names during 2011. dbGaP intends to participate in this migration. Use of BioSample ids (SRSs) will help make this conversion transparent to submitters and for a time will provide a lookup table based on both aliquot bar codes and UUIDs.

4.3 Metadata Preparation

The SRA requires that there exist predefined project (SRP) and sample (SRS) records for each submission to succeed. A TCGA submission consists of one or more experiments (SRX), and one or more runs (SRR), and one or more analysis objects (SRZ). The information content of these respective metadata objects is described in the SRA Analysis Submission Guide.

Do not combine xml metadata with BAM files. Please combine the xml metadata data files into a tar file, for example

tar cvf mysubmission.xml.tar A.submission.xml A.experiment.xml A.run.xml A.analysis.xml

The SRA metadata will be made public. Consequently, do not include identifying information in the XML metadata. If information is restricted to the library preparation and run conditions this will not be an issue.

Public metadata are indexed, visible in Entrez, and dumped for bulk access. TCGA short read datasets will appear as normal deposits in every respect except that you cannot see or download the run or analysis genotyping data. Instead, a message will appear that the user is asked to apply to the relevant Data Use Committee to gain access.

4.4 Run Data Preparation

SRA run data are extracted from the BAM files delivered with the SRA analysis objects. The BAM file is also called out as the run data file. For details please see the SRA Analysis Submission Guide.

4.5 Alignment Preparation

BAM files should follow the requirements of BAM file submission for NIH projects. For details please see SRA Analysis Submission Guide. BAM files should not be compressed or wrapped into another archive container.

4.6 Probes and Capture Arrays

Where appropriate NCBI would like to define probe sets for capture arrays or techniques. These can be simply defined (list of targets and their coordinates is sufficient). These can be provided in spreadsheet form or bed file, and NCBI can create accessions in ProbeDB for these data, and attach them to the submitted experiments.

5 Submission Protocol

1.

Prepare the submission xml with the new PROTECT action. This tells the SRA that the data are intended for dbGaP.

2.

Use your established ssh key pairs for transmission with NCBI. Key pairs provide more security. More than one key pair can be defined, you may wish to dedicate one to the transaction of protected data.

3.

Transmit submission articles (xml and data files) to a special server dedicated to delivery of protected datasets:

ascp –l400m –Q files /detcetorp:vog.hin.mln.ibcn.daolpu-pag@XXX-psa

where XXX is one of asp-<center_name>, for example asp-bi. Note that the –T option is NOT specified so that the data will remain encrypted during transmission.

4.

Metadata and data can be delivered asynchronously, one or the other will wait in the protected area until the submission is complete.

5.

Inspect the ./outgoing area for annotated XML for objects that have been processed. There will be some delay before the appearance of annotated XML files.

6.

Files will be cleaned up automatically as they are processed and moved to dbGaP.

The submission process is the same as that for the open SRA but conducted in isolation. Currently BAM files are processed in the following manner:

1.

Each BAM file has its md5 checksum computed. The BAM file name, its checksum, and the submitting center are compared to the stated name, checksum, and center_name provided with the SRA analysis xml.

2.

An index file (.bai) is generated from the bam file, which requires scanning the entire file.

3.

The BAM header is dumped for internal use.

4.

Each BAM file corresponds to a sample in the project. If this sample is not contained in the subject-sample mapping table in dbGaP, the entire submission is rejected.

5.

Runs are NOT extracted from the BAM files at this time. SRRs are left in an unloaded state. A future version of the SRA will load these runs from their associated BAM files.

6.

The BAM file as delivered along with its index file is added to the list of currently “loaded” analysis objects. At the next periodic release, dbGaP provides these files through the dbGaP authorized access interface.

6 Updates and Withdrawals

Updates of metadata can be handled through the normal SRA channel. Please see the SRA Submission Guide for details about how to update metadata through modify xml submissions.

There does not exist a mechanism to automatically withdraw (suppress) objects in the SRA. Please write to the SRA helpdesk to request suppression of objects. This request is handled by a curator. Suppressed objects remain in the SRA database but are not indexed and not returned in any query. Run and analysis objects that have been suppressed are not available for download from the dbGaP authorized access channel.

7 Example Submissions

Example protected SRA submissions can be found in: ftp://ftp.ncbi.nlm.nih.gov/sra/examples/SRA029111

The files can be downloaded using this command:

wget ftp://ftp.ncbi.nlm.nih.gov/sra/examples/SRA029111/*.xml

The following files give an example XML package (SRA029111a.xml.tar) package containing four SRA documents to add to the database. A screenshot of the interactive submission tool following successful submission can be seen in this file: SRA029111a.pdf .

SRA029111a.analysis.xml

SRA029111a.experiment.xml

SRA029111a.pdf

SRA029111a.run.xml

SRA029111a.submission.xml

SRA029111a.xml.tar

The following files give an example XML package (SRA029111m.xml.tar) package containing four SRA documents that will modify existing documents already added to the database.

SRA029111m.analysis.xml

SRA029111m.experiment.xml

SRA029111m.run.xml

SRA029111m.submission.xml

SRA029111m.xml.tar

PubReader format: click here to try

Views

  • PubReader
  • Print View
  • Cite this Page

Other titles in this collection

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...