NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

SRA Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010-.

Cover of SRA Handbook

SRA Handbook [Internet].

Show details

Analysis Submission Guide

Created: ; Last Update: October 19, 2011.

1. Overview

This document reviews submission procedures and guidelines for SRA analysis objects, including

  • De novo assemblies (to be specified in a future version of this document)
  • Reference alignments
  • Sequence annotations (to be specified in a future version of this document)
  • Abundance measurements (to be specified in a future version of this document)

In keeping with developing NIH policy, this document also shows how to submit primary sequencing data as a part of the analysis object.

1.1. History

Guidelines for SRA analysis submission were developed in conjunction with two NIH roadmap initiatives: The Cancer Genome Atlas (TCGA), and the Human Microbiome Project (HMP). The TCGA established early requirements to allow submission of all needed primary data through the BAM file format. The HMP pioneered requirements for annotation of raw sequencing data from metagenome projects where assembly into higher constructs is difficult.

1.2. Goals

1.

Meet the needs of users by providing a home somewhere in the data model for all desired properties.

2.

Distinguish where in the data model each desired property should reside.

3.

Define processing directives that might be important to interpreting the sequencing/alignment data and loading it into an archive database.

4.

Eliminate dependence on spreadsheets and filenames to convey metadata.

5.

Provide searchable metadata that can be used by query writers in the public database.

6.

Provide query source for programmatic construction of component descriptions that users of protected data will see inside the dbGaP authorized access download interface.

1.3. Scope

In its current revision, this document describes metadata needs for BAM file submission. It does not describe the submission modalities. Higher level analysis types and other analysis types are not described. Some BAM files are submitted using preexisting SRA data, other BAM files will be submitted containing de novo sequencing data as part of its payload. This document does not describe archive requirements for the BAM file read placement records, which may have additional requirements in order to be loaded into the NCBI alignment database. These requirements need further development.

1.4. Revision History

Drafts A-E created 2010-09-14 to 2010-10-08. Document released with draft status 20 Oct 2010.

1.5. Related Documents

Elements of TCGA project requirements have been incorporated into this document [Tim Fennell. BAM File Format for TCGA Submissions. Draft v2, July 9, 2009.]

Submitters should also consult the established SRA submission documentation:

Quick Start Guide:

http://www.ncbi.nlm.nih.gov/books/NBK47529/

Aspera Transfer Guide:

http://www.ncbi.nlm.nih.gov/books/NBK242625/

Here is the released SRA XML Schema:

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=xml_schemas

For details on the SAM/BAM specification please reference:

http://samtools.sourceforge.net/SAM1.pdf

A BAM file validator utility is available here:
http://picard.sourceforge.net/command-line-overview.shtml – ValidateSamFile

2. Data Model

NCBI ObjectAccessionSequencer Production UnitBAM Component
Submission envelopeSRAn/an/a
AnalysisSRZn/aBAM file
StudySRPn/an/a
ExperimentSRXn/aLibrary (LB)
SampleSRSn/aSample (SM)
RunSRRLane/slide/plateRead Group (RG)
Reference SequenceNC_ and othersn/aSequence Dictionary (SQ)
Probe setPrcapture array n/a

2.1. Submission Metadata

The submission metadata pertains the submission “package” or “envelope” conveying the data to the archive.

Submitter id/alias – Submitter’s name or alias for the submission.

Submission date – ISO 8601 date for the date of transmission of the file to NCBI.

Submitter contact – name and email address of the submitter contact(s).

Center name – NCBI short name for the submitting center.

2.2. Analysis Metadata

Analysis alias – Submitter’s name or alias for the analysis object.

Analysis title – The title string that will be presented to users of the public archive when this record is retrieved in a search result. Please limit this string to 80 characters.

Analysis type – DE_NOVO_ASSEMBLY | REFERENCE_ALIGNMENT | SEQUENCE_ANNOTATION | ABUNDANCE_MEASUREMENT

Analysis Description – A free form description of the analysis product and the process by which it was produced.

Analysis date – ISO 8601 date when the analysis was completed and the BAM file written.

Analysis center – NCBI short name for center that performed the analysis

Analysis Files and Checksums – Each analysis file and its MD5 checksum.

2.2.1. Reference Alignment Metadata

This section enumerates metadata components that are specific to reference alignment analysis objects.

Standard Assembly – Controlled name for the reference assembly or set of reference sequences used in the alignment. The following table shows a catalog of standard assemblies that are supported by NCBI. Other SRAs may define and support different assemblies. A set of cross referenced sequences may also be specified as the reference assembly.

short_nameDescriptionsource
GRCh37GRCh37 is the Genome Reference Consortium Human Reference 37 released 24-FEB-2009, and includes haploid and alternative loci sequences. This reference can also be specified in the NAME field (db=”gencoll”, accession=”GCA_000001405.1”)http://www​.ncbi.nlm.nih​.gov/projects/genome​/assembly/grc/human/index.shtml
GRCh37-liteGRCh37-lite is a subset of the full GRCh37 human genome assembly plus the human mitochondrial genome reference sequence (the "rCRS") from Mitomap.org. This set of sequences excludes all the alternate loci scaffolds of the full GRCh37 assembly, and has the pseudo-autosomal regions (PARs) on chromosome Y masked with Ns. This haploid representation of the genome is provided as a convenience for use in alignment pipelines that cannot handle the multiple placements expected in the PARs and in regions of the genome that are represented by the alternate loci. http://www​.ncbi.nlm.nih​.gov/projects/genome​/assembly/grc/human/index.shtml
http://www​.mitomap.org/MITOMAP
HG18The March 2006 human reference sequence (NCBI Build 36.1) was produced by the International Human Genome Sequencing Consortium and is distributed by UCSC. http://genome​.ucsc.edu​/cgi-bin/hgGateway?db=hg18
NCBI36NCBI Build 36.3 released 24 March 2008. This build consists of a reference assembly for the whole genome, alternate assemblies for the whole genome produced by Celera and by JCVI, plus alternate assemblies for some parts of the genome.ftp://ftp​.ncbi.nlm.nih​.gov/genomes/H_sapiens/ARCHIVE/BUILD​.36.3/
NCBI36-HG18_Broad_variantBroad Institute variant of Build 36/HG 18.ftp://ftp​.ncbi.nlm.nih​.gov/genomes/H_sapiens/ARCHIVE/BUILD​.36​.3/special_requests​/assembly_variants/NCBI36-HG18​_Broad_variant.README
NCBI36_BCCAGSC_variantBritish Columbia Cancer Agency Genome Sequencing Center variant of Build 36/HG 18.ftp://ftp​.ncbi.nlm.nih​.gov/genomes/H_sapiens/ARCHIVE/BUILD​.36​.3/special_requests​/assembly_variants/NCBI36​_BCCAGSC_variant.README
NCBI36_BCM_variantBaylor College of Medicine variant of Build 36/HG 18.ftp://ftp​.ncbi.nlm.nih​.gov/genomes/H_sapiens/ARCHIVE/BUILD​.36​.3/special_requests​/assembly_variants/NCBI36_BCM_variant​.README
NCBI36_WUGSC_variantWashington University variant of Build 36/HG 18.ftp://ftp​.ncbi.nlm.nih​.gov/genomes/H_sapiens/ARCHIVE/BUILD​.36​.3/special_requests​/assembly_variants/NCBI36_WUGSC_variant​.README

Custom assembly – It is possible to specify a list of contigs including de novo assemblies of unmapped reads that together comprises the reference sequence. More development is needed to define the business rules that would apply to this kind of reference specification.

Processing pipeline – The sequence of processes/tools/operations and their versions can be specified for the alignment process.

Processing directives – certain specific instructions to the data loading software, or properties that users of the data should be aware of:

  • alignment_includes_unaligned_reads - Whether unaligned reads are provided in the alignment, and what to do with them
  • alignment_marks_duplicate_reads - Whether duplicates are removed from the alignment
  • alignment_includes_failed_reads - Whether non-PF filtered reads have been included in the read groups

2.3. Study Metadata

For open SRA submissions, the submitter must create or reference a SRA data producing study (SRP).

For protected SRA submissions, the submitter must reference an existing dbGaP authorized access study (phs). Reference can be made to the study handle with refcenter=”NCBI”. Submitters should NOT create these records.

2.4. Sample Metadata

For open SRA submissions, the submitter must create or reference a SRA sample or BioSample (SRS).

For protected SRA submissions, the submitter must reference an existing BioSample record (SRS). Reference can be made to the submitted sample name with refcenter set to the original repository short name. Submitters should NOT create these records.

Open SRA samples or Biosamples have diverse attributes and information content.

Protected SRA samples are exported from dbGaP and make visible a standard subset of attributes, including at the time of this writing:

Title – Brief yet unique headline returned with the record as part of a search result.

Identifiers – SRS accession, dbGaP sample accession

Organism – Target organism {human}

Original_repository – Namespace for sample set {TCGA}

Submitted_sample_id – Sample name {TCGA aliquot id}

Submitted_subject_id – Subject name {TCGA subject id, substring of the aliquot id}

Sex – {male, female, unknown}

Sample_type – Project specific sample type {TCGA: normal, primary tumor, etc}

Is_tumor – {0,1}

Histological_type – Sample diagnosis {TCGA: Serous Cystadenocarcinoma, etc}

Analyte_type – {DNA, RNA, etc}

Study_name – Short name for the parent study {TCGA}

Description – Free form text describing the sample.

Links – Includes link to parent dbGaP authorized access study homepage

An example of a TCGA record that has this information:

http://www.ncbi.nlm.nih.gov/biosample/limits?term=TCGA-13-0725-01A-01D-0359-05

2.5. Library Metadata

Each library mentioned in the BAM will map to a new or existing SRA experiment. The SRA experiment contains the following data:

Experiment title – The title string that will be presented to users of the public archive when this record is retrieved in a search result. Please limit this string to 80 characters.

Experiment description – Description of the library and its sequencing.

Library Name – Controlled vocabulary of terms describing overall strategy of the library. Library Strategy – Controlled vocabulary of terms describing overall strategy of the library. Terms used by TCGA include {WGS, WXS, RNA-Seq}.

Library Source – Controlled vocabulary of terms describing starting material from the sample. Terms used by TCGA include {GENOMIC, TRANSCRIPTOMIC*}.

Library Selection method – Controlled vocabulary of terms describing selection or reduction method use in library construction. Terms used by TCGA include {Random, Hybrid Selection}.

Library Layout – Specification of the layout: fragment/paired, and if paired, the nominal insert size and standard deviation.

Library Protocol description – Description of the library construction protocol, or reference to a standard protocol.

Targeted loci* - Set of loci to be selected for sequencing {16S RNA, exome} and associated probes.

Platform – Controlled vocabulary of platform type {Illumina, LS454, AB_SOLID, CompleteGenomics}

Instrument model – Controlled vocabulary of instrument models {Illumina Genome Analyzer II, etc}

Expected sequence length – Number of raw bases or color space calls expected for the read (includes both mate pairs and all technical portions).

Sequence processing software and version – Name and version of sequencing processing software used.

2.6. Run Metadata

Each read group will map to exactly one new or existing SRA run.

Run name – Production flowcell/slide/plate name

Run date – ISO 8601 date the run was produced

Run center – NCBI center short name where the run was produced (useful if different from the submitter).

Run file info – Information about the run data file(s). If BAM, then this is the BAM file name and its checksum.

Processing directives – certain specific instructions to the data loading software, encoded as tag-value attributes, including:

  • Actual raw sequence length, including both mate pairs and all technical portions.
  • Quality scoring system {phred, log-odds}
  • Quality basis character {! or @}
PubReader format: click here to try

Views

Other titles in this collection

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...