TSA Submission Guide

Requirements

  • Register your project in the BioProject database as a Transcriptome Shotgun Assembly project.
  • Register your library information in the BioSample database.
  • Raw reads should be submitted to SRA and the SRA run accession(s) (SRR) provided. Do not provide SRX accession numbers.
  • EST sequences should be submitted to dbEST and the accession range provided in the COMMENT section of the submission.
  • Assembly Data structured comment. This information can be input through the Submission Portal dialogs or can be created using the Structured Comment Template .
  • Description of the assembly process if a multi-step assembly was performed should be provided in the COMMENT section.
  • If annotation is provided the product names should follow the International Protein Nomenclature Guidelines.
  • Annotation must be biologically valid.
  • The keyword 'Targeted' and feature annotation should be included for all targeted subsets of transcriptome data. See Targeted vs. Non-targeted TSA Studies for more information.

Creating the TSA submission file

[1] The BioProject accession, BioSample accession(s), SRA run accession(s) and Assembly Structured Comment data are entered using the Submission Portal dialogs. See Requirements for the links to these databases.

[2] If submitting a Targeted subset of your data see the additional requirements under Targeted vs. Non-targeted TSA .

[3] All TSA submissions are submitted through the TSA Submission Portal .

[4] Submit either a fasta or .sqn file

  • Preparing a fasta file for submission:

    • Sequences should be in fasta format.
    • Files have the suffix .fsa.
    • fasta defline component: [moltype=transcribed_RNA]
    • Each sequence has a definition line beginning with a unique identifier, eg contig001, contig002, etc. Use concise names that do not include length or coverage information. The unique identifier cannot exceed 50 characters. The unique identifier appears in the DEFINITION line in the flatfile view of the record.
    • Contigs should >199 nt.
    • Remove any n's from the beginning or end of each sequence.
    • Do not include internal n's that represent a gap of unknown length. These sequences should be split at the gap.
    • Any n stretches greater than 14 nucleotides will need an assembly gap feature. The Submission Portal will will provide a prompt to set up the assembly gap feature.
  • Preparing .asn file using tbl2asn for submission.

    • tbl2asn reads a template.sbt along with the sequence and table files, and outputs ASN.1 for submission to TSA through the portal.
    • Annotation may be included using a Feature table. See tbl2asn .
    • fasta defline components:

      • [moltype=transcribed_RNA]
      • [tech=TSA]
      • To add Source information see tbl2asn Source table format

      Sample command line:

          tbl2asn -t template.sbt -p. -Y comment -M t
      

      The validator output (*.val) should be reviewed before submitting. Any validator errors not resolved prior to submission may be stopped in the Submission Portal. See Submitting the file to TSA-Submission Portal for more information.

      tbl2asn command line arguments
      -Y To import Assembly Description Comment
      -M t To run standard validator and additional TSA checks
      -j Allows the addition ofsource qualifiersthat will be the same for each submission.

      Example: -j "[organism=Homo sapiens] [tissue-type=liver]"

      -w assembly.cmt To import Structured Comment Table*

      *This is optional, but can be helpful when there are multiple transcriptomes, because there will be less information to supply on the web form during submission. See Creating the Structured Comment Table for more informa tion.

Submitting the file to TSA Submission Portal

All files must be submitted via the Submission Portal .

When the file is uploaded it will undergo a series of validation checks. The following will stop your submission in portal:

  • Sequences less than 200 bp
  • Sequences with univec hits that are for Next-Gen sequencing primers
  • Sequences that are more than 10% n's or have more than 14n's in a row
  • Files that are incorrectly formatted or have biologically invalid annotation

Submission statuses in the Submission Portal:

  • Queued: The submission is successful and waiting for review by TSA staff. If there are any issues the submitter will be contacted with a list of revisions and/or inquiries.
  • Error: The TSA staff has reported any error(s) to the submitter. The corrections need to be made and a new file uploaded using the Fix button.
  • Processing: The submission has been successfully completed and an accession number for the project has been assigned.
  • Processed: The project has been released to the database.

Targeted vs. Non-targeted TSA Studies

It is expected that submissions to TSA would comprise a large-scale comprehensive study of the complete transcriptome of an organism. However, some scientists do targeted studies of their transcriptome data and only want to submit this small subset. For targeted studies the regular submission process should be followed with the following requirements:

  • The keyword 'Targeted' should be added to the submission file. Using tbl2asn this can be done by including [keyword=Targeted] in the fasta definition line.
  • Annotation must be included showing the focus of the targeted study. This can be done with a gene, misc_feature, or RNA feature.
  • If coding regions are provided the product names should follow the International Protein Nomenclature Guidelines. If misc_features are provided then the /note should be in the following format &"similar to product_name&".
  • Set the molecule type (moltype) to the appropriate RNA type -mRNA, rRNA, ncRNA, or transcribed RNA.

*SRA cannot release a subportion of your data to match your subset. The entire SRA dataset will be released upon release of your subset.­­­­­­­­­­

Assembly Gaps

Sequences with known gaps can be submitted to TSA providing the gap is annotated with an assembly_gap feature.

The required qualifiers for the assembly_gap feature are:

  • estimated_length
  • linkage_evidence
    • paired-ends: paired sequences from the two ends of a DNA fragment.
    • align-genus: alignment to a reference genome within the same genus.
    • align-xgenus: alignment to a reference genome within another genus.
    • align-trnscpt: alignment to a transcript from the same species.

Updating TSA submissions

  • If you are updating a publication send the TSA accession prefix and complete publication information in the text portion of an email to gb-admin@ncbi.nlm.nih.gov.
  • If you are updating any other information or adding an additional sequence(s) to your assembly do not create a new submission. Please contact gb-admin@ncbi.nlm.nih.gov for directions and include the following information with your request:
    • Description of your update.
    • TSA accession prefix or submission portal ID for your submission.

Creating the Assembly Structured Comment Table

The Assembly Structured Comment table is a single tab-delimited table that includes the tag-value pairs that are to be applied to all of the sequences in your submission. For TSA records the Assembly Method (with version and/or year if available) and Sequencing technology must be included. Coverage and Assembly name are optional.

The table to import is created using the Structured Comment page.

An example table:

StructuredCommentPrefix Assembly
Assembly Method Newbler 2.0
Coverage 220x
Sequencing Technology 454; Solexa

Support Center

Last updated: 2018-06-27T14:06:12Z