Transcriptome Shotgun Assembly Sequence Database
What is the Transcriptome Shotgun Assembly (TSA) Database?
TSA is an archive of computationally assembled sequences from primary data such as ESTs, traces and Next Generation Sequencing Technologies. The overlapping sequence reads from a complete transcriptome are assembled into transcripts by computational methods instead of by traditional cloning and sequencing of cloned cDNAs. The primary sequence data used in the assemblies must have been experimentally determined by the same submitter. TSA sequence records differ from EST and GenBank records because there are no physical counterparts to the assemblies.
How Do TSA Sequence Records Differ from Other GenBank/EMBL/DDBJ Records?
The display of a TSA sequence is similar to other International Nucleotide Sequence Database Collaboration (INSDC) records, but includes the following:
- The label 'TSA:' at the beginning of each Definition Line.
-
DBLINK
- BioProject
- BioSample (optional)
- Sequence Read Archive
- Keywords: TSA; Transcriptome Shotgun Assembly
- Assembly data
- Comment describing the assembly if from a multi-step process.
Other Features and References are similar to those displayed in regular GenBank/EMBL/DDBJ records.
An example of a TSA submission is JU497302.
TSA sequence records are shared by all three INSDC databases and can be found using typical search methods in Entrez Nucleotide and Entrez Protein.
General Information
Nucleotide sequences must conform to the following standards:
- Submitted sequences must be assembled from data experimentally determined by the submitter.
- Screened for vector contamination and any vector/linker sequence removed. This includes the removal of NextGen sequencing primers.
- Sequences cannot be less than 200 bp.
- Sequences should have no more than 10% n's or greater than 14 n's in a row.
- The raw reads should be submitted to SRA and the SRA run accession (SRR) provided. Do not include SRA and SRX accession numbers.
- If the submission is a single-step, unannotated assembly and the output is a BAM file(s) these should be submitted as a TSA project to SRA.
Additional Requirements:
- Register your project in the BioProject database as a Transcriptome Shotgun Assembly project.
- An Assembly Data structured comment. Please see Creating the Structured Comment Table .
- Description of the assembly process if a multi-step assembly was performed.
- The library information for the primary data should be annotated on the Source Feature. Or the information should be submitted to BioSample and the BioSample accession provided.
- If annotation is provided the product names should follow the UniProt-Protein Naming Guidelines.
Creating the submission file
Submission Process:
- The submission file can be generated using Sequin or tbl2asn.
- The sequin file(s) should be submitted using GenomesMacroSend . Select the TSA option on the submission form.
-
After uploading, write to gb-admin@ncbi.nlm.nih.gov with the following information:
- GDSub number from GenomesMacroSend
- Release date: Immediate Release OR Mon/Dd/Yyyy
Submission Tools:
Sequin
- Select "Use a Submission Wizard":TSA
-
There are dialogs to enter:
- Assembly data
- Assembly description
- BioProject
- SRR accession
- The wizard is not for large sets of sequences.
- One submission should not consist of multiple BioProjects.
tbl2asn
- tbl2asn reads a template.sbt along with the sequence and table files, and outputs ASN.1 for submission to GenBank.
fasta defline components:
- [moltype=mRNA]
- [tech=TSA]
- [bioproject=PRJNAXXXX1]
- [SRA=SRRXXXXX1]
- [biosample=SAMNXXXXXXX1]
| -w assembly.cmt |
Import assembly data See Creating the Structured Comment Table for more information. |
| -Y | Import assembly comment |
| -M t | argument includes standard validator and additional TSA checks |
Sample command line:
tbl2asn -t template.sbt -p. -a s -w assembly.cmt -Y comment -M t
Creating the Structured Comment Table
The structured comment table is a single tab-delimited table that includes the tag-value pairs that are to be applied to all of the sequences in your submission. For TSA records the Assembly Method (with version and/or year if available) and Sequencing technology must be included. Coverage is optional.
If you are using tbl2asn, generate the table to import using the Structured Comment page.
- If you choose the save option the table will automatically be saved as assembly.cmt. If you are saving multiple tables with different options you will need to change the name of the file for each structured comment.
- If you use the open option you will generate a table in the browser window that will need to be copied and saved.
An example table:
| StructuredCommentPrefix | Assembly |
| Assembly Method | Newbler 2.0 |
| Coverage | 220x |
| Sequencing Technology | 454; Solexa |
Should not be submitted to TSA
- Assemblies from sequences not directly sequenced by the submitter.
- Clonal based assemblies. These should be submitted to GenBank.
- Sequences assembled by inserting Ns to represent the gaps.
- A single assembly from multiple organisms.