| |
Whole Genome Shotgun Submissions |
| PubMed | Entrez | BLAST | OMIM | Books | Taxonomy | Structure |
WGS List List of WGS Projects Genome Project Registration page Home page Metagenomes Metagenome Submission Guide GenBank Sequence submission support and software Trace Archive Trace Archive database tbl2asn Command line sequence submission tool Annotation Guidelines Prokaryotic genomes Eukaryotic genomes Example Files Sample .fsa and .tbl files GenomesMacroSend To upload your submission files Sequin Stand-alone sequence submission tool |
Whole Genome Shotgun Sequence SubmissionsDDBJ/EMBL/GenBank accepts both complete and incomplete genomes. Whole Genome Shotgun (WGS) sequencing projects are incomplete genomes or incomplete chromosomes that are being sequenced by a whole genome shotgun strategy. WGS projects may be annotated, but annotation is not required. The pieces of a WGS project are the contigs (overlapping reads), and they do not include any gaps. An AGP file can be submitted to indicate how the contig sequences are assembled together into scaffolds (contig sequences separated by gaps) and/or chromosomes. We must have the contig sequences without gaps as the basic units for all WGS projects. WGS projects without annotation require at least two weeks to be processed. Projects with annotation require at least one month for processing. Please submit your project with enough lead time. Below are detailed instructions for preparing a WGS submission. We recommend sending us a test file via GenomesMacroSend to see if there are problems before committing to generating the entire project. See the list of WGS projects. The Annotation column indicates whether a project is annotated on the contigs (Y-c), on the scaffolds (Y-s) or is not annotated (-). Table of Contents
IntroductionEach WGS project is assigned a stable 4-letter WGS accession prefix, which does not change
as the project is updated. In addition to the WGS accession prefix, the contig identifiers have a version number
corresponding to a particular WGS project update. Finally, each individual contig
within the assembly is assigned a unique accession number prefixed by the WGS
accession prefix and version number. For instance, if a WGS project's assigned accession
number is XXXX00000000, then that project's first assembly version would be
XXXX01000000, and the first contig of that version would be XXXX01000001. (The
last six digits of this ID identify each individual contig). The nucleotide data from most WGS projects go into the BLAST wgs database, whereas
proteins go into the BLAST nr database. Nucleotides from environmental projects are
present in either the BLAST wgs or env_nt database, depending upon whether that sequence
has been identified as a particular organism (wgs), or if the organism is not yet known (env_nt).
Similarly, the proteins from those projects are in the nr or env_nr BLAST database. See the Metagenome Submission Guide for information about how to submit the various elements of a metagenome project. What To Do
WGS Project Info
The table below shows three examples of WGS projects that have both contigs and scaffolds. One is unannotated and the others have annotation on either the contigs or the scaffolds. You can see that when the contigs are annotated, that annotation is displayed up on the corresponding scaffold in Entrez. Annotated records are shown as GenBank(Full) view. The accession number of each WGS project is included in the table:
How to Create a WGS SubmissionSubmissions to WGS can be created with tbl2asn , a command line program that automates parts of the submission process. tbl2asn reads a template along with sequence (*.fsa) and optional quality score (*.qvl) and annotation table (*.tbl) files, and outputs an ASN.1 file (*.sqn) for submission to GenBank. You can upload your submission files to us with GenomesMacroSend. Be sure to include in the comment box the Project ID that was assigned when you registered your project with the Genome Project database. Other information about the biological source of the organism, including the organism name, should also be included in the definition line of the sequence. In addition to organism name, include other source modifiers that are known, eg [strain=yyy] [chromosome=nnn]. Note that there are no spaces surrounding the equal sign. A complete list of modifiers is available from the Sequin FAQ page. If there is annotation and the organism does not use the standard genetic code, include the correct genetic code in the defline. For example, include [gcode=11] for bacteria. The definition line must be on a single line with no line break. A sample definition line is Annotation can be included by creating a 5-column table in a .tbl file for each .fsa file. Go to the appropriate page for information about the format of the table and the desired annotation for eukaryotic or prokaryotic genomes. Three required fields are
Note that the nucleotide SeqID appears in the DEFINITION line in the flatfile view of the record. Although the protein SeqIDs are not displayed in the final flatfile view, they are present in the ASN.1. See example *.fsa and *.tbl files for various situations, such as partial CDS or features on the minus strand. Use tbl2asn to include the Phrap/Consed quality scores of a sequence. The scores must be in files named *.qvl that are in the same directory and have the same nucleotide SeqIDs as the corresponding *.fsa files. If there is a published reference, it can be sent separately with the submission, to be added to the records by GenBank staff during WGS processing. tbl2asn does not include a release date in the output file, so include that information in your email message to us when you submit. Using tbl2asnIn a specified directory, tbl2asn looks for .fsa files and any .tbl and .qvl files with the same basename, for example file1.fsa, file1.tbl and file1.qvl, and it builds ASN.1 records from them. The ASN.1 record for this example would be called file1.sqn. The results of the validation would be in a file named file1.val. Most validation errors must be fixed before the .sqn files can be submitted to GenBank; however, taxonomy-related errors can generally be ignored. Upload the .sqn files to us with GenomesMacroSend. Be sure to include in the comment box the Project ID that was assigned when you registered your project with the Genome Project database. Some tbl2asn options that are relevant to WGS submissions are:
Go to the tbl2asn page for more detailed information about tbl2asn, its command line arguments, and file formats. Sample command: If there are more than 1000 contigs, then we recommend that you combine the fasta (and other)
files, putting 10,000 or fewer into each file. Combining multiple
fasta sequences into a single file can be useful for smaller projects too.
The corresponding .tbl and .qvl files must have the information
for all of the sequences that are in the .fsa file. Run tbl2asn with the "-a s" argument so
that each definition line is recognized as the beginning of a new sequence. A
single .sqn file will then be generated for the multiple sequences of each .fsa file. Sample command: See example .fsa and .tbl files.
If there is assembly information, of how the contigs are assembled into scaffolds
(supercontigs) or chromosomes, then submit an AGP file with that information. AGP files provide the ordering and
orientation information to construct supercontigs or scaffolds from contigs, or to
construct chromosomes from supercontigs and/or contigs. More information about
genome assemblies
is here. See this page for the AGP
format.
Some specific requests are:
If the same version of a WGS project is being updated, with annotation, for example, then the SeqIDs must be identical and the accession numbers must be included
in the update, for both nucleotides and proteins. The correct format of the identifiers in such an update is:
gnl|WGS:XXXX|SeqID|gb|XXXX01xxxxxx
where XXXX is the accession prefix and XXXX01xxxxxx is the contig's accession number. We recommend
that you send a test file to NCBI with details of your plans before generating a complicated update.
If you need additional assistance in preparing WGS submissions, please contact genomes@ncbi.nlm.nih.gov. Revised November 20, 2009
| |||||||||||||||||||||||||||||||||