Whole Genome Shotgun Submissions
What is Whole Genome Shotgun (WGS)?
Whole Genome Shotgun (WGS) projects are genome assemblies of incomplete genomes or incomplete chromosomes of prokaryotes or eukaryotes that are generally being sequenced by a whole genome shotgun strategy. WGS projects may be annotated, but annotation is not required. NCBI has a Prokaryotic Genomes Annotation Pipeline that may be requested at the time the genome files are submitted to GenBank. This pipeline generates a submission-ready annotated file that is posted back to the submitter for review and which the submitter could edit prior to data release.
The public WGS projects are at the list of WGS projects .
Each WGS project is assigned a stable 4-letter WGS accession prefix, which does not change as the project is updated. In addition to the WGS accession prefix, the contig identifiers have a version number corresponding to a particular WGS project update. Finally, each individual contig within the assembly is assigned a unique accession number prefixed by the WGS accession prefix and version number. For instance, if a WGS project's assigned accession number is XXXX00000000, then that project's first assembly version would be XXXX01000000, and the first contig of that version would be XXXX01000001. (The last six digits of this ID identify each individual contig). When there is more sequencing and the genome is reassembled, the contigs are submitted as the 02 version of the WGS project. No linkage or relationship is expected between the old and new contigs, and the new contigs are given new accession numbers beginning with XXXX02000001. The 01 contigs are suppressed when the 02 contigs are released.
Note: In January 2019 GenBank began assigning accessions with a stable 6-letter WGS accession prefix and a minimum of 9 digits, eg XXXXXX000000000, XXXXXX010000000, and XXXXXX010000001, for the wgs project, its first version, and its first sequence, respectively.
In addition, each genome is part of a BioProject that describes the research effort, and is from a BioSample which presents details of the source of the DNA. Furthermore, each public genome is loaded into the Assembly database, where it is assigned an Assembly accession. When a genome is updated, the Assembly accession is incremented to the next version, but the BioProject and BioSample accessions remain the same.
Note: In January 2014 creation of strain-level taxids ended, and registration of a BioSample became a requirement for genomes.
The nucleotide data from all WGS projects go into the BLAST wgs database since the fall of 2011. Proteins from most WGS projects go into the BLAST nr database. Proteins from environmental projects are present in either the BLAST nr or env_nr database, depending upon whether that sequence has been identified as a particular organism (nr), or if the organism is not yet known (env_nr).
See the Metagenome Submission Guide for information about how to submit the various elements of a metagenome project.
Information about the requirements for more complex assemblies, such as those with PARs or alternate loci, is in the Assembly Submission pages.
Some Examples
The table below shows a few examples of WGS projects:
- unannotated contigs and scaffolds
- annotated contigs with unannotated scaffolds
- unannotated contigs with annotated scaffolds.
The accession number of each WGS project is included in the table and will link to the live record for viewing. For accession AZCS00000000, notice that the annotation on contigs is displayed up on the corresponding scaffold. However, annotation that is submitted on a scaffold or chromosome CON record is not displayed on the underlying components, as seen in ABXC00000000. To be able to see the annotation on large records, use the GenBank(full) Display setting and/or the Customize options to "show sequence".
Annotated Contigs | Annotated Scaffolds | No Annotation |
ACZS00000000 | ABXC00000000 | AAGU00000000 |
WGS contig | WGS contig | WGS contig |
Scaffold CON | Scaffold CON | Scaffold CON |
Nucleotide sequences must conform to the following standards:
- Submitted sequences must be assembled from data experimentally determined by the submitter.
- Screened for vector contamination and any vector/linker sequence removed. This includes the removal of NextGen sequencing primers.
- Sequences should be greater than 200 bp in length, if they are not part of multi-component scaffolds
- Sequence gaps may be present and annotated with the assembly_gap feature; however, sequences cannot be randomly concatenated for submission. See the Gapped Genome Submissions page for more information about adding assembly_gap features.
- Sequences cannot begin or end with Ns
WGS genomes without annotation or with PGAP annotation require at least two weeks to be processed. Genomes with annotation require at least one month for processing. Please submit your genome assembly with enough lead time.
Requirements:
- Each genome must belong to a BioProject. Genomes sequenced as part of the same research effort can belong to a single BioProject. Registering a new BioProject can be done during the WGS submission process for unannotated (or PGAP-annotated) genomes; however, genomes submitted with annotation will need to be pre-registered .
- Register the source information for each genome in the BioSample database. If the same sample is used for two different genome assemblies, then use the same BioSample for both. Registering a new BioSample can be done during the WGS submission process for unannotated (or PGAP-annotated) genomes; however, genomes submitted with annotation will need to be pre-registered to get a locus_tag prefix.
- Raw reads should be submitted to SRA
- Genome-Assembly-Data Structured Comment. This can be supplied in the submission web page during the genome submission. Alternatively, it can be created using the Structured Comment Template and then included in the genome file that is submitted. Additional information is in the WGS Submission Guide
- If annotation is provided, the product names should follow the International Protein Nomenclature Guidelines.
- Annotation must be biologically valid (and error-free).
How to Submit to WGS
Submission details for WGS and non-wgs prokaryotic and eukaryotic genomes can be found in the WGS Submission Guide .
See the Metagenome Submission Guide for information about how to submit the various elements of a metagenome project.
WGS projects without annotation require at least two weeks to be processed. Projects with annotation require at least one month for processing. Please submit your project with enough lead time.
We recommend sending us a test file if you have a large annotated genome to see if there are problems before committing to generating the entire project.
How to Update an Existing WGS Submission
Should not be submitted to WGS
- Assemblies from sequences not directly sequenced by the submitter.
- A single assembly from multiple organisms.
- Complete organellar and viral genomes. They should be submitted as regular GenBank records. See GenBank Submissions for more information on how to submit these types if sequences. If the organelle belongs to an already submitted WGS genome, then include the WGS accession and BioProject and BioSample identifiers (PRJNAxxxxxx and SAMNxxxxxx, respectively) in the Comments box during submission.
Genome Resources
- About WGS
- WGS Browser
- Genome Submission Guide
- Genome Submission Portal
- Update Genome Records
- FAQ
- table2asn
- Submitting Multiple Haplotype Assemblies
- Create Submission Template
- Eukaryotic Annotation Guide
- Prokaryotic Annotation Guide
- Annotation Example Files
- Annotating Genomes with GFF3 or GTF files
- Validation Error Explanations for Genomes
- Discrepancy Report
- NCBI Prokaryotic Genome Annotation Pipeline
- AGP Format
- Metagenome Submission Guide
- Structured Comment
- BioProject
- BioSample