Whole Genome Shotgun Sequence Submissions

Whole Genome Shotgun Sequence Submissions

DDBJ/EMBL/GenBank accepts both complete and incomplete genomes. Whole Genome Shotgun (WGS) sequencing projects are incomplete genomes or incomplete chromosomes of prokaryotes or eukaryotes that are being sequenced by a whole genome shotgun strategy or a hybrid strategy. See the project info section for details on what is and is not suitable for submission as a WGS project.

WGS projects may be annotated, but annotation is not required. The pieces of a WGS project are the contigs (overlapping reads), and they do not include any gaps. An AGP file can be submitted to indicate how the contig sequences are assembled together into scaffolds (contig sequences separated by gaps) and/or chromosomes. We must have the contig sequences without gaps as the basic units for all WGS projects.

WGS projects without annotation require at least two weeks to be processed. Projects with annotation require at least one month for processing. Please submit your project with enough lead time.

See the submission instructions. General information about WGS projects is below. We recommend sending us a test file via GenomesMacroSend if you have a large annotated genome to see if there are problems before committing to generating the entire project.

See the list of WGS projects. Here is the sortable display of the list of WGS projects. The Annotation column indicates whether a project is annotated on the contigs (Y-c), on the scaffolds (Y-s) or is not annotated (-).

See the Assembly Basics page for more detailed information about genome assemblies.

Table of Contents

Introduction

Each WGS project is assigned a stable 4-letter WGS accession prefix, which does not change as the project is updated. In addition to the WGS accession prefix, the contig identifiers have a version number corresponding to a particular WGS project update. Finally, each individual contig within the assembly is assigned a unique accession number prefixed by the WGS accession prefix and version number. For instance, if a WGS project's assigned accession number is XXXX00000000, then that project's first assembly version would be XXXX01000000, and the first contig of that version would be XXXX01000001. (The last six digits of this ID identify each individual contig). When there is more sequencing and the genome is reassembled, the contigs are submitted as the 02 version of the WGS project. No linkage or relationship is expected between the old and new contigs, and the new contigs are given new accession numbers beginning with XXXX02000001. The 01 contigs are suppressed when the 02 contigs are released.

The nucleotide data from most WGS projects go into the BLAST wgs database, whereas proteins go into the BLAST nr database. Nucleotides from environmental projects are present in either the BLAST wgs or env_nt database, depending upon whether that sequence has been identified as a particular organism (wgs), or if the organism is not yet known (env_nt). Similarly, the proteins from those projects are in the nr or env_nr BLAST database.

See the Metagenome Submission Guide for information about how to submit the various elements of a metagenome project.


WGS Project Info

Updating a WGS Project

If the same version of a WGS project is being updated, with annotation, for example, then the SeqIDs must be identical and the accession numbers must be included in the update, for both nucleotides and proteins. The correct format of the identifiers in such an update is:

gnl|WGS:XXXX|SeqID|gb|XXXX01xxxxxx

where XXXX is the accession prefix and XXXX01xxxxxx is the contig's accession number. We recommend that you send a test file to NCBI with details of your plans before generating a complicated update.
 

If you need additional assistance in preparing WGS submissions, please contact genomes@ncbi.nlm.nih.gov.

Revised January 17, 2012

Genomes

Links