NCBI logo Whole Genome Shotgun Submissions  
PubMed Entrez BLAST OMIM Books Taxonomy Structure

NCBI
back to NCBI homepage
back to NCBI homepage
SITE MAP


WGS List
List of WGS Projects

Genome Project
Registration page
Home page

Metagenomes
Metagenome Submission Guide

GenBank
Sequence submission support and software

Trace Archive
Trace Archive database

tbl2asn
Command line sequence submission tool

Annotation Guidelines
Prokaryotic genomes
Eukaryotic genomes

Example Files
Sample .fsa and .tbl files

GenomesMacroSend
To upload your submission files


Sequin
Stand-alone sequence submission tool

 

Whole Genome Shotgun Sequence Submissions

DDBJ/EMBL/GenBank accepts both complete and incomplete genomes. Whole Genome Shotgun (WGS) sequencing projects are incomplete genomes or incomplete chromosomes that are being sequenced by a whole genome shotgun strategy. WGS projects may be annotated, but annotation is not required. The pieces of a WGS project are the contigs (overlapping reads), and they do not include any gaps. An AGP file can be submitted to indicate how the contig sequences are assembled together into scaffolds (contig sequences separated by gaps) and/or chromosomes. We must have the contig sequences without gaps as the basic units for all WGS projects.

WGS projects without annotation require at least two weeks to be processed. Projects with annotation require at least one month for processing. Please submit your project with enough lead time.

Below are detailed instructions for preparing a WGS submission. We recommend sending us a test file via GenomesMacroSend to see if there are problems before committing to generating the entire project.

See the list of WGS projects. The Annotation column indicates whether a project is annotated on the contigs (Y-c), on the scaffolds (Y-s) or is not annotated (-).

Table of Contents

Introduction

Each WGS project is assigned a stable 4-letter WGS accession prefix, which does not change as the project is updated. In addition to the WGS accession prefix, the contig identifiers have a version number corresponding to a particular WGS project update. Finally, each individual contig within the assembly is assigned a unique accession number prefixed by the WGS accession prefix and version number. For instance, if a WGS project's assigned accession number is XXXX00000000, then that project's first assembly version would be XXXX01000000, and the first contig of that version would be XXXX01000001. (The last six digits of this ID identify each individual contig).

The nucleotide data from most WGS projects go into the BLAST wgs database, whereas proteins go into the BLAST nr database. Nucleotides from environmental projects are present in either the BLAST wgs or env_nt database, depending upon whether that sequence has been identified as a particular organism (wgs), or if the organism is not yet known (env_nt). Similarly, the proteins from those projects are in the nr or env_nr BLAST database.

See the Metagenome Submission Guide for information about how to submit the various elements of a metagenome project.


What To Do

  • Register your project and a locus_tag prefix with the Genome Project database. See the annotation guides for eukaryotes or prokaryotes for more information about locus_tags. Include your GenomeProject ID in correspondence about your project and with your submissions.

  • Submit the contigs as the WGS project. WGS projects consist of only contigs (overlapping reads), not any supercontigs (assembled contigs separated by gaps), of a sequencing project. Supercontig/scaffold or chromosome assembly information can be sent to us in AGP format, which will allow us to make CON records that indicate how the pieces of the WGS submission are put together. You can upload your submission files to us with GenomesMacroSend. Be sure to include in the comment box the Project ID that was assigned when you registered your project with the Genome Project database.

  • Submit the unassembled sequences to the appropriate archive database, as this information is useful for the scientific community. The Trace Archive is for traces from Sanger-style sequencing, and the Sequence Read Archive (SRA) is for runs from next-generation (massively parallel) sequencing technology. For questions about submitting to these Archives contact the Trace or SRA HelpDesk via the Write to the Help Desk link at the bottom of the Trace and SRA home pages. Be sure to include in your submissions and correspondence the Project ID that was assigned when you registered your project with the Genome Project database.


    WGS Project Info

  • Submit complete organellar and viral genomes as regular GenBank records by emailing the submissions to GenBank Submissions.
  • Complete, annotated genomes should be submitted to GenBank as a complete genome. The most common complete genomes are bacteria and archaea. Complete genomes are defined for GenBank as gap-free sequences that are annotated. For information about complete genomes, see the bacterial genome submission guidelines.
  • Complete genomes that lack annotation are processed as WGS projects. When annotation is added, the complete genome is given a new accession number and the WGS accession number is made secondary, so that Entrez searches for either number will retrieve the complete annotated genome.
  • Include specific source information, such as strain or isolate name, country where the sample was collected, specimen voucher, sex, and any other relevant information. See the tbl2asn page for information on how to include source qualifiers in a submission.
  • In general, submit only those contigs >200bp. However, if you are submitting an AGP file with assembly information, then you may include shorter contigs that are part of scaffolds/chromosomes.
  • Include the quality scores, when possible.
  • Annotation can be included on the WGS contigs or on the scaffold or chromosome CON records that are generated from the information in the agp file, whichever is most appropriate for the project. Annotation that is submitted on a WGS contig will be displayed in Entrez on the scaffold or chromosome that includes that contig. Similarly, if a scaffold has annotation and is a component of a chromosome CON record, then its annotation will be displayed in Entrez on the chromosome. However, annotation that is submitted on a scaffold or chromosome CON record is not displayed on the underlying components. Contact NCBI for information about annotating scaffolds or chromosomes.
  • The table below shows three examples of WGS projects that have both contigs and scaffolds. One is unannotated and the others have annotation on either the contigs or the scaffolds. You can see that when the contigs are annotated, that annotation is displayed up on the corresponding scaffold in Entrez. Annotated records are shown as GenBank(Full) view. The accession number of each WGS project is included in the table:

    Annotated Contigs Annotated Scaffolds No Annotation
    ACZS00000000 ABXC00000000 AAGU00000000
    WGS contig WGS contig WGS contig
    Scaffold CON Scaffold CON Scaffold CON

    How to Create a WGS Submission

    Submissions to WGS can be created with tbl2asn , a command line program that automates parts of the submission process. tbl2asn reads a template along with sequence (*.fsa) and optional quality score (*.qvl) and annotation table (*.tbl) files, and outputs an ASN.1 file (*.sqn) for submission to GenBank. You can upload your submission files to us with GenomesMacroSend. Be sure to include in the comment box the Project ID that was assigned when you registered your project with the Genome Project database.

    File Format

    Sequence
    Nucleotide sequences of any size FASTA format can be used as input with tbl2asn. FASTA format consists of a single definition line, beginning with a '>', followed by text and subsequent lines of sequence. [See below for information about having multiple sequences in a single file.] At a minimum, all definition lines must contain [tech=wgs] (to indicate that these sequences are whole genome shotgun sequences) and an identifier for the nucleotide sequence, called the SeqID. The SeqID must be unique for each sequence, and is important when updating a WGS project. It cannot begin with the word "assembly" as that causes errors in tbl2asn.

    Other information about the biological source of the organism, including the organism name, should also be included in the definition line of the sequence. In addition to organism name, include other source modifiers that are known, eg [strain=yyy] [chromosome=nnn]. Note that there are no spaces surrounding the equal sign. A complete list of modifiers is available from the Sequin FAQ page.

    If there is annotation and the organism does not use the standard genetic code, include the correct genetic code in the defline. For example, include [gcode=11] for bacteria.

    The definition line must be on a single line with no line break.

    A sample definition line is
    >SeqID [organism=Mus musculus] [strain=BALB/c] [tech=wgs] [chromosome=2]

    Annotation

    Annotation can be included by creating a 5-column table in a .tbl file for each .fsa file. Go to the appropriate page for information about the format of the table and the desired annotation for eukaryotic or prokaryotic genomes. Three required fields are

    • locus_tag for genes
      • The locus_tag is the systematic name of the gene and is used for tracking individual genes. It therefore must be unique across all the genes in a project. If a gene's biological name is known, then it is included as the gene qualifier in the table.

    • protein_id for proteins
      • The protein_id is the SeqID of the protein (analogous to the nucleotide SeqID) and is used to track the protein. All of the SeqIDs, both nucleotide and protein, must be unique within a project. For WGS projects you can use type general protein_id's (format: gnl|dbname|SeqID) or local protein_id's (format: lcl|SeqID). Note that during our processing both forms of protein_id's are converted to type general id's in the format gnl|WGS:XXXX|SeqID, where XXXX is the project_ID. You can use your name or your lab name as the dbname, if you choose to create these as type general protein_id's.

    • product for proteins
      • The product is free text, chosen by the submitter. Protein names should be concise names, not descriptions or phrases. BLAST similarity results can be included as a note, or can be modified to be used as the product name. For example, if BLAST results indicate that the translation is similar to XYZ protein, then the product name could be "XYZ-like protein". If the protein is predicted and the product name is not known, use "hypothetical protein" as the product name.

    Note that the nucleotide SeqID appears in the DEFINITION line in the flatfile view of the record. Although the protein SeqIDs are not displayed in the final flatfile view, they are present in the ASN.1.

    See example *.fsa and *.tbl files for various situations, such as partial CDS or features on the minus strand.

    Quality Scores

    Use tbl2asn to include the Phrap/Consed quality scores of a sequence. The scores must be in files named *.qvl that are in the same directory and have the same nucleotide SeqIDs as the corresponding *.fsa files.

    Template File
    The template file for tbl2asn is created with Sequin . On the starting Sequin page, choose "Start New Submission". Enter a manuscript title if desired. Enter the contact, authors and affiliation information then return to the submission tab and use File->Export Submitter Info. Save the file as 'template.sbt'.

    If there is a published reference, it can be sent separately with the submission, to be added to the records by GenBank staff during WGS processing.

    tbl2asn does not include a release date in the output file, so include that information in your email message to us when you submit.

    Using tbl2asn

    In a specified directory, tbl2asn looks for .fsa files and any .tbl and .qvl files with the same basename, for example file1.fsa, file1.tbl and file1.qvl, and it builds ASN.1 records from them. The ASN.1 record for this example would be called file1.sqn. The results of the validation would be in a file named file1.val. Most validation errors must be fixed before the .sqn files can be submitted to GenBank; however, taxonomy-related errors can generally be ignored. Upload the .sqn files to us with GenomesMacroSend. Be sure to include in the comment box the Project ID that was assigned when you registered your project with the Genome Project database.

    Some tbl2asn options that are relevant to WGS submissions are:

    -pPath to the directory. If files are in the current directory -p. should be used.
    -rPath for the resulting .sqn file(s) (if the -r argument is not used, the .sqn files will be saved in the source directory).
    -tSpecifies the template file (.sbt). If the .sbt file is in a different directory the full path must be specified.
    -jAllows the addition of source qualifiers that will be the same for each submission. Example: -j "[organism=Saccharomyces cerevisiae] [strain=S288C]".
    -VVerification (combine any of the following letters):
      v :Validates the data records. The output is saved to files with a .val suffix.
      b :Generates GenBank flatfiles with a .gbf suffix (this format is just for viewing and is not acceptable for submission)
    Sample command line: -V v
    -yAdds a COMMENT to each submission. Example: -y "Contigs larger than 2kb have been annotated, representing approx. 87% of the total genome".
    -YLike -y, but adds a COMMENT to each submission from a file.
    -ZRuns the Discrepancy Report. Must supply an output file name. Recommended only for annotated genome submissions, complete or WGS. See the Discrepancy Report page for information about its output.

    Go to the tbl2asn page for more detailed information about tbl2asn, its command line arguments, and file formats.

    Sample command:

    tbl2asn -t template.sbt -p path_to_files -V v

    Multiple Sequences in a Single .fsa File

    If there are more than 1000 contigs, then we recommend that you combine the fasta (and other) files, putting 10,000 or fewer into each file. Combining multiple fasta sequences into a single file can be useful for smaller projects too.

    The corresponding .tbl and .qvl files must have the information for all of the sequences that are in the .fsa file. Run tbl2asn with the "-a s" argument so that each definition line is recognized as the beginning of a new sequence. A single .sqn file will then be generated for the multiple sequences of each .fsa file.

    Sample command:

    tbl2asn -t template.sbt -p path_to_files -a s -V v

    See example .fsa and .tbl files.

    AGP Files to Build Scaffolds and/or Chromosomes

    If there is assembly information, of how the contigs are assembled into scaffolds (supercontigs) or chromosomes, then submit an AGP file with that information. AGP files provide the ordering and orientation information to construct supercontigs or scaffolds from contigs, or to construct chromosomes from supercontigs and/or contigs. More information about genome assemblies is here. See this page for the AGP format.

    Some specific requests are:

  • Use "100" as the length and U as the component-type for gaps of unknown size, as that is the GenBank convention. These will appear as gap(unk100) in the flatfile view of the GenBank record.
  • Include the accession.version number as the component identifier, not just the accession number. If you do not know the accession numbers then use the SeqIDs of the contigs, from the .fsa files, and they will be converted to the accession.version numbers during processing at NCBI.

    Updating a WGS Project

    If the same version of a WGS project is being updated, with annotation, for example, then the SeqIDs must be identical and the accession numbers must be included in the update, for both nucleotides and proteins. The correct format of the identifiers in such an update is:

    gnl|WGS:XXXX|SeqID|gb|XXXX01xxxxxx

    where XXXX is the accession prefix and XXXX01xxxxxx is the contig's accession number. We recommend that you send a test file to NCBI with details of your plans before generating a complicated update.
     

    If you need additional assistance in preparing WGS submissions, please contact genomes@ncbi.nlm.nih.gov.

    Revised November 20, 2009