Prokaryotic and Eukaryotic Genomes Submission Guide

Both WGS and non-WGS genomes, including gapless complete bacterial chromosomes, can be submitted via the Submission Portal. You will be asked to choose whether the genome being submitted is considered WGS or not. The differences for GenBank purposes are:

Each chromosome in a non-WGS genome is in a single piece and there are no extra sequences. A WGS genome may still have chromosomes in multiple pieces and/or unplaced sequences

non-WGS

  • Each chromosome is in a single sequence and there are no extra sequences
  • Each sequence in the genome must be assigned to a chromosome or plasmid or organelle
  • Plasmids and organelles can still be in multiple pieces.

WGS

  • One or more chromosomes are in multiple pieces and/or some sequences are not assembled into chromosomes

In both cases

  • There can still be gaps within the sequences; you will supply that information in the submission
  • Plasmids and organelles can still be in multiple pieces.
  • Internal sequences must be arranged in the correct order and orientation.
  • Sequences concatenated in unknown order are not allowed.

Table of Contents

Submit a single genome

This is the simplest submission route because you just fill in a web form in the Submission Portal and upload fasta (or sqn) files of the genome sequences. You will need to:

  • Provide the BioProject created for this research effort, e.g., during submission of the reads to SRA OR register a new BioProject during the genome submission.
  • Provide the BioSample created during submission of the reads to SRA OR register a BioSample during the genome submission
  • Assert whether this is a WGS or non-wgs genome assembly
  • Upload fasta sequences of the genome (or .sqn, file if the genome is annotated)
  • Upload optional AGP file(s) to assemble scaffolds (unplaced or unlocalized) and/or chromosomes from the submitted sequences. This is for WGS only. Remember that you can submit the gapped scaffolds themselves instead of submitting contigs plus an AGP file
  • Provide information in response to prompts during the genome submission (see the common metada section):
    • Genome Assembly Data and other information about this genome assembly
    • Gap Information (What the Ns represent)
    • Chromosome and plasmid assignments. Every sequence in a non-wgs genome must have a chromosome or plasmid assignment and every chromosome must be submitted as a single sequence.
    • Authors and a title (for fasta submissions)
    • Release date (immediately after processing OR a specific date. Release will be on that date or upon publication, whichever is first)
    • Optional request for annotation of prokaryotic genomes by PGAP

Submit a batch of genomes

This submission route allows you to submit as many as 400 WGS or non-wgs genomes in a single batch submission. In this route you choose Batch/multiple in the Genome Submission Portal , fill in the web form, upload a Genome Info file with genome metadata, and upload or preload fasta files (or sqn files if there is annotation) of the genome sequences. All the genomes within a batch must:

  • Be part of the same BioProject
  • Be either WGS or non-wgs, not a mix of both types
  • Have the same (initial) release date
  • Have the same gap/Ns information
  • Contain either fasta files or ASN ( .sqn ) files, not a mix of file types. We recommend submitting fasta files unless the submission needs to include annotation or the Genome-Assembly-Data structured comment
  • Have a single file for each genome, including any plasmid or organelle sequences
  • Have a separate file for each genome, not all the genomes together
  • Request PGAP annotation or not (only relevant for prokaryotic genomes)
  • Be just a single layer (= no AGP file(s))

You will need to:

  • Provide the BioProject created for this research effort, e.g., during submission of the reads to SRA OR register a new BioProject during the genome submission.
  • Provide the BioSamples that were preregistered, eg during submission of the reads to SRA OR register BioSamples during the genome submission
  • Include assignment (ie, chromosome, plasmid or organelle) information about the sequence in the fasta files (see the Additional requirements for batch submissions section)
  • Upload or preload fasta sequences (or sqn files for annotated genomes) of the genomes. Each genome is in a separate single file, uniquely named, but the files can be archived together
    • An option for batch submission is to preload the files of genome assemblies before beginning the submission, rather than uploading them in the browser during the submission. You can preload using Aspera, the FTP protocol or Filezilla. Detailed instructions for using the preload option for genome submissions are at How to preload files.
  • Upload a Genome Info table with information specific to each genome
  • Provide this information in response to prompts on the web pages during the genome submission (see the common metadata section):
    • Gap Information (What the Ns represent)
    • Authors and a title (for fasta submissions)
    • Release date (immediately after processing OR a specific date. Release will be on that date or upon publication, whichever is first)
    • Optional request for annotation of prokaryotic genomes by PGAP

Events

  1. Only if you will be submitting a genome with annotation and you have not yet registered a BioProject and BioSample for this genome, then you will register the genome sequencing project with the BioProject and BioSample databases so that a locus_tag prefix will be assigned to the BioProject:BioSample pair. If you have already registered a BioProject and BioSample for this genome, eg when submitting the reads to SRA, then a locus_tag prefix should have already been assigned. A file of the locus_tag prefix(es) for the BioSamples within a BioProject is linked to the BioProject submission. Write to genomes@ncbi.nlm.nih.gov if you did not receive a locus_tag prefix. Do not register a duplicate BioProject or BioSample for the same genome. Provide these preregistered BioProject and BioSample accessions in the genome submission. Remember that annotation is optional for genome submissions. If you are submitting a genome without annotation, even if you will be requesting PGAP annotation, then you'll create the BioSample (and BioProject, if necessary) during the genome submission. Genomes sequenced as part of the same research effort can belong to a single BioProject, so it's common to create a BioProject during the submission of one genome and then include that BioProject during the submission of additional genomes.
  2. Make the genome assembly data files.

    • Unannotated genomes just need fasta files
    • Annotated genomes need to make .sqn file submissions by running the command line program tbl2asn, and then fixing Errors and Fatals that are indicated in the .val and discrep files. Failure to do this will cause serious delays in processing.
  3. If you have higher-level assembly information, scaffolds and/or chromosomes, then generate an AGP file to build those objects from the wgs-contigs.

  4. If you are submitting a batch of genomes (maximum of 400 per batch), then create a Genome Info file.
  5. Submit via the new Genomes (was WGS) in the Submission Portal
  6. What happens after submission

Submission Files

Fasta files

Put the sequences file into fasta format
  • These files have the suffix .fsa.
  • Each sequence has a definition line beginning with a '>' and a unique identifier (SeqID), eg contig001, contig002. The SeqIDs must:
    • Be <50 characters
    • Can only include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).
    • Be unique within a genome
  • Include in the definition line the organism and the relevant strain, breed, cultivar or isolate, if one exists for the sequenced organism. Any additional source qualifiers will be added from the registered BioSample to the genome during processing.
  • Remove any Ns from the beginning or end of each sequence.
  • Contigs should be >199nt, unless they are part of multi-component scaffolds in an AGP file
IMPORTANT: Additional requirements for batch submissions
  • All the sequences of single genome must be in one file
  • Sequences that are part of a plasmid, or an organellar chromosome, or specific nuclear chromosomes need to have that information included in the fasta definition line, in these formats:
    • [plasmid-name=pBR322]
    • [plasmid-name=unnamed] (when the plasmid name is not known. However, be sure that each plasmid has a unique name, eg unnamed1 and unnamed2. )
    • [location=mitochondrion]
    • [location=chloroplast]
    • [chromosome=2]
  • Follow the Plasmid and chromosome names rules
  • To indicate that a single gapped or ungapped sequence represents the chromosome, include [location=chromosome]. This must be true for at least one sequence when "option 1" (non-wgs genome) is selected.
  • Sequences that are a complete circular chromosome or plasmid need to have the circular topology and the completeness included. If a gapped or nongapped circular chromosome has a gap at the end or has not been circularized, please also include a note.
    • [topology=circular] [completeness=complete]
    • [topology=circular] [completeness=complete] [note=gap at end] [note=not circularized]
  • Here is an example of the definition line for the complete plasmid of a bacterial submission (all the text must be in a single line): >contig02 [organism=Clostridium difficile] [strain=ABDC] [plasmid-name=pABDC1] [topology=circular] [completeness=complete]
  • Here is an example of a gapped sequence that represents the chromosome of a bacterial genome that has not yet been circularized (all the text must be in a single line): >contig001 [organism=Clostridium difficile] [strain=ABDC] [location=chromosome] [topology=circular] [completeness=complete] [note=not circularized]

.sqn files

These are generally required only when the submitter wants to include annotation. Annotation is optional for GenBank genome submissions.

see details

Prepare a .sqn file for submission using tbl2asn. tbl2asn reads a template file along with the fasta sequence and annotation table files, and outputs an ASN (.sqn) file for submission to GenBank. Follow these three steps:

  1. Prepare data files

    • Prepare fasta files as above, with one file per genome.
    • Prepare these additional files:
      • a template file with submitter and publication information.
      • annotation files. These correspond to and have the same basenames as the .fsa files, but have the suffix .tbl. The .tbl files have a 5-column tab-delimited format, as described in the annotation instruction pages. Be sure to read the annotation requirements in the appropriate annotation guidelines:
      • New: annotation can be provided in GFF files. See Genome Annotation with GFF or GTF files for instructions with this beta product.
      • Genome-Assembly-Data Structured Comment. This information can be supplied in the submission web page (single submission) or the Genome Info table (batch submission), but if many genomes are being submitted it can be simpler to include this in the .sqn file itself. To do that, use the Genome-Assembly-Data Structured Comment Template to create the file and then have it included with -w genasm.cmt in the tbl2asn commandline, below.
      • quality scores of the sequences. These files correspond to and have the same basenames as the .fsa files, but have the suffix .qvl. The quality scores are optional, but desired.
  2. Run tbl2asn (version 22.9 or higher). Sample command line when the sequences are contigs (overlapping reads with no Ns representing gaps) is

    tbl2asn -p path_to_files -t template -M n -Z discrep
    

    If the sequences contain Ns that represent gaps, then run the appropriate tbl2asn command line with the -l and -a arguments, as described in the Gapped Genome Submission page. The command line for the most common situation (runs of 10 or more Ns represent a gap, and there are no gaps of completely unknown size, and the evidence for linkage across the gaps is "paired-ends") is:

    tbl2asn -p path_to_files -t template -M n -Z discrep -a r10k -l paired-ends
    

    For either case you can include the source information in the definition line of each contig, as described in the fasta defline components section, above. Alternatively, the organism and strain (or breed or isolate) can be included with -j in the tbl2asn command line. The additional source qualifiers will be obtained from the registered BioSample. However, chromosome, plasmid & organelle assignment information must be included in the fasta definition lines. In addition, if the submission is an annnotated prokaryotic genome, then include the genetic code with -j in the commandline, for example:

    tbl2asn -p path_to_files -t template -M n -Z discrep -j "[organism=Clostridium difficile ABDC] [strain=ABDC] [gcode=11]"
    
    Option Description
    -M n To fix some product names and run genome-specific functions and validator
    -j Allows the addition ofsource qualifiersthat are the same for every sequence in every fasta file being read. Examples:
    -j "[organism=Mus musculus] [tissue-type=liver]"
    -j "[organism=Escherichia coli] [strain=ABC1] [gcode=11]"
    -Z Discrepancy Report Output File
    -c s Add exception to short introns
    -V b Generate GenBank Flatfile
    -Y To import a file that is a text comment
    -w assembly.cmt To import Structured Comment Table. This is optional, but can be helpful when there are multiple genomes, because there will be less information to supply on the web form during submission. This file can be created at Structured Comment Template

    Table 1. Some relevant arguments for tbl2asn

  3. Check the output of the validation and discrepancy report and fix problems

    • Check the errorsummary.val file for the number, severity and type of errors that are present in the .val files. All Errors and Rejects need to be fixed. The presence of errors will slow processing. See the genome validation errors for guidance. Contact genomes@ncbi.nlm.nih.gov with any questions about the validation output. During processing there may be some questions about other aspects of the submission.
    • Check the file named 'discrep' for the results of the discrepancy report. Categories prefaced with FATAL are always unacceptable and must be fixed. Some of the categories are informational. Reports that are not flagged as fatal should be examined to determine if they represent annotation artifacts that need to be corrected or if they are acceptable due to the biology of the genome. See the discrepancy report examples and explanations for guidance. Write to genomes@ncbi.nlm.nih.gov and send the discrep file with questions about this report.
    • Make any necessary fixes to the input .fsa and/or .tbl files and run tbl2asn again.

AGP file (optional)

AGP files provide the ordering and orientation information to construct scaffolds from contigs, or to construct chromosomes from scaffolds and/or contigs. However, remember that we do accept the gapped scaffolds themselves as the basic sequences of the genome. If you choose to submit a multi-layer submission with and AGP file, then know that the AGP file defines these genome assemblies, so be sure to include all wgs-contigs that are considered to be part of the genome in the AGP file. However, if the sequences in the fasta (or .sqn) files are already the scaffolds or chromosomes, then do not make an AGP file.

see details

See this page for the AGP format.

There are 3 types of AGP files:

  • Unplaced scaffolds = scaffolds without chromosome or plasmid assignments
  • Chromosome = the objects built in the AGP file represent the nuclear or organellar chromosomes or plasmids
  • Unlocalized scaffolds = scaffolds that are known to belong to a nuclear or organellar chromosome or plasmid but are not part of the assembly to build those chromosomes or plasmids
  • NOTE: assignments will need to be provided for the objects being assembled in the Chromosome and Unlocalized AGP files, not the sequences in the fasta/sqn files

Some specific requests are:

  • Encode the type of object in the object names in column 1 like this:
    • scaffold01, scaffold02, etc for scaffolds
    • chr (for bacteria) OR chr1, chr2, etc for eukaryotic chromosomes
    • the plasmid names for plasmids (eg pBR322). If the name is not known, then use 'unnamed'.
    • MT for the mitochondrial genome. MT_scaf01, MT_scaf02, etc for mitochondrial scaffolds.
  • Use "100" as the length and U as the component-type for gaps of unknown size, as that is the GenBank convention. These will appear as gap(unk100) in the flatfile view of the GenBank record.
  • Use the same contig identifiers in column 6 (the component-id) that you used in the .fsa files. If the components have already been assigned accession numbers, then you need to use the accession.version numbers as the component identifiers; do not use just the accession number.
  • Use different identifiers for the object in column 1 and the component in column 6, even if the object consists of a single component.

You can validate the basic format of your AGP file at http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_validate.cgi. In addition, the standalone commandline program, agp_validate is available by anonymous FTP to validate the AGP file more extensively yourself. The -help option details the arguments and command line format.

Genome Info table

The Genome Info table is required for batch submissions and is used to provide the Genome Assembly Data of each. Download the Genome Info file template. The instructions are on the first tab of this file and the template is on the second tab. Complete the second tab (Genome_Data), then save the worksheet as a Text (Tab-delimited) file -- (use 'File, Save as, Save as type: Text (Tab-delimited)' ).

see details

Each row in the template represents a genome. The required fields are:

  • Biosample accession OR sample_name
  • Assembly method
  • Assembly method version
  • Genome coverage
  • Sequencing technology
  • File name

Optional fields:

  • Assembly date
  • Assembly name
  • Reference genome
  • Update (update_for)
  • bacteria_available_from

Definitions of these fields are in the Genome Assembly Data section and also as comments in the template itself. Instructions:

  • If you created BioSamples previously, provide accessions in the form of SAMN# in the column biosample_accession. See the Example of Genome Info file using BioSample Accessions.
  • If you are creating samples during this genome submission, provide the names of samples that you just created in the column 'sample_name'. Please note that sample names must be unique within your entire account. See the Example of Genome Info file using BioSample Names. Do not include both sample_name and biosample_accession for a genome.
  • If you archived your files by tar utility, you must list file names that are contained in the archive not the name of the tarred file.
  • Provide exact file names (including extensions) in the filename column.
  • Supported extensions for compressed files: tar, tar.gz, gz, bz2 (do not use zip!).
  • File names must be unique.

Metadata required for all genome submissions

BioProject

The BioProject contains the description of the research effort, relevant grant(s), and has links to the public data for the proejct. Each genome must belong to a BioProject, and genomes sequenced as part of the same research effort can belong to a single BioProject. Use the same BioProject for the sequence reads and genome assembly made from those reads; do not create duplicate BioProjects. If a new BioProject is necessary for unannotated (or PGAP-annotated) genomes, then registering during the genome submission process is simplest. However, genomes submitted with annotation will need to be pre-registered so that a locus_tag prefix can be assigned to the BioProject/BioSample pair and used to identify each gene within that genome uniquely. A file of the locus_tag prefix(es) for the BioSamples within a BioProject is linked to the BioProject submission. Write to genomes@ncbi.nlm.nih.gov if you did not receive a locus_tag prefix after preregistering a BioSample for your BioProject.

BioSample

The BioSample contains the source information of the sample that was sequenced. Use the same BioSample for the sequence reads and genome assembly made from those reads; do not create duplicate BioSamples. Registering a new BioSample can be done during the genome submission process for unannotated (or PGAP-annotated) genomes; however, genomes submitted with annotation will need to be pre-registered to get a locus_tag prefix. Include the registered BioProject when you register the BioSample so that a locus_tag prefix is assigned to the pair. You'll find the locus_tag assignment(s) in a file linked to the BioProject submission.

Genome Assembly Data and other information about a genome assembly

  • Assembly method : Name of the assembly algorithm(s)
  • Assembly method version or date : version of the algorithm or date it was run
  • Genome coverage : The estimated base coverage across the genome, eg 12x.
  • Sequencing technology : sequencing platform(s) used
  • Assembly date : Optional. Year, month or day the assembly was made. Date formats: YYYY-MM-DD; YYYY-MM; YYYY
  • Assembly name : Optional and not usually relevant for prokaryotes. This is a short name suitable for display that does not include the organism name eg, LoxAfr_3.0 for a Loxodonta africana assembly, version 3.0
  • Full or Partial Genome in the sample : the answer is nearly always "yes, Full". Choose "no, partial" only if a subset of the sample was deliberately selected, eg just exomes or a single chromosome of a eukaryote or only the non-repetitive regions of the genome
  • Reference genome : If this is NOT a de novo assembly, you will need to provide the accession.version and/or the assembly name of the genome assembly that was used as the reference guide for this assembly
  • Update : accession of the genome being updated, when appropriate
  • bacteria_available_from : Optional. For prokaryotes provide a name and physical address (not email) of the lab or PI, or a culture collection identifier where scientists could obtain this bacterial culture

Gap Information: What the Ns represent

  • The minimum number of consecutive Ns that represents a gap (must be 10 or less). Be aware that the assembly statistics are always calculated using 10 or more Ns as a gap, regardless of the presence/absence of gaps in the final genome sequence.
  • The number of Ns that represents a gap of completely unknown length (usually 0; sometimes 100 or another value)
  • The evidence used to assert that the sequence on either side of the gap is linked (usually paired-ends)
  • This information is collected in the submission form for individual and batch submissions. Default answers are those that have been most commonly submitted. Be sure to select the correct answer when the defaults are incorrect for the genome(s) being submitted.

Chromosome and plasmid assignments

  • Indicate any sequences that are chromosomes or plasmids, or that belong to chromosomes or plasmids.
  • Follow the chromosome and plasmid names rules
  • Every sequence in a non-wgs genome must have a chromosome or plasmid assignment and every chromosome must be submitted as a single sequence.

Plasmid and chromosome names rules

Chromosome and plasmid names can only digits, dots, underscores, and ASCII characters in plain text in the standard English alphabet. In addition, there are rules specific for each.

Chromosome names
  • Can contain only digits, dots, underscores, and ASCII characters in plain text in the standard English alphabet.
  • Cannot include "chr" or "chromosome". However, linkage group names should include "LG" as part of the name.
  • Are limited to 33 characters
  • Cannot include these words: unknown, Un, Unk, 0 (= zero as the full name). This restriction is because each unplaced sequence should be separate, not concatenated with others, and without an assignment.
Plasmid names
  • Can contain only digits, dots, underscores, and ASCII characters in plain text in the standard English alphabet.
  • Should start with lower case 'p' UNLESS the plasmid name is not known. In that case use 'unnamed', or "unnamed1" & "unnamed2" for distinct unnamed plasmids.
  • Cannot include the word 'plasmid'
  • Are limited to 20 characters

Submit the genomes to the Genome Submission Portal

All files must be submitted via the Submission Portal. Choose "Single" or "Batch/multiple" genomes. Answer the questions and upload the necessary files Review the summary page and click the "Submit" button. The submission will be given a 'SUB' temporary identifier which you can use in correspondence before an accession number is assigned to the genome submission.

What happens next

Once we receive your genome submission, a few automated validations are run and a member of our staff conducts an initial review. If no significant issues are found, the genome will be assigned an accession number.

If there are problems

The submitted files will be marked in the submission portal as "Error" and you will receive an email with details of the problems. Errors found in the automated validations are automatically reported back to the submission portal and an email is automatically sent to the submitter with instructions on how to proceed. In addition, a member of our staff conducts an initial review of each submission and reviews several additional validations. The problems, including those described in the Fix problems section, could be:

  • Any Error-level errors and some Warning-level errors from the validation. For .sqn file submissions you would see these in the .val file(s) generated by tbl2asn.
  • Any FATAL or problem categories from the discrepancy report (in the discrep file from tbl2asn for .sqn submissions)
  • Sequence contamination in the genome sequences
  • Bad format of the AGP files or inconsistencies between the AGP and fasta/.sqn files.
  • If the genome size is not within 0.5-1.5x the average size of the genomes of that species that are already in GenBank. This test uses the genomes that are already in GenBank, so genomes could pass this test just because there are not yet enough genomes of that species in GenBank.
  • If ANI analysis of the genome indicates that the genome is misidentified. This test uses the genomes that are already in GenBank, so genomes could pass this test just because there are not yet enough genomes of that species in GenBank.

Once you have made the fixes, log back into the Genomes (WGS & Complete) submission portal](https://submit.ncbi.nlm.nih.gov/subs/genome), retrieve that submission by its 'SUB' identifier and click the "FIX" button of that submission. You'll be back in the original submission and will need to delete the files that are marked as having errors, and then upload new files in their place.

Once your submission is assigned an accession number it undergoes a thorough review by our staff. This review is critical because we are striving to present genome annotation in an accurate and consistent manner so that database users can make maximum use of the data. If we encounter problems during this review, we will contact you by email.

Submission statuses in the submission portal
  • Queued : the submission is waiting for initial review
  • Error : one or more genomes has errors in its files, so needs to be resubmitted. Use the same file name when resubmitting batch submissions.
  • Processing, and no accession number : all of the genomes have passed the initial automated validations and are waiting for additional review.
  • Processing, and accession number : genome accessions have been assigned and the genomes will be processed by NCBI staff. Genomes will remain at this status until they are released. We will contact you during processing if the submission has issues that require additional information.
  • Processed : the genome has been publicly released.

If you elected to hold your genome until a particular date (or publication, whichever is first), we ask that you provide us with the expected publication date and also notify us in a timely manner of the upcoming publication and the relevant citation details. This will allow us to coordinate the release of your genome with the appearance of the paper. Please provide at least two weeks' notice of any upcoming publication.

NOTE: As of January 2017, genomes will be released on their release date without additional communication, as is the normal GenBank policy. Be sure to request an extension of the release date if the genome is not yet published and you wish to continue to keep it confidential.

Requesting PGAP annotation of prokaryotic genomes

Requests for annotation by the Prokaryotic Genomes Annotation Pipeline is a step during submission of the genome to GenBank. Prepare a regular GenBank genome submission and request PGAP annotation during the submission process by clicking on the box "Annotate this prokaryotic genome in the NCBI Prokaryotic Annotation Pipeline before being released". The annotated genome will be posted back to the Submission Portal for your review. You may edit the file and resubmit that to GenBank; however, this is not required and is generally not recommended, as it will slow processing and may introduce problems that you would need to fix.

Support Center

Last updated: 2018-02-22T14:46:40Z