WGS Frequently Asked Questions

1. Can I submit annotation as a GenBank flatfile?

In general, we cannot accept annotation a GenBank, EMBL or DDBJ flat file. To submit annotation, follow the instructions given for Prokaryotic annotation or Eukaryotic annotation. However, you can use the RAST conversion scripts to make the correct file for submission from a .gb file, although there may still be problems that need to be fixed to create a GenBank submission.

2. I want all of the WGS contigs in my assembly available to users. Should I put singleton WGS contigs into the AGP?

The AGP file defines the assembly, so typically we do want all of the WGS contigs in the AGP file. However, contigs that are not considered to be part of the assembly, perhaps because they are degenerate or duplicates, should not be included in the AGP file. In addition, remove from the submission any sequences that are <200 bp and are not part of multi-component scaffolds.

3. Can I submit an assembly and have it held back until I publish my paper?

Yes, you may submit your assembly and have it held until publication. You will select a release date, and your genome will be released on that day or when it is published, whichever is first. If needed, you can write to genomes@ncbi.nlm.nih.gov to request an extension of the release date.

4. I'm using second generation sequencing technology. Can I still submit an assembly?

Yes, you may submit assemblies using second or third generation sequencing technology. The process is analogous to using Sanger technology. The primary reads should be submitted to the Sequencing Read Archive. The reads should be assembled into contigs and submitted as described in the submission instructions. These WGS contigs can be used to assemble higher order molecules and submitted either as gapped scaffold sequences or as contigs plus an AGP file, as described in the submission instructions.

5. Do I need to split the sequences at the Ns that were inserted by the assembler, eg Velvet or Abyss?

No, you no longer need to split properly assembled sequences. However, sequences concatenated in unknown order are not allowed. During the submission process you will be asked to indicate what any Ns in the genome sequence represent. The default answers are that 10 or more Ns in a row represent a gap and that "paired-ends" is the evidence that the sequences on either side of each gap are linked. If those answers are not correct, then you can provide the correct answers in the submission form. During processing those runs of Ns will be converted to assembly_gap features. Note that NCBI's Assembly resource counts runs of 10 or more Ns as a gap, regardless of whether they have been converted to a gap during processing of the genome.

The original submission format, of splitting the sequences at the runs of Ns into contigs and rebuilding the scaffolds with an AGP file, remains a submission option.

6. What should I use for the gap sizes?

If you have estimates of the gap sizes, then use those values for the gaps. We prefer that you use 10 as the minimum gap size, to be more of a signal to database users. If you do not have an estimate of the gap size, then the preference is to use 100 as the value and the 'U' in column five of the AGP file, indicating that the gap size is unknown. For a Gapped submission, use 100 Ns and -a r#u (with the appropriate number in place of #; case 2 or 3 in the Gapped submission examples).

If there is no annotation, then you can submit the fasta file and answer the questions about the Ns in the sequence. The default answers are that 10 or more Ns in a row represent a gap and that "paired-ends" is the evidence that the sequences on either side of each gap are linked. If those answers are not correct, then you can provide the correct answers in the submission form. During processing those runs of Ns will be converted to assembly_gap features. Note that NCBI's Assembly resource counts runs of 10 or more Ns as a gap, regardless of whether they have been converted to a gap during processing of the genome.

7. I concatenated the sequences into the correct order with the Ns between each sequence and annotated the pseudomolecule. Can I submit this annotated pseudomolecule?

Yes, now that gapped submissions are allowed; however, you will need to include the correct gap and linkage evidence for each run of Ns that represent a gap. If all the gaps have the same linkage evidence, then you can make the appropriate submissions simply with tbl2asn, as described.

8. I concatentated the sequences in a random order with Ns between each sequence and annotated this pseudomolecule. Can I submit the annotated pseudomolecule?

Since the annotated sequence does not correspond to a biological molecule, you need to split the pseudomolecule into the contig sequences and submit those as the pieces of a wgs project. You will need to map the annotation down to the contig level, but can use the offset in the .tbl file to avoid recalculating if desired, as shown here.

9. Can I annotate across gaps?

Protein translations are allowed to cross gaps of estimated size, but not those of unknown sizes. That is, introns can be in gaps of unknown size, but not exons. However, annotation across gaps is discouraged unless there is evidence that the translation on the other side of the gap is in the correct frame. In addition, if >50% of the translation is Xs (i.e. in the gap) then the CDS should be made partial at the gap, or split into two partial CDSs, as described for genes split across two contigs, depending upon the confidence of the translation on both sides of the gap.

10. Do I need to submit my genome assembly with annotation?

No, you can submit the genome without any annotation. However, you may request that a prokaryotic genome assembly be annotated by NCBI's Prokaryotic Genome Annotation Pipeline before its release into GenBank.

11. Does NCBI have an annotation pipeline that can be used to annotate my assembly?

You can request that NCBI annotate complete or incomplete prokaryotic genomes using our Prokaryotic Genome Annotation Pipeline during the submission process. The NCBI Eukaryotic Genome Annotation pipeline is not available as a GenBank submitter resource.

12. If I do have my own annotation, in what format should I provide this data?

Annotation must be in the 5-column feature table described in tbl2asn and the Eukaryotic and Prokaryotic annotation instructions. The 5-column feature table is saved as a file with the suffix .tbl, and that file is used in conjunction with the template, fasta, and optional quality score files to create the annotated genome file for submission to GenBank, as described on the tbl2asn page. The .sqn file(s) that is the output of running tbl2asn and the .tbl file (for eukaryotes) are submitted to GenBank.

However, the RAST conversion scripts are able to convert some flatfile formats into a GenBank submission.

13. My genome assembly has contigs and scaffolds. Should I submit the annotation on the contigs or the scaffolds?

Eukaryotic genomes, which usually have thousands of contigs and hundreds or thousands of scaffolds, should be annotated at the scaffold level. Small genomes, eg prokaryotic, can be annotated at either level. However, processing of those small genomes will be quicker if the annotation is on the contigs.

14. Do I have to register a separate BioProject for each genome I am sequencing?

If multiple cultured genomes are part of the same research effort, then they can belong to the same BioProject. However, each culture must be registered as a separate BioSample.

15. How do I submit a prokaryotic or eukaryotic genome assembled from metagenomic reads (a MAG)?

Description: You have isolated DNA from an environmental or mixed sample and then assembled the sequences to create assemblies of individual prokaryotic or eukaryotic genomes. To the best of your ability, each assembly represents the genome from a single prokaryotic or eukaryotic organism reconstructed from the metagenomic mix. All of the available DNA is included (eg you have not intentionally removed noncoding regions or included only the sequences for a single gene).

Note that you should only use sequences that you have determined yourself. Do not include sequences you have only downloaded from a public depository. The raw reads should be submitted to the Sequence Read Archive (SRA) and the contigs made from overlapping reads can be submitted as the pieces of one or more WGS projects.

(1) You will need to register a BioProject for this research effort. You will use this one BioProject for all of the submissions associated with this study.

(2) Once you have the BioProject ID, PRJNAxxxxx, from step (1), register the physical metagenomic sample in the BioSample database at https://submit.ncbi.nlm.nih.gov/subs/biosample/. Select either the "Metagenome or environmental sample" or "Genome, metagenome or marker sequences (MIxS compliant)" package and provide the organism name "xxxx metagenome". Choose one of the metagenome names that is already present in the NCBI Taxonomy database, if at all possible. You will need the SAMNxxxxxxxx ID assigned to the physical BioSample when you register the organism BioSamples in step (4).

(3) Send us a list of the organism names you plan to use for the organism metagenomic assemblies so that our taxonomists can review them. The organism names should be taxonomically meaningful, at the lowest rank that is reliable (division, phylum, class, order, family, genus) and include a unique identifier (i.e., isolate id). NCBI does not utilize unpublished ad hoc taxonomic names from other database such as Silva or GTDB. Please use the appropriate NCBI taxonomic nomenclature when using from the GTDB web site. Here are some examples:

  • bacterium <identifier> [division]
  • Proteobacteria bacterium <identifier> [phylum]
  • Alphaproteobacteria bacterium <identifier> [class]
  • Caulobacterales bacterium <identifier> [order]
  • Caulobacteraceae bacterium <identifier> [family]
  • Caulobacter sp. <identifier> [genus]

We will return the names that are entered in NCBI's Taxonomy database for you to use in step (4).

(4) Use the names that we return to you in step (3) to create organism-specific BioSamples. When you create the organism-specific BioSamples:

  • choose the “MIMAG Metagenome-assembled Genome” package
  • include the BioProject ID PRJNAxxxx you created in step [1]
  • include all of the source attributes that are in the physical metagenomic sample (eg, geo_loc_name, collection-data, lat-lon, isolation-source, etc.)
  • include a unique isolate name
  • include sample_type=metagenomic assembly
  • add a custom attribute with:
    • column header=derived-from
    • column value=This BioSample is a metagenomic assembly obtained from the xxxx metagenome BioSample: SAMNxxxxxxxxx.

This last step is to provide a text link between the organism BioSample and its physical metagenomic BioSample (or BioSamples if you pooled more than one).

If you have several organism BioSamples, you can use a table to upload all of the BioSample information. From the BioSample registration page select "Download batch template". Choose the “MIMAG Metagenome-assembled Genome” package and select "download". Fill in this template and then upload it using the "Batch/Multiple BioSamples" option when you create a new BioSample submission. Alternatively, you can provide this information in the embedded table within the BioSample submission form.

(5) Once you have created the BioProject and the BioSamples, you are ready to submit the data using the genome submission portal. Submit each genome assembly as a separate row in a batch submission using the BioProject ID PRJNAxxxxx from step [1] and the BioSample ID SAMNxxxxxxxx for the individual organism from step (4).

(6) Are you planning to submit annotation? Annotation is not required. However, you may be interested to know that NCBI has a publicly available Prokaryotic Genomes Annotation Pipeline (PGAP). You can find more information here: https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/#pgap. You will be given an option to request PGAP annotation during submission of the genome in the Submission Portal.

The Prokaryotic Genome Annotation Pipeline cannot be used for sequences that are only identified as metagenomic. However, it can be used if you have assembled draft genomes from a metagenomic source and have enough evidence to confidently assign organism names (ie, for MAGs). We do not have a publicly available eukaryotic annotation pipeline.

16. Can I submit RAST annotation?

We are currently working on a prototype that will convert flatfile formats created by outside programs into a 5 column feature table. Part of the problem is that GenBank type files from other sources often contain qualifiers that are not recognized by GenBank so they can't be converted. Conversely, features or qualifiers that are required by GenBank may be missing. In addition, there may be errors such as internal N's representing gaps, invalid translations or unacceptable protein names that need to be addressed.

We are working to make a simpler conversion system, but for now to convert the flatfile (.gb) file from RAST to a .sqn file for GenBank submission, get the scripts from the scripts directory on the NCBI ftp site: ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/scripts/

  • gbf2tbl.pl
  • rast2sqn.sh
  • rastbatch.sh
  • tblfix.pl

In addition, provide the following:

usage:

  • ./rast2sqn.sh template flatfile locus_tag_prefix protein_id_prefix

for example:

input:

  • flatfile = TEST.gb
  • template file = template.sbt
  • locus_tag prefix = AAA
  • protein_id_prefix = xx

commandline:

  • ./rast2sqn.sh template.sbt TEST.gb AAA xx

output:

  • TEST.sqn
  • TEST.fsa
  • TEST.tbl
  • TEST.val = validation
  • errorsummary.val = summary of validation
  • TEST.dsc = discrepancy report
  • TEST.err = qualifiers that couldn't be converted
  • TEST.ecn = EC_numbers that are not found at ftp://ftp.expasy.org/databases/enzyme/enzyme.dat
  • TEST.fixedproducts = product names found by the discrepancy report typo, hypothetical protein, and American spelling categories that are automatically corrected

You will need to review the validation and discrepancy reports and make any necessary corrections to the .sqn file. Some of the product names will probably need to be improved. See the Prokaryotic Genome Guidelines for more information about NCBI protein naming conventions. Submit the .sqn file, as described.

Support Center

Last updated: 2021-01-28T18:11:19Z