Prokaryotic Annotation Guide

Introduction

This guide explains how to submit a bacterial genome to GenBank, and how to update an existing bacterial genome submission using either Sequin, an NCBI software tool for submitting and updating GenBank entries, or tbl2asn, an NCBI command-line program for generating submission files.

Both programs combine a simple five-column tab-delimited table of feature locations and qualifiers with the DNA sequence (in FASTA format) and the submitter information to generate a file for submission to GenBank.

The format of this feature table allows diferent kinds of features (e.g. gene, coding region, tRNA, repeat_region) and qualifiers (e.g. /product, /note) to be indicated. The validator will check for errors such as internal stops in coding regions.

Guidelines for eukaryotic genome submissions .

If you do not understand any of the instructions presented here or you have questions, please contact us by email at genomes@ncbi.nlm.nih.gov prior to creating your submission. This will save us both a lot of time.

Table of Contents

  1. Register your Project
  2. Prepare FASTA-formatted sequence
  3. Annotation
  4. Create your submission
  5. Submitting
  6. What happens next
  7. Updating existing genome submissions
  8. Examples

Register your Project

Please register your genome project and proposed locus_tag prefix on the BioProject registration page prior to preparing your submission to GenBank. Each project that is registered here is assigned a BioProjectID, which will appear on all entries associated with a particular genome project.

FASTA-formatted sequence

Nucleotide sequences must be in FASTA format. FASTA format consists of a single definition line, beginning with a '>' and followed by optional text, and subsequent lines of sequence. At minimum, all definition lines must contain an identifier for the sequence, called the SeqID. Other information about the biological source of the organism can also be encoded on the definition line of the sequence and is used to build the record.

A sample definition line is

>HTE831 [organism=Oceanobacillus iheyensis] [strain=HTE831]

Common source modifiers may be incorporated into the definition line e.g. [strain=HTE831]. Many of these modifiers can also be entered on the template. See the section Create your submission below. An example of a FASTA-formatted sequence is shown in Figure 1

Annotation

While annotation is optional for incomplete WGS submissions, complete genome submissions must be annotated. You can annotate the genome yourself, following the instructions on these pages.  Alternatively, you can request that your genome submission be annotated by NCBI's Prokaryotic Genomes Annotation Pipeline that is available for genomes being submitted to GenBank.      New in May 2013:   We have a new version of the annotation pipeline called "Prokaryotic Genome Annotation Pipeline"; its first public version will be 2.0.  It uses GeneMarkS+ to take protein alignment data as an input and incorporate that information in the gene prediction process. Just as with the original PGAAP, this new version will be available to GenBank submitters by request when they submit a prokaryotic genome to GenBank.   Genomes submitted with an annotation request after May 21 will be annotated by Prokaryotic Genomes Annotation Pipeline 2.0. 

More details are available from the web page, http://www.ncbi.nlm.nih.gov/genome/annotation_prok/

 In order for you to get your genome submission annotated, generate files for genome submission to GenBank and request annotation in the Private Comments box during the submission of that genome. (There is no longer a separate annotation process.)

If you annotate yourself, several features are the minimal required annotation, but there are many additional features that can be included. It is our hope that the annotation present on any genome will evolve over time as more is known about the biology. In reviewing bacterial genome annotation, NCBI strives to ensure that the annotation is consistent throughout the submission and when compared to other genome submissions. We also strive to present information that is an accurate representation of the known biology. To do this we need your help. Please pay careful attention to the annotation instructions presented here and please review all your annotation before submitting your genome. Many genomes are annotated by automatic prediction programs and since these programs do make mistakes, it is up to all of us to try and ensure the information being presented is as accurate as possible. A summary of the required annotation is presented below, however please also refer to our detailed annotation instructions for our annotation expectations.

Required Annotation

  1. Genes
    • locus_tag
  2. Coding regions of known proteins
    • product (protein) names
    • protein_id
  3. structural RNAs (tRNAs and ribosomal RNAs)

Gene features

A gene is defined as a region of biological interest for which a name has been assigned. Gene features are always a single interval, and their location should cover the intervals of all the relevant features such as promoters and operator binding sites. Gene names must follow the standard bacterial nomenclature rules of three lower case letters. Different loci are distinguished by a suffix of uppercase letters. Please refer to detailed annotation instructions for more information on genes.

locus_tag

The locus_tag is a systematic gene identifier that is assigned to each gene. The locus_tag must be unique for every gene of a genome. Each genome project (i.e. all chromosomes and plasmids) have the same unique locus_tag prefix to ensure that a locus_tag is specific for a particular genome project, which is why we require that the locus_tag prefix be registered. In addition, a gene may have a biological name, as assigned in the scientific literature, and described above. For example, OBB_0001 is a systematic gene identifier, while abcD is the biological gene name. We recommend having the BioProject registration process auto-assign a locus_tag prefix, as they are not meant to convey meaning. The locus_tag prefix must be 3-12 alphanumeric characters and the first character may not be a digit. Additionally locus_tag prefixes are case-sensitive. The locus_tag prefix is followed by an underscore and then an alphanumeric identification number that is unique within the given genome. Other than the single underscore used to separate the prefix from the identification number, no other special characters can be used in the locus_tag. Locus_tags must only be used in combination with a gene feature. Read more about locus_tags and their intended usage. Please refer to detailed annotation instructions for how to incorporate locus_tags into your annotation table.

The use of locus_tag is supported in Sequin version 4.35 or newer. If you have an older version of Sequin please download the current version .

CDS (coding region) features

The CDS feature is used to define a protein coding region. All CDS features must have a product qualifier (protein name). Use a concise name, not a description or phrase. Alternatively, protein names may be denoted by the same symbol as the corresponding gene, but the symbol begins with a capital letter. In cases where the protein is not known use "hypothetical protein" as the product name. We recommend the use of "hypothetical protein" as this will allow the locus_tag identifier to be appended to the product name in BLAST and Entrez summary lines. Our detailed annotation instructions contain instructions and examples on naming your proteins as well as including additional CDS qualifiers such as EC_numbers, protein functions, descriptive and similarity notes.

protein_id

The submitter must assign an identification number to all proteins. NCBI uses this number to track proteins when sequences are updated. This number is indicated in the table by the CDS qualifier protein_id, and must have the format gnl|dbname|string, where dbname is a version of your lab name that you think will be unique (eg SmithUCSD), and string is the unique protein SeqID assigned by the submitter.

The protein_id is used for internal tracking in our database, so it is important that the complete protein_id (dbname + string) not be duplicated by a genome center. Note that when WGS submissions are processed, the dbname in the protein_id is automatically changed to 'WGS:XXXX', where XXXX is the project's accession number prefix. Please see detailed annotation instructions .

structural RNAs

rRNA, tRNA, misc_RNA, and ncRNA are features used to annotate the various structural RNA genes. All RNA features must include a corresponding gene feature with a locus_tag qualifier. Only ribosomal RNAs (rRNA) and transfer RNAs (tRNA) are required.

Additional optional annotation

For examples of the many types of optional annotation that can be included please refer to detailed annotation instructions for instructions on the types of features available, their proper usage as well as examples.

Create your submission

The submission file can be generated using Sequin or tbl2asn . tbl2asn is a simple command line program that automates parts of the submission process and it is very useful for projects that have multiple sequences (i.e. multiple chromosomes or plasmids). It is packaged with the Sequin archive, but the newest version is available by anonymous FTP . The main difference between Sequin and tbl2asn the two is that Sequin is a menu driven program with a graphical interface, while tbl2asn is a command line program. Most people find Sequin easier for finished genomes while tbl2asn is useful for unfinished WGS submissions that have many contigs.

See the WGS instructions for specific information on generating a WGS submission.

For both programs the sequence must be in a file or files in FASTA format, and the annotation must be in a file or files in the five column tab-delimited feature table format, as described above.

Sequin

If you choose to use Sequin to make your submission file, then follow the directions on the Sequin page. Check the detailed annotation instructions to ensure that you have included the annotation correctly. Be sure to validate and fix the errors. Run the Discrepancy Report if your submission has annotation, and fix any problematic annotation detected. Contact genomes@ncbi.nlm.nih.gov with questions about any errors or discrepancy report output.

tbl2asn

If you choose to use tbl2asn, then the basic instructions follow, but more detail is available on the tbl2asn page.

tbl2asn reads a template along with the sequence and table files, and outputs ASN.1 for submission to GenBank. tbl2asn requires that the sequence and annotation file have specific name conventions. The FASTA-formatted sequence file has ".fsa" as an extension, and the five column tab-delimited table file has ".tbl" as an extension. The base name of the .tbl file must be identical to that of the .fsa file for tbl2asn to recognize it and to include the annotation in the output ".sqn" file that it generates.

The template file is created on the submission template page.

The basic tbl2asn command is:

tbl2asn -t template_file -p path_to_files -M n -Z discrep -j "[gcode=11]"
  -t specifies the template file (including the path) [required]
  -p specifies the path for the table and sequence files ('-p .' is the current directory) [required]
  -j specifies the correct genetic code for translation of bacterial proteins [required]
  -M n performs some clean-ups and runs validation
  -Z discrep outputs the discrepancy report to a file named 'discrep'

Additional command line arguments can be seen on the tbl2asn page.

In the directory specified by '-p', the program looks for corresponding pairs of *.fsa and *.tbl files, and builds ASN.1 records named *.sqn for these pairs. The results of the validation (error checking) will be in *.val files. Note that if there are no .tbl files in the directory, then tbl2asn will still generate .sqn files from the .fsa files that are present.

Check the errorsummary.val file for the number, severity and type of errors that are present in the .val files. All Errors and Rejects need to be fixed. The presence of errors will slow processing.  Contact genomes@ncbi.nlm.nih.gov with any questions about the validation output. 

Check the file named 'discrep' for the results of the discrepancy report. Categories prefaced with FATAL are always unacceptable and must be fixed.  Some of the categories are informational. See the discrepancy report examples and explanations for guidance. Write to genomes@ncbi.nlm.nih.gov and send the discrep file with questions about this report.

Make any necessary fixes to the input .fsa and/or .tbl files and run tbl2asn again. Or make the necessary fixes directly to the .sqn file by opening it in Sequin and editing the features there. Additional tips on using Sequin are found in the Sequin Guide .

In addition, NCBI offers a Genome Submission Check Tool to check your submission file before sending it to us. This tool performs another validation check on your submission files. It takes neighboring pairs of proteins and does a BLASTP analysis on them and notes which neighbors hit the same longer protein. Finding pairs of proteins that hit the same longer protein suggests that the pair may represent a single gene that has gained a frameshift or other mutation. We ask that you do your own analysis to decide whether the pair should remain two proteins, or be combined into a single pseudogene. This validation check also looks for and reports tRNAs and rRNAs that may have escaped your detection or were annotated on the wrong strand.

Once the errors have been fixed, the .sqn files can be submitted to GenBank. If either the Discrepancy Report or the Genome Submissions Check tool report errors that you feel are not problems, please send the list of these errors along with some explanation as to why they are OK.

Submitting

Genomes are generally submitted to us by FTP or with our Genomes Submission Tool. Generally, you can use our Genomes Submission Tool to upload genome submission files to us directly without using your email system, which can sometimes lead to delivery or file corruption problems. If you are going to be submitting frequently, we can establish an FTP account for you here at NCBI. In this case please send us an email, requesting an ftp account and describing your project. Regardless of the method you use, we ask that you send us an email at genomes@ncbi.nlm.nih.gov whenever you submit a new genome, and include the registered ProjectID and organism name in the message.

What happens next

Once we receive your genome submission, a member of our staff will conduct an initial review of it and will contact you by email. If we do not find any significant issues with your submission, you will be issued an accession number. Once your submission is assigned an accession number it undergoes a thorough review by our staff. This review is critical because we are striving to present genome annotation in an accurate and consistent manner so that database users can make maximum use of the data. If we encounter problems during this review, we will contact you by email.

Once we have completed our review of your submission, we will prepare it for release to the public database. You can choose to have your submission released immediately or to be kept confidential until a certain date or publication of the work, whichever is first. If you wish your genome to be held until publication, we ask that you provide us with the expected publication date and also notify us in a timely manner of the upcoming publication and the relevant citation details. This will allow us to coordinate the release of your genome with the appearance of the paper. Please provide at least two weeks notice of any upcoming publication.

Updating a genome

When a complete genome or chromosome is updated, the original protein annotation must be tracked to the update. To do this, proteins from the original submission that are present in the update must have the same identifiers that were used in the original submission, plus the accession numbers that were assigned when the submission was loaded into GenBank. These identifiers are included in the protein_id of the update in this format:

gnl|dbname|SeqID|gb|accession_number

where the dbname and SeqID are the values used in the original submission, and the accession number was assigned by GenBank.

When your genome is released, we will supply you with a table that has each protein SeqID and protein accession number, so that you can use those in future updates. If you did not receive this table and need to update your genome, contact us at genomes@ncbi.nlm.nih.gov prior to the preparation of your submission.

See the WGS page for information about updating a WGS project.

Write to the Help Desk

Last updated: 2013-09-14T17:31:23-04:00