Import

Use Table Reader

WORK IN PROGRESS

Import 5 Column Feature Table

Import 5 Column Feature Table opens a dialog that allows you to import a five-column, tab-delimited feature table to add features (for example, coding regions, genes, structural RNAs, and regulatory features) to a sequence record. Instructions for constructing a five-column feature table are at the end of this document.

5 Column Dialog

In the dialog, the file format is preset. In the Filenames box, the name of file to import can be entered as text or the browse function can be used to find a specific file. Note: The browse feature defaults to looking for files with the extension .tbl, but that can be changed so that files with other extensions can be used. Recently Used Files shows a list of files that were previously used. After the correct file has been chosen, click Finish.

When attempting to import a feature table with any qualifiers that do not belong on the features, a window will pop up reporting the problem and what will happen to the text in the qualifiers that are incorrect. For example, the window below reports converting the qualifier from /gene to /note. Converting Qualifier

Chose Close to close this window and move on to the next step.

If the Feature Table ID in the first line of the feature table file matches the SequenceID (SeqID) in the corresponding sequence (FASTA) file, the feature annotation will be added without any other issues.

However, if the table’s ID does not match the SeqID of the sequence file and there is more than one nucleotide sequence in the submission, a matching dialog will be opened for manual pairing of feature tables with sequences.

Match Feature Table Ids

Choose one feature table ID and one sequence ID and then click ‘Map Selected’. The pair will appear in the ‘Selected matches’ window and the table will be removed from the list of tables:

Selected Matches

If necessary, more tables can be matched to more sequences. However, this process must always occur one pair at a time. The nucleotide sequence is not removed from the Sequence ID list on the right panel.

When all are matched, choose OK. This will add the annotation to the records based on the pairings indicated. Next, check sequences to verify that the annotation is present and run Reports -> Validator reports and Reports -> Submitter Report to check for errors and the feature counts to verify that all features were added as expected.

If errors need to be fixed, they can be fixed with the editing functions in the program. If large scale changes are needed the Reports->Feature Table function can be opened and the existing imported feature table can be exported from that menu.

Construction of a 5-column feature table

What is a Feature Table?

A five-column, tab-delimited feature table contains features, their nucleotide locations, and their qualifiers in a specific format that can be read by Genome Workbench to add features to a sequence.

Valid features and qualifiers are restricted those approved by the International Nucleotide Sequence Database Collaboration.

The feature table format specifies the type, location, and additional descriptive information of each feature, allowing Genome Workbench to process and add the features based on the sequence to which they apply. It also allows Genome Workbench to translate CDS features into proteins, which are shown on the sequence record.

Structure of the Feature Table

The first line of a feature table should have this format: >Feature SequenceID (‘greater than’ symbol – the word ‘Feature’ – space – SequenceID used in the corresponding FASTA file to identify the sequence)

The SequenceID should be identical to that used to identify the sequence in the FASTA file. Subsequent lines of the table list nucleotide locations, features, and qualifiers. Each feature is on a separate line. Qualifiers describing that feature are on the line below. Columns are separated by tabs.

Line 1:
Column 1: Start nucleotide location of feature 
Column 2: Stop nucleotide location of feature 
Column 3: Feature name 
Line 2: 
Column 4: Qualifier name 
Column 5: Qualifier value

Example of a 5 column Feature table

>Feature SEQ1
1   750 gene
            gene    abc1
1   750 CDS
            product ABC1
925 795 rRNA
            product 16S ribosomal RNA
930 955 tRNA
            product tRNA-Phe
1005    >1480   gene
            gene    def2
1005    1250    CDS
1370    >1480
            product DEF2
            note        similar to yeast defensin

This example illustrates several characteristics about the table’s format.

  • Features that are on the complementary strand, such as the 16S rRNA, are indicated by reversing the interval locations.
  • Locations of partial (incomplete) features are indicated with a ">" or "<" next to the number. In this example, the def2/DEF2 gene and CDS end downstream of the end of the 1480 nucleotide sequence. The ">" symbol indicates that they are 3' partial features.
  • If a feature contains multiple intervals, like the DEF2 CDS, each interval is listed on a separate line by its start and stop position before subsequent qualifier lines.
  • Gene features are always a single interval, and their location should cover the intervals of all the relevant features. For example, the gene def2 is a single interval even though it’s corresponding CDS has two intervals.
  • If the gene feature spans the intervals of the CDS or mRNA features for that gene, there is no need to include gene qualifiers on those features in the table, because they will be read from the overlapping gene feature.
  • A /note qualifier can be added to any feature using the qualifier note in the table. A note has been added to the DEF2 CDS.

For information about the requirements and expectations of annotation as well as additional examples of feature annotation, please see:

GFF3 File

Importing a GFF3 file

To import a valid GFF3 file use the Import -> GFF3 file button to add annotation to the sequence. When importing the following box will appear:

Check Sequence Identifiers

This box will compare the sequence IDS in the GFF table in column 1 to the sequence IDS in the submission file to ensure that annotation is being added to the correct sequence. If the names are not correct, they can be adjusted in this dialog.

When the information is correct, press OK to move forward in the process to see the next dialog box. If the GFF3 file includes locus-tag information, the first value will be displayed in the box.

Locus Tag

If there are no locus-tags in the file, the locus-tag box will be empty

Locus Tag Empty

Simply enter the registered locus-tag prefix in this box and the software will automatically create them when adding the annotation.

Be sure to select whether the submission is a prokaryote or eukaryote genome. If a eukaryotic genome, the software will make mRNA features as well as valid protein_ids and transcript_ids if these are not included in the GFF3 table. If a prokaryotic genome, the software will make protein_ids if not included in the GFF3 table.

General information about GFF3 files

A 9-column annotation file conforming to the GFF3 or GTF specifications can be used for genome annotation submission. The basic characteristics of the file formats are described at:

The GFF3 format is better described and allows for a richer annotation, but GTF will also work for many submissions. This documentation focuses on GFF3 formatting conventions, but GTF conventions to use for submission are similar. Several basic validators are available to verify that a GFF3 file is syntactically valid:

Note these standalone validators will not detect all formatting and annotation issues, and the GenBank annotation submission software is tolerant of some common GFF3 formatting issues, but they can be useful for initial testing, especially if an input file isn't working as expected.

GenBank-specific requirements

An additional set of rules, specific attributes (equivalent to INSDC qualifiers), and automatic processing are utilized for submission of annotated genomes to GenBank. These additions are:

Formatting requirements

[1] seqid in GFF3/GTF column 1 should match the corresponding FASTA or ASN.1 file that is being annotated. For assemblies already in GenBank, seqids will be matched to their corresponding accessions if they are the same as what was used in the original submission. [The seqid is the text between the '>' and the first space in the fasta definition line; do not include the '>' in the GFF file]

[2] tig, supercontig, chromosome and similar landmark features are not required and will be ignored

[3] multi-exon mRNA and other RNA features can be represented using either: [a] child exon features [b] child five_prime_UTR, CDS, and three_prime_UTR features [c] multiple RNA feature rows with the same ID

Furthermore, whereas the GFF3 specifications require that all rows of a multi-exon CDS feature use the same ID, some commonly used software deviates from this requirement. To allow for deviations from the specifications, for eukaryotes the GenBank software assumes that multiple CDS rows with the same Parent attribute represent parts of the same CDS feature. Multiple CDS features for the same gene need to be annotated by using a separate mRNA Parent feature for each, so there is always a 1:1 relationship of mRNA to CDS, like in the following schematic:

gene1            ================================    ID=gene1
mRNA1            ================================    ID=mRNA1;Parent=gene1
five_prime_UTR   ==                                  Parent=mRNA1
CDS1               ==....=====...........==          Parent=mRNA1 (3 rows)
three_prime_UTR                            ======    Parent=mRNA1
mRNA2            ================================    ID=mRNA2;Parent=gene1
exon             ====                                Parent=mRNA2
CDS2               ==....................==          Parent=mRNA2 (2 rows)
exon                                     ========    Parent=mRNA2

Changes that occur during processing

[1] CDS features that don't include but are adjacent to a stop codon will be automatically extended 1-3 bp to include the stop codon. start_codon and stop_codon features are not required in either GFF3 or GTF.

[2] gene and mRNA features are useful but NOT required. If they are omitted, and only CDS features are provided, then gene and/or mRNA features will be created on-the-fly based on the corresponding CDS feature. mRNA features are auto-created for eukaryote genome annotation submissions. If your organism name is in the taxonomy database, the software will be able to determine whether your submission is prokaryotic or eukaryotic. If there is no valid taxonomy lookup, a popup will appear and you will need to select whether the submission is from a prokaryote or eukaryote.

[3] The partialness markup on gene, mRNA, and CDS features is computed automatically based on the completeness of the CDS feature at either end. There is no need to specify attributes in column 9, and any attributes that are sometimes used to specify partialness, such as start_range or end_range, will be ignored.

Attributes/Annotation features

[1] Many SO feature types are recognized in column 3 and converted to their INSDC equivalents. Commonly used types are:

  • gene
  • CDS
  • mRNA
  • exon
  • five_prime_UTR
  • three_prime_UTR
  • rRNA
  • tRNA
  • ncRNA
  • tmRNA
  • transcript
  • mobile_genetic_element
  • origin_of_replication
  • promoter
  • repeat_region

Some SO types may need to be changed before processing in order to be properly recognized:

 [a] All gene features should use "gene". More specific SO types like rRNA_gene, miRNA_gene, tRNA_gene, pseudogene, pseudogenic_tRNA, and others should be converted to use "gene" instead

 [b] Misc_RNA is sometimes used for a generic RNA feature type, but it is not a recognized SO term. Use "transcript" instead. Feature types that aren't recognized will be automatically dropped and reported in the log file. Feature types that are always ignored (so not reported in the log file) are:

  • intron
  • protein

[2] Pseudogenes should be flagged with pseudogene=<TYPE> qualifier in column 9 on the gene feature and optionally on any child features. Further details about the TYPE values allowed for the pseudogene qualifier are available at: Pseudogene Qualifier Vocabulary

[3] annotate with pseudo=true any genes that are 'broken' but are not thought to be pseudogenes. These are genes that do not encode the expected translation, for example because of internal stop codons. These are often caused by problems with the sequence and/or assembly.

[4] gene features require locus_tag qualifiers. GFF3 ID attributes are not used for the locus_tag qualifier, so if the ID is applicable as the locus_tag, it should be copied into that attribute with the appropriate formatting. The locus_tags can be provided either by:

     [a] Adding a locus_tag= attribute to column 9. This option should be used for annotation updates to keep the existing locus_tags where appropriate.

     [b] You can indicate in the popup box that you have not included locus-tags and the you will be prompted to specify the prefix to use. The software will then assign locus_tags automatically.

[5] mRNA and CDS features require transcript_id and protein_id qualifiers, respectively. They can be provided either by including both or neither of them. Specifically [a] and [b], OR just [c]:

     [a] adding transcript_id= attributes to mRNA (and other RNA) features, using the format:

  • transcript_id=gnl|dbname|ID

Where dbname is either the locus_tag prefix, or WGS:XXXX (for assemblies that have already been assigned a WGS accession prefix). Further details are available in the eukaryotic annotation guidelines .

     [b] adding protein_id attributes to the CDS features, using one of these formats: - protein_id=gnl|dbname|ID - protein_id=gnl|dbname|ID|gb|accession

"gb|accession" is only applicable for annotation updates where tracking of proteins is desired or required. It is not required to reuse existing protein accessions if the same dbname and ID are provided. Further details are available in the eukaryotic annotation guidelines.

     [c] both transcript_id and protein_id can be omitted, and they will be generated automatically based on the IDs of the mRNA/CDS and gene locus_tag prefix. These qualifiers do not appear in the flatfile view, so if the GFF3 IDs are meant to be seen in that view, then they should be copied into a 'note' attribute with the appropriate formatting. However, annotation updates should include the generated protein_ids on CDS features described in point [5b] to allow protein accessions to be preserved appropriately.

[6] GFF3 ID attributes are primarily used just for interpreting parent-child feature relationships.

  • They are not automatically used for the locus_tag qualifier, so if the ID is applicable as the locus_tag, it should be copied into that attribute with the appropriate formatting.
  • However, if no transcript_id, or protein_id qualifiers are present, then the GFF3 ID attribute will be used as the basis of those qualifiers, as described in point [5c]. These qualifiers do not appear in the flatfile view, so if the GFF3 IDs are meant to be seen in that view, then they should be copied into a 'note' attribute with the appropriate formatting.

[7] GFF3 Name attributes are ignored.

[8] product names are specified using a product= attribute on a CDS or RNA feature.

  • Names should conform to GenBank guidelines .
  • Multiple names can be specified by providing the primary name first, and additional names as a comma-separated list.
  • Commas that are intended to be part of a name should be encoded (%2C) according to the GFF3 specifications. However, literal commas should only be included when they are part of enzymatic names. Semi-colons generally should not be included in product names.
  • If a CDS feature does not specify a product name, it will be automatically named 'hypothetical protein'.
  • If an mRNA feature does not specify a product name, it will automatically inherit the name from the CDS.
  • Product names should be provided for tRNAs, rRNAs and ncRNAs in GFF3/GTF submission files.

[9] Most INSDC qualifiers that can be used for submission in a conventional 5-column .tbl file will also work if provided as attributes in column 9 of a GFF3 input file. Multiple values for a qualifier should be provided as a comma-separated list. Commonly used attributes/qualifiers include:

     [a] attributes described above in more detail:

  • locus_tag=_ID (gene)
  • transcript_id=gnl|dbname|ID (RNA)
  • protein_id=gnl|dbname|ID|gb|accession (CDS)
  • product= (RNA, CDS)
  • pseudo=true (gene, RNA, CDS)
  • pseudogene= (GENE, RNA, CDS)

     [b] Dbxref=DB:value all feature types. See GenBank db_xref qualifier for the current list of allowed databases

     [c] ec_number=x.x.x.x (CDS features)

     [d] Note= (all feature types). Converted to INSDC /note (also known as a comment)

     [e] gene=Abc1 (gene). For the biological gene name (aka symbol)

     [f] gene_synonym=xyz (gene). Database names can be included as synonyms, even with no gene name

     [g] description= (gene). gene full name, displayed as /note in flatfile.

     [h] exception= (gene, RNA, CDS)

     [i] transl_except=(pos:%2Caa:) (CDS). Used to specify the location of translation exceptions on a CDS feature where a codon at a specific location on the genome should be translated as an alternative amino acid, such as Sec.

     [j] function (CDS)

     [k] experiment (RNA,CDS)

     [l] old_locus_tag (gene)

     [m] mobile_element. This has the mandatory qualifier of mobile_element_type, eg mobile_element_type=SINE:Alu

     [n] ncRNA_class, regulatory_class, recombination_class. These can also be represented with specific SO feature types in column 3, if they have equivalents in the INSDC class controlled vocabularies.

Annotation crossing gaps

A CDS can only cross a gap of unknown size in introns, not in the actual coding region. If the gap of unknown size is within an exon, then you could split the CDS into two partial CDS features (and mRNAs in eukaryotes) that abut the gap, with a single gene over the whole locus. Alternatively, one of the partial CDS/mRNA features may be deleted if it is very short and there is little or no supporting evidence for it. If you have a single gene and two partial CDS/mRNA features, you should: (1) add a note to each CDS referencing the other half of the gene, (2) add a note to the gene and CDS features stating, "gap found within coding sequence." A CDS exon can cross a gap of estimated size; however, a CDS (or mRNA) should not cross a gap such that over 50% of the translation is X (ie, is in the gap). This situation will generate an error. Again, the CDS/mRNA should either be partial up to the gap or split into two partial CDS/mRNA features on either side of the gap, depending upon your confidence in the translation on each side of the gap. In addition, no feature should begin or end inside a gap. Instead, the feature should abut the gap and be partial. For more information about splitting CDS features, see either the eukaryotic annotation guidelines or the prokaryotic annotation guidelines .

For more information please see the full documentation for NCBI Genome Workbench Editing Package.

Support Center

Last updated: 2019-07-03T16:35:04Z