The DDBJ/EMBL/GenBank
Feature Table:
Definition
Version 7 Oct 2007
DNA Data Bank of Japan, Mishima, Japan.
EMBL Nucleotide Sequence Database, Cambridge, UK.
GenBank, NCBI, Bethesda, MD, USA.
2 Overview of the Feature Table format
2.2 Key aspects of this feature table design
3 Feature table components and format
3.2.3 Key groups and hierarchy
3.4.3 Examples of feature labels
5 Examples of sequence annotation
5.3 Artificial cloning vector (circular)
5.6 Immunoglobulin heavy chain
6. Limitations of this feature table design
7.1 Appendix I EMBL,GenBank and DDBJ entries
7.2 Appendix II Feature table: Backus-Naur form
7.3 Appendix III: Feature keys reference
7.4 Appendix IV: Summary of qualifiers for feature keys
7.4.2 Feature qualifiers – mapped to Feature keys
7.5 Appendix V: Controlled vocabularies
7.5.1 Nucleotide base codes (IUPAC)
7.5.2 Modified base abbreviations
7.5.3 Amino acid abbreviations
7.5.4 Modified and unusual Amino Acids
Nucleic acid
sequences provide the fundamental starting point for describing and
understanding the structure, function, and development of genetically diverse
organisms. The GenBank, EMBL, and DDBJ nucleic acid sequence data banks have
from their inception used tables of sites and features to describe the roles
and locations of higher order sequence domains and elements within the genome
of an organism.
In February, 1986, GenBank and EMBL began a collaborative effort (joined by
DDBJ in 1987) to devise a common feature table format and common standards for
annotation practice.
The overall goal of the feature table design is to provide an extensive vocabulary for describing features in a flexible framework for manipulating them. The Feature Table documentation represents the shared rules that allow the three databases to exchange data on a daily basis.
The range of features to be represented is diverse, including regions which:
· perform a biological function,
· affect or are the result of the expression of a biological function,
· interact with other molecules,
· affect replication of a sequence,
· affect or are the result of recombination of different sequences,
· are a recognizable repeated unit,
· have secondary or tertiary structure,
· exhibit variation, or have been revised or corrected.
The format design is based on a tabular approach and consists of the following items:
Feature key
a single word or abbreviation indicating functional group
Location
instructions for finding the feature
Qualifiers
auxiliary information about a feature
·
Feature keys allow specific annotation of
important sequence features.
· Related features can be easily specified and retrieved.
Feature keys are arranged hierarchically, allowing complex and compound features to be expressed. Both location operators and the feature keys show feature relationships even when the features are not contiguous. The hierarchy of feature keys allows broad categories of biological functionality, such as rRNAs, to be easily retrieved.
·
Generic feature keys provide a means for
entering new or undefined features.
A number of "generic" or miscellaneous
feature keys have been added to permit annotation of features that cannot be
adequately described by existing feature keys. These generic feature keys will
serve as an intermediate step in the identification and addition of new feature
keys. The syntax has been designed to allow the addition of new feature keys as
they are required.
·
More complex locations (fuzzy and alternate
ends, for example) can be specified.
Each end point of a feature may be specified as a
single point, an alternate set of possible end points, a base number beyond
which the end point lies, or a region which contains the end point.
· Features can be combined and manipulated in many different ways.
The location field can contain operators or functional descriptors specifying what must be done to the sequence to reproduce the feature. For example, a series of exons may be "join"ed into a full coding sequence.
· Standardized qualifiers provide precision and parsibility of descriptive details
A combination of
standardized qualifiers and their controlled-vocabulary values enable free-text
descriptions to be avoided.
· The nature of supporting evidence for a feature can be explicitly indicated.
Features, such as open reading frames or sequences showing sequence similarity to consensus sequences, for which there is no direct experimental evidence can be annotated. Therefore, the feature table can incorporate contributions from researchers doing computational analysis of the sequence databases. However, all features that are supported by experimental data will be clearly marked as such.
· The table syntax has been designed to be machine parsible.
A consistent syntax allows machine extraction and manipulation of sequences coding for all features in the table.
The format and wording in the feature table use common biological research terminology whenever possible. For example, an item in the feature table such as:
Key Location/Qualifiers
CDS 23..400
/product="alcohol dehydrogenase"
/gene="adhI"
might be read as:
The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called 'alcohol dehydrogenase' and is coded for by a gene called “adhI”.
A more complex description:
Key Location/Qualifiers
CDS join(544..589,688..>1032)
/product="T-cell receptor beta-chain"
which might be read as:
This feature, which is a partial coding sequence, is formed by joining elements indicated to form one contiguous sequence encoding a product called T-cell receptor beta-chain.
The following sections contain detailed explanations of the feature table design showing conventions for each component of the feature table, examples of how the format might be implemented, a description of the exact column placement of all the data items and examples of complete sequence entries that have been annotated using the new format. The last section of this document describes known limitations of the current feature table design.
Appendix I gives an example database entry for the DDBJ, GenBank and EMBL formats. Appendix II describes the format in Backus-Naur Form (BNF). Appendices III and IV provide reference manuals for the feature table keys and qualifiers, respectively. Appendix V includes controlled vocabularies such as nucleotide base codes, modified base abbreviations, genetic code tables etc.
This document defines the syntax and vocabulary of the feature table. The syntax is sufficiently flexible to allow expression of a single biological entity in numerous ways. In such cases, the annotation staffs at the databases will propose conventions for standard means of denoting the entities.
This feature table format is shared by GenBank, EMBL and DDBJ. Comments, corrections, and suggestions may be submitted to any of the database staffs. New format specifications will be added as needed.
Feature table components, including feature
keys, qualifiers, accession numbers, database name abbreviations, feature
labels, and location operators, are all named following the same conventions.
Component names may be no more than 20 characters long (Feature keys 15,
Feature qualifiers 20)
and must contain at least one letter. Case should not be regarded as
significant in comparing feature labels (“Prot1” and “pROT1” are the same). The
following characters are permitted to occur in feature table component names:
· Uppercase letters (A-Z)
· Lowercase letters (a-z) Numbers (0-9)
· Underscore (_)
· Hyphen (-)
· Single quotation mark or apostrophe (')
· Asterisk (*)
Feature keys indicate (1) the biological nature of the annotated feature or (2) information about changes to or other versions of the sequence. The feature key permits a user to quickly find or retrieve similar features or features with related functions.
There is a defined list of allowable feature keys, which is shown in Appendix III. Each feature must contain a feature key.
The feature keys fall into families which are in some sense similar in function and which are annotated in a similar manner. A functional family may have a "generic" or miscellaneous key, which can be recognized by the 'misc.' prefix, that can used for instances not covered by the other defined keys of that group.
The feature key groups are listed below with a short definition and an annotation example:
1. Difference and change features
Indicate ways in which a sequence should be changed to produce a different
"version":
misc_difference
location
/replace="change_location"
2. Expression signal features
Indicate regions containing a signal that alters a biological function:
misc_signal location
3. Transcript features
Indicate products made by a region:
misc_RNA location
4. Binding features
Indicate that a sequence or nucleotide is covalently, non-covalently, or
otherwise bound to something else:
misc_binding
location
/bound_moiety="bound molecule"
5. Repeat features
Indicate repetitive sequence elements:
repeat_region location
6. Recombination features
Indicate regions that have been either inserted or deleted by recombination:
misc_recomb location
7. Structure features
Indicate sequence for which there is secondary or
tertiary structural information:
misc_structure location
In addition to the functional groupings shown above, the feature keys can also be arranged in a hierarchical tree based on the degree of specificity or level of detail known about a feature. This hierarchy is shown in outline form in Appendix III where the most general level is the 'misc_feature' key and other keys are arranged in increasing level of detail. By using more general keys, features can be annotated even if their biological functions are insufficiently well characterized to assign them more specific keys.
Key Description
CDS Protein-coding sequence
RBS ribosome binding site
rep_origin Origin of replication
protein_bind Protein binding site on DNA
tRNA mature transfer RNA
See Appendix III for descriptions of all feature keys.
Qualifiers provide a general mechanism for supplying information about features in addition to that conveyed by the key and location.
Qualifiers take the form of a slash (/) followed by the qualifier name and, if applicable, an equal sign (=) and a value. Each qualifier should have a single value; if multiple values are necessary, these should be represented by iterating the same qualifier, eg:
Key Location/Qualifiers
CDS 1..1000
/codon=(seq:"cug",aa:Ser)
/codon=(seq:"tga",aa:Trp)
If the location descriptor does not need a continuation line, the first qualifier begins a new line in the feature location column. If the location descriptor requires a continuation line, the first qualifier may follow immediately after the location. Any necessary continuation lines begin in the same column. See Section 4 for a complete description of data item positions.
Since qualifiers convey many different types of information, there are several
value formats:
1. Free text
2. Controlled vocabulary or enumerated values
3. Citation or reference numbers
4. Sequences
5. Feature labels
Most qualifier values will be a descriptive text phrase which must be enclosed in double quotation marks. When the text occupies more than one line, a single set of quotation marks is required at the beginning and at the end of the text. The text itself may be composed of any printable characters (ASCII values 32-126 decimal). If double quotation marks are used within a free text string, each set (") must be 'escaped' by placing a second double quotation mark immediately before it (""). For example:
/note="This is an example of ""escaped"" quotation marks"
Some qualifiers require values from a controlled vocabulary and are entered without quotation marks. For example, the '/direction' qualifier has only three values: 'left', 'right' or 'both'. Qualifier value controlled vocabularies, like feature table component names, must be treated as completely case insensitive: they may be entered and displayed in any combination of upper and lower case ('/direction=Left' '/direction=left' and '/direction=LEFT' are all legal and all convey the same meaning). The database staffs reserve the right to regularize the case of qualifier values in the interest of readability, unlike the case of feature labels where the databases will maintain the case as originally entered (see Section 3.4.2). Qualifier value controlled vocabularies will be maintained by the cooperating database staffs. Examples of controlled vocabularies can be found in Appendices IV and V. The database staff should be contacted for the current lists.
The citation or published reference number (as enumerated in the entry 'REFERENCE' or 'RN' data item) should be enclosed in square brackets (e.g., [3]) to distinguish it from other numbers.
Literal sequence of nucleotide bases e.g., join(12..45,"atgcatt",988..1050) in location descriptors has become illegal starting from implementation of version 2.1 of the Feature Table Definition Document (December 15, 1998)
Key Location/Qualifiers
source 1..1509
/organism="Mus musculus"
/strain="CD1"
/mol_type=”genomic DNA”
promoter <1..9
/gene="ubc42"
mRNA join(10..567,789..1320)
/gene="ubc42"
CDS join(54..567,789..1254)
/gene="ubc42"
/product="ubiquitin conjugating enzyme"
/function="cell division control"
The /label= qualifier takes as its value a feature label. Feature labels follow the same naming conventions as other feature table components (e.g., keys and qualifiers). While feature labels are optional, attaching a label to a feature allows it to be referred to unambiguously. For example, the feature label can be used to refer unambiguously to a coding region that exists in a different entry to the exons of which it is comprised.
The feature label identifies a feature item within an entry and, when combined with the entry's primary accession number and the name of the database from which it came, is a permanent internationally unique tag for that feature. There are, however, certain situations in which a "permanent" feature may "disappear" from the distributed version of the database and others in which it may be desirable to change a feature's label.
Each feature in a feature table may have a label which must be unique within that entry, but which may be the same as feature labels used in other entries. A feature can be given any label. However, labels containing meaningful abbreviations will be much more easily remembered than non-descriptive labels. Because letter case is not significant, two features within one entry cannot have labels that differ only in case: '16S_rRNA' and '16s_rRNA' could not both be used in the same entry.
The full feature name syntax is as follows:
Database name::primary accession number:feature label
References to a feature should use as much of the full feature name as required to unambiguously identify the feature.
Feature label Description
adhI adhI gene coding for alcohol dehydrogenase
tfp35 tail fiber protein 35
3'-ltr long terminal repeat
a1col_x51 prepro-alpha-1-collagen, exon 51
X10045:diff1 first conflict for the sequence of entry X10045
GB::K10675:catexA feature with label catexA in entry K10675 of the
GenBank databank
The location indicates the region of the presented sequence which corresponds to a feature.
The location contains at least one sequence location descriptor and may contain one or more operators with one or more sequence location descriptors. Base numbers refer to the numbering in the entry. This numbering designates the first base (5' end) of the presented sequence as base 1.
Base locations beyond the range of the presented sequence may not be used in location descriptors, the only exception being location in a remote entry (see 3.5.2.1, e).
Location operators and descriptors are discussed in more detail below.
The location descriptor can be one of the following:
(a) a single base number
(b) a site between two indicated adjoining bases
(c) a single base chosen from within a specified range of bases (not allowed for new
entries)
(d) the base numbers delimiting a sequence span
(e) a remote entry identifier followed by a local location descriptor
(i.e., a-d)
A site between two adjoining nucleotides, such as endonucleolytic cleavage site, is indicated by listing the two points separated by a carat (^). The permitted formats for this descriptor are n^n+1 (for example 55^56), or, for circular molecules, n^1, where "n" is the full length of the molecule, ie 1000^1 for circular molecule with length 1000.
A single base chosen from a range of bases is indicated by the first base number and the last base number of the range separated by a single period (e.g., '12.21' indicates a single base taken from between the indicated points). From October 2006 the usage of this descriptor is restricted: it is illegal to use "a single base from a range" (c) either on its own or in combination with the "sequence span" (d) descriptor for newly created entries. The existing entries where such descriptors exist are going to be retrofitted.
Sequence spans are indicated by the starting base number and the ending base number separated by two periods (e.g., '34..456'). The '<' and '>' symbols may be used with the starting and ending base numbers to indicate that an end point is beyond the specified base number. The starting and ending base positions can be represented as distinct base numbers ('34..456') or a site between two indicated adjoining bases.
A location in a remote entry (not the entry to which the feature table belongs) can be specified by giving the accession-number and sequence version of the remote entry, followed by a colon ":", followed by a location descriptor which applies to that entry's sequence (i.e. J12345.1:1..15, see also examples below)
The location operator is a prefix that specifies what must be done to the indicated sequence to find or construct the location corresponding to the feature. A list of operators is given below with their definitions and most common format.
complement(location)
Find the complement of the presented sequence in the span specified by "location" (i.e., read the complement of the presented strand in its 5'-to-3' direction)
join(location,location, ... location)
The indicated elements should be joined (placed end-to-end) to form one contiguous sequence
order(location,location, ... location)
The elements can be found in the specified order (5' to 3' direction), but nothing is implied about the reasonableness about joining them
Note : location operator "complement" can be used in combination with either "join" or "order" within the same location; combinations of "join" and "order" within the same location (nested operators) are illegal.
The following is a list of common location descriptors with their meanings:
Location Description
467 Points to a single base in the presented sequence
340..565 Points to a continuous range of bases bounded by and
including the starting and ending bases
<345..500 Indicates that the exact lower boundary point of a feature
is unknown. The location begins at some base previous to
the first base specified (which need not be contained in
the presented sequence) and continues to and includes the
ending base
<1..888 The feature starts before the first sequenced base and
continues to and includes base 888
1..>888 The feature starts at the first sequenced base and
continues beyond base 888
102.110 Indicates that the exact location is unknown but that it is
one of the bases between bases 102 and 110, inclusive
123^124 Points to a site between bases 123 and 124
join(12..78,134..202) Regions 12 to 78 and 134 to 202 should be joined to form
one contiguous sequence
complement(34..126) Start at the base complementary to 126 and finish at the
base complementary to base 34 (the feature is on the strand
complementary to the presented strand)
complement(join(2691..4571,4918..5163))
Joins regions 2691 to 4571 and 4918 to 5163, then
complements the joined segments (the feature is on the
strand complementary to the presented strand)
join(complement(4918..5163),complement(2691..4571))
Complements regions 4918 to 5163 and 2691 to 4571, then
joins the complemented segments (the feature is on the
strand complementary to the presented strand)
J00194.1:100..202 Points to bases 100 to 202, inclusive, in the entry (in
this database) with primary accession number 'J00194'
join(1..100,J00194.1:100..202)
Joins region 1..100 of the existing entry with the region
100..202 of remote entry J00194
The examples below show the preferred sequence annotations for a number of commonly occurring sequence types. These examples may not be appropriate in all cases but should be used as a guide whenever possible. This section describes the columnar format used to write this feature table in "flat-file" form for distributions of the database.
Feature table format example (EMBL):
source 1..1859
/db_xref="taxon:3899"
/organism="Trifolium repens"
/tissue_type="leaves"
/clone_lib="lambda gt10"
/clone="TRE361"
/mol_type=”genomic DNA”
CDS 14..1495
/db_xref="MENDEL:11000"
/db_xref="SWISS-PROT:P26204"
/note="non-cyanogenic"
/EC_number="3.2.1.21"
/product="beta-glucosidase"
/protein_id="CAA40058.1"
/translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSR.......
---------+---------+---------+---------+---------+---------+---------+---------
1 10 20 30 40 50 60 70 79
Feature table format example (GenBank):
source 1..8959
/organism="Homo sapiens"
/db_xref="taxon:9606"
/mol_type=”genomic DNA”
gene 212..8668
/gene="NF1"
CDS 212..8668
/gene="NF1"
/note="putative"
/codon_start=1
/product="GAP-related protein"
/protein_id="AAA59924.1"
/translation="MAAHRPVEWVQAVVSRFDEQLPIKTGQQNTHTKVSTE.......
---------+---------+---------+---------+---------+---------+---------+---------
1 10 20 30 40 50 60 70 79
Feature table format example (DDBJ):
source 1..2136
/clone="pK28"
/organism="Rattus norvegicus"
/strain="Sprague-Dawley"
/tissue_type="kidney"
/mol_type=”genomic DNA”
mRNA 19..2128
CDS 31..1212
/codon_start=1
/function="Dual specificity protein tyrosine/threonine
kinase"
/product="MAP kinase kinase"
/protein_id="BAA02603.1"
/translation="MPKKKPTPIQLNPAPDGSAVNGTSSAETNLEALQKKL.......
---------+---------+---------+---------+---------+---------+---------+---------
1 10 20 30 40 50 60 70 79
The feature table consists of a header line,
which contains the column titles for the table, and the
individual feature entries. Each feature entry is composed of a feature
descriptor line and qualifier and
continuation lines, if needed. The feature descriptor line contains the
feature's name, key, and location. If
the location cannot be contained on the first line of the feature descriptor,
it is continued on a continuation
line immediately following the descriptor line. If the feature requires further
attributes, feature qualifier
lines immediately follow the corresponding feature descriptor line (or its
continuation). Qualifier
information that cannot be contained on one line continues on the following
continuation lines as
necessary.
Thus, there are 4 types of feature table lines:
Line type Content #/entry #/feature
--------- ------- ------- ---------
Header Column titles 1* N/A
Feature descriptor Key and location 1 to many* 1
Feature qualifiers Qualifiers and values N/A 0 to many
Continuation lines Feature descriptor or 0 to many 0 to many
qualifier continuation
The position of the data items within the feature descriptor line is as follows:
column position data item
--------------- ---------
1-5 blank
6-20 feature key
21 blank
22-80 location
Data on the qualifier and continuation lines begins in column position 22 (the first 21 columns contain blanks). The EMBL format for all lines differs from the GenBank / DDBJ formats that it includes a line type abbreviation in columns 1 and 2.
Blanks (spaces) may, in general, be used within the feature location and qualifier values to make the construction more readable. The following rules should be observed:
· Names of feature table components may not contain blanks (see Section 3.1)
· Operator names may not be separated from the following open parenthesis (the beginning of the operand list) by blanks.
· Qualifiers may not be separated from the preceding slash or the following equals sign (if one) by blanks
The examples below show the preferred sequence annotations for a number of commonly occurring sequence types. These examples may not be appropriate in all cases but should be used as a guide whenever possible.
source 1..1509
/organism="Mus musculus"
/strain="CD1"
/mol_type=”genomic DNA”
promoter <1..9
/gene="ubc42"
mRNA join(10..567,789..1320)
/gene="ubc42"
CDS join(54..567,789..1254)
/gene="ubc42"
/product="ubiquitin conjugating enzyme"
/function="cell division control"
/translation="MVSSFLLAEYKNLIVNPSEHFKISVNEDNLTEGPPDTLY
QKIDTVLLSVISLLNEPNPDSPANVDAAKSYRKYLYKEDLESYPMEKSLDECS
AEDIEYFKNVPVNVLPVPSDDYEDEEMEDGTYILTYDDEDEEEDEEMDDE"
exon 10..567
/gene="ubc42"
/number=1
intron 568..788
/gene="ubc42"
/number=1
exon 789..1320
/gene="ubc42"
/number=2
polyA_signal 1310..1317
/gene="ubc42"
source 1..9430
/organism="Lactococcus sp."
/strain="MG1234"
/mol_type="genomic DNA"
operon 160..6865
/operon=”gal”
-35_signal 160..165
/operon=”gal”
/experiment=”experimental evidence, no additional details
recorded”
-10_signal 179..184
/operon=”gal”
/experiment=”experimental evidence, no additional details
recorded”
CDS 405..1934
/operon=”gal”
/gene="galA"
/product="galactose permease"
/function="galactose transporter"
/experiment=”experimental evidence, no additional details
recorded”
CDS 2003..3001
/operon=”gal”
/gene="galM"
/product="aldose 1-epimerase"
/EC_number="5.1.3.3"
/function="mutarotase"
CDS 3235..4537
/operon=”gal”
/gene="galK"
/product="galactokinase"
/EC_number="2.7.1.6"
/experiment=”experimental evidence, no additional details
recorded”
mRNA 189..6865
/operon="gal"
/experiment=”experimental evidence, no additional details
recorded”
source 1..5300
/organism="Cloning vector pABC"
/lab_host="Escherichia coli"
/mol_type="other DNA"
/focus
source 1..5138
/organism="Escherichia coli"
/mol_type="other DNA"
/strain="K12"
source 5139..5247
/organism="Aequorea victoria"
/mol_type="other DNA"
/dev_stage="adult"
source 5248..5300
/organism="Escherichia coli"
/mol_type="other DNA"
/strain="K12"
CDS join(complement(<1..799),complement(5080..5120))
/gene="mob1"
/product="mobilization protein 1"
CDS complement(1697..2512)
/gene="Km"
/product="kanamycin resistance protein"
CDS 3037..3711
/gene="rep1"
/product="replication protein 1"
CDS complement(4170..4829)
/gene="Cm"
/product="chloramphenicol resistance protein"
CDS 5139..5247
/gene="GFP"
/product="green fluorescent protein"
source 1..2245
/organism="Escherichia coli"
/plasmid="Plasmid XYZ"
/strain="K12"
/mol_type=”genomic DNA”
rep_origin 6
/direction=LEFT
/note="ori"
CDS join(complement(567..795),complement(21..349))
/gene="trbC"
/product="transfer protein C"
CDS 803..1344
/gene="traN"
/product="transfer protein N"
CDS 1559..1985
/gene="incA
/product="incompatability protein A"
CDS join(2004..2195,3..20)
/gene="finP"
/product="fertility inhibition protein P"
source 1..1011
/organism="Homo sapiens"
/clone="pha281u/1DO"
/mol_type="genomic DNA"
repeat_region 80..401
/rpt_type=DISPERSED
/rpt_family="Alu-J"
source 1..321
/organism="Mus musculus "
/strain="BALB/c2
/cell_line="hybridoma 1A4"
/rearranged
/mol_type=”mRNA”
CDS <1..>321
/codon_start=1
/gene="VFM1-DFL16.1-JH4"
/product="immunoglobulin heavy chain"
V_region 1..277