Sequence Identifiers: A Historical Note
Why are there two types of sequence identification numbers (GI and VERSION), and what is the difference between them?


The two types of sequence identification numbers, GI and VERSION, have different formats and were implemented at different points in time.

  1. GI number (sometimes written in lower case, "gi") is simply a series of digits that are assigned consecutively to each sequence record processed by NCBI. The GI number bears no resemblance to the Accession number of the sequence record.

    • nucleotide sequence GI number is shown in the VERSION field of the database record

    • protein sequence GI number is shown in the CDS/db_xref field of a nucleotide database record, and the VERSION field of a protein database record

  2. VERSION is made of the accession number of the database record followed by a dot and a version number (and is therefore sometimes referred to as the "accession.version")

    • nucleotide sequence version contains two letters followed by six digits, a dot, and a version number (or for older nucleotide sequence records, the format is one letter followed by five digits, a dot, and a version number)

    • protein sequence version contains three letters followed by five digits, a dot, and a version number

The GI number has been used for many years by NCBI to track sequence histories in GenBank and the other sequence databases it maintains. The VERSION system of identifiers was adopted in February 1999 by the International Nucleotide Sequence Database Collaboration (GenBank, EMBL, and DDBJ). More details are given in the historical note, below.

The two systems of identifiers run in parallel to each other. That is, when any change is made to a sequence, it receives a new GI number AND an increase to its version number.

A Sequence Revision History tool is available to track the various gi numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record (more information and example).

Historical Note:

The first type of sequence identification number was GI, which stands for "GenInfo Identifier." GenInfo was an early system used to access GenBank and related databases. A GI number was assigned to each nucleotide and protein sequence accessible through the NCBI search systems, and was a means of tracking changes to the sequence. However, GI numbers were not used uniformly across the collaborating databases (GenBank, EMBL, DDBJ). They instead served as an internal tracking system for the databases that chose to implement them. In addition, the gi number for a nucleotide sequence originally appeared in the "Comment" field of a record. There was no separate field for sequence identification numbers.

When the collaborating databases began to formalize use of sequence identifiers, they created a new, separate field called NID (nucleotide identifier) in the database record, which contained the GI number of the nucleotide sequence. Similarly, the GI number for each protein sequence was named PID, and placed above each amino acid translation in the field: FEATURES/CDS/db_xref="PID:gNNNNNN". Hence, there became two types of gi numbers: NID and PID. In December 1999, the use of the abbreviations "NID" and "PID" was discontinued. Both are now just shown as "GI".

In February 1999, GenBank/EMBL/DDBJ implemented a new "accession.version" system of sequence identifiers that runs parallel to the gi number system. (See section 1.3.2 of the GenBank 111.0 release notes for details.)

Unlike the gi number system, in which sequence identification numbers were not necessarily consistent across the databases (e.g., GenBank and EMBL could each assign their own gi number to a sequence), the new system is designed to ensure consistency. It is also designed to show a relationship between a sequence identification number and the accession number of the record in which it is found. In contrast, GI numbers are assigned consecutively and bear no resemblance to the accession number. Finally, the new system allows the assignment of alphanumeric protein IDs to proteins translations within nucleotide sequence records. The protein IDs contain three letters followed by five digits, a period, and a version number.

As of December 1999 (GenBank release 115.0):

  • the NID field and /db_xref="PID:xxxxxxx" qualifer have been removed, and both are now simply shown as "GI" numbers
  • the VERSION field of nucleotide records will continue to contain both an accession.version and a GI number for the nucleotide sequence
  • each amino acid translation will continue to be labeled with an accession.version sequence identifier (in the "/protein_id" field) and a GI number (in the "/db_xref=GI:xxxxxxx" qualifier), under the CDS feature of a GenBank record
  • the accession.version and GI systems of sequence identifiers will run in parellel to each other. Therefore, when any change is made to a sequence, it receives a new GI number AND an increase to its version number.

For more information, see section 3.4.7 of the current GenBank release notes.

