NCBI

Announcing NCBI's plan for adopting v2.0 of the AGP specifications

NCBI is switching from the old version of the AGP specification (version 1.1) to the new version (version 2.0) because the latter can convey valuable information on the nature of the evidence linking sequences on either side of a gap. This change will affect users who obtain AGP files from the NCBI FTP site, as well as users who submit AGP files to GenBank as part of a genome assembly submission.

Timeline

2012 Feb 10 AGP files written in 2.0 format.
AGP 2.0 files accepted in submissions.
2012 Jul 1 AGP 1.1 files no longer accepted.

AGP files produced by NCBI

The AGP files that NCBI posts on the GenBank genomes FTP site (ftp://ftp.ncbi.nih.gov/genbank/genomes/) and on the RefSeq genome FTP site (ftp://ftp.ncbi.nih.gov/genomes/) will be written in v2.0 format starting on 10th February 2012. These v2.0 AGPs can be identified by the header line: "##agp-version 2.0". AGP parsers will require slight modification to enable them to read v2.0 AGPs (see list of changes from v1.1 to v2.0 in the AGP v2.0 specification). Existing AGP files already on the FTP site will not be updated, they will remain in v1.1 format.

AGP files submitted to NCBI

GenBank will accept AGP files in v2.0 format from 10th February 2012 onwards. GenBank will continue to accept AGP files in the old v1.1 format until 30th June 2012, but will convert them to v2.0 format. AGP files in v1.1 format will not be accepted after 1st July 2012.

Table1. Mapping of AGP v1.1 gaps to AGP v2.0 gaps

Note: AGP v1.1 gaps can be mapped forward to AGP v2.0 gaps, however, ambiguities prevent AGP v2.0 gaps from being mapped back to AGP v1.1 gaps.

AGP v1.1 gap

AGP v2.0 gap

Gap_type

Linkage

Gap_type

Linkage

fragment no scaffold yes
fragment yes scaffold yes
clone yes scaffold yes
repeat yes repeat yes
clone no contig no
contig no contig no
repeat no repeat no
centromere no centromere no
telomere no telomere no
short_arm no short_arm no
heterochromatin no heterochromatin no

Display of assembly gaps and linkage evidence by INSDC

The International Nucleotide Sequence Database Collaboration (INSDC) recently added a new feature type called "assembly_gap", and the associated qualifiers "gap_type" and "linkage_evidence" (see INSDC Feature Table Definitons). DDBJ, ENA & GenBank will use the "assembly_gap" feature to display information derived from version 2.0 AGP files in their flat-file views of sequence records.

Table2. Mapping of AGP v2.0 gaps to INSDC features

AGP v2.0 gap

INSDC Gap Qualifier

Gap_type

Linkage

/gap_type

scaffold yes "within scaffold"
repeat yes "repeat within scaffold"
contig no "between scaffolds"
repeat no "repeat between scaffolds"
centromere no "centromere"
telomere no "telomere"
short_arm no "short_arm"
heterochromatin no "heterochromatin"

February 28, 2012