NCBI Genomic Biology Assembly Information AGP Resources Old AGP Specification

AGP Specification 1.0

Valid until October 1, 2006

Below is the original AGP file format specification. An updated version of this is available here.

The AGP format is used to describe the assembly of an object. This object can be
a chromosome, a contig or a supercontig (scaffold). Each line (row)
of the AGP file describes a different piece of the object, and has the column entries
defined below. Extended comments follow. The format was initially developed
during the early assembly phase of the human genome by UCSC, EBI and NCBI.
Special thanks to UCSC for their nice web site (where I was able to obtain additional
information).

column content description
1 object This is the identifier for the object being assembled. This can be a chromosome, contig or supercontig (scaffold). If the object is a chromosome: the naming convention is the chromosome number preceded by the letters chr.
  Ex: chr1
If the object is a contig or supercontig, the identifier needs to be unique within the assembly.
2 object_beg The starting coordinates on the object in column 1.
3 object_end The ending coordinates on the object in column 1.
4 part_number The line count for the components that make up the
object described in column 1. All components (sequence
and gaps) are counted.
5 component_type The sequence status of the component. Current values:
 A=Active Finishing
 D=Draft HTG
 F=Finished HTG
 G=Whole Genome Finishing
 P=Pre Draft
 N=gap
 O=Other sequence
 W=WGS contig

The value in this field will determine the values of items in some of the remaining fields. Gap lines (N) have a different structure than sequence component lines.
6a component_id If column 5 not equal to N: This is a unique identifier for the sequence component contributing to the object described in column 1. If the components have been submitted to a public repository (GenBank/EMBL/DDBJ) this value should be the accession.version of the component. Otherwise it should be an identifier that is unique within the assembly.
6b gap_length If column 5 equal to N: This column represents the length of the
gap.
7a component_start If column 5 not equal to N: This column specifies the beginning
of the component sequence that contributes to the
object in column 1. (in component coordinates)
7b gap_type

If column 5 equal to N: This column specifies the gap type. Fundamentally, there are two types of gaps, captured and uncaptured. In some cases, uncaptured gaps are assigned biological value (i.e. centromere).
Accepted Values:
 Captured gaps:
  fragment: gap between two sequence contigs (also called a "sequence gap")
 Uncaptured gaps:
  split_finished: a specialized gap between two finished sequence contigs [OBSOLETE]
  clone: a gap between two clones that do not overlap
  contig: a gap between clone contigs (also called a "layout gap")
  centromere: a gap inserted for the centromere
  short_arm: a gap inserted at the start of an acrocentric chromosome
  heterochromatin: a gap inserted for an especially large region of
  heterochromatic: (may also include the centromere).
  telomere: a gap inserted for the telomore

8a component_end If column 5 not equal to N: This column specifies the end of the part of the component that contributes to the object in column 1. (in component coordinates)
8b linkage

If column 5 equal to N: This column indicates if there is evidence of linkage between the adjacent lines. Values:
  yes
  no

9a orientation

If column 5 not equal to N: This column specifies the orientation of the component relative to the object in column 1. Values:
  +
  -

If column 5 equal to N, this column is empty.

Extended comments:
1. Columns should be tab delimited. Lines end with a new line (\n). There should be no
extra space around the individual tokens.

2. All coordinates given in the file are 1 based inclusive (not 0 based). i.e. the first base
of an object is 1 (not 0).

3. Evidence of linkage. In general, evidence of linkage is provided by end pairs
(sometimes referred to as mate pairs). Although, other evidence could be used (such as
transcript alignments). In some cases, evidence of linkage may be indirect. For example,
given the following supercontig:
A---B---C----D

Where A,B,C, and D are components, there could be end pairs linking A and B
and end pairs linking A and C. There might be no pairs linking B and C, but
their linkage is implied.

4. If the object is a contig or supercontig, the object should not end with a gap line.

5. Coordinates are all with respect to the plus strand, no matter the orientation of the
component.

6. object_beg should always be less than or equal to object_end.

7. component_beg should always be less than or equal to component_end

8. Any text after a # symbol is assumed to be a comment


Page last updated: October 3, 2006