Proposed AGP Specification v2.1
Introduction:
What it is: Describes the assembly of a larger sequence object from smaller objects. The large object can be a contig, a scaffold (supercontig), or a chromosome. Each line (row) of the AGP file describes a different piece of the object, and has the column entries defined below. Extended comments follow.
What it is not: neither a description of how sequence reads were assembled, nor a description of the alignments between components used to construct a larger object. Not all of the information in proprietary assembly files can be represented in the AGP format. It is also not for recording the spans of features like repeats or genes.
Changes from v2.0 to v2.1:
This version supersedes version 2.0 of the AGP file specification. The changes are:
- 'proximity_ligation' added to the set of accepted linkage evidence values
- 'pcr' added to the set of accepted linkage evidence values
- 'contamination' gap-type added
- definition of 'paired-end' linkage evidence expanded to include 'mate-pairs' and molecular-barcode techniques
Definitions:
- Contig:
- a non-redundant sequence formed by joining, based on sequence overlap, one or more smaller sequences. The smaller sequences are typically sequences that have been submitted to the International Sequence Database Collaboration (GenBank/EMBL/DDBJ). There should be no gaps in a sequence contig (although there may be short runs of Ns due to ambiguous base calls).
- Scaffold (supercontig):
- a non-redundant sequence formed by joining one or more sequence contigs. The distinction is that no sequence overlap is required to construct the larger sequence. Additional information, such as clone end analysis, can support the relationship. There can be, and typically there are, gaps in a scaffold.
- Gap:
- a sub region within an object where there is no known sequence. Generally represented as a series of the letter ‘N’
- Component:
- a sequence used to construct a larger sequence.
File Format:
One feature of the AGP file is that column definitions change depending on whether the line is a component line or a gap line. There is a single column definition up to column 5, then each column will have two definitions, depending on the value in column 5.
column | content | description |
---|---|---|
1 | object | This is the identifier for the object being assembled. This can be a chromosome, scaffold or contig. If an accession.version identifier is not used to describe the object the naming convention is to precede chromosome numbers with ‘chr’ (e.g. chr1) and linkage group numbers with ‘LG’ (e.g. LG3). Contigs or scaffolds may have any identifier that is unique within the assembly. |
2 | object_beg | The starting coordinates of the component/gap on the object in column 1. These are the location in the object’s coordinate system, not the component’s. |
3 | object_end | The ending coordinates of the component/gap on the object in column 1. These are the location in the object’s coordinate system, not the component’s. |
4 | part_number | The line count for the components/gaps that make up the object described in column 1. |
5 | component_type | The sequencing status of the component. These typically correspond to keywords in the International Sequence Database (GenBank/EMBL/DDBJ) submission. Current acceptable values are:
|
6a | component_id | If column 5 not equal to N or U: This is a unique identifier for the sequence component contributing to the object described in column 1. Ideally this will be a valid accession.version identifier as assigned by GenBank/EMBL/DDBJ. If the sequence has not been submitted to a public repository yet a local identifier should be used. |
6b | gap_length | If column 5 equal to N or U: This column represents the length of the gap. N type gaps can be of any length. A length of 100 must be used for all U type gaps. |
7a | component_beg | If column 5 not equal to N or U: This column specifies the beginning of the part of the component sequence that contributes to the object in column 1 (in component coordinates). |
7b | gap_type |
If column 5 equal to N or U: This column specifies the gap type. Accepted values:
|
8a | component_end | If column 5 not equal to N or U: This column specifies the end of the part of the component that contributes to the object in column 1 (in component coordinates). |
8b | linkage |
If column 5 equal to N or U: This column indicates if there is evidence of linkage between the adjacent lines. Values:
|
9a | orientation |
If column 5 not equal to N or U: This column specifies the orientation of the component relative to the object in column 1. Values:
By default, components with unknown orientation (?, 0 or na) are treated as if they had + orientation. |
9b | Linkage evidence | If column 5 equal to N or U: This specifies the type of evidence used to assert linkage (as indicated in column 8b). Accepted values:
If there are multiple lines of evidence to support linkage, all can be listed using a ‘;’ delimiter (e.g. paired-ends;align_xgenus). |
Extended comments:
- Columns should be tab delimited. Lines end with a new line (\n). There should be no extra space around the individual tokens.
- All coordinates given in the file are 1-based inclusive (not 0-based). i.e. the first base of an object is 1 (not 0).
- Evidence of linkage. In general, evidence of linkage is provided by paired-ends (sometimes referred to as mate pairs), although other evidence could be used. In some cases, evidence of linkage may be indirect. For example, given the following scaffold: A--B--C--D Where A, B, C and D are components, there could be paired-ends linking A and B and paired-ends linking A and C. There might be no paired-ends linking B and C but their linkage is implied. Use paired-ends as the linkage evidence for the gaps between A/B and B/C.
- If the object is a sequence contig or scaffold, the object should not start or end with a gap line. A chromosome will frequently start or end with one or more biological gap types (e.g. telomere or short_arm).
- A gap of type scaffold will usually be flanked by components and not by other gap lines. Typically, successive gap lines are not encouraged, except in the case of gaps implying some biologically defined entity (such as centromere, heterochromatin, etc.).
- Coordinates of the object are all with respect to the plus strand, no matter the orientation of the component.
- object_beg (column 2) should always be less than or equal to object_end (column 3).
- component_beg (column 7) should always be less than or equal to component_end (column 8).
- Each object must start with a part_num of 1 (column 4) and an object_beg coordinate of 1 (column 2).
- Gap lengths must be positive. Negative gaps and gap lines with zero length are not valid.
- For negative gaps, or gaps of unknown size, use U as the component_type and 100 as the gap size, since 100 is the GenBank/EMBL/DDBJ standard for gaps of unknown size.
- Gaps between sequence contigs in an HTGS_PHASE1 BAC clone will typically have a gap-type of ‘scaffold’, linkage ‘yes’ and evidence type ‘within_clone’. The component type should be U, and the gap size 100 (entered in columns 5 & 6b).
- Use an orientation of ‘+’ for the component of any unplaced scaffold composed of a single component (a singleton scaffold).
-
The use of comment lines, starting with a # symbol, at the head of the file is encouraged. Useful information to include in such headers is:
- agp-version pragma (e.g. ##agp-version 2.1)
- organism name
- assembly name
- a description of any non-standard object identifiers
Comment lines must not appear within the body of the AGP. - Linkage evidence of type map should only be used when the map provides evidence of linkage between adjacent sequence contigs. It should not be used when a map has been used to order and orient scaffolds on a chromosome.
Describing breaks and continuity:
Information about continuity is provided by a combination of the gap_type (column 7b) and linkage (column 8b) that provide information on building the object. This first version of this specification did not specifically define how to use these columns, thus there has been a divergence in how they are currently used. Below is a proposal on how information should be encoded.
Gap_type | Linkage | Interpretation and description |
---|---|---|
Within-scaffold gaps: sequences on either side of the gap are in a single scaffold. | ||
scaffold | yes | Do not break scaffold There is evidence linking sequence contigs on both sides of the gap. |
repeat | yes | Do not break scaffold If an unresolvable repeat unit is spanned by linkage evidence, the linkage will be ‘yes’. |
contamination | yes | Do not break scaffold Treated as linked to preserve the original scaffold but with linkage evidence 'unspecified'. |
Scaffold-breaking gaps: sequences on either side of the gap are in separate scaffolds. | ||
contig | no | Break scaffold A contig gap indicates there is no evidence to link the adjacent sequence contigs. |
repeat | no | Break scaffold If an unresolvable repeat unit is not spanned by linkage evidence, the linkage will be ‘no’. |
centromere/ short_arm/ heterochromatin/ telomere | no | Break scaffold Gaps with these biological types are used for laying out scaffolds along a chromosome. |
Invalid gap/linkage combinations | ||
contig | yes | Invalid If there is evidence of linkage between the adjacent sequence contigs, the gap type should be scaffold. |
scaffold | no | Invalid If there is no evidence of linkage between the adjacent sequence contigs, the gap type should be contig. |
centromere/ short_arm/ heterochromatin/ telomere | yes | Invalid It is invalid to use these biological types within a scaffold. |
Describing scaffolds with unknown orientation:
Scaffolds can sometimes be positioned along a chromosome or linkage group without there being sufficient data to orient the scaffold. Such placed but unoriented scaffolds can be indicated in an AGP that specifies how a chromosome or linkage group is assembled from scaffolds by using ‘?’ in the orientation column (9a) (see the example “chromosome from scaffolds”). It is not appropriate to use an orientation of ‘?’ in an AGP that specifies how a chromosome is assembled from components, except for any components that are not scaffolded to other components (singletons). Using an orientation of ‘?’ for all the components in a multi-component scaffold is misleading because to do so implies that the component lies at the position indicated but could be in either orientation. Depending on the orientation of the scaffold, however, the components in an unoriented multi-component scaffold either lie at the indicated position in the ‘+’ orientation (the default) or at a different position in the ‘-‘ orientation. The preferred method to indicate that scaffolds have been placed but their orientation is unknown is to provide two AGP files, the first that builds scaffolds from components and the second that builds chromosomes from scaffolds. The unknown orientation of a scaffold would be indicated in the chromosome-from-scaffold AGP file with a ‘?’.
Validation:
File structure needs to be validated in the following ways:
- Columns are tab delimited
- All columns of numeric data must contain positive integers
- Accession identifiers must be valid, and must include a version number
- Columns with controlled values must only use those values
- All columns must have some data
File content needs to be validated in the following ways:
- Each object must start with a part_num of 1 and an object_beg coordinate of 1.
- All object ranges must be sequential and non-overlapping
- object_beg must be less than or equal to object_end
- component_beg must be less than or equal to component_end
- The span specific for a component must be valid.
- The length of the span specified for the component (in columns 7 and 8) must match the length of the span specified for the object (in columns 2 and 3).
- If no gap lines exist between components, the defined switch points should be consistent with an alignment of the two components.
- All gap lengths must be 1 base or longer.
EXAMPLES
- Scaffold from component (WGS)
- Chromosome from scaffold (WGS)
- Chromosome from component (WGS)
- Chromosome from component (BAC)