Skip navigation and go to main content

Assembly Terminology

Below is a list of commonly used terms and definitions in the field of genomics.

Describing Assemblies

Alternate locus:
A sequence that provides an alternate representation of a locus found in a largely haploid assembly. These sequences don't represent a complete chromosome sequence although there is no hard limit on the size of the alternate locus; currently these are less than 1 Mb. Previously these sequences have been referred to as "partial chromosomes", "alternate alleles", and "alternate haplotypes". However, these terms are confusing because they contain terms that have biological implications. Diploid assemblies (which by definition are from a single individual) should not have alternate loci representations. Multiple scaffolds from different loci that are considered to be part of the same haplotype should be grouped into alternate locus groups (e.g. mouse 129/Sv group). Note: an alternate locus group was previously considered an alternate partial assembly.
Assembly:
a set of chromosomes, unlocalized and unplaced (random) sequences and alternate loci used to represent an organism's genome. Most current assemblies are a haploid representation of an organism's genome, although some loci may be represented more than once (see Alternate locus, above). This representation may be obtained from a single individual (e.g. chimp or mouse) or multiple individuals (e.g. human reference assembly). Except in the case of organisms which have been bred to homozygosity, the haploid assembly does not typically represent a single haplotype, but rather a mixture of haplotypes. As sequencing technology evolves, it is anticipated that diploid sequences representing an individual's genome will become available.
Assembly Units:
Collections of sequences used to define discrete parts of an assembly. For example, the Primary assembly is considered one sequence unit. Alternate-loci grouped together by a common name (e.g. 129/Sv in mouse) would be considered a separate assembly-unit. In many cases, the assembly-units are what many people previously considered ‘assemblies'.
Chromosome Assembly:
a relatively complete pseudo-molecule assembled from smaller sequences (components) that represent a biological chromosome. Relatively complete implies that some gaps may still be present in the assembly, but independent measures suggest that most of the sequence is represented by sequenced bases. Completeness is submitter defined. Understanding completeness is important for determining whether we submit chromosome level ASN for that chromosome.
Diploid Assembly:
A genome assembly for which a Chromosome Assembly is available for both sets of an individual's chromosomes. It is anticipated that a diploid genome assembly is representing the genome of an individual. Therefore it is not anticipated that alternate loci will be defined for this assembly, although it is possible that unlocalized or unplaced sequences could be part of the assembly.
Genome Patch:
A contig sequence that is released outside of the full assembly release cycle. These sequences are meant to add information to the assembly without disrupting the stable coordinate system. There are two types of patches, FIX and NOVEL. FIX patches are released to correct an error in the assembly and will be removed when the new full assembly is released. NOVEL sequences are sequences that were not in the last full assembly release and will be retained with the next full assembly release.
Haploid Assembly:
The collection of Chromosome assemblies, unlocalized and unlocalized sequences and alternate loci that represent an organism's genome. Any locus may be represented 0, 1 or >1 time, but entire chromosomes are only represented 0 or 1 times.
PAR:
Pseudo-autosomal region. A region found on the X and Y chromosomes of mammals that allow recombination between the sex chromosomes. In human, the regions are defined on the X chromosome and the sequence from the X chromosome is copied onto the Y, but this is not a requirement for representing the PAR.
PATCH:
A genome patch is a scaffold sequence that is part of a minor genome release. These sequences either correct errors in the assembly (a FIX patch) or add additional alternate loci (a NOVEL patch). These sequences allow us to update the assembly information without disrupting the chromosome coordinate system. FIX patches will be removed at the next major assembly release as the changes will be rolled into the new assembly. NOVEL patches will be moved from the PATCHES assembly unit to a proper assembly unit.
Primary Assembly:
Relevant for haploid assemblies only. The primary assemblies represents the collection of assembled chromosomes, unlocalized and unplaced sequences that, when combined, should represent a non-redundant haploid genome. This excludes any of the alternate locus groups.
Unlocalized Sequence:
A sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome.
Unplaced Sequence:
A sequence found in an assembly that is not associated with any chromosome.

Building Assemblies

AGP File:
a file used to describe the instructions for building a contig, scaffold or chromosome sequence. This files specficies the order, orientation and switch points for each component. For more information on AGP files, see the AGP page.
Contig:
a contiguous sequence generated from determining the non-redundant path along an order set of component sequences. A contig should contain no gaps (Figure 1) but often the terms contig and scaffold are used interchangeably.
Component:
a low genomic level sequence used to construct the genome, typically these are either clone sequences, WGS sequence or a PCR fragment. These sequences must be submitted to GenBank/EMBL/DDBJ (Figure 2).
Join:
the sequence overlap between two adjacent components in a contig. Figure 1 shows the different types of joins (Figure 2).
Scaffold:
an ordered and oriented set of contigs. A scaffold will contain gaps, but there is typically some evidence to support the contig order, orientation and gap size estimates.
Switch point:
the base at which the contig sequence stops being generated from one component sequence and switches to using the next component sequence. There must be at least 1 switch points between adjacent component sequences in a contig (Figure 2).
TPF:
short for Tiling Path File, provides the order of the component sequences used to build a contig, scaffold or chromosome. For more information, see the TPF Specification.

figure showing contig building

Figure 1. Graphical representation of building a contig. The short blue lines labelled Component# are the low level sequences used to build the contig. These will typically have GenBank accessions. The regions where adjacent components overlap is represented by the small vertical line, the second of which is labelled 'join'. In order to generate the non-redundant sequence of the contig, the sequence of the first clone is used until the first switch point and then the sequence of the second clone is used. In this example, the first part of the second component is not used, so any sequence differences in this component will not be represented in the final contig sequence.

image of join types

Figure 2. Join types: the full dovetail is the preferred join type for contig building.