| Related Resources
AGP Resources Genome Annotation |
The basic problem of genome assembly stems from the fact that while genomes themselves are quite large and contain long stretches of contiguous sequence, the current generation of genome sequencers (based on traditional Sanger sequencing) can only generate a few hundred to a little over a thousand bases of sequence at a time. Thus, a genome must be fragmented, sequenced in bits and then re-assembled to obtain the full contiguous sequence. The bits of information from each sequenced piece is often referred to as a sequencing read (read for short). Several thousand reads must be produced to reconstruct the sequence of a longer molecule. Both raw reads and assembled data (regardless of the method used) are typically available. Read information can be obtained via the Trace Archive. Assembled sequences are available via the International Nucleotide Sequence Database Collaboration (GenBank, EMBL and DDBJ)
Below is a description of the three major approaches to genome assembly. In addition to descriptions of the generic approaches for genome assemblies, some examples of assemblers will be mentioned, although this document is not meant to provide an exhaustive list of genome assembly algorithms. A brief description of NCBI genome processing will also be provided as well.![]()
![]() Figure 1. An example of a clone tiling path. All lines represent clones and their relative positions. The red lines represent a minimal tiling path through this region. |
The hierarchical approach (often referred to as 'clone' based) relies on mapping a set of large insert clones (typically BAC or fosmid clones). Typically, numerous clones will cover any given location of the genome (depending upon the library depth and mapping method used). A minimal tiling path of clones (see figure 1.) is selected in which all sequence is covered with the least amount of redundant sequence produced. Note that there can be substantial overlap between clones. The amount of overlap between clones will vary depending on how the library was constructed.
In this case, the assembly of the sequencing reads has been reduced from a global problem (the entire genome) to a local problem (a single clone, typically 40 - 200 Kb). The genome sequence is then assembled by aligning sequences of adjacent clones and calculating a path through these alignments that will produce a non-redundant sequence. It should be noted that the clone sequence need not be finished in order to produce a genome assembly. Indeed, the first several human assemblies consisted of a mixture of finished and unfinished sequence. Typically, unfinished sequence is deposited to the High Throughput Genome Sequences (HTGS) division of GenBank. Once it is finished it is moved to the regular divisions. In order to assess unfinished sequence and track quality metrics, a series of HTGS keywords were introduced. Assembled sequences can be submitted to the CON (contig) division of GenBank using an AGP file.
Examples of hierarhical assemblers:
References:
Lander ES, et al. Initial sequencing and analysis of the human genome. Nature 2001 Feb 15;409(6822): 860-921.
Kent WJ and Haussler D. Assembly of the working draft of the human genome with GigAssembler. Genome Res. 2001 Sep; 11(9): 1541-8.
![]() Figure 2. Production of a WGS contig. These contigs contain no gaps, although the sequence may contain 'N's due to sequence ambiguity. WGS contigs obtain accessions similar to the ones shown in the figure, with the first 4 letters representing a project code, the first two numbers representing the assembly version, and the last 6 numbers providing unique identifiers for each contig. |
The Whole Genome Assembly (WGA) approach dispenses with upfront mapping. The entire genome is fragmented and used to construct libraries of varying insert size. Typically there are libraries of some smaller size (2, 4 or 6 Kb), libraries of intermediate size (10 - 40 Kb) and libraries with large insert sequences (>100 Kb). The ends of these clones are sequenced (referred to as a sequence read). The reads from different ends of the same clone are referred to as mate-pairs.
The assembly approach is typically in two phases. In the initial phase WGS contigs (see Figure 2) are produced by calculating the sequence overlap between all possible reads. These contigs contain no gaps. WGS contigs can be submitted to the WGS division of GenBank.
![]() Figure 3. Construction of supercontigs. These sequences do contain captured gaps, meaning the sequence within the gap is not known but gap is covered by a clone. Supercontigs may obtain GenBank accessions (consisting of a 2 letter 6 digit combination). They may also obtain RefSeq accession identifiers (typicaly NW or NT followed by _ and 6 or 9 digits). |
The next phase of assembly involves calculating the relationship between WGS contigs in an effort to construct Supercontigs (see Figure 3). Using the mate-pair information order, orientation and the distance between contigs can be asserted based on such information as the library insert size and the insert size standard deviation. The relationship between two contigs can be determined using a single mate-pair but the level of confidence in such a relationship is not great. A large number of mate-pair relationships is ideal, the optimal number depending on the clone depth used in generating the assembly. Typically, the number of mate-pairs supporting a linkage asserion is not reported, however. These supercontigs can be submitted to the CON (contig) division of GenBank using an AGP file.
Some examples of WGA algorithms:There are two common methods for combining the two approaches. One of these involves performing limited clone mapping and low-coverage clone sequencing. These reads are supplimented by whole genome reads. The general approach involves doing an assembly of the clone based reads and adding in the whole genome reads to generate an 'enriched BAC (e-BAC)'. It is important to note that that sequence represented in the e-BAC record may extend beyond the boundaries of the physical clone associated with the record due to the incorporation of the whole genome reads. These e-BACs can then be assembled to produce a genome sequence.
Example of this hybrid approach:
References:
Havlak P. The Atlas genome assembly system. Genome Res. 2004 Apr;14(4): 721-32.
A different approach involves performing the whole genome assembly and then integrating the clone sequence into the assembly. This approach has largely been used for the mouse genome, although there can be complications integrating clone sequence that is not assembled in the WGA. In these cases, additional mapping is necessary to resolve any conflicts. ![]()
Each assembly approach has pluses and minuses. Regardless of the assembly method all assemblies are subject to some common issues:
References:
Istrail S. et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc Natl Acad Sci USA. 2004 Feb 17;101(7):1916-21.
Eichler EE. Widening the spectrum of human genetic variation. Nat Genet. 2006 Jan;38(1); 9-11.
Two genomes (human and mouse) are produced at NCBI in collaboration with International consortiums (the Internation Human Genome Sequencing Consortium and the Mouse Genome Sequencing Consortium). For all other genomes the reference assembly is produced by an outside entity and submitted to GenBank.
In order to provide genome annotation, NCBI must produce a Reference Sequence (RefSeq) representation of the genome assembly. While NCBI endeavors to represent the submitted genome faithfully, there is no guarantee that the RefSeq version will be exactly like the submitted version. The differences typically center around potentially contaminated regions.
In addition, NCBI will often provide alternate genome representations for a given organism if such assemblies are available. These assemblies can be complete or partial assemblies.