- Genome home

- Assembly

- Assembly Primer
A primer on genome assembly methods.
The basic problem of genome assembly stems from the fact that while genomes themselves are quite large and contain long stretches of contiguous sequence, the current generation of commonly used genome sequencers (based on traditional Sanger sequencing) can only generate a few hundred to a little over a thousand bases of sequence at a time. Thus, a genome must be fragmented, sequenced in bits and then re-assembled to obtain the full contiguous sequence. Each sequenced piece of DNA is referred to as a sequencing read (read for short). Several thousand to several million reads must be produced to reconstruct the sequence of a longer molecule. Both raw reads and assembled data (regardless of the method used) are typically available. Read information can be obtained via the Trace Archive. Assembled sequences are available via the International Nucleotide Sequence Database Collaboration (GenBank, EMBL and DDBJ), and mappings of reads to assembled sequences can be obtained via the Assembly Archive. Although, it should noted that few projects have made use of the Assembly archive.
Below is a description of four different approaches to genome assembly using traditional Sanger sequencing. See the 'Future Directions' section for information on the next generation sequencers and assemblies. In addition to descriptions of the generic approaches for genome assemblies, some examples of assemblers will be mentioned, although this document is not meant to provide an exhaustive list of genome assembly algorithms. A brief description of NCBI genome processing will also be provided as well.

Figure 1. An example of a clone tiling path. All lines represent clones and their relative positions. The red lines represent a minimal tiling path through this region.
The Hierarchical approach (often referred to as 'clone-based') relies on mapping a set of large insert clones (typically BAC or fosmid clones) using methods such as Fingerprint analysis or identifying clones that contain markers localized by linkage mapping or radiation hybrid (RH). Typically, numerous clones will cover any given location of the genome (depending upon the library depth and mapping method used). A minimal tiling path of clones (see figure 1.) is selected in which all sequence is covered with the least amount of redundant sequence produced. Note that there can be substantial overlap between clones. The amount of overlap between clones will vary depending on how the library was constructed.
In this strategy, the assembly of the sequencing reads has been reduced from a global problem (the entire genome) to a local problem (a single clone, typically 40 - 200 Kb). First, each clone is fragmented and sequenced using a 'shotgun' approach. This involves randomly breaking up the larger clones and sequencing each fragment. Typically, each read is evaluated for quality and each base is assigned a 'quality score'. The most common software used for this is Phred, but the program Trace Tuner has recently been introduced as well. The base level accuracy of each read is important for evaluating alignments and generating assemblies.
The sequenced fragments can then be assembled to recreate the insert sequence of the clone. The most commonly used software for this problem is part of the Phred package and is called Phrap. Typically, the shotgun approach does not recover all of the insert sequence such that a few gaps will remain. Many factors influence how many gaps will remain; these include the depth of shotgun sequencing (greater coverage leads to few gaps), the organism being sequenced, and the repeat content of the clone. To get the complete insert of the clone, manual intervention (by people typically referred to as 'finishers') is required to manually close the gaps, typically using a PCR based approach. It is not unusual for the finisher to manually evaluate the traces as part of the generation of the finished, consensus sequence for the clone insert.
The genome sequence is then assembled by aligning sequences of adjacent clones and calculating a path through these alignments that will produce a non-redundant sequence. Typically, evaluations of these alignments are guided by a map (often called a Tiling Path (or TPF)). Examples of such programs are Gigassembler (Jim Kent, UCSC) and TPF Analyzer (Richa Agarwala, NCBI). It should be noted that the clone sequence need not be finished in order to produce a genome assembly. Indeed, the first several human assemblies consisted of a mixture of finished and unfinished sequence. Typically, unfinished sequence is deposited to the High Throughput Genome Sequences (HTGS) division of GenBank. Once it is finished it is moved to the regular divisions. In order to assess unfinished sequence and track quality metrics, a series of HTGS keywords were introduced. Assembled sequences can be submitted to the CON (contig) division of GenBank using an AGP file.
Figure 2.Production of a WGS contig. These contigs contain no gaps, although the sequence may contain 'N's due to sequence ambiguity. WGS contigs obtain accessions similar to the ones shown in the figure, with the first 4 letters representing a project code, the first two numbers representing the assembly version, and the last 6 numbers providing unique identifiers for each contig.
The Whole Genome Assembly (WGA) approach, which is the dominant strategy in use today, dispenses with up front mapping. The entire genome is fragmented and used to construct libraries of varying insert sizes. Typically there are libraries of some smaller size (2, 4 or 6 Kb), libraries of intermediate size (10 - 40 Kb) and libraries with large insert sequences (>100 Kb). The ends of these clones are sequenced, generating sequence reads. The reads from different ends of the same clone are referred to as mate-pairs.
The WGS assembly approach typically has three major phases, known as overlap, layout, and consensus. In the initial phase (overlap), the WGS algorithm calculates the sequence overlap between all available reads. In the layout step, the reads are arranged according to their pattern of overlap, producing a multiple alignment of the reads. In the consensus step, a contig is generated (see Figure 2) by calculating the consensus base at each position of the layout. These contigs contain no gaps. WGS contigs can be submitted to the WGS division of GenBank.
Figure 3. Construction of supercontigs. These sequences do contain captured gaps, meaning the sequence within the gap is not known but gap is covered by a clone. Supercontigs may obtain GenBank accessions (consisting of a 2 letter 6 digit combination). They may also obtain RefSeq accession identifiers (typically NW or NT followed by _ and 6 or 9 digits).
After building contigs, a WGS assembler can use mate-pair information to order and orient the contigs and place them into larger structures called scaffolds (or supercontig) (see Figure 3). The contigs within a scaffold are separated by gaps of unknown size, although the library insert sizes can be used to provide good estimates of these gap sizes. The relationship between two contigs can be determined using a single mate-pair, but the level of confidence in such a relationship is not great, and most WGS assemblers require at least two such links for each pair of contigs in a scaffold. Typically, the number of mate-pairs supporting a linkage assertion is not reported, however. These scaffolds can be submitted to the CON (contig) division of GenBank using an AGP file.
The first WGS assemblers, used for bacterial and viral genomes and for BAC clones, were Phrap (Phil Green), TIGR Assembler (Granger Sutton), and Cap3 (X. Huang). These assemblers were widely used during the 1990s.; Some examples of recent WGS assemblers that have been applied successfully to large (mammalian-size) genomes are:
There is a method that combiness the whole genome and hierarchical approaches. It involves supplementing limited clone mapping and low-coverage clone sequencing with whole genome sequencing. The clone-based reads are assembled first and the whole genome reads are then added to generate an 'enriched BAC (e-BAC)'. These e-BACs are then used to produce a genome assembly. It is important to note that that sequence represented in the e-BAC record may extend beyond the boundaries of the physical clone associated with the record due to the incorporation of the whole genome reads.
Example of this hybrid approach:
Another approach to assembly, which has become possible with the advent of increasing numbers of finished genomes, is comparative assembly, in which a reference genome is used to guide assembly. In this approach, rather than the overlap-layout-consensus of WGS algorithms, the assembler uses an alignment-consensus algorithm. The WGS reads are first aligned to the reference genome, which is assumed to be very similar to the newly sequenced genome. This alignment is then used directly to compute the consensus sequence of the new genome. Comparative assembly is much faster than the standard WGS algorithm because it avoids the very expensive overlap step. It also avoids scaffolding because the reference genome is presumed to have the same structure. This approach breaks down if the new genome is too divergent or in regions of large-scale structural variation. It has been very successful for assembly of multiple strains of many bacteria (such as Bacillus anthracis), and can produce better contiguity and coverage than a de novo WGS strategy.
Example of this comparative approach:
Each assembly approach has pluses and minuses. Regardless of the assembly method all assemblies are subject to some common issues:
Two genomes (human and mouse) were produced at NCBI in collaboration with International consortiums (the International Human Genome Sequencing Consortium and the Mouse Genome Sequencing Consortium). These genomes are now maintained by the Genome Reference Consortium. There is currently an ongoing effort to close gaps, curate problems and provide robust allelic representation when necessary. For all other genomes the reference assembly is produced by an outside entity and submitted to GenBank.
In order to provide genome annotation, NCBI must produce a Reference Sequence (RefSeq) representation of the genome assembly. While NCBI endeavors to represent the submitted genome faithfully, there is no guarantee that the RefSeq version will be exactly like the submitted version. The differences typically center around potentially contaminated regions.
In addition, NCBI will often provide alternate genome representations for a given organism if such assemblies are available. These assemblies can be complete or partial assemblies.
Sanger sequencing has been the dominant technology for over 30 years. However, many so called 'Next Generation' platforms are generating sequence data at an astounding rate. These platforms are built on novel approaches and utilize different chemistry to produce sequence data. All of the platforms currently in use require an amplification step prior to sequencing. Development is still ongoing to understand how to best utilize these reads with respect to base level quality, alignment and assembly. Currently, this data is being submitted to the Short Read Archive (SRA).
Platforms in widespread use:
| Platform | Chemistry | Read Length | Paired-end Length |
|---|---|---|---|
| Roche (454) | Pyrosequencing | 230 - 400 bp | 3000 bp |
| Illumina (Solexa) | Sequencing by Synthesis | 40 bp | 200 bp |
| ABI SOLiD | Ligation based sequencing | 35 bp | 3000 bp |
Data for table kindly provide by Elaine Mardis (The Genome Center at Washington University in St. Louis) and are accurate as of May 20, 2008.
Additional technology is quickly being developed. Many of these newer technologies don't require an amplification step. Examples of these are:
May 30, 2008