A primer on genome assembly methods.
The basic problem of genome assembly stems from the fact that while genomes themselves are quite large and contain long stretches of contiguous sequence, on the order of millions of base pairs), the current generation of commonly used genome sequencers can only generate relatively short segments of sequence. Traditional approaches, based on Sanger sequence could produce reads of up to 1000 bp. Current generation sequencing technologies (e.g. Illumina, Solid and 454) produce shorter reads, although read length for all of these platforms is improving. Thus, a genome must be fragmented, sequenced in bits and then re-assembled to obtain the full contiguous sequence. Each sequenced piece of DNA is referred to as a sequencing read (read for short). Several thousand to several million reads must be produced to reconstruct the sequence of a longer molecule. Both raw reads and assembled data (regardless of the method used) are typically available. Read information for Sanger based sequences can be obtained via the Trace Archive and read information for next generations sequences are available at the Sequence Read Archive (SRA). Assembled sequences are available via the International Nucleotide Sequence Database Collaboration (GenBank, EMBL and DDBJ), and mappings of reads to assembled sequences can be obtained via the Assembly Archive. The construction of higher order molecules (scaffolds and chromosomes) is described using an AGP file.
Below is a description of four different approaches to genome assembly. Most were developed using Sanger technology but many are adapting to the second generation platforms that now dominate the sequencing landscape. In addition to descriptions of the generic approaches for genome assemblies, some examples of assemblers will be mentioned, although this document is not meant to provide an exhaustive list of genome assembly algorithms.
Figure 1. An example of a clone tiling path. All lines represent clones and their relative positions. The red lines represent a minimal tiling path through this region.
The Hierarchical approach (often referred to as 'clone-based') relies on mapping a set of large insert clones (typically BAC or fosmid clones) using methods such as Fingerprint analysis or identifying clones that contain markers localized by linkage mapping or radiation hybrid (RH). Typically, numerous clones will cover any given location of the genome (depending upon the library depth and mapping method used). A minimal tiling path of clones (see figure 1.) is selected in which all sequence is covered with the least amount of redundant sequence produced. Note that there can be substantial overlap between clones. The amount of overlap between clones will vary depending on how the library was constructed.
In this strategy, the assembly of the sequencing reads has been reduced from a global problem (the entire genome) to a local problem (a single clone, typically 40 - 200 Kb). First, each clone is fragmented and sequenced using a 'shotgun' approach. This involves randomly breaking up the larger clones and sequencing each fragment. Typically, each read is evaluated for quality and each base is assigned a 'quality score'. The most common software used for this is Phred, but the program Trace Tuner has recently been introduced as well. The base level accuracy of each read is important for evaluating alignments and generating assemblies.
The sequenced fragments can then be assembled to recreate the insert sequence of the clone. The most commonly used software for this problem is part of the Phred package and is called Phrap. Typically, the shotgun approach does not recover all of the insert sequence such that a few gaps will remain. Many factors influence how many gaps will remain; these include the depth of shotgun sequencing (greater coverage leads to few gaps), the organism being sequenced, and the repeat content of the clone. To get the complete insert of the clone, manual intervention (by people typically referred to as 'finishers') is required to manually close the gaps, typically using a PCR based approach. It is not unusual for the finisher to manually evaluate the traces as part of the generation of the finished, consensus sequence for the clone insert.
The genome sequence is then assembled by aligning sequences of adjacent clones and calculating a path through these alignments that will produce a non-redundant sequence. Typically, evaluations of these alignments are guided by a map (often called a Tiling Path (or TPF)). Examples of such programs are Gigassembler (Jim Kent, UCSC) and TPF Analyzer (Richa Agarwala, NCBI). It should be noted that the clone sequence need not be finished in order to produce a genome assembly. Indeed, the first several human assemblies consisted of a mixture of finished and unfinished sequence. Typically, unfinished sequence is deposited to the High Throughput Genome Sequences (HTGS) division of GenBank. Once it is finished it is moved to the regular divisions. In order to assess unfinished sequence and track quality metrics, a series of HTGS keywords were introduced. Assembled sequences can be submitted to the CON (contig) division of GenBank using an AGP file.
Figure 2.Production of a WGS contig. These contigs contain no gaps, although the sequence may contain 'N's due to sequence ambiguity. WGS contigs obtain accessions similar to the ones shown in the figure, with the first 4 letters representing a project code, the first two numbers representing the assembly version, and the last 6 numbers providing unique identifiers for each contig.
The Whole Genome Assembly (WGA) approach, which is the dominant strategy in use today, dispenses with up front mapping. The entire genome is fragmented and used to construct libraries of varying insert sizes. Typically there are libraries of some smaller size (2, 4 or 6 Kb), libraries of intermediate size (10 - 40 Kb) and libraries with large insert sequences (>100 Kb). The ends of these clones are sequenced, generating sequence reads. The reads from different ends of the same clone are referred to as mate-pairs.
The original WGS assembly approach, developed using Sanger reads (which are relatively long with low throughput), typically has three major phases, known as overlap, layout, and consensus. In the initial phase (overlap), the WGS algorithm calculates the sequence overlap between all available reads. In the layout step, the reads are arranged according to their pattern of overlap, producing a multiple alignment of the reads. In the consensus step, a contig is generated (see Figure 2) by calculating the consensus base at each position of the layout. These contigs contain no gaps. WGS contigs can be submitted to the WGS division of GenBank.
Figure 3. Construction of supercontigs. These sequences do contain captured gaps, meaning the sequence within the gap is not known but gap is covered by a clone. Supercontigs may obtain GenBank accessions (consisting of a 2 letter 6 digit combination). They may also obtain RefSeq accession identifiers (typically NW or NT followed by _ and 6 or 9 digits).
After building contigs, a WGS assembler can use mate-pair information to order and orient the contigs and place them into larger structures called scaffolds (or supercontig) (see Figure 3). The contigs within a scaffold are separated by gaps of unknown size, although the library insert sizes can be used to provide good estimates of these gap sizes. The relationship between two contigs can be determined using a single mate-pair, but the level of confidence in such a relationship is not great, and most WGS assemblers require at least two such links for each pair of contigs in a scaffold. Typically, the number of mate-pairs supporting a linkage assertion is not reported, however. These scaffolds can be submitted to the CON (contig) division of GenBank using an AGP file.
The first WGS assemblers, used for bacterial and viral genomes and for BAC clones, were Phrap (Phil Green), TIGR Assembler (Granger Sutton), and Cap3 (X. Huang). These assemblers were widely used during the 1990s.; Some examples of recent WGS assemblers that have been applied successfully to large (mammalian-size) genomes, sequenced using Sanger technology are:
With the advent of second generation sequencers, most of which produced very large numbers of relatively short reads, new approaches to assembly had to be developed. Many of these new assemblers take an approach known as the de Bruijn graph to performing assemblies. This approach is attractive as it does not require all reads to be aligned to all other reads and it can compress redundant sequence. A review on assembly algorithms for next-generation sequencing data has been recently published by Miller et al (2010).
There is a method that combiness the whole genome and hierarchical approaches. It involves supplementing limited clone mapping and low-coverage clone sequencing with whole genome sequencing. The clone-based reads are assembled first and the whole genome reads are then added to generate an 'enriched BAC (e-BAC)'. These e-BACs are then used to produce a genome assembly. It is important to note that that sequence represented in the e-BAC record may extend beyond the boundaries of the physical clone associated with the record due to the incorporation of the whole genome reads.
Example of this hybrid approach:
Another approach to assembly, which has become possible with the advent of increasing numbers of finished genomes, is comparative assembly, in which a reference genome is used to guide assembly. In this approach, rather than the overlap-layout-consensus of WGS algorithms, the assembler uses an alignment-consensus algorithm. The WGS reads are first aligned to the reference genome, which is assumed to be very similar to the newly sequenced genome. This alignment is then used directly to compute the consensus sequence of the new genome. Comparative assembly is much faster than the standard WGS algorithm because it avoids the very expensive overlap step. It also avoids scaffolding because the reference genome is presumed to have the same structure. This approach breaks down if the new genome is too divergent or in regions of large-scale structural variation. It has been very successful for assembly of multiple strains of many bacteria (such as Bacillus anthracis), and can produce better contiguity and coverage than a de novo WGS strategy.
Example of this comparative approach:
Each assembly approach has pluses and minuses. Regardless of the assembly method all assemblies are subject to some common issues:
October 15, 2010