Assembling Genomic Sequences


Making a composite assembly:
NCBI Mouse Build 36 represents a highly polished product. It is composed largely of finished sequence (click here for a graph of sequence composition). HTGS phase 3 sequence is assembled by hand into non-redundant contigs and the non-redundant sequence is used as an imput to the assembly. A hand curated combined Tiling Path file was used to guide the assembly. All non-finished portions of the assembly have been hand curated to determine whether to include WGS, HTGS sequence or to leave a gap.
The Y chromosome was built by hand in collaboration with the Washington University Genome Sequencing Center and the Page lab at the Whitehead Institute for Biomedical Research. Currently, only the short arm of the Y has reliable mapping data, so most of the contigs on the Y chromosome are unplaced.
Graphical representation of the sequence status of each chromosome: Build 36

NCBI Mouse Build 35 represents a fifth generation composite assembly. In this build, chromosomes 1,3,5,6,7,8,9,10, and 12-19 were automatically assembled. As in build 34, only HTGS phase 3, single fragment HTGS phase 2 and WGS contigs were used in the assembly. The HTGS phase 3 sequence is assembled by hand into non-redundant contigs and the non-redundant sequence is used as an imput to the assembly. A hand curated combined Tiling Path file was used to guide the assembly.
Assembly instructions for chromosomes 2,4,11 and X were provided as AGP files by the Sanger Institute. The Y chromosome was built by hand in collaboration with the Washington University Genome Sequencing Center and the Page lab at the Whitehead Institute for Biomedical Research. Currently, only the short arm of the Y has reliable mapping data, so most of the contigs on the Y chromosome are unplaced.
Graphical representation of the sequence status of each chromosome: Build 35

NCBI Mouse Build 34
represents a fourth generation composite assembly. In this build, chromosomes 1,3,5,6,7,8,9,10, and 12-19 were automatically assembled. As in build 33, only HTGS phase 3, single fragment HTGS phase 2 and WGS contigs were used in the assembly. The HTGS phase 3 sequence is assembled by hand into non-redundant contigs and the non-redundant sequence is used as an imput to the assembly. A hand curated combined Tiling Path file was used to guide the assembly.
Assembly instructions for chromosomes 2,4,11 and X were provided as AGP files by the Sanger Institute. The Y chromosome was built by hand in collaboration with the Washington University Genome Sequencing Center and the Page lab at the Whitehead Institute for Biomedical Research. Currently, only the short arm of the Y has reliable mapping data, so most of the contigs on the Y chromosome are unplaced.
Graphical representation of the sequence status of each chromosome: Build 34

NCBI Mouse Build 33 represents a third generation composite assembly. Problems with build 32 led us to reconsider attempting to combine HTGS phase 1 and WGS sequence. In this build, HTGS phase 3, single fragment HTGS phase 2 and WGS contigs were used in the assembly. The HTGS phase 3 sequence is assembled into non-redundant contigs by hand and this non-redundant sequence is used as the input to the assembly. In addition, a single Tiling Path File was used for all chromosomes. This was hand curated to take information from the MGSCv3 and clone based files. The only exception to this is Mmu11. The assembly instructions for this chromosome were provided by the Sanger Institute as this chromosome is essentially finished.
Graphical representation of the sequence status of each chromosome: Build 33

NCBI Mouse build 32 represents a second generation composite assembly. Chromosomes were assemblied using slightly different algorithms depending upon available mapping date. Chromosomes 2,4,5,7,11,15,18,19,X and Y were assembled using a clone based Tiling Path File. Whhole genome shotgun sequence was used to fill gaps as appropriate. Chromosomes 1,3,6,8,9,10,12,13,14,16 and 17 were assembled using the MGSCv3 as a tiling path and integrating HTGS sequence (both finished and draft) as appropriate.
Graphical representation of the sequence status of each chromosome: Build 32

NCBI Mouse build 30 represents the first composite assembly for mouse. In this instance, a composite assembly refers to constructing a composite genome using the MGSCv3 Whole Genome Shotgun assembly and HTGS sequence. As this was the first attempt at a composite assembly, a very conservative approach was taken.

Graphical representation of the sequence status of each chromosome: Build 30

I. Preparing the input data:

  • The WGS contigs (CAAA01000000) and HTGS sequences were masked for repetetive sequence using RepeatMasker.
  • The overlaps between HTGS phase 3 clones are hand curated to produce non-redundant contigs that are used in the assembly "as is".
  • megaBLAST is used to perform an all vs. all sequence comparison.

II. Producing the assembly:
Sequence overlaps are assessed using the following criteria:

  • Overlaps between phase 3 clones must be at least 95 bp and 99% ID.
  • Overlap tails between phase 3 sequences must be less than 50 bp.
  • Overlaps between 2 WGS sequences, or a WGS sequence and an HTGS sequence, will be considered if the percent identity is greater than 98.5% and the overlap tails are less than 1 Kb.

Once all sequence overlaps are collected, then the HTGS phase 3 clones are "stitched" into the MGSCv3. This is done in the following manner:

  • The best placement of the phase 3 sequence is analyzed.
  • At the best place, at least 2 WGS contigs must have an alignment to the phase 3 sequence.
  • The number of overlapping bases in the best place must be at least 20% of the length of the BAC.
  • The number of overlapping bases in the best place must be at least 5 time greater than the number of WGS bases not involved in sequence alignments at the best place.
Figure 1.1

A.
This picture depicts an example of an acceptable alignment, allowing an HTGS phase 3 sequence to be stitched into the MGSCv3. The green lines represent WGS contigs, while the blue line represents a phase 3 sequence. The thin, horizontal black lines represent gaps in the MGSCv3 scaffold, while the thin, vertical lines represent regions of sequence alignment between the phase 3 sequence and the WGS contigs. The arrow depicts a region of an alignment "tail". The very end of this contig does not align to the phase 3 sequence, as one might expect it. This can occur because the sequence quality tends to be lower at the ends of sequences. It can also occur due to a small assembly error in either sequence.

B. This picture shows an example of an unacceptable alignment. Note the large amount of WGS sequence that does not align to the phase 3 sequence. This region will be removed for manual inspection and addressed in the next mouse release.

Figure 1.1 shows examples of acceptable and unacceptable phase 3 to WGS alignments.

In addition to "stitching" phase 3 sequence into the assembly some WGS contigs were found to overlap. The consequences of this are that some gaps were removed and some WGS contigs from chrUn were placed on a chromosome. Phase 3 sequence that did not pass the alignment criteria above are listed as "unplaced on a chromosome".

In addition, phase 3 sequence from non-C57BL/6J sources were assembled into non-redundant contigs. These contigs are also annotated and chromosome coordinates are determined relative to the reference (C57BL/6J) chromosomes when possible. These contigs and their annotation are available by selecting the 'Strain' map from the Maps&Options menu in the MapViewer.















MGSCv3
Sequence Reads

Figure 1: Individual plasmids from genomic libraries are chosen. The average insert size of the clones in each library is defined (2 kb, 4, kb, etc.). Both ends of the plasmid are sequenced (the common sequencing primers T7 and SP6 are shown in the figure). The two sequenced ends of a particular plasmid are known as "mate pairs".

The Mouse Genome Sequencing Consortium (MGSC) has produced greater than 6-fold sequence coverage of the mouse genome using a Whole Genome Shotgun (WGS) approach. To perform WGS sequencing, the genome of a C57BL/6J female was used as a substrate to make several different libraries. The fragments for these libraries were selected for various sizes. For the mouse project, libraries with fragment sizes of 2, 4, 6, 10, 12 and 40 kb were produced. Individual clones were chosen from each library, and both ends of the clone were sequenced (see figure 1). Sequence reads from opposite ends of the same clone are often referred to as "mate pairs". In this manner, over 40 million individual sequence reads were generated. In addition, BAC end sequences (BES), generated by TIGR from approximately 450,000 clones from the RPCI-23 and RPCI-24 BAC libraries, were added. The library and pairing information for each sequence read was also retained for later use.
Caveats:

  • The same end of a clone can be sequenced multiple times.
  • Data tracking errors can produce mis-pairing. That is, two sequences are labeled as being from opposite ends of the same clone, when they are not.
  • Low quality sequence.
  • Some clones only have one end sequenced.

Contig Assembly










Figure 2: The sequences generated from the WGS libraries and the BES were compared to one another. When sequence overlap was detected at the end of reads (as shown above, commonly referred to as a "dove-tail" overlap), the two reads can be merged into a single sequence.

To obtain an assembled genome, all of the sequence reads were compared to each other. If two reads shared significant sequence overlap, they were merged to form a WGS contig. Figure 2 shows a simple example of two reads overlapping to form a contig. The actual depth of the reads will be related to the number of ends sequenced and the size of the genome. For mouse, the reads were on average 6 fold deep. These contigs have been submitted to GenBank/EMBL/DDBJ and received accession numbers that look like CAAA01XXXXXX. CAAA is the accession prefix, 01 represents version of the assembly (in this case 01), and the individual contigs are numbered starting at 1 (CAAA01000001). This assembly is referred to as version 3, because the MGSC did two assemblies previous to this one, but this is the first mouse WGS assembly submitted to the public repositories.
Caveats:

  • Overlap is not a true dove tail. This can happen because of repeats or because of low quality sequence at the end of the read. Often, a small tail in the overlap is allowed.
  • Repeats in the genome can complicate the determination of a true overlap


Building Supercontigs (scaffolds)


Figure 3: By using the mate pair information, contigs can be ordered and oriented relative to each other. In the example above, contig 1 and contig 2 share mate pairs from three different clones. This is strong evidence that the sequences contained in these contigs are close to one another. The distance between the two contigs can be estimated based on the average size of the library from which the clones were chosen. The merged supercontig sequence will contain the sequence of contig 1 and contig 2, separated by Ns. The number of Ns chosen will reflect the gap estimate between the two contigs.

The next stage of the assembly involves using mate-pair information to build supercontigs (sometimes referred to as scaffolds).
By knowing the location of a sequence read within the assembly, one can predict where its mate pair should lie. The assemblers used this mate-pair information to order and orient WGScontigs with respect to each other (Figure 3). An appropriate number of Ns (based on the plasmid size estimates) was inserted between linked WGS contigs to generate supercontigs.
It is important to sequence clones representing a variety of insert sizes. The smaller insert libraries (2-10 Kb) were useful for local assembly, while the large insert libraries (40 Kb and BES) were useful for long range linking.
Caveats:

  • Mispairing (as described above) can lead to inappropriate linking. Contigs joined by a single clone should be viewed with caution.
  • Duplication in the genome can also complicate linking. A distribution of the expected number of links can be produced based on the clone coverage of the genome. Contigs that share too many links are often discarded.
  • If the size selection of the libraries shows a wide distribution, estimation of gap sizes will be difficult.

Placing the assembly on the genome
The final part of the MGSCv3 assembly placed the supercontigs onto specific chromosomes using the WIBR Genetic Map. Genetic markers that have unambiguous genotypes were located within the supercontigs. After the supercontigs had been placed on the chromosome, approximately 50 Mb of finished BAC sequence was integrated into the final MGSCv3 assembly.

Annotating the assembled genomic sequence
In order to annotate the MGSCv3, NCBI has made reference sequences (RefSeqs) from the supercontigs. These have been given accession numbers of the type NW_XXXXXX. The sequences were put through the NCBI annotation pipeline, and features such as gene models, STSs, and variation were added.

WGS assemblers
Arachne
Phusion

References
Weber JL, Myers EW. Human whole-genome shotgun sequencing. Genome Res 1997; 7(5):401-409.
Venter JC et al. The sequence of the human genome. Science 2001; 291(5507):1304-51.
Batzogluo S et al. ARACHNE: a whole-genome shotgun assembler. Genome Res 2001; 12(1):177-89.
Waterston et al. Initial sequencing and comparative analysis of the mouse genome. Nature 2002; 420(6915):520-62.
Mullikin JC, Ning Z. The phusion assembler. Genome Res 2003; 13(1):81-90.

 



Finished BACs
In addition to annotating the MGSCv3 assembly, NCBI constructed contigs from HTGS phase 3 (finished sequence). The construction of these contigs was straightforward and involved identifying dove-tail sequence overlap between finished BAC clones. These contigs were made into RefSeqs and given accession numbers of the type NT_XXXXXX. These sequences were then annotated using the NCBI annotation pipeline.
In build 27, NT_XXXXXX contigs can be composed of BAC clones that were derived from different strains. Initially, this was done to increase coverage since there was so little finished sequence. Future releases will find NT_XXXXXX contigs that are restricted to a single strain. In addition, future relases will begin integrating finished and draft sequence as well as WGS contigs.

This page last updated: November 27, 2006