|
Assembling Genomic SequencesMaking a composite assembly: NCBI Mouse Build 36 represents a highly polished product. It is composed largely of finished sequence (click here for a graph of sequence composition). HTGS phase 3 sequence is assembled by hand into non-redundant contigs and the non-redundant sequence is used as an imput to the assembly. A hand curated combined Tiling Path file was used to guide the assembly. All non-finished portions of the assembly have been hand curated to determine whether to include WGS, HTGS sequence or to leave a gap. The Y chromosome was built by hand in collaboration with the Washington University Genome Sequencing Center and the Page lab at the Whitehead Institute for Biomedical Research. Currently, only the short arm of the Y has reliable mapping data, so most of the contigs on the Y chromosome are unplaced. Graphical representation of the sequence status of each chromosome: Build 36 NCBI Mouse Build 35 represents a fifth generation composite assembly. In this build, chromosomes 1,3,5,6,7,8,9,10, and 12-19 were automatically assembled. As in build 34, only HTGS phase 3, single fragment HTGS phase 2 and WGS contigs were used in the assembly. The HTGS phase 3 sequence is assembled by hand into non-redundant contigs and the non-redundant sequence is used as an imput to the assembly. A hand curated combined Tiling Path file was used to guide the assembly. Assembly instructions for chromosomes 2,4,11 and X were provided as AGP files by the Sanger Institute. The Y chromosome was built by hand in collaboration with the Washington University Genome Sequencing Center and the Page lab at the Whitehead Institute for Biomedical Research. Currently, only the short arm of the Y has reliable mapping data, so most of the contigs on the Y chromosome are unplaced. Graphical representation of the sequence status of each chromosome: Build 35 NCBI Mouse Build 34 represents a fourth generation composite assembly. In this build, chromosomes 1,3,5,6,7,8,9,10, and 12-19 were automatically assembled. As in build 33, only HTGS phase 3, single fragment HTGS phase 2 and WGS contigs were used in the assembly. The HTGS phase 3 sequence is assembled by hand into non-redundant contigs and the non-redundant sequence is used as an imput to the assembly. A hand curated combined Tiling Path file was used to guide the assembly. Assembly instructions for chromosomes 2,4,11 and X were provided as AGP files by the Sanger Institute. The Y chromosome was built by hand in collaboration with the Washington University Genome Sequencing Center and the Page lab at the Whitehead Institute for Biomedical Research. Currently, only the short arm of the Y has reliable mapping data, so most of the contigs on the Y chromosome are unplaced. Graphical representation of the sequence status of each chromosome: Build 34 NCBI Mouse Build 33 represents a third generation composite assembly. Problems with build 32 led us to reconsider attempting to combine HTGS phase 1 and WGS sequence. In this build, HTGS phase 3, single fragment HTGS phase 2 and WGS contigs were used in the assembly. The HTGS phase 3 sequence is assembled into non-redundant contigs by hand and this non-redundant sequence is used as the input to the assembly. In addition, a single Tiling Path File was used for all chromosomes. This was hand curated to take information from the MGSCv3 and clone based files. The only exception to this is Mmu11. The assembly instructions for this chromosome were provided by the Sanger Institute as this chromosome is essentially finished. Graphical representation of the sequence status of each chromosome: Build 33 NCBI Mouse build 32 represents a second generation composite assembly. Chromosomes were assemblied using slightly different algorithms depending upon available mapping date. Chromosomes 2,4,5,7,11,15,18,19,X and Y were assembled using a clone based Tiling Path File. Whhole genome shotgun sequence was used to fill gaps as appropriate. Chromosomes 1,3,6,8,9,10,12,13,14,16 and 17 were assembled using the MGSCv3 as a tiling path and integrating HTGS sequence (both finished and draft) as appropriate. Graphical representation of the sequence status of each chromosome: Build 32 NCBI Mouse build 30 represents the first composite assembly for mouse. In this instance, a composite assembly refers to constructing a composite genome using the MGSCv3 Whole Genome Shotgun assembly and HTGS sequence. As this was the first attempt at a composite assembly, a very conservative approach was taken. Graphical representation of the sequence status of each chromosome: Build 30
I. Preparing the input data:
II. Producing the assembly:
Once all sequence overlaps are collected, then the HTGS phase 3 clones are "stitched" into the MGSCv3. This is done in the following manner:
Figure 1.1 shows examples of acceptable and unacceptable phase 3 to WGS alignments. In addition to "stitching" phase 3 sequence into the assembly some WGS contigs were found to overlap. The consequences of this are that some gaps were removed and some WGS contigs from chrUn were placed on a chromosome. Phase 3 sequence that did not pass the alignment criteria above are listed as "unplaced on a chromosome". In addition, phase 3 sequence from non-C57BL/6J sources were assembled into non-redundant contigs. These contigs are also annotated and chromosome coordinates are determined relative to the reference (C57BL/6J) chromosomes when possible. These contigs and their annotation are available by selecting the 'Strain' map from the Maps&Options menu in the MapViewer.
MGSCv3 Sequence Reads
The Mouse Genome Sequencing Consortium (MGSC)
has produced greater than 6-fold sequence coverage of the mouse genome
using a Whole Genome Shotgun (WGS)
approach. To perform WGS sequencing, the genome of a C57BL/6J female was
used as a substrate to make several different libraries. The fragments
for these libraries were selected for various sizes. For the mouse project,
libraries with fragment sizes of 2, 4, 6, 10, 12 and 40 kb were produced.
Individual clones were chosen from each library, and both ends of the
clone were sequenced (see figure 1). Sequence reads from opposite ends
of the same clone are often referred to as "mate pairs". In
this manner, over 40 million individual sequence reads were generated.
In addition, BAC end sequences
(BES), generated by TIGR
from approximately 450,000 clones from the RPCI-23 and RPCI-24 BAC libraries,
were added. The library and pairing information for each sequence read
was also retained for later use.
Contig Assembly
To obtain an assembled genome, all of the sequence reads were compared
to each other. If two reads shared significant sequence overlap, they
were merged to form a WGS
contig. Figure 2 shows a simple example of two reads overlapping to form
a contig. The actual depth of the reads will be related to the number
of ends sequenced and the size of the genome. For mouse, the reads were
on average 6 fold deep. These contigs have been submitted to GenBank/EMBL/DDBJ
and received accession numbers
that look like CAAA01XXXXXX. CAAA is the accession prefix, 01 represents
version of the assembly (in this case 01), and the individual contigs
are numbered starting at 1 (CAAA01000001). This assembly is referred to
as version 3, because the MGSC did two assemblies previous to this one,
but this is the first mouse WGS assembly submitted to the public repositories.
The next stage of the assembly involves using mate-pair information to
build supercontigs (sometimes referred to as scaffolds).
Placing the assembly on the genome Annotating the assembled genomic sequence
Finished BACs In addition to annotating the MGSCv3 assembly, NCBI constructed contigs from HTGS phase 3 (finished sequence). The construction of these contigs was straightforward and involved identifying dove-tail sequence overlap between finished BAC clones. These contigs were made into RefSeqs and given accession numbers of the type NT_XXXXXX. These sequences were then annotated using the NCBI annotation pipeline. In build 27, NT_XXXXXX contigs can be composed of BAC clones that were derived from different strains. Initially, this was done to increase coverage since there was so little finished sequence. Future releases will find NT_XXXXXX contigs that are restricted to a single strain. In addition, future relases will begin integrating finished and draft sequence as well as WGS contigs. |
|
Questions or Comments? Write to the Help Desk Disclaimer Privacy statement Accessibility |