• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinfoLink to Publisher's site
Bioinformatics. Aug 15, 2009; 25(16): 2071–2073.
Published online Jun 10, 2009. doi:  10.1093/bioinformatics/btp356
PMCID: PMC2723005

Reordering contigs of draft genomes using the Mauve Aligner

Abstract

Summary: Mauve Contig Mover provides a new method for proposing the relative order of contigs that make up a draft genome based on comparison to a complete or draft reference genome. A novel application of the Mauve aligner and viewer provides an automated reordering algorithm coupled with a powerful drill-down display allowing detailed exploration of results.

Availability: The software is available for download at http://gel.ahabs.wisc.edu/mauve.

Contact: ude.csiw@namssir

Supplementary information: Supplementary data are available at Bioinformatics online and http://gel.ahabs.wisc.edu

1 INTRODUCTION

New high-throughput technologies have greatly reduced the cost of genome sequencing, leading to an abundance of draft-quality genome sequences that may be composed of hundreds or thousands of contigs. Ordering and orienting these contigs into larger units (scaffolds or supercontigs) facilitates genome closure and comparative analyses. Contigs can be ordered based on additional data, such as the presence of discontinuous portions of the same sequencing template (clone or fragment) in two contigs, but this type of information is not available for all projects. However, even without additional data, contig order can be predicted by comparison with a reference genome that is expected to have conserved genome organization.

We present a new method for comparative contig ordering based on iterative genome alignment using Mauve. The reference used may be draft quality itself, or may have divergent genetic content. The Mauve aligner has been used extensively for microbial genome comparisons because it effectively identifies and aligns homologous regions even if genomes have undergone rearrangements, large insertions or deletions, and substantial sequence divergence. Mauve Contig Mover (MCM) provides advantages over methods that rely on matches in limited regions near the ends of contigs, require anchors at both ends of contigs, force users to exclude lineage-specific sequences at contig boundaries, or are unable to resolve which, if any, copies of repeated sequences are consistent with more extensive collinearity (Darling et al., 2004; Richter et al., 2007; van Hijum et al., 2005). An interactive full-genome alignment display shows the relative order of the contigs as well as potential gaps in sequence coverage and regions of possible rearrangement or misassembly. After reordering, Mauve is a useful platform for further detailed comparative sequence analysis that is often the motivation for the sequencing effort itself.

2 METHODS

The Mauve aligner filters and sorts internally identified matches into locally collinear blocks (LCBs). Each LCB represents a region of homologous sequence without rearrangement among the input genomes. Each LCB must be separated from the next by rearrangement in at least one genome (Darling et al., 2004). Contig boundaries (edges) represent potentially artificial LCB edges. Therefore, finding the contig order that minimizes the number of LCBs caused by contig edges is equivalent to finding a likely contig order.

Using the Mauve alignment LCBs, the reordering process occurs in three steps: placing contigs with no apparent conflict in ordering information, placing contigs with conflicting information into intermediary anchor positions, and finally matching LCB ends that extend to contig boundaries. Each step occurs in at most O(n2) time, where n is the number of LCBs, plus the time required for alignment (Darling et al., 2004). Mauve assumes contigs are in the correct order when filtering matches, so as the order is optimized, alignment results change. Therefore, results are refined through iterative alignment until no further ordering is possible.

MCM outputs a series of Mauve alignments, each representing an iteration of the reordering. In addition to the standard Mauve output, the reorder process produces a FastA file containing the new order and orientation, as well as a list of ordered contigs including name and coordinate location. The standard Mauve visualization can be applied in novel ways to analyze contig order. For example, we have used it to identify potential misassemblies in contigs, and to evaluate the presence or absence of genes split by contig boundaries or by rearrangements. If FastAs representing the order produced by other programs are created, Mauve can also be used to compare results, as in Supplementary Figure 1. Furthermore, annotations from GenBank format input can be viewed, even once reordered.

3 RESULTS

We have used MCM to order contigs for a variety of different bacterial genome projects based on comparison to the single best reference sequence available, and show some of our results in Table 1. These projects include draft genomes assembled from Sanger sequencing as well as short read generating technologies developed by 454 and Illumina, with the most fragmented example involving a 5 Mb genome with more than 1000 contigs. The draft and reference genome combinations selected include comparisons of genomes from different species, the same species, different strains and different assemblies of the same genome. Many of the draft genomes available through the Enteropathogen Resource Integration Center (Glasner et al., 2008a) and the ASAP database (Glasner et al., 2006) have been ordered using MCM. Examples are available as Supplementary Material on our web site. Supplementary Figure 1 shows a Yersinia pestis strain FV1 draft genome (Touchman et al., 2007) reordered and aligned to the complete Y. pestis CO92 reference genome. MCM was able to order 356 out of 400 contigs (4211103 out of 4472646 bp) reducing the alignment from 359 to 11 LCBs. Supplementary Figure 1 also shows the utility of the Mauve Viewer for comparing different suggested contig orders.

Table 1.
Summary of results of Mauve Contig Mover reorders

We urge caution in interpretation of contig order predicted using MCM or any other algorithm. Many true bacterial genome rearrangements occur at repetitive sequences, which pose challenges for both genome assembly and alignment. Ordering contigs based on comparative analyses can mask true rearrangements anchored in repeats at these contig breaks. Annotations of complete and partial repeats can be viewed in the Mauve alignment display, providing a means of identifying such regions. Conversely, misassemblies can appear as false rearrangements. Mauve clearly displays these regions, and PCR primers may be designed that allows verification of the rearrangement or proof of the misassembly. The Mauve Viewer also allows exploration of alternate positions for contigs with multiple LCBs. A comparison of the number of LCBs between reference and draft to the number between other genomes expected to show similar levels of rearrangement can provide an estimate of this effect. Table 2 summarizes reorders without modeling these effects, showing accuracy between 90.4% and 99.4%. Generally, closer alignments will provide more accurate reorders covering more of the draft. Because MCM maximizes collinearity among genome sequences under comparison it produces alignments that are easily visualized and provides an excellent platform for analysis and finishing of draft genomes.

Table 2.
Overview of percents ordered and correctly ordered

Supplementary Material

[Supplementary Data]

ACKNOWLEDGEMENTS

The authors thank Eric Cabot and Michael Cox (UW-Madison), John Battista (LSU), and the UW Biotechnology Center for sequence data and analyses of radiation resistant E.coli strains (NIH grant #GM067085).

Funding: NIH-NIGMS Award #GM62994 and NSF Award #0412599 (to N.T.P.); Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN266200400040C for the Enteropathogen Resource Integration Center.

Conflict of Interest: none declared.

REFERENCES

  • Blattner FR, et al. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1474. [PubMed]
  • Darling AC, et al. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14:1394–1403. [PMC free article] [PubMed]
  • Deng W, et al. Genome sequence of Yersinia pestis KIM. J. Bacteriol. 2002;184:4601–4611. [PMC free article] [PubMed]
  • Glasner JD, et al. ASAP: a resource for annotating, curating, comparing, and disseminating genomic data. Nucleic Acids Res. 2006;34:D41–D45. [PMC free article] [PubMed]
  • Glasner JD, et al. Enteropathogen Resource Integration Center (ERIC): bioinformatics support for research on biodefense-relevant enterobacteria. Nucleic Acids Res. 2008a;36:D519–D523. [PMC free article] [PubMed]
  • Glasner JD, et al. Niche-specificity and the variable fraction of the Pectobacterium Pan-Genome. Mol. Plant-Microbe Interact. 2008b;21:1549–1560. [PubMed]
  • Hayashi T, et al. Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res. 2001;8:11–22. [PubMed]
  • Parkhill J, et al. Genome sequence of Yersinia pestis the causative agent of plague. Nature. 2001;413:523–527. [PubMed]
  • Perna NT, et al. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature. 2001;409:529–533. [PubMed]
  • Richter DC, et al. OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics. 2007;23:1573–1579. [PubMed]
  • Toth IK, et al. Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence factors. Proc. Natl Acad. Sci. USA. 2004;101:11105–11110. [PMC free article] [PubMed]
  • Touchman JW, et al. A North American Yersinia pestis draft genome sequence: SNPs and phylogenetic analysis. PLoS ONE. 2007;2:e220. [PMC free article] [PubMed]
  • van Hijum SA, et al. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005;33:W560–W566. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...