![]() | ![]() |
Formats:
|
||||||||||||
Copyright © 2003 Oxford University Press Projector: automatic contig mapping for gap closure purposes Molecular Genetics, University of Groningen, Groningen Biomolecular Sciences and Biotechnology Institute, PO Box 14, 9750 AA Haren, The Netherlands *To whom correspondence should be addressed. Tel: +31 50 3632408; Fax: +31 50 363 2348; Email: s.a.f.t.van.hijum/at/biol.rug.nl Received July 2, 2003; Revised July 24, 2003; Accepted September 25, 2003. This article has been cited by other articles in PMC.Abstract Projector was designed for automatic positioning of contigs from an unfinished prokaryotic genome onto a template genome of a closely related strain or species. Projector mapped 84 contigs of Lactococcus lactis MG1363 (corresponding to 81% of the assembly nucleotides) against the genome of L.lactis IL1403. Ninety three percent of subsequent gap closure PCRs were successful. Moreover, a significant improvement in the N50 and N80 values (describing the assembly quality) was observed after the use of Projector. Because increasing numbers of bacterial genomes are being sequenced, Projector provides an efficient method to close a significant number of remaining gaps in the late stages of a genome sequencing project. INTRODUCTION Genome sequences emerge in ever-increasing numbers. As of June 2003, 112 finished and 128 unfinished microbial genomes have been reported in the NCBI genome database (http://www.ncbi.nlm.nih.gov/). These numbers exclude genomes that were sequenced by private companies. A consortium consisting of the Microbiology Department of University College (Cork, Ireland), the Institute of Food Research (Norwich, UK) and the Molecular Genetics Department of the University of Groningen (Groningen, The Netherlands) is sequencing the genome of the food-grade lactic acid bacterium Lactococcus lactis subsp. cremoris MG1363 by a random shotgun sequencing strategy (1,2). This methodology implies that under-represented clones require more sequence runs, which in turn leads to ‘over-sequencing’ of over-represented clones. In many cases clones overlap, yielding enlarged sequences (contigs) that are scattered over different parts of the genome. In order to position these contigs with respect to each other the scaffolding technique can be used. Smaller contigs are linked according to their position on larger DNA fragments (large insert genomic library, for instance in phage and cosmid banks or bacterial artificial chromosomes) of which only the ends are sequenced. Other types of linkage information might be obtained by using mapping by microarray analysis (3), restriction fingerprints of large insert genomic clones (4) or the so-called ‘slalom libraries’ (5). The gaps between these linked contigs are closed by subsequent PCR strategies. When these methods are exhausted, physical gaps between contigs or scaffolds remain. These are closed by, for example, inverse PCR, anchored PCR or multiplex PCR (6). The above-mentioned procedure results in an average coverage of a genome of about 8-fold, making sequencing of a genome rather expensive and time consuming. A significant reduction in time and costs of a sequencing project could be realized by using primer pairs in specific PCRs. This would require prior knowledge about the position and orientation of the contigs on the genome. The fact that relatedness between organisms frequently results in structural conservation of genomic regions (7,8) and in conservation within genes might aid in the positioning of contigs. Because quite a number of genomes have been sequenced and even more genomes are in an unfinished stage, the chances increase of matching an unfinished genome sequence with a finished genome sequence that is comparable in structure and nucleotide sequence. PGAAS (9) was developed to identify contigs that end in the same open reading frame (ORF). Its use is limited because (i) in many cases contigs end with repetitive elements rendering this method inadequate and (ii) in only a very few cases do two contigs end in the same ORF. The genome alignment tool MUMmer (10,11) may also be used to obtain additional contig linkage information. Because of the algorithm used, in some cases contigs are positioned at several places on the genome, requiring contig mappings to be inspected manually. The Projector software was developed to provide an automated method to efficiently use information from increasing numbers of finished genomes for ‘smart’ gap closure by PCR. The Projector software uses a finished genome sequence (template genome) of comparable nucleotide sequence and structure as a template to position and orient contigs of an unfinished genome sequence (target genome). The positioned contigs allow prediction of sequence gap locations and sizes. Projector was tested by positioning Lactococcus lactis MG1363 contigs (target) on the genome of L.lactis IL1403 (template). Over 90% success rate was obtained in PCRs for 83 gap-closing primer pairs suggested by Projector. Projector also successfully mapped significant numbers of contigs from unfinished genome sequences from various other bacteria. We show that by using a more similar template genome, the probability that contigs of the target genome are successfully mapped increases significantly. MATERIALS AND METHODS System requirements Projector runs on a Linux platform and only requires a locally installed Blastall program (ftp://ftp.ncbi.nih.gov/blast/). The version of Blastall used in this study was 2.2.3. Projector consists of eight sub-programs written in Pascal and compiled by FreePascal 1.0.6 (http://www.freepascal.org/) under Red Hat Linux release 7.2 (http://www.redhat.com). The sub-programs are linked by a shell script. All relevant Projector settings can be entered in this shell script. The system on which Projector was tested was a dual Intel Pentium III 667 Mhz with 2 Gb RAM memory. Projector had modest computer requirements for all runs: a typical mapping run took ~7–10 min using at most 30 Mb of physical RAM memory. Blast searches took ~6 min and the Projector sub-programs ~30 s to run. In general, the files used by the Projector sub-programs are either of the FASTA type or of comma-delimited text file format. By using these file types, the output of each step can be checked manually if need be by using, for example, Microsoft Excel. Mapping procedure A typical Projector run is presented in Figure Figure1.1
PCR conditions PCRs were performed with the Extensor Hi-Fidelity PCR Enzyme Mix (AB-Gene, Epsom, UK) according to the manufacturer’s instructions using an iCycler Thermal Cycler (Bio-Rad Laboratories, Hercules, CA). Lactococcus lactis MG1363 chromosomal DNA was used as template (end concentration 6 ng µl–1). Genome comparisons All ORFs of a template genome were compared against the contigs of a target genome using the Blastn program. The number of ORFs giving a hit below the selected cutoff e-value (1 × e–50) divided by the total number of ORFs in the template genome results in an arbitrary value describing the similarity of the template genome to the target genome. This arbitrary value also depends on the completeness of the target genome contigs. Software availability Projector is available for educational and research purposes by non-profit institutions at http://molgen.biol.rug.nl/molgen/research/molgensoftware.php. Sources of genome sequence data Unfinished genomic sequences were obtained from the US DOE Joint Genome Institute (http://www.jgi.doe.gov). Finished genome sequences were obtained from the NCBI genome database (http://www.ncbi.nlm.nih.gov/). RESULTS Mapping of several target genomes Prior to testing Projector on several freely available target genomes, the software was tested by using L.lactis IL1403 ORFs as both target (each ORF is a contig) and template sets. All IL1403 ORFs that were not discarded, were mapped at their proper positions on the IL1403 genome. Projector was tested on seven target (unfinished) genome sequences (Table 2). Many target contigs are relatively small, resulting in high numbers of contigs being discarded based on their size (results not shown). Except in the case of Clostridium thermocellum, more than 70% of the nucleotides in the organism assembly were in mapped contigs. The significant numbers of contigs that result in ‘no hit’ (Table 2) indicate that many target contigs are strain specific. The median predicted gap sizes range from 500 bp to >7 kb. The percentage of predicted gaps that might be closed in one PCR (~15 kb maximum) ranges from 60 to almost 90. In the case of L.lactis MG1363 and L.lactis SK11 target genomes, mapping on an IL1403 template genome from which repetitive elements had been removed (repeat-masked) resulted in a reduction in the percentage of nucleotides mapped by five and four, respectively. The number of contigs that were mapped at overlapping positions decreased significantly when repeat-masked sequences were used (results not shown).
Correlating genomic relatedness with mapping success Comparing the genomic content of different bacteria is an important but difficult task and many software packages have been developed to do so (13–16). These software packages give a lot of information on genomic relatedness, but it is difficult to generate a single quality value for a genomic comparison. In order to obtain a rough estimate of the relatedness of two genomes, a simple methodology, similar to that used in Projector, was applied. For all target genomes used in this study, an estimate was made of the percentage of template ORFs present in the respective target contigs. A similarity of 70% was calculated between the target L.lactis MG1363 contigs and the ORFs of the template L.lactis IL1403, which corresponds to the earlier reported similarity of 70–98% (17). A correlation between similarity values obtained for all bacteria (except C.thermocellum) and the percentage of nucleotides present in mapped contigs (Table 2) is given in Figure Figure2.2 Mapping of L.lactis MG1363 and gap closing PCR results Figure Figure33
Projector was used to map L.lactis MG1363 contigs on a repeat-masked IL1403 genome (Table 2, sixth entry). Contigs <1500 bp were omitted from the mapping procedure to prevent incorrect mappings due to repetitive and/or low-quality sequences. From the 83 PCRs that were performed using gap-closing PCR primer pairs suggested by Projector and GenomePrimer, 77 PCR products were obtained (93% success), with sizes ranging from 300 bp to 15 kb (the primer sequences, projected gap sizes, PCR product sizes and the first round sequencing results are listed in Supplementary Material, Table S3). From these 77 PCR products, 31 had sizes that were different from the projected gap sizes. The reasons underlying these deviations are outlined above. From the first sequencing round performed on the 77 PCR products, 18 contigs ‘fit’ in 15 (partially) sequenced gaps (Supplementary Material, Table S3). The N50 and N80 values are often used to assess the quality of a sequence assembly. The N50 (N80) value describes a contig size at which 50% (80%) of the assembled nucleotides are in contigs of sizes larger than this value. Before the use of Projector, the N50 and N80 values were 39 and 14 kb, respectively. Because sequencing of the gap-closing PCR products is underway, the assembly after use of Projector is not yet finished. By conservatively only taking into account the gap-closing PCR products that could be sequenced from both sides (61 products; Supplementary Material, Table S3), an estimate of resulting contig sizes in the (provisional) assembly after the use of Projector was made. The contig sizes after gap closure can be estimated from: (i) the original contig sizes; (ii) the sizes of the gap-closing PCR products; (iii) their overlaps with flanking contigs (yielding an estimate for the actual gap size). From this provisional assembly the (likely underestimated) N50 and N80 values were 349 and 44 kb, respectively. DISCUSSION The Projector software was developed to automatically and reliably link contigs by using information of already sequenced genomes. Because many sequencing projects are confidential, Projector was developed as a ‘stand-alone’ software package, as opposed to software that use web interfaces. From Table 2 it is clear that the number of contigs in unfinished genomes can amount to up to at least 300. This large number of contigs per genome stresses the need for automated gap closing software to avoid having to position contigs manually. Using Projector has several advantages: (i) the ‘hands-on’ time is limited to ~10 min per mapping as opposed to many hours of manual labor; (ii) mapping procedures can be highly standardized; (iii) optimization of mappings (getting rid of erroneously mapped contigs) is very easy; (iv) errors in manual mappings are avoided. In the present study, contigs <1500 bp were considered to be ‘low quality’ and were discarded. With the underlying thought that less potential information is discarded, a contig cut-off size of 500 bp was tried in mapping procedures of L.lactis MG1363. In the resulting map, many contigs were mapped at approximately the same positions due to that these contigs contain repetitive sequences. When performing mappings, there will be a trade-off between how many contigs one wants to use for the mapping procedure and the quality (with few mappings that overlap and small gap sizes) of the resulting map. With the assumption that a number of the remaining contig sequences will ‘fit’ within gaps that are closed, the contig cut-off size has to be balanced with the percentage of gaps that can be closed in a single PCR. This assumption was in part validated by the results of the first sequencing round, where 18 remaining contigs ‘fit’ in 15 gaps (Supplementary Material, Table S3). Projector uses the ORFs of a template genome to order contigs of a target genome. Although this methodology proved to work nicely for several prokaryotic genomes, it is not suitable for eukaryotic genomes. This is due to two major differences between prokaryotic genomes and eukaryotic genomes: (i) prokaryotic genomes consist of one circular chromosome whereas eukaryotic organisms contain several linear chromosomes; (ii) prokaryotic genomes are more ‘coding dense’ than eukaryotic genomes, e.g. often >95% of bacterial chromosomes code for proteins while the human genome contains at most 2% coding DNA. Prokaryotic genomes contain repetitive elements or highly homologous sequences such as prophages, insertion elements and gene duplications (18). Up to 25% of the coding regions in the Bacillus subtilis genome consists of duplications (19). Omitting repetitive elements from a template DNA sequence (repeat-masking) results in less ambiguous maps in the case of L.lactis MG1363 and L.lactis SK11 (results not shown). When using the PGAAS and MUMmer software packages, it is highly recommended to repeat-mask sequences before performing the actual mapping. In the case of a small contig that ends with a repetitive element, it might be advantageous to perform the mapping on non-repeat-masked sequences. In this particular case, a successful mapping only occurs when the repetitive element is taken into account. Although it is advisable to remove repetitive elements, it is possible for Projector to perform a mapping on sequences that are not repeat-masked. The software will automatically select the contig map size that is closest to the original contig size. Only when repetitive elements that also appear elsewhere and with comparable spacing on the template genome are present on either side of a contig, it could be incorrectly mapped. The MUMmer software package was developed as a genome comparison tool. It positions small DNA fragments (MUMs) taken from contigs on a template genome and extends them until extension is no longer possible, due to divergence in target and template sequences. Because small fragments are extended, the mapping is accurate but in some cases ambiguous. Furthermore, repetitive elements have to be filtered from the sequence data prior to using MUMmer. The recently released scaffolding software BAMBUS (http://www.tigr.org/software/bambus/) allows obtaining additional contig linking information by using MUMmer. BAMBUS does not have the possibility to automatically suggest gap closure primer pairs. In addition, in some cases scattering of contigs occurs, which results in ambiguous mappings. If that is the case, the correct contig linkage information has to be obtained manually, which is difficult because of lack of information on the mapped size of the contig on the template genome compared to the actual contig size. The aim of PGAAS is to aid in closing gaps between contigs that end within the same ORF (9). The methodology used in PGAAS differs from that used in Projector because PGAAS attempts to position two ends from two contigs on one ORF. In Projector, the two ends within one contig must map on a template sequence with certain spacing: the mapped contig size must resemble the actual contig size as much as possible. As the authors of PGAAS state, it is imperative that the genome does not contain gene duplications or orthologous genes as otherwise two contigs could be erroneously linked. In the final stages of an assembly it is often observed that a large number of contigs end with repetitive sequences (from, for example, phages, rRNAs or IS elements; this was also the case for the L.lactis MG1363 contigs), meaning that these ends cannot be used by the PGAAS methodology. Construction of a large insert genomic bank of the AT-rich organism L.lactis MG1363 proved to be problematic. In the case where it is difficult to obtain additional linkage information by using large insert banks, Projector might prove to be very efficient. It was successful in projecting 84 contigs (consisting of 81% of the assembled nucleotides) from L.lactis MG1363 contigs present in the end phase of the genome sequencing project. The mapped contigs on the L.lactis IL1403 template genome were accurately positioned because 93% of the gap-bridging PCRs with primers suggested by Projector and GenomePrimer were successful. Moreover, a marked improvement in the N50 and N80 values was observed after the use of Projector. Even if large insert genome banks were successfully used in a genome sequencing project, it is very probable that at a minimum of time and cost involved, a significant amount of the remaining gaps could be closed by using Projector. Mappings of Xylella oleander and Xylella almond contigs on Xylella fastidosa (over 90% of the nucleotides were in mapped contigs with a median gap size of <1 kb; Table 2) were even more successful compared to those of L.lactis MG1363 (81% of the nucleotides in mapped contigs with a median gap size of ~3 kb; Table 2). The correlation between mapping success and genomic relatedness (Fig. (Fig.2)2 Supplementary Material is available at NAR Online. [Supplementary Material]
REFERENCES 1. Fraser C.M. and Fleischmann,R.D. (1997) Strategies for whole microbial genome sequencing and analysis. Electrophoresis, 18, 1207–1216. [PubMed] 2. Staden R., Judge,D.P. and Bonfield,J.K. (2001) Sequence assembly and finishing methods. Methods Biochem. Anal., 43, 303–322. [PubMed] 3. Stjepandic D., Weinel,C., Hilbert,H., Koo,H.L., Diehl,F., Nelson,K.E., Tummler,B. and Hoheisel,J.D. (2002) The genome structure of Pseudomonas putida: high-resolution mapping and microarray analysis. Environ. Microbiol., 4, 819–823. [PubMed] 4. Crowe M.L., Rana,D., Fraser,F., Bancroft,I. and Trick,M. (2002) BACFinder: genomic localisation of large insert genomic clones based on restriction fingerprinting. Nucleic Acids Res., 30, e118. [PubMed] 5. Zabarovska V.I., Gizatullin,R.Z., Al Amin,A.N., Podowski,R., Protopopov,A.I., Lofdahl,S., Wahlestedt,C., Winberg,G., Kashuba,V.I., Ernberg,I. et al. (2002) A new approach to genome mapping and sequencing: slalom libraries. Nucleic Acids Res., 30, e6. [PubMed] 6. Tettelin H., Radune,D., Kasif,S., Khouri,H. and Salzberg,S.L. (1999) Optimized multiplex PCR: efficiently closing a whole-genome shotgun sequencing project. Genomics, 62, 500–507. [PubMed] 7. Romling U., Grothues,D., Heuer,T. and Tummler,B. (1992) Physical genome analysis of bacteria. Electrophoresis, 13, 626–631. [PubMed] 8. Eisen J.A., Heidelberg,J.F., White,O. and Salzberg,S.L. (2000) Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol., 1, RESEARCH0011.1–011.9. [PubMed] 9. Yu Z., Li,T., Zhao,J. and Luo,J. (2002) PGAAS: a prokaryotic genome assembly assistant system. Bioinformatics, 18, 661–665. [PubMed] 10. Delcher A.L., Phillippy,A., Carlton,J. and Salzberg,S.L. (2002) Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res., 30, 2478–2483. [PubMed] 11. Delcher A.L., Kasif,S., Fleischmann,R.D., Peterson,J., White,O. and Salzberg,S.L. (1999) Alignment of whole genomes. Nucleic Acids Res., 27, 2369–2376. [PubMed] 12. van Hijum S.A.F.T., de Jong,A., Buist,G., Kok,J. and Kuipers,O.P. (2003) Unifrag and GenomePrimer: selection of primers for genome-wide production of unique amplicons. Bioinformatics, 19, 1580–1582. [PubMed] 13. Thomas J.W. and Touchman,J.W. (2002) Vertebrate genome sequencing: building a backbone for comparative genomics. Trends Genet., 18, 104–108. [PubMed] 14. Couronne O., Poliakov,A., Bray,N., Ishkhanov,T., Ryaboy,D., Rubin,E., Pachter,L. and Dubchak,I. (2003) Strategies and tools for whole-genome alignments. Genome Res., 13, 73–80. [PubMed] 15. Bray N., Dubchak,I. and Pachter,L. (2003) AVID: a global alignment program. Genome Res., 13, 97–102. [PubMed] 16. Hohl M., Kurtz,S. and Ohlebusch,E. (2002) Efficient multiple genome alignment. Bioinformatics, 18, S312–S320. [PubMed] 17. Campo N., Dias,M.J., Daveran-Mingot,M.L., Ritzenthaler,P. and Le Bourgeois,P. (2002) Genome plasticity in Lactococcus lactis. Antonie Van Leeuwenhoek, 82, 123–132. [PubMed] 18. Yanai I., Camacho,C.J. and DeLisi,C. (2000) Predictions of gene family distributions in microbial genomes: evolution by gene duplication and modification. Phys. Rev. Lett., 85, 2641–2644. [PubMed] 19. Kunst F., Ogasawara,N., Moszer,I., Albertini,A.M., Alloni,G., Azevedo,V., Bertero,M.G., Bessieres,P., Bolotin,A., Borchert,S. et al. (1997) The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature, 390, 249–256. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Electrophoresis. 1997 Aug; 18(8):1207-16.
[Electrophoresis. 1997]Methods Biochem Anal. 2001; 43():303-22.
[Methods Biochem Anal. 2001]Environ Microbiol. 2002 Dec; 4(12):819-23.
[Environ Microbiol. 2002]Nucleic Acids Res. 2002 Nov 1; 30(21):e118.
[Nucleic Acids Res. 2002]Nucleic Acids Res. 2002 Jan 15; 30(2):E6.
[Nucleic Acids Res. 2002]Genomics. 1999 Dec 15; 62(3):500-7.
[Genomics. 1999]Electrophoresis. 1992 Sep-Oct; 13(9-10):626-31.
[Electrophoresis. 1992]Genome Biol. 2000; 1(6):RESEARCH0011.
[Genome Biol. 2000]Bioinformatics. 2002 May; 18(5):661-5.
[Bioinformatics. 2002]Nucleic Acids Res. 2002 Jun 1; 30(11):2478-83.
[Nucleic Acids Res. 2002]Nucleic Acids Res. 1999 Jun 1; 27(11):2369-76.
[Nucleic Acids Res. 1999]Bioinformatics. 2003 Aug 12; 19(12):1580-2.
[Bioinformatics. 2003]Trends Genet. 2002 Feb; 18(2):104-8.
[Trends Genet. 2002]Bioinformatics. 2002; 18 Suppl 1():S312-20.
[Bioinformatics. 2002]Antonie Van Leeuwenhoek. 2002 Aug; 82(1-4):123-32.
[Antonie Van Leeuwenhoek. 2002]Phys Rev Lett. 2000 Sep 18; 85(12):2641-4.
[Phys Rev Lett. 2000]Nature. 1997 Nov 20; 390(6657):249-56.
[Nature. 1997]Bioinformatics. 2002 May; 18(5):661-5.
[Bioinformatics. 2002]