![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © 2006, American Society of Plant Biologists TOPAAS, a Tomato and Potato Assembly Assistance System for Selection and Finishing of Bacterial Artificial Chromosomes1[W] Centre for Biosystems Genomics, 6700 AB Wageningen, The Netherlands (S.A.P., J.C.v.H., T.H., M.J.v.S.); Department of Bioscience, Cluster Greenomics, Plant Research International, 6708 PB Wageningen, The Netherlands (S.A.P., J.C.v.H., T.H., M.J.v.S., M.H.C.A.-H., R.M.K.-L.); and Keygene N.V., 6700 AE Wageningen, The Netherlands (T.P.J., D.W., K.J.) *Corresponding author; e-mail sander.peters/at/wur.nl; fax 31–317–418094. Received September 13, 2005; Revised December 16, 2005; Accepted January 6, 2006. This article has been cited by other articles in PMC.Abstract We have developed the software package Tomato and Potato Assembly Assistance System (TOPAAS), which automates the assembly and scaffolding of contig sequences for low-coverage sequencing projects. The order of contigs predicted by TOPAAS is based on read pair information; alignments between genomic, expressed sequence tags, and bacterial artificial chromosome (BAC) end sequences; and annotated genes. The contig scaffold is used by TOPAAS for automated design of nonredundant sequence gap-flanking PCR primers. We show that TOPAAS builds reliable scaffolds for tomato (Solanum lycopersicum) and potato (Solanum tuberosum) BAC contigs that were assembled from shotgun sequences covering the target at 6- to 8-fold coverage. More than 90% of the gaps are closed by sequence PCR, based on the predicted ordering information. TOPAAS also assists the selection of large genomic insert clones from BAC libraries for walking. For this, tomato BACs are screened by automated BLAST analysis and in parallel, high-density nonselective amplified fragment length polymorphism fingerprinting is used for constructing a high-resolution BAC physical map. BLAST and amplified fragment length polymorphism analysis are then used together to determine the precise overlap. Assembly onto the seed BAC consensus confirms the BACs are properly selected for having an extremely short overlap and largest extending insert. This method will be particularly applicable where related or syntenic genomes are sequenced, as shown here for the Solanaceae, and potentially useful for the monocots Brassicaceae and Leguminosea. An established strategy to determine the sequence content of target genomes involves large insert clones that are physically mapped into contigs spanning the target of interest, and which are used for shotgun library construction and high-throughput sequencing. Many aspects concerning the clone-by-clone whole-genome sequencing strategy in literature have been addressed, and although much progress has been made in developing this strategy, key steps are the subject of continued evaluation and improvement. Here we present results on the Centre for Biosystems Genomics initiative to sequence tomato chromosome 6 of Solanum lycopersicum cv Heinz 1706 by a clone-by-clone sequencing approach and to establish a resistance gene homolog profiling for the potato (Solanum tuberosum) genome. In this paper we particularly focus on selecting bacterial artificial chromosomes (BACs) for walking and finishing. The condition of having large insert clones available was fulfilled by Budimann et al. (2000), who constructed a HindIII BAC library for cultivated tomato cv Heinz 1706, covering the target with approximately 15 genome equivalents, and recently with an MboI and an EcoRI BAC library that the United States' part of the International Solanaceae Project (SOL) has made available (Mueller et al., 2005b). A key step in clone-by-clone whole-genome sequencing is determining a reliable minimal-tiling path. This strategy depends on the availability of a high quality physical map. An established approach for map construction involves DNA fingerprinting. With fingerprinting, overlapping clones are identified by determining a pattern of shared bands produced from restriction enzyme analysis, which is indicative for the physical overlap. Owing to its simplicity and low initial costs, often agarose separation and staining is used for detection of bands. A combinatorial comparison of fingerprints through automated physical map assembly software, e.g. FingerPrinted Contigs (FPC), is applied for map construction (Soderlund et al., 1997, 2000). However, low resolution separation, errors in detection and size estimation of separated fragments, uncalibrated FPC parameter settings for size tolerance, and inaccurate probability cutoff scores, cause false negative scoring results, creating gaps in the physical map and resulting in a higher amount of singletons, and false positives creating chimeric contigs (for review, see Meyers et al., 2004). Compared to agarose separation, amplified fragment length polymorphism (AFLP) fingerprinting is a high-resolution separation technique, and this allows for more precise fragment size estimation. Typically 50 to 100 restriction fragments in the range from 50 to 500 nucleotides can be detected (Vos et al., 1995). Budimann et al. (2000) have proposed a sequence-tagged connector (STC) framework for more precise selection of minimally overlapping tomato BAC clones to support whole-genome sequencing of the tomato genome. The selection strategy originally proposed by Venter et al. (1996) involves a fingerprint analysis and BAC end sequencing, which is used in combination with genetically anchored seed BACs that are completely sequenced. Recently a large number of tomato BAC end sequences have been made available by the Solanaceae Genome Network (SGN) for the sequencing community, and these developments make it possible to pursue the STC approach using high-density fingerprints. Upon selection of fingerprinted BACs, determining the sequence content is the next important step in rebuilding the genomic content of targets. The method most commonly used for genomic DNA sequencing is shotgunning. The sample DNA is randomly sheared into small fragments and cloned into appropriate sequencing vectors. With double-barreled shotgun sequencing, small insert clones are sequenced from both insert ends, producing read pairs or mates. The aim is to cover the target of interest and to reduce the number of sequence gaps between contigs by producing a sufficient amount of sequences from which a reliable consensus can be determined upon assembly. Theoretically, following Poisson distribution rules, the probability for bases not being sequenced leaving sequence gaps reduces with an increase of coverage, as outlined by Lander and Waterman (1988), although cloning bias causes a nonrandom distribution leading to nonsequenced areas regardless of coverage. Uncovered areas are usually rescued by PCR, using custom-designed primers and templates spanning the sequence gap. For tomato and potato BAC sequencing we focus on 6-fold coverage, aiming for a limited and balanced demand of resources. However, low coverage will leave assemblies more incomplete and will demand a dedicated input for the assembly finishing phase. While sequencing and computer technology have facilitated the automated processing and assembly of large amounts of shotgun sequence data, the finishing of contig sequences is a time-consuming process, and needs expert knowledge to evaluate base calls, design primers for gap closure, and untangle complex sequences that obstruct a proper assembly. To compensate for the human input required to finish low-covered BACs, we aim to automate local assembly verification, contig linking, and gap closure. Several tools for contig linking and gap closure have been presented in the past. Among those, prokaryotic genome assembly assistance system, which was developed to automate contig ordering and gap closure for prokaryotic cyanobacterial genome assembly by finding possible links for Synechococcus contigs with known protein sequences coming from closely related Synechocystis sp. (Yu et al., 2002), using local sequence homology-based searches with BLASTX (Altschul et al., 1990). Finding contig links by BLASTX homology searches depends on gene distribution in the target genome. For tomato, the regions near the centromeric region have the lowest gene density with 15 to 17 kb per gene, while the euchromatin has a gene density of approximately 7 kb. Analysis of sequenced tomato BACs reveal a gene density with an average of 10 kb per gene (Van der Hoeven et al., 2002). Bacterial genomes in general do not contain introns and have a higher gene density compared to eukaryotic plant genomes. Therefore, finding corresponding putative functions on sequences from higher eukaryotic plant origin for gapped assemblies will be more difficult. Additional linkage information might be obtained through comparative genomics. Solanaceae members like tomato and potato share a conserved colinearity between their genomes (Bonierbale et al., 1988). The genomic sequence information from Solanaceae is, however, scarcely available. From studies to analyze gene content and organization though, a large collection of single-pass expressed sequence tags (ESTs) from tomato cDNA have become available (Van der Hoeven et al., 2002) and this opens the possibility for genome-wide comparative studies. In addition to existing database information, a powerful data source for contig scaffolding and inherent to the double-barreled shotgun sequencing approach, is the assembly position of a sequence read constraint by the assembly position and direction of its mate pair. This information can be used to both relatively position contigs and to solve local assembly problems. Reconstruction of target sequences is often complicated by repeats, resulting in collapsed assemblies. To resolve these phenomena, a tool that reports on violation of direction and size constraints will help to determine contig quality. We report here the development of a Tomato and Potato Assembly Assistance System (TOPAAS) that uses homology-based searches, comparative alignments, read pair information, and high-density AFLP fingerprint data to link contigs, verify assemblies, and select minimal overlapping BACs. RESULTS Dataflow and Output The main purpose of TOPAAS is to automate key steps in the clone-by-clone sequencing approach. Its tasks are to find contig link information for gapped assemblies resulting from low-coverage sequencing, to analyze the assembly integrity, and to assist the selection of overlapping BAC clones for a subsequent sequence walk. To that end we have built a system that extracts read pair information, carries out homology-based searches, and analyzes this information according to user-defined settings. A schematic representation of the TOPAAS pipeline and dataflow is shown in Figure 1
TOPAAS provides a web front end in PHP for uploading assembly data and contig sequences, setting alignment constraints and average insert sizes for shotgun libraries. Homology-based alignments can be uploaded manually or provided by TOPAAS via two automated BLASTs. TOPAAS aligns contigs against the nonredundant sequence database from the National Center for Biotechnology Information (NCBI) and against the BAC end sequence database from SGN. The system also carries out a MUMmer (Delcher et al., 2002) or a BLAT (Kent, 2002) alignment against Solanaceae ESTs. Together with the homology-based alignment results, read pair positions and directions are parsed into MySQL tables comprising the TOPAAS database (for an overview of the TOPAAS table scheme, see Supplemental Fig. 1). The actual link analysis is started from a web front end and is carried out by the ContigLinker that queries the TOPAAS database. First the system retrieves and filters hits on cutoff for percentage identity or e-value score. We separated the filtering step from the alignment program filtering options to enable linkage analysis using variable cutoff scores without the need to perform additional homology searches. Next TOPAAS matches identical database accession numbers from EST and BLASTX hits. Subsequently, the system outputs a linkage analysis on the fly rather than storing the analysis. TOPAAS tracks down read pairs both within and between contigs. Violations against direction and spacing constraints point toward possible local assembly problems, and inconsistent read pairs are reported to the editor for extraction and reassembly. Via the web interface primer design constraints can be manipulated and the system will output unique primer pair combinations for sequence gap closing purposes (Supplemental Fig. 2). The automated BLASTN analysis of contigs against the BAC end sequence database is used for high-throughput screening and rapid preselection of candidate BACs, having a sequence overlap with seed BACs. The single-pass BAC end sequences are reassembled onto the seed BAC consensus. Base pair inconsistencies are edited to exclude high quality base call mismatches and the position of a nearby cloning site upstream of the BAC end sequence start position is verified. When meeting constraints, corresponding BACs are then selected for further analysis with high-density AFLP fingerprinting. The reassembly of BAC ends and AFLP fingerprinting analysis is carried out independently from TOPAAS. Selection of Tomato BACs for Sequence Walking Sequence Homology-Based Searches To examine whether a STC approach with a nonselective AFLP fingerprinting can support the tomato BAC walking, we selected P250I21 and P046G10 from an initial set of tomato seed BACs for sequencing. P250I21 is assembled to full closure, whereas the assembly of P046G10 is gapped (Table I). Different lines of evidence indicate these BACs originate from tomato chromosome 6. Fluorescent in situ hybridization analysis shows P250I21 and P046G10 are located on the short and the long arm of chromosome 6, respectively (for an overview, see http://sgn.cornell.edu/cgi-bin/cview/map.pl?map_id=13). Furthermore, the chromosome 6 known functional gene Mi marker, which has been used as a probe in an overgo plating analysis, shows plausible associations to P250I21. In addition, P112G05 has been associated to the Mi marker and has been assigned to a chromosome 6 FPC contig (for details, see http://www.genome.arizona.edu/fpc/WebAGCoL/tomato/WebFPC/ and http://www.sgn.cornell.edu/cgi-bin/search/direct_search.pl?search=bacs). No FPC data is available for P250I21. However, AFLP mapping shows both BACs coassemble (see also Fig. 5
We first searched the contig sequences of P250I21 and P046G10 with TOPAAS against the BAC end database from SGN, containing 75,000 to 126,000 BAC end sequences from a HindIII and an MboI library depending on the time of screening. The raw BLASTN output was converted into html format to provide for a complete overview of hits (Fig. 3
High-Density Nonselective AFLP Fingerprinting of Tomato BACs To investigate the relation between BACs over a larger extent we analyzed AFLP EcoRI/MseI + 0/+ 0 fingerprints by determining the number of comigrating fragments between BACs (Fig. 4 The nature of the overlap is further investigated by shotgun-sequencing P073H07 and 103N18 and assembly onto the consensus of P250I21 and P046G10, respectively. Both P073H07 and P103N18 align without base inconsistencies, and the overlap start position is similar to that determined by BLAST. Furthermore the BAC end assembly positions and directions are in agreement with the mapping results (Supplemental Fig. 4). From these results we conclude to have identified P073H07 as optimal BAC for walking in terms of minimal overlap and largest extending insert. At the time of screening the same did hold true for BAC P103N18. Over time the sequencing community will be provided in total with some 400,000 BAC end sequences obtained from three different libraries (Mueller et al., 2005b). It is likely we will find new BAC candidates with even more favorable features for walking as BAC end sequence data accumulate. This is illustrated by candidate BAC P008K02, which we found later on in the screening process. This BAC has a larger extending insert, but also a larger overlapping portion with seed BAC P046G10 (Supplemental Fig. 5). Linking of Tomato and Potato Contigs To analyze the quality of the contig links predicted by TOPAAS, we have constructed an assembly data set from three potato BACs, which were pulled from two different libraries (Rouppe van der Voort et al., 1999; Huang et al., 2005) and two tomato BACs. A total of 21 potato contigs with 18 sequence gaps was obtained for three potato BACs and comprised a contig length of approximately 335 kb. For two tomato BACs P073H07 and P103N18 we obtained 24 contigs with a length of 235 kb. The type, number of links, and references to EST and BLAST matches between tomato and potato contigs was determined by TOPAAS as shown in Table I and Supplemental Table I. All potato BAC contigs have been linked, of which 13 out of 21 contigs are linked by read pairs. For tomato BACs, 17 contigs have been linked. For five contig pairs, 18 gap-spanning EST alignments have been found. P073H07 and RH123P09 have one contig pair, each linked by ESTs from both potato and tomato. One contig pair from RH123P09 has been linked with 12 ESTs from both tomato and potato (see Fig. 2 Subsequently, primers designed by TOPAAS on contig ends were used for PCR analysis on BAC template DNA in combinations according to the contig order predicted by TOPAAS. Figure 6
DISCUSSION Selection of BAC Clones for Sequence Walk We presented here a software package, TOPAAS, that automates key steps in the selection and finishing of BAC clones. A combination of nonselective AFLP fingerprinting, BLASTN analysis, and assembly of BAC ends supports an accurate physical mapping. The BLASTN search is used for high-throughput screening of BACs and rapid preselection. The selection can be used without laborious screening techniques such as the STS approach (Blake et al., 1996; Marra et al., 1997) or having to fingerprint an entire BAC library. The BAC clones we have screened for building the Mi contig are repetitive for Mi homologous sequences and contain transposable elements, the latter being well represented in plant genomes. Repetitive domains can confound the binning by scoring false overlaps and this also poses a problem for assembly, ordering, and bridging of contigs. By filtering the BLASTN hits, verifying for nearby upstream cloning sites within 50 bps from the start of the overlap on the seed BAC consensus, and manual inspection and curation of base call discrepancies, the screening is made robust enough to discriminate for true BAC end overlaps. An alternative approach to circumvent potential problems caused by alignment to repetitive regions is discussed hereafter. For screening contigs against BAC ends alternatively MegaBlast might be used. MegaBlast is faster compared to BLASTN and allows for a percentage identity cutoff rather than expected value cutoff. Since e-values depends on the length of the BAC ends and the size of the referenced database, relatively short BAC end sequences with a perfect match might be missed when filtering with a cut-off e-value of 0.0. We have also included the option to screen BAC contig sequences with MegaBlast. The screening presented here works very efficiently. From a total of 75,000 to 126,000 BACs we have identified four and seven candidates for P250I21 and P046G10, respectively, prior to fingerprinting. The fingerprinting and BLASTN analyses work complementarily in the physical mapping process. With the BAC end sequence homology search we are able to pinpoint the exact start position and direction of the overlap, and the AFLP fingerprinting is used to determine the relationship between overlapping BACs over a larger domain. Whereas the BLASTN hits disclose information on minimal overlap sizes, the multiple BAC comparisons through nonselective AFLP fingerprinting provide vital information for identifying BACs with the largest extending insert. For BAC P073H07, two comigrating fragments with seed BACs P250I21 have been scored (Fig. 4 The approach we have taken does not depend on the full closure of a seed BAC. The results for P046G10 show that minimal overlapping BACs can be scored for as well, even when having gapped assemblies, provided the contig ends adjacent to the T7 and SP6 region are identified. Theoretically with this approach it should be possible to identify BACs for walking having only a few hundred base pair overlap. This will depend on the distribution of restriction sites in the tomato genome and the number of BAC clones available to cover the genome. Recently also BAC end sequences from an MboI library have been made available and will be complemented by the United States' part of the SOL initiative with additional sequences coming from an EcoRI library. The use of multiple libraries produced with different restriction enzymes will increase the likelihood of finding BACs with even shorter overlap sizes. The mapping for BACs in AFLP contigs Mi and P103 has revealed some striking differences compared to FPC mapping results. Six BACs coassemble into contig Mi (Fig. 5 Other important aspects are cost and labor involved. Recently we have screened 21 seeds from a HindIII library against 350,000 BAC ends. The screening yields 186 BACs from the HindIII library, 126 BACs from the EcoRI library, and 75 BACs from the MboI library (data not shown). Thus on average 18 candidate overlapping BACs have been identified per seed BAC. We can now roughly estimate the total number of BACs to be fingerprinted using the STC approach, and compare this with the classical FPC method. If we follow Batzoglou et al. (1999), the HindIII library with depth d = 15 and an average BAC insert length of λ = 117.5 kb (Budimann et al., 2000) would yield a minimal tilling path with redundant sequencing of 13%. The percentage of redundant sequence will however be closer to 7.1% as a best possible obtainable result, since two additional libraries are available. We estimate the euchromatic part of chromosome 6 with length L to be 20 Mb (http://www.sgn.cornell.edu/help/about/tomato_project_overview.pl). The proportion π, with which 21 seeds from the HindIII library cover chromosome 6, is approximately 2.5 Mb and yields an average gap length ω = (L − π)/π = 7 λ (approximately 819 kb). The number of bidirectional walking steps (κ) to cover 90% of chromosome 6 is roughly equal to the initial mean gap size, and up to 2 κ when covering 98% (Batzoglou et al., 1999). If we consider parallel walking starting from 21 seeds, ignore possible cloning bias and repeat sequences that mask overlaps, and assume all BACs are sequenced at both ends, in total some 2,500 to 5,000 BACs would have to be fingerprinted. A classical map first and sequence second approach like FPC would involve some 350,000 to 400,000 BACs to be fingerprinted. BAC Finishing TOPAAS assists the assembly, scaffolding, and finishing of BAC contigs. Read pairs are used commonly for finishing assemblies, and this linking approach has also contributed extensively to the positioning of tomato and potato contigs in this study. The likelihood for finding sequence gap-spanning read pairs depends on the insert sizes used for constructing the shotgun library and the coverage with which the target is sequenced. Approximately 15% of the contigs could not be ordered with gap-spanning mate pairs. This is partly due to the low coverage with which BACs have been sequenced. We have included homology-based searches to increase the chance of finding leads that link contig ends. From the links predicted, approximately 70% belonged to a read pair link type, whereas the remaining 30% were equally divided over BLASTX and ESTs link types. Multiple factors contribute to the success of the homology-based linking approach. We show here alignments to single-pass ESTs can successfully be used for tomato and potato contig linking. For many plant genomes extensive amounts of ESTs have been produced, and in combination with genomic sequences the approach is feasible for many sequence projects including those from monocots, Brassicaceae, and Leguminosea (http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html). The closing efficiency will improve when using unigenes, since the spanning distance in general is larger compared to single-pass EST sequences. Building high quality unigenes requires base calling, accurate preclustering, and assembly, however. Reliable linkage by bridging unigenes will thus depend on the consistency and the overall quality of the build. Some 31,000 for S. lycopersicum and 25,000 unigenes for S. tuberosum have been assembled (Mueller et al., 2005a), each set containing some 38% singletons (http://www.sgn.cornell.edu/search/direct_search.pl?search=unigenes). We have used both unigene sets for alignment against tomato BAC 073H07; however the screening did not yield additional linkages. MUMmer has been used as the matching algorithm. Its suffix tree-based method is relatively computational inexpensive and is very fast. MUMmer can perform a translated alignment, which is preferable for more distant related genomes. However, it is memory intensive and is originally designed for global rather then local alignments (Delcher et al., 2002; Kurtz et al., 2004). Tools like BLAT are specifically designed for EST-genome alignments. BLAT is also fast but differs from MUMmer in that it uses a hash array. It is very accurate for highly related genomes, but its nucleotide alignment strategy starts to break down when the base identity is below 90%. This makes it less suitable for cross-species alignments that are more distantly related. BLAT can work in translated mode but has limitations for protein alignments with respect to indels (Kent, 2002). We have provided TOPAAS with the option to screen BAC contig sequences with both BLAT and MUMmer. Both BLAST and EST bridging sequences were checked manually for homology against known Solanum repeats. In one instance we found a contig pair linked by a BLAST hit against a repetitive element. The contig pair also shared a bridging read pair, making an aberrant linkage unlikely. Neither BLAST nor TOPAAS is specifically designed to deal with repetitive sequences. Although not used in this study, we have recently included an automated screen in the assembly phase against The Institute for Genomic Research Solanaceae Repeat Database (http://www.tigr.org/tdb/e2k1/plant.repeats) with RepeatMasker to circumvent potential problems (http://www.repeatmasker.org/RMDownload.html). In a Staden environment RepeatMasker is interfaced by PREGAP4 (Bonfield et al., 1995) and it tags repeats accordingly. Upon assembly, consensus sequences are extracted in which repeats are masked and are being denied from making false overlaps in homology-based alignments and EST alignments. Ordering contig ends with BLASTX depends on the gene distribution in the tomato and potato genome. In this study we have finished BACs containing inserts of the euchromatic part of tomato chromosome 6. The genes are not evenly distributed in the tomato and potato genome (Van der Hoeven et al., 2002), and the likelihood of linking contigs in regions with few genes, e.g. in the heterochromatic parts of the genome, will be lower compared to the euchromatic domains of the genome. In addition, information on Solanaceae (putative) protein sequences are only scarcely available, and finding relationships depends on the availability of more distantly related (putative) protein sequences. The results show four out of six contig-bridging BLASTX alignments having a homology against non-Solanaceous protein sequences. Furthermore, coding regions in higher eukaryotes like tomato and potato contain introns, and this further decreases the chance to find contig ends matching the same protein sequence. We have included comparative alignments between tomato and potato ESTs and genomic sequences in the link analysis. The alignments between genomic and EST sequences show both species-specific and tomato-potato alignments that provide useful linking leads. Even more linking information could be obtained by comparative alignments to non-Solanaceae ESTs. A computational comparison of some 120,000 ESTs against tomato BACs from tomato cv Heinz 1706 and the Arabidopsis genome revealed 70% of the tomato unigenes having identifiable homologs in the Arabidopsis genome. Furthermore a comparison of gene repertoires indicates a set of highly conserved genes (17%) is shared between Arabidopsis, S. esculentum, and Medicago truncatula (Van der Hoeven et al., 2002). Therefore, alignments between, for example, full-length cDNAs or At-ESTs coming from studies to verify transcription units within the Arabidopsis genome (Yamada et al., 2003) to tomato and potato genomic sequences seems a promising possibility. Yet, caution should be taken to use sources from more distantly related species in comparative studies. Where genome rearrangements have occurred in evolution between species, changes on a microsyntheny level might lead to inaccurate projection and false ordering information. Nevertheless, the chances for finding ordering leads based on comparative alignments will surely increase with the rapidly expanding number of genome sequences and EST data sets from closely related species. We will continue to explore data sets and new linking approaches for the BAC finishing process. In this respect we are currently investigating whether matching AFLP gel fingerprints to in silico AFLP fingerprints can be used effectively for automated scaffolding purposes. The TOPAAS software is available for nonprofit, academic, and personal use. Please contact http://www.cbsg.nl for nonexclusive commercial licenses. The software can be downloaded from http://www.appliedbioinformatics.wur.nl. MATERIALS AND METHODS Sequencing and PCR Analysis BAC DNA was isolated with the Qiagen large construct kit, sized by hydro shearing, fractionated by gel electrophoresis, and 2-kb sized fragments were cloned into the dephosphorylated EcoRV site of pBlueScriptSK (Stratagene) or pGEM-TEasy (Promega). Shotgun templates were prepared from XL2 transformants (Stratagene) and sequenced using the ABI PRISM Big Dye Terminator Cycle Sequencing Ready reaction kit with FS AmpliTaq DNA polymerase (Perkin Elmer) or the DYEnamic ET Terminator Cycle Sequencing kit (Amersham). For gap closure, PCR products were amplified with custom-made primers using a regular PCR protocol. Typically a 10-μL PCR reaction contained 1 μL 5 μm forward and 1 μL 5 μm reversed custom primer, 1 μL 2.5 mm dNTPs, 2 μL 25 mm MgCl2, 2 μL 10× sequence buffer (200 mm Tris-HCl pH 9.0, 5 mm MgCl2), 0.2 μL 5 units/μL Goldstar (Eurogentec) polymerase, and 1 μL 10 μg/μL BAC template DNA. PCR products were analyzed on agarose gel, purified using QIAquick gel extraction kit (Qiagen) as described by the manufacturer, and diluted into 30 μL. Sequence PCR was carried out in 10 μL reaction mixture with 2 μL Amerdye (Amersham), 1 μL sequence primer, 2 μL sequence buffer (200 mm Tris-HCl pH 9.0, 10 mm MgCl2), and 5 μL template DNA. Sequence PCRs were analyzed on a 3730 XL DNA analyzer (Applied Biosystems). Assembly Using the PREGAP4 interface of the Staden package 2004, raw trace data was processed into assembly ready sequences. Sequences were base called by the PHRED base caller (Ewing and Green, 1998; Ewing et al., 1998). Clipping was performed to remove sequencing vector, cloning vector, and bad quality sequences. Processed sequences were subsequently assembled with GAP4, with a sequence percentage mismatch threshold of 8%, and parsed into the GAP4 assembly database. The GAP4 contig editor interface was used for editing and finishing. Consensus calculations with a quality cutoff score of 40 were performed from within GAP4 using a probabilistic consensus algorithm based on the expected error rates output by PHRED. Software Dependencies To manage the sequence, assembly, and scaffolding data we developed TOPAAS with components that are available as open-source components or with an academic user license. In particular we use MySQL as a database management system (http://www.mysql.com/downloads). Perl (http://www.perl.org) and PHP (http://www.phpmyadmin.net) are used for scripting purposes, and Apache (http://www.apache.org) is used for web hosting. Graphical output relies on the use of the graphics draw library (http://www.sunfreeware.com, or http://www.boutell.com/gd). The core program for primer design is built upon Primer3 (http://www-genome.wi.mit.edu/genome_software/other/primer3.html), though additional scripting has been used to manipulate Primer3 to automated primer design for sequence gap closure. The software also includes scripts to build a local database of contig sequences for redundancy check purposes of primer sequences using BLASTN. To find matching putative functions that can be attributed to contig sequences we rely on BLASTX hits. We have adopted the prokaryotic genome assembly assistance system approach, but we use our own implementation to screen for identical accession ID. We have extensively revised the table structure so that storage of datasets for multiple projects is supported. The software does not cover the implementation of a local BLAST facility and a proper environment to run BLAST. This should be implemented by the user (for details, see http://www.ncbi.nih.nlm.gov/BLAST). For multiple alignment viewing of BLASTX matches we rely on Mview (http://mathbio.nimr.mrc.ac.uk/~nbrown). Base calling is carried out using PHRED (http://www.phrap.org). GAP4 assemblies were carried out using the Staden package 2004 (http://staden.sourceforge.net). The MUMmer package was used for sequence alignments between contig sequences and ESTs (http://www.tigr.org/software/mummer; http://mummer.sourceforge.net). Alternatively BLAT (http://www.cse.ucsc.edu/~kent/) can be used for EST alignments. The software is implemented on a UNIX platform and tested on a SUN V440 server running Solaris 2.9. Data Manipulation Consensus sequences of contig ends were cured with the GAP4 assembly viewer using a PHRED quality threshold of 40 over a length of 1 kb for both ends of a contig. Assembly information was extracted from the GAP4 assembly database and parsed into the ContigLink database with TOPAAS. Subsequently, read pairs were evaluated with respect to direction and size constraints that underlie the shotgun library properties. Bridging read pairs are considered valid when positioned on different contig ends, pointing toward each other with respect to their sequencing direction, and meeting size constraints. For gap-flanking read pairs we calculate the sequence-spanning distance, excluding the size of the gap itself. The left distance, dleft, is taken from position 1 at the 5′-end of the first mate pair to the end position of the contig it is assembled in, running in the direction similar to the sequence direction of the first mate pair. The right distance, dright, is taken from the start position of the second contig to the 5′ end coordinate of the second mate pair running opposite to the sequence direction of the second mate pair. The total spanning distance is calculated as dtot = dleft + dright. The size constraint dtot for read pairs can be set to a value related to the average insert size used to construct a shotgun library. In this study dtot is set to 2.5 kb. To align tomato (Solanum lycopersicum) and potato (Solanum tuberosum) EST sequences to contig sequences, we use an extension of the MUMmer package, designated NUCmer, using mummer2 as the matching algorithm. Consensus sequences in multi-fasta format from assembled contigs are used as a reference, and multi-fasta formatted potato and tomato EST sequences derived from NCBI are used as a query data set. An EST is considered contig bridging when aligning to different contig end sequences, with its domains aligned in a consecutive order, and with a minimal sequence identity threshold of 90% for each aligned domain. To find related putative gene functions, contig sequences were queried against the nonredundant sequence database from NCBI with BLASTX. A link is considered valid when hitting against protein sequences with the same accession ID. A threshold for the expected value was set to 1 × 10−5 to avoid low similarity matches. Primers are automatically designed on contig end sequences, using Primer3 as a core primer design program. Maximum distance of primer positions to contig ends is set to 500 bp. Additional custom scripting is applied to prefer primer sequences pointing outward with respect to the contig end positions and positioned nearest to a contig end. An automated redundancy check is used by aligning the primer sequence against the consensus sequence of the contigs using BLASTN. The expected value threshold for reporting primers as redundant was set to 0.1. Possible mispriming that could give rise to ambiguous PCR results is output by the program and described in terms of position, number of aligned bases, and alternative melting temperature. To identify minimal overlapping BAC clones for walking, we use tomato BAC end sequences from the SOL Genomics Network available at ftp://ftp.sgn.cornell.edu/tomato_genome, and perform a BLASTN analysis against assembled tomato contigs. Position and direction of overlap were verified, and candidate BAC clones were preselected setting a threshold expected value to 0.0. When meeting constraints, corresponding ABI traces were subsequently assembled onto BAC contig sequences to which the BLAST hit was found and verified at nucleotide level for integrity. Assembled BAC end sequences showing high quality base call differences compared to contig consensus sequences, or showing its assembly start more than 50 bp downstream from a candidate HindIII or MboI cloning site are rejected. Remaining candidate BAC clones are further analyzed by fingerprint analysis. AFLP Fingerprinting and BAC Insert Sizes BAC DNA was isolated by standard alkaline lysis method (Sambrook et al., 1989) and EcoRI/MseI, HindIII/MseI, and PstI/MseI AFLP templates were prepared as described by Vos et al. (1995). Five microliters of the restriction ligation mix was diluted 10-fold in 10 mm Tris-HCl pH 7.5, 0.1 mm EDTA buffer. A nonselective amplification with [γ-33]ATP-labeled EcoRI + 0 and a MseI + 0 primers was performed in a total volume of 20 μL (Vos et al., 1995). Typically a 30-s DNA denaturing step at 94°C, a 1-min annealing step at 56°C, and a 1-min extension step at 72°C for 35 cycles was performed. For the HindIII/MseI and PstI/MseI templates, respectively, the HindIII + 0 and PstI + 0 [γ-33]ATP-labeled primers were used in combination with the MseI + 0 primer. All amplification reactions were performed in a PE-9700 thermocycler (Perkin Elmer). After the amplification step electrophoretic gel analysis of the reaction mix was carried out (Vos et al., 1995) and the fingerprint patterns were visualized using a Fuji BAS-2000 phosphoimaging analysis system (Fuji Photo Film). Band sizes were calculated relatively to a 10-bp size ladder with AFLP-Quantar fingerprint analysis software, and comigrating bands were scored by visual inspection. AFLP-Quantar fingerprint analysis software (http://www.keygene.com/technologies/technologies_keymaps.htm) is distributed by KeyGene and is not part of TOPAAS. For insert size determination BAC DNA was prepared by a standard alkaline lysis method (Sambrook et al., 1989) from a 3-mL overnight culture. BAC DNA was digested with NotI (New England Biolabs) to completion and separated by field inversion gel electrophoresis (Bio-Rad FIGE MAPPER) on a 1% agarose gel in 0.5× Tris-borate/EDTA, with a linear run time, forward (3–30 s) reverse (1–10 s), 14 h and 160 V, along with a mid-range PFGE marker I (New England Biolabs). Supplemental Data
Acknowledgments We thank Joyce van Eck for providing us with the MboI and EcoRI library from tomato cv Heinz 1706, and Andy Pereira and Roeland van Ham for reading the manuscript and for advice. Notes 1This work was supported by the research program of the Centre of BioSystems Genomics, which is part of the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Sander A. Peters (sander.peters/at/wur.nl). [W]The online version of this article contains Web-only data. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Genome Res. 2000 Jan; 10(1):129-36.
[Genome Res. 2000]Comput Appl Biosci. 1997 Oct; 13(5):523-35.
[Comput Appl Biosci. 1997]Genome Res. 2000 Nov; 10(11):1772-87.
[Genome Res. 2000]Nucleic Acids Res. 1995 Nov 11; 23(21):4407-14.
[Nucleic Acids Res. 1995]Nature. 1996 May 30; 381(6581):364-6.
[Nature. 1996]Genomics. 1988 Apr; 2(3):231-9.
[Genomics. 1988]Bioinformatics. 2002 May; 18(5):661-5.
[Bioinformatics. 2002]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]Plant Cell. 2002 Jul; 14(7):1441-56.
[Plant Cell. 2002]Nucleic Acids Res. 2002 Jun 1; 30(11):2478-83.
[Nucleic Acids Res. 2002]Genome Res. 2002 Apr; 12(4):656-64.
[Genome Res. 2002]Nature. 2000 Dec 14; 408(6814):796-815.
[Nature. 2000]Plant J. 2005 Apr; 42(2):251-61.
[Plant J. 2005]Genome Res. 1997 Nov; 7(11):1072-84.
[Genome Res. 1997]Genome Res. 1999 Dec; 9(12):1163-74.
[Genome Res. 1999]Genome Res. 2000 Jan; 10(1):129-36.
[Genome Res. 2000]Plant Physiol. 2005 Jul; 138(3):1310-7.
[Plant Physiol. 2005]Nucleic Acids Res. 2002 Jun 1; 30(11):2478-83.
[Nucleic Acids Res. 2002]Genome Biol. 2004; 5(2):R12.
[Genome Biol. 2004]Genome Res. 2002 Apr; 12(4):656-64.
[Genome Res. 2002]Nucleic Acids Res. 1995 Dec 25; 23(24):4992-9.
[Nucleic Acids Res. 1995]Plant Cell. 2002 Jul; 14(7):1441-56.
[Plant Cell. 2002]Science. 2003 Oct 31; 302(5646):842-6.
[Science. 2003]Genome Res. 1998 Mar; 8(3):186-94.
[Genome Res. 1998]Genome Res. 1998 Mar; 8(3):175-85.
[Genome Res. 1998]Nucleic Acids Res. 1995 Nov 11; 23(21):4407-14.
[Nucleic Acids Res. 1995]Nucleic Acids Res. 1995 Dec 25; 23(24):4992-9.
[Nucleic Acids Res. 1995]