• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jun 1, 2002; 30(11): 2460–2468.
PMCID: PMC117194

An efficient strategy for large-scale high-throughput transposon-mediated sequencing of cDNA clones

Abstract

We describe an efficient high-throughput method for accurate DNA sequencing of entire cDNA clones. Developed as part of our involvement in the Mammalian Gene Collection full-length cDNA sequencing initiative, the method has been used and refined in our laboratory since September 2000. Amenable to large scale projects, we have used the method to generate >7 Mb of accurate sequence from 3695 candidate full-length cDNAs. Sequencing is accomplished through the insertion of Mu transposon into cDNAs, followed by sequencing reactions primed with Mu-specific sequencing primers. Transposon insertion reactions are not performed with individual cDNAs but rather on pools of up to 96 clones. This pooling strategy reduces the number of transposon insertion sequencing libraries that would otherwise be required, reducing the costs and enhancing the efficiency of the transposon library construction procedure. Sequences generated using transposon-specific sequencing primers are assembled to yield the full-length cDNA sequence, with sequence editing and other sequence finishing activities performed as required to resolve sequence ambiguities. Although analysis of the many thousands (22 785) of sequenced Mu transposon insertion events revealed a weak sequence preference for Mu insertion, we observed insertion of the Mu transposon into 1015 of the possible 1024 5mer candidate insertion sites.

INTRODUCTION

Current limitations in detailed knowledge of gene structures and the nucleic acid elements controlling mRNA transcription and processing have presented substantial challenges to the derivation of sensitive and specific automated gene predictions from whole-genome sequence data. Instead, robust gene predictions currently require empirical gene sequence information. The recognition of this need has provided impetus for sequencing cDNAs. Expressed sequence tags (ESTs), generated from the ends of cDNA clones, have long provided both crucial experimental gene sequence information and relatively inexpensive access to the more abundantly expressed genes. While of continuing long-term value, particularly to efforts aimed at gene identification, EST data have limitations. For example, EST data sample only the ends of cDNA clones, are error prone, and tend to have been generated from libraries of normalized cDNA clones that, for technical reasons, tend to be incomplete at the 5′ end of the transcript. Certain of these limitations apply also to synthetic cDNA sequences derived from automatic assemblies of ESTs, with the additional concern that these, which may be composed of ESTs generated from different tissues or developmental states, may not reflect bona fide transcripts.

Comprehensive sets of accurate, full-length cDNA sequences would address many of the current limitations of EST data and fill gaps in current knowledge of gene structure and identity. Of particular value are full-length cDNAs for those organisms having also substantial amounts of genomic sequence data. Comparison of these different kinds of sequence data will provide much needed knowledge of gene structure and facilitate refinement of automated gene prediction algorithms. Recognizing the need for gene sequence data derived from complete transcripts, several programs to generate the complete sequences of full-length cDNAs have been established. As a participant in one of these, known as the Mammalian Gene Collection (1), our laboratory has developed an efficient, large-scale high-throughput method for accurate sequencing of cDNA clones.

Previously described approaches to complete cDNA sequencing include primer ‘walking’ (2), concatenated cDNA sequencing (CCS) (3) and strategies involving transposons (4). In designing our method we considered the relative merits of each of these. The primer walking approach, in which oligonucleotide primers are employed to completely sequence the cDNA insert, has the obvious advantage of avoiding the sequencing of vector DNA. The process is serial however, with each primer walk dependent on a successful prior sequencing reaction. For longer cDNA clones the potential number of sequential steps seemed to us daunting, particularly in light of our goal of a scalable method and the difficulties associated with scaling up serial processes. Also an issue was the expense associated with the large numbers of oligonucleotide primers required for projects aimed at sequencing many thousands of cDNAs.

The CCS strategy shares certain similarities with shotgun sequencing (5) of large-insert bacterial clones. CCS starts with the isolation of cDNA inserts from vector DNA by restriction digestion, followed by agarose gel electrophoresis. Restriction fragments purified from agarose gels are then ligated together to produce long concatemers that can be tens of kilobases in length. The concatemers are fragmented using a physical disruption method (e.g. shearing) and fragments of the desired size are selected on agarose gels. Fragments are then cloned and sequenced and the individual sequence reads are assembled to yield the final sequence of the cDNA inserts. As in the primer walking approach, the CCS strategy avoids repeated sequencing of the vector. An additional advantage is the parallel nature of the sequencing, in which multiple clones are sequenced simultaneously. The primary disadvantages of the approach are associated with the restriction digestion and concatenation steps. While enzymes with infrequently occurring recognition sequences (in the vector) are used to liberate cDNA inserts from vector DNA, these may occasionally cleave within the insert, complicating or possibly even obviating assembly of the final sequence of the clone. Also potentially an issue are restriction fragments that co-migrate on the preparative agarose gels with the vector DNA. These fragments will not be represented in the concatemer and hence will also be absent in the resulting sequence assembly.

Transposon-mediated sequencing approaches (reviewed in 6) have become increasingly popular, with reagents for a number of different transposon systems readily available. Transposons offer considerable advantages to cDNA sequencing. Chief among these are the simplicity of use. Transposon reagents, including transposase, are mixed with purified DNA from the target cDNA clones. Following re-introduction into bacterial cells, transposon insertions into the clone are selected using an antibiotic resistance marker carried by the transposon. Sequencing is achieved using primers specific for the ends of the transposon. Some studies report mapping the transposon insertions relative to the vector, in this way avoiding sequencing vector DNA (7).

Although straightforward, transposon-mediated sequencing has some disadvantages. First, transposons must insert into their cDNA targets in an approximately random fashion if they are to be of general utility in large-scale approaches. While most transposon vendors claim their products do not have preferred insertion sites, these claims tend to be inadequately substantiated by sufficient data and require confirmation. A second disadvantage concerns transposon insertions into vector DNA. While these can be identified and sequencing of them avoided, doing so is cumbersome. For example, the locations of the transposon insertions are first mapped relative to the vector DNA and then non-vector transposon insertions are selected for sequencing. A third disadvantage is that the existing protocols involve creating sequencing templates by inserting transposons into individual clones, with a separate transposon reaction required for each clone. For large efforts aimed at sequencing tens of thousands of cDNAs this could involve assembling a similar number of transposition reactions, a large task we viewed as undesirable.

Our sequencing approach addresses these issues. We show in this report that Mu transposon insertion events are almost random. Further, although the transposon exhibits a very weak preference for an insertion target sequence, this does not negatively impact the use of Mu transposon for cDNA sequencing. The accompanying paper by Shevchenko et al. (8) reports similar results for the TN5 transposon. In our method, selection against insertions in the vector origin of replication and antibiotic resistance-encoding gene(s) substantially reduces the number of vector insertions sequenced. Finally, we have developed a scheme in which transposons are introduced to pools of up to 96 cDNA clones, reducing the number and costs of transposon reactions that would otherwise be required.

MATERIALS AND METHODS

Culturing and DNA purification of plasmid clones

cDNA clones, arrayed in 384-well plates as bacterial glycerol stocks, were inoculated into individual wells in 96-well, square well growth blocks (Beckman Coulter) containing 1.2 ml 2× YT medium (Becton Dickinson) and appropriate antibiotic, either 10.0 µg/ml chloramphenicol (Sigma) for pOTB7 vector (G. Rubin) or 100 µg/ml ampicillin (Sigma) for pCMV-SPORT6 vector (Invitrogen). The blocks were sealed with a sheet of AirPore™ tape (Qiagen), and incubated in a New Brunswick Scientific shaking-incubator fitted with custom holders for 20 h at 37°C with agitation at 290 r.p.m. Following growth, cell pellets were collected by centrifugation for 20 min at 1400 g and the media decanted. Draining of residual media was achieved by inverting the blocks over paper towelling. The blocks were sealed with foil tape and stored at –80°C. Culturing of cDNA clones carrying the Mu transposon was performed in the same manner with the exception that growth media was supplemented with 10 µg/ml kanamycin (EM Science). The use of two antibiotics provided positive selection of cDNA clones containing the transposon and simultaneous negative selection for clones in which the transposon had inserted into the vector origin of replication or antibiotic resistance gene, thereby reducing the frequency of sequences primed from within vector DNA. Plasmid DNA purification was performed using a 96-well alkaline lysis-based protocol derived for purification of BAC DNA (J.Schein, T.Kucaba, M.Sekhon, D.Smailus, R.Waterston and M.Marra, submitted for publication). Plasmid DNA was resuspended in 140 µl of sterile, deionized water and stored at –20°C. DNA concentrations were determined using a 96-well spectrophotometer (PowerWave SelectX, Bio-Tek Instruments).

Restriction digest and insert size estimation of cDNA clones

DNA samples were subjected to restriction digestion with EcoRI and XhoI to liberate the insert from the vector. Reactions were performed in 96-well cycle plates (Robbins Scientific), 384 samples at a time. For each sample, the restriction digest reaction contained 16.5 µl of sterile, deionized water, 2 µl of 10× React 2 buffer (Invitrogen), 0.25 µl of (2.5 U) each EcoRI and XhoI (Invitrogen) and 1 µl of the purified plasmid DNA. The cycle plates were sealed with foil tape and incubated for 2 h at 37°C in an air incubator. Following incubation, 5 µl of 5× loading dye [0.21% (w/v) bromophenol blue (Fisher Biotech), 12.5% (w/v) Ficoll Type 400 (Sigma), in water] was added to each well. One microliter of the mixture from each well was used for analysis by agarose gel electrophoresis. Agarose gels were prepared and electrophoresis performed as described elsewhere (J.Schein, T.Kucaba, M.Sekhon, D. Smailus, R.Waterston and M.Marra, submitted for publication) with the modifications noted here: gels were cast with 0.7% agarose; four 121-lane combs were placed in each 23 cm width × 42.5 cm length casting tray, allowing 384 samples to be loaded onto a single cast gel; 3 ng of 1 kb Plus Ladder (Invitrogen) were loaded into every fifth well as DNA size marker. Electrophoresis was performed for 2 h at 4.2 V/cm. Analysis of the digitized gel images for purposes of restriction fragment identification and size determination was performed interactively as described (9) using Image 3.10 (10) (F.Wobus, D.Platt, R.Durbin, J.Atwood, S.Kelley and J.Mullikin, unpublished results). Image software was developed at, and is available from, the Sanger Institute (http://www.sanger.ac.uk/Software/Image).

Pooling cDNA clones

DNA concentration and clone length insert data were employed to calculate the volume of each cDNA clone to be added into the pool. Longer cDNA clones require more recovered transposon insertion events than smaller clones in order to achieve the same sequence coverage along the length of the clone. The representation of each clone in the pool was therefore calculated to be proportional to its length, relative to the lengths of the other clones in the pool. This was achieved using an algorithm that considered both the relative size of each cDNA clone in the pool and the concentration of the purified clone DNA. For purposes of pipetting accuracy, the minimum volume permitted for any clone was 0.5 µl. First, the ratio of DNA concentration (ng/µl) to length (bp) was calculated for each clone, where the size was calculated as the sum of the clone insert size and the effective transposon target size of the vector (700 bp). The volumes of DNA to add into the pool were calculated as follows. (i) Let x = the highest (ng/µl)/bp ratio in the pooling set and let a = the size of this clone. (ii) For the clone represented by x, add the minimum dispensable volume (0.5 µl) to the pool. This clone becomes the reference point for calculations for all other clones in the pooling set. (iii) For each other clone in the pooling set, let y = the (ng/µl)/bp ratio, let b = the size of the clone and let v = the volume of DNA added to the pool such that v = 0.5(x/y)(b/a). The term 0.5(x/y) in this calculation describes the volume required to equalize the molar amount of DNA with respect to the reference clone, while the term (b/a) adjusts this volume to account for the relative difference in size. The required volume of DNA for all cDNAs in the pool was combined into a single, 1.5 ml microcentrifuge tube.

Construction of transposon insertion libraries from pooled cDNAs

Transposon libraries from the pooled cDNAs were constructed using the Template Generation System kit (Finnzymes). Each transposition reaction mixture contained 4 µl of 5× reaction buffer, 5 ng of Entranceposon (KanR), 0.22 µg of MuA transposase, 1 µg of total DNA from the pooled cDNAs and sufficient sterile, deionized water to bring the total reaction volume to 20 µl. Reaction components were combined in a 0.2 ml thermocycler tube (Applied Biosystems). Transposition reaction conditions were as per manufacturer’s instructions. When the reaction was complete the solution was diluted to 200 µl with sterile, deionized water. An aliquot of 1 µl of the resulting transposon library pool was used to transform 50 µl of ELECTROMAX™ DH10B T1 Phage Resistant electrocompetent cells (Invitrogen) using an electroporator (E. coli Pulser, Bio Rad). The cells were resuspended in 1 ml of SOC medium (11) and incubated at 37°C for 1 h. Following incubation, 330 µl of the resuspended cells were plated onto each of three, 22 cm square plates (Genetix) containing 300 ml of 2× YT agar and 10 µg/ml chloramphenicol (or 100 µg/ml ampicillin for pCMV-SPORT6 vectors) and 10 µg/ml kanamycin. The plates were incubated at 37°C for 16 h or until colonies had grown to a sufficient size for picking. Using a QPix colony-picking robot (Genetix), colonies were picked into individual wells of 384-well microtiter plates (Genetix), each well containing 80 µl of 2× YT medium supplemented with 7.5% glycerol, 10 µg/ml chloramphenicol (or 100 µg/ml ampicillin for pCMV-SPORT6 vectors) and 10 µg/ml kanamycin. The microtiter plates were incubated at 37°C for 16 h and then stored at –80°C. A sufficient number of colonies were picked to achieve, on average, 11 sequencing reads per kb per clone in the pool. The required number of sequencing reads was calculated as (11 reads/kb)[sum of clone sizes (kb) in the pool]. Since two sequencing reads are generated from each transposon (one read from each end of the transposon) the number of clones needed is equal to half the number of required reads.

DNA sequencing

An aliquot of the purified plasmid DNA was diluted one in five for use in sequencing reactions. Sequencing reactions were performed in 384-well cycle plates (Applied Biosystems) in a 4 µl total volume. Each reaction contained 2 µl of the diluted DNA (80–100 ng), 0.26 µl of sequencing primer (at 5 pmol/µl), 0.43 µl of 5× reaction buffer (400 mM Tris pH 9.0, 10 mM MgCl2), 0.54 µl of BigDye v.2.0 terminator premix (Applied Biosystems) and 0.77 µl of sterile, deionized water. Single-pass end sequence reads of the cDNA clones were performed using universal primers –21 M13 Forward (5′-TGTAAAACGACGGCCAGT-3′) and M13 Reverse (5′-CAGGAAACAGCTATGAC-3′) as well as an oligo(dT)23N (N = A, G, C) primer. Primers specific for the ends of the Mu transposon (5′-GAATTCTCTAGATGATCAGCGGC-3′ and 5′-CGAACTTTATTCGGTCGAAAAGG-3′) were used to generate sequence reads from cDNA clones containing transposon insertions. Thermal cycling of the reactions was performed on an MJ Research PTC-225 thermal cycler, with ramp speed set at 1°C/s. Cycling parameters for end reads were: 30 cycles of 96°C for 10 s, 52°C for 5 s for –21 M13 Forward primer [or 43°C for M13 Reverse, 41°C for oligo(dT)23N], 60°C for 3 min, followed by incubation at 4°C. Cycling parameters for transposon-generated sequencing reads were: 95°C for 2 min, followed by 30 cycles of 96°C for 10 s, 56°C for 5 s, 60°C for 3 min, followed by incubation at 4°C. Reaction products were precipitated by addition of 6 µl isopropanol (for a final isopropanol concentration of 60%) and incubation for 15 min at room temperature (protected from light), followed by centrifugation at 2750 g for 30 min. Isopropanol was decanted by inverting and vigorously shaking the plates to remove the liquid from the wells. Reaction pellets were washed by addition of 40 µl of 70% ethanol, immediately decanting the liquid by inverting the plates and then centrifuging the inverted plates over paper towelling for 1 min at 700 g to remove residual ethanol. The dried samples were stored in plastic bags at –20°C. Samples were resuspended in 8 µl of autoclaved, deionized water prior to loading on an ABI Prism 3700 DNA Analyzer.

Data processing and sequence assembly

DNA sequence chromatograms were processed using the PHRED software (12,13). Scripts automatically move sequence data into a MySQL database. Sequence reads yielding less than 20 PHRED 20 quality bases were excluded from sequence assemblies. DNA sequence assemblies were performed using PHRAP (14) with default parameters. PHRED/PHRAP scripts were modified to handle our read naming conventions. Analysis of consensus sequences and sequence finishing were carried out using a combination of Consed (14) and in-house web tools.

Identification of transposon insertion sites was carried out using CROSS_MATCH software (http://www.phrap.org) and the known Mu transposon sequence supplied by Finnzymes. The results of CROSS_MATCH were used to detect the actual site of transposon insertion within assemblies. Software was designed to analyze each contig and automatically detect the site of transposon insertion where the 5 bp target site duplication occurs. Where the software was unable to detect the exact location of the transposon insertion site (due to lower quality reads), the location was estimated to be 16 bp from the beginning of the sequence read. The software was also used to detect the sequences flanking this site. In calculating the consensus insertion sequence, insertion sites that had been estimated were not utilized. For these analyses the poly(A) tail of the cDNA sequences was not considered.

The Monte Carlo simulation of the random insertion model considered a 1242 clone data set. We randomized the positions of each of the insertions, preserving the number of insertions for each clone. We then calculated the binomial probabilities for each bin. This process was replicated 100 000 times and the average for each bin represented the expected probability.

RESULTS

Our process for obtaining complete cDNA sequences is depicted in Figure Figure1.1. To date, 7.06 Mb of finished cDNA sequence from 3695 cDNA clones have been produced. The average cDNA size is 1.9 kb, and on average each clone received 37.5 sequence reads, or 19.6 sequence reads per kilobase (Table (Table1).1). The sequence of every clone is contiguous and accurate, with each base in every clone of at least PHRED (12,13) quality 30 (corresponding to an error probability that does not exceed one in 1000 bases) with an overall error probability for each clone sequence that is less than one error per 50 000 bases. The majority (82%) of these clones are in the pOTB7 vector (G. Rubin, personal communication) and all clones have been sequenced as part of the Mammalian Gene Collection, sponsored by the National Cancer Institute (1).

Figure 1
Overview of the cDNA sequencing process. Additional details are included in the text.
Table 1.
A summary of clones completed at different stages in the sequencing pipeline

Construction and sequencing of pools

One of the key features contributing to the efficiency of our approach is the construction of pools of up to 96 cDNA clones prior to introduction of transposons. The main advantage of the pooling step is the reduction by almost two orders of magnitude in the number of transposon libraries that would otherwise be required, simplifying this part of the procedure and reducing the costs. Care is taken in constructing the pool, with proportional representation of each clone in the pool achieved by consideration of its length and DNA concentration relative to the other clones in the pool. Single pass sequence reads (i.e. ESTs) from both ends of every cDNA clone are generated to permit identification of the cDNA after sequence assembly. For 1% of the cDNAs that we have processed, end sequence reads [EST and oligo(dT) reads; Materials and Methods] are sufficient to complete the clone (Table (Table11).

We generate restriction size data and DNA concentration information (data not shown) for each clone that enters into the sequencing queue, and use these data to automatically calculate (Materials and Methods) the volumes of DNA that must be combined to produce the ‘primary pool’. Mu transposon reagents are added to the pooled cDNAs (Materials and Methods) and aliquots of the pooled DNAs are then introduced into bacterial cells by electroporation. Bacterial cells, receiving a cDNA clone with a transposon inserted into it, are selected using an antibiotic marker carried by the transposon. Simultaneously, we select against transposon insertions into the antibiotic resistance gene present in vector DNA. Insertions into this antibiotic resistance gene are effectively ‘lethal’ to the bacterial cell. Because these cells tend not to propagate and the antibiotic resistance gene is relatively large, the lethality results in a dramatic reduction in the number of vector sequences generated from transposons inserted into vector DNA. Similarly lethal are transposon insertions into the cDNA vector origin of replication. Hence, the pOTB7 vector targets that are lethal under our selection conditions reduce the length of the vector capable of housing a transposon insertion to ~700 bp, which is only 39% of the original size of the pOTB7 vector DNA (Fig. (Fig.2).2). This serves to reduce the number of vector bases sequenced, enhancing the efficiency and cost-effectiveness of our approach.

Figure 2
Analysis of transposon insertions in the pOTB7 vector. A set of sequences (5552) were analyzed and the relative positions of transposon insertions within the vector were mapped. Of the sequences in this set, 22% (1233) were observed to initiate ...

For generation of sequencing templates from transposon-bearing clones, approximately 5.5 colonies per kilobase of cDNA clone insert (plus the effective vector target size of 700 bp) were picked robotically and used to construct primary pool glycerol stocks. From these, purified DNAs were produced for sequencing. These DNAs were referred to as the sequencing library, with each DNA representing one of the orginal clones added to the pool, now bearing a transposon. Two sequence reads were generated from each transposon-bearing DNA, one read from each end of the transposon. Systematic sequencing from different DNAs was performed until approximately 11 sequence reads per kilobase of target DNA (the length of which is calculated by summing the portion of the vector DNA and the cDNAs in the pool) had been generated. On average, after assembly with PHRAP (14), this level of sequence coverage resulted in completion of 58% of the cDNAs in the primary pools (Table (Table1),1), with these cDNA sequences requiring no additional effort to meet our sequence quality target.

Assessment of sequence assemblies

Assemblies were assessed automatically to identify those that represented complete or near complete cDNA sequences. To be considered complete, clone sequences had to meet four criteria. These were: (i) detection of the restriction sites used for cDNA cloning; (ii) verification that the length of the sequence assembly and the length of the clone as determined by restriction enzyme analysis were within 10%; (iii) verification that the cumulative error probability of the cDNA sequence did not exceed one error per 50 000 bases (as determined by PHRED/PHRAP); and (iv) the presence in the assembly of end sequence reads derived from both ends of the cDNA. This last criteria served to unambiguously identify the clone. Nearly complete sequences were defined as those requiring no further sequence data. Sequences in this category were typically ones in which sequence coverage was adequate but the PHRAP assemblies were erroneous, resulting usually in incorrect trimming at the cDNA sequence terminii. These errors were corrected manually.

Construction of secondary pools

cDNA clones which remained incomplete after the first round of systematic sequencing were used to construct a secondary pool. For each secondary pool a sequencing library was constructed, employing the same methods used to construct the primary pools. A second round of sequencing followed until the depth of sequence coverage was 11 sequence reads per kilobase of pooled clones. After this round of sequencing, the sequence reads from both libraries were assembled and automatically processed to identify completed clone sequences. After this second round of sequencing and sequence assembly, on average an additional 27% of the clones in the primary pools were completed (Table (Table11).

Additional sequence data for clones remaining unfinished after sequence assembly of reads from the secondary pool were collected using strategies aimed at resolution of specific sequence discrepancies. These strategies included automated targeted sequence read selection with Autofinish (15) and manual primer selection and sequence editing using Consed (14). Fourteen percent of the clones processed were finished using directed strategies (Table (Table1).1). In 63% of these cases, a single round (i.e. a single additional targeted sequence read derived from a ‘custom’ primer) of directed sequencing was sufficient to complete the clone. The remainder of the clones required multiple iterations of directed sequencing.

Assessment of pool construction

The aim of pooling was to produce a mixture of clones normalized with respect to DNA concentration and clone length, the latter estimated by restriction enzyme analysis (Materials and Methods). To measure the efficacy of pool construction, we examined the correlation between the number of sequence reads required to complete cDNAs and the lengths of their completed sequence assemblies. We also examined the correlation between cDNA sequence assembly length and the length as determined by restriction enzyme analysis. A positive correlation between assembly size and reads accumulated per clone would indicate that the pooling methodology was functioning as intended, and that both restriction size and DNA concentration data were appropriately considered in pool construction. A negative correlation would indicate otherwise. For example, pooling clones based only on equalizing molar ratios and not correcting for cDNA length would produce a negative correlation.

We observed a reasonably strong correlation (r = 0.7) between the number of sequence reads accumulated per clone and the size of the sequence assemblies (Materials and Methods; Fig. Fig.3).3). A small number of clones (~2%) appeared to be outliers in this analysis. We sought to explain these apparent outliers, considering it likely that either the restriction enzyme length data or the DNA concentration data were incorrect. We were unable to verify the original DNA concentration measurements for these clones but were able to confirm that the restriction data were accurate. We confirmed this was generally the case by comparing the restriction data with the clone sizes determined by the finished sequence for all 3695 finished clones. In this analysis, 86% of the comparisons were within 5% of the sequence assembly size.

Figure 3
Correlation between sequence assembly size and sequence reads in the assembly. There is a positive correlation between the number of reads obtained and the length of the clone. The pooling calculations employed therefore appropriately consider variations ...

Mu transposon insertion events are approximately random

To study the distribution of Mu transposon insertion events, we examined 21 469 insertions into 1242 cDNA clones, representing in total 2.8 Mb of completed sequence data. First, each cDNA sequence was divided into a number of equal sized bins, determined by calculating the square root of the number of transposon insertions observed in the cDNA (16,17; Materials and Methods). Next, we used a binomial test to determine whether the frequency of insertion events in each bin was consistent with a random insertion model across the entire clone. We found that of the 4760 bins studied (median bin size 503 bp), 541 (11.4%) had an insertion profile which did not support a random insertion pattern (P ≤ 0.05; Fig. Fig.4).4). Using a Monte Carlo simulation (Materials and Methods) we determined that with the bin size distribution employed and the number of transposon insertions studied, 382 bins would fall outside the cutoff probability of P ≤ 0.05, even if the transposon insertion pattern was random. Therefore, we observed 159 (3.3%) more bins with P ≤ 0.05 than would be expected with a completely random transposon insertion profile. To further investigate the distribution of transposon insertion events, we performed a chi-squared analysis of the distribution of sequence read start-sites (16; Materials and Methods). Of the 21 469 sequence read start sites in the 1242 cDNAs, 119 (9.6%) had a sequence read start-site profile which rejected the hypothesis of random insertion (P ≤ 0.05). Both these tests suggest that the insertion behavior of the Mu transposon only weakly deviates from random.

Figure 4
Mu transposon insertion deviates only slightly from random. The insertions of Mu into 1242 cDNA clones were analyzed using the binomial test and assigned to bins (Materials and Methods). The resulting P-values reflect the likelihood that the observed ...

The Mu transposon exhibits a weak preference for insertion sites

We examined the 40 bases surrounding each of 22 785 transposon insertions in 1242 clones. This revealed a weak consensus sequence for Mu transposon insertion. These data are summarized in Table Table22 and Figure Figure5.5. The overall consensus indicates that regions of higher GC content represent preferential substrates for Mu transposon insertion. More specifically, at the 5′ end of the 5 bp insertion site, we observe a preference for pyrimidines, especially C. At the 3′ end of the sequence we observe a preference for purines, particularly G. The center position in the sequence tends to be occupied by a C or G. These finding substantiate those of previous studies but with a much larger set of insertion data (18).

Figure 5
Base composition surrounding the Mu transposon insertion site. The frequency of each base flanking 22 785 Mu-insertion events was cataloged (see Table 2). From those data, the relative compositions of GC and AT (A) and pyrimidine (Py) and purine (Pu) ...
Table 2.
Frequency of bases immediately flanking insertion of the Mu transposon

We next examined in greater detail the frequency at which individual insertion sites were used. All of the possible 1024 (45) 5mer sequences are found at least once within the set of cDNA sequences analyzed. Of these, 1015 were occupied (‘hit’) at least once by a transposon. Interestingly, some 5mer sites were not hit by a transposon (Fig. (Fig.6).6). From the insertion site frequency distribution (Fig. (Fig.6),6), it is apparent that the consensus sequences we derived are used, but at a low frequency. Furthermore, the vast majority (94%) of the 1015 sites containing an insertion tend to be ‘hit’ at a frequency of <0.3% of the 22 785 insertion events examined. We calculated the frequency of all 1024 5mers found in the cDNA sequences analyzed (Fig. (Fig.6).6). From this, it is apparent that there were many additional possible sites for Mu transposon insertion than were actually occupied. This is true even in the cases of the most frequently occupied insertion sites. Furthermore, there is a positive correlation (r = 0.44) between the frequency of 5mers used by the transposon and the frequency of these 5mers in the cDNA sequences. The observation that many different insertion sites house the overwhelming majority of insertion events explains the effectiveness of the Mu transposon in our sequencing scheme and substantiates our claim that the insertion pattern of Mu transposon deviates only slightly from that predicted using a random model.

Figure 6
Comparison of the frequency of 5mers occurring in the sequences of 1242 cDNAs with the frequency of 5mers utilized in 22 785 transposon insertion events (see Table 2).

DISCUSSION

In this report we described a transposon-mediated approach for large-scale high-throughput sequencing of cDNA clones. Developed with a view towards future increases in scale, the method has been used in our laboratory over the last 14 months to generate >7 Mb of high quality sequence data from 3695 cDNA clones. Key to our approach has been the construction of pools containing up to 96 clones. This has obviated the need to perform an independent transposition reaction for every cDNA sequenced.

We validated our approach using commercially available Mu transposon reagents (Materials and Methods). Selection for transposons inserted into the clone and against transposons inserted into the majority of the vector DNA resulted in reduction in the number of sequence reads derived from vector DNA, enhancing the efficiency of the approach.

Assembly of sequence reads mid-way through the sequencing process, followed by the identification of completed and nearly completed clone sequences and the subsequent construction of a secondary pool, has also enhanced efficiency. This increase in efficiency results from the omission in the secondary pool of clones completed in the first round of systematic sequencing. Hence, additional reads obtained from clones in the secondary pool are not expended on clones already completed, but limited to those clones remaining unfinished at the mid-way point in the process.

Further increases in the efficiency of our cDNA sequencing method are possible. These could be achieved by further reducing the number of sequence reads derived from vector DNA. Two possibilities exist. One of these is to promote the design and use of cDNA vectors that are composed primarily of ‘lethal sites’, so that a transposon insertion into vector DNA would yield a construct that would not propagate and, hence, not be sequenced. Another possibility is to employ ‘Gateway’ technology, as described in the accompanying manuscript by Shevchenko et al. (8). Either of these approaches could be used in cases where new libraries in ‘lethal’ vectors were being constructed or where libraries had been constructed in ‘Gateway’ compatible vectors. Certainly we would advocate that cDNA library makers consider these and other options when constructing libraries to fuel sequencing programs.

We performed an extensive analysis of the sequences into which Mu had inserted and from this were able to verify that there was a preferred Mu insertion sequence, as described previously (18). Our result was based on the analysis of many more insertions than the previous study (~200 versus 22 785) and resulted in a refinement of the previously proposed consensus sequence. The preference Mu exhibits for this site is weak and in fact the pattern of Mu insertions deviates only slightly from that predicted by a random model. This does not negatively impact the use of Mu transposon in our cDNA sequencing activity and is supported by the observation that the majority (85%) of the cDNAs were completed without directed sequencing but were adequately covered by reads derived systematically from transposon insertions.

ACKNOWLEDGEMENTS

We thank Eric Green for his helpful comments on the manuscript and Bob Waterston for insightful discussion on constructing cDNA pools. We acknowledge Andrew Heaford of the Whitehead Institute/MIT Center for Genome Research for advice on sequencing reactions. We are indebted to the staff of the British Columbia Cancer Agency Genome Sciences Centre for superb technical and administrative assistance. M.A.M. is a Scholar of the Michael Smith Foundation for Health Research.

REFERENCES

1. Strausberg R.L., Feingold,E.A., Klausner,R.D. and Collins,F.S. (1999) The mammalian gene collection. Science, 286, 455–457. [PubMed]
2. Wiemann S., Weil,B., Wellenreuther,R., Gassenhuber,J., Glassl,S., Ansorge,W., Bocher,M., Blocker,H., Bauersachs,S., Blum,H. et al. (2001) Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs. Genome Res., 11, 422–435. [PMC free article] [PubMed]
3. Andersson B., Lu,J., Shen,Y., Wentland,M.A. and Gibbs,R.A. (1997) Simultaneous shotgun sequencing of multiple cDNA clones. DNA Seq., 7, 63–70. [PubMed]
4. Devine S.E., Chissoe,S.L., Eby,Y., Wilson,R.K. and Boeke,J.D. (1997) A transposon-based strategy for sequencing repetitive DNA in eukaryotic genomes. Genome Res., 7, 551–563. [PMC free article] [PubMed]
5. Anderson S. (1981) Shotgun DNA sequencing using cloned DNase I-generated fragments. Nucleic Acids Res., 9, 3015–3027. [PMC free article] [PubMed]
6. Kimmel B.E., Palazzolo,M.J., Martin,C.H., Boeke,J.D. and Devine,S.E. (1997) Transposon-mediated DNA sequencing. In Birren,B., Green,E.D., Klapholz,S., Myers,R.M. and Roskams,J. (eds), Genome Analysis: A Laboratory Manual. Analyzing DNA. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, Vol. 1, pp. 455–532.
7. Strathmann M., Hamilton,B.A., Mayeda,C.A., Simon,M.I., Meyerowitz,E.M. and Palazzolo,M.J. (1991) Transposon-facilitated DNA sequencing. Proc. Natl Acad. Sci. USA, 88, 1247–1250. [PMC free article] [PubMed]
8. Shevchenko Y., Bouffard,G.G., Butterfield,Y.S.N., Blakesley,R.W., Hartley,J.L., Young,A.C., Marra,M.A., Jones,S.J.M., Touchman,J.W. and Green,E.D. (2002) Systematic sequencing of cDNA clones using the transposon Tn5. Nucleic Acids Res., 30, 2469––2477.–2477. [PMC free article] [PubMed]
9. Marra M.A., Kucaba,T.A., Dietrich,N.L., Green,E.D., Brownstein,B., Wilson,R.K., McDonald,K.M., Hillier,L.W., McPherson,J.D. and Waterston,R.H. (1997) High throughput fingerprint analysis of large-insert clones. Genome Res., 7, 1072–1084. [PMC free article] [PubMed]
10. Sulston J., Mallett,F., Staden,R., Durbin,R., Horsnell,T. and Coulson,A. (1988) Software for genome mapping by fingerprinting techniques. Comput. Appl. Biosci., 4, 125–132. [PubMed]
11. Sambrook J. and Russell,D.W. (2001) Molecular Cloning: A Laboratory Manual, 3rd Edn. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
12. Ewing B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res., 8, 175–185. [PubMed]
13. Ewing B. and Green,P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res., 8, 186–194. [PubMed]
14. Gordon D., Abajian,C. and Green,P. (1998) Consed: a graphical tool for sequence finishing. Genome Res., 8, 195–202. [PubMed]
15. Gordon D., Desmarais,C. and Green,P. (2001) Automated finishing with autofinish. Genome Res., 11, 614–625. [PMC free article] [PubMed]
16. Chissoe S.L., Marra,M.A., Hillier,L., Brinkman,R., Wilson,R.K. and Waterston,R.H. (1997) Representation of cloned genomic sequences in two sequencing vectors: correlation of DNA sequence and subclone distribution. Nucleic Acids Res., 25, 2960–2966. [PMC free article] [PubMed]
17. Sokal R.R. and Rohlf,F.J. (1995) Biometry: The Principles and Practice of Statistics in Biological Research, 3rd Edn. Freeman, New York.
18. Haapa S., Taira,S., Heikkinen,E. and Savilahti,H. (1999) An efficient and accurate integration of mini-Mu transposons in vitro: a general methodology for functional genetic analysis and molecular biology applications. Nucleic Acids Res., 27, 2777–2784. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...