Logo of narLink to Publisher's site
Nucleic Acids Res. 2007 Aug; 35(16): 5284–5293.
Published online 2007 Aug 7. doi:  10.1093/nar/gkm597
PMCID: PMC2018620

A survey of bacterial insertion sequences using IScan


Bacterial insertion sequences (ISs) are the simplest kinds of bacterial mobile DNA. Evolutionary studies need consistent IS annotation across many different genomes. We have developed an open-source software package, IScan, to identify bacterial ISs and their sequence elements—inverted and target direct repeats—in multiple genomes using multiple flexible search parameters. We applied IScan to 438 completely sequenced bacterial genomes and 20 IS families. The resulting data show that ISs within a genome are extremely similar, with a mean synonymous divergence of Ks = 0.033. Our analysis substantially extends previously available information, and suggests that most ISs have entered bacterial genomes recently. By implication, their population persistence may depend on horizontal transfer. We also used IScan's ability to analyze the statistical significance of sequence similarity among many IS inverted repeats. Although the inverted repeats of insertion sequences are evolutionarily highly flexible parts of ISs, we show that this ability can be used to enrich a dataset for ISs that are likely to be functional. Applied to the thousands of genomes that will soon be available, IScan could be used for many purposes, such as mapping the evolutionary history and horizontal transfer patterns of different ISs.


Transposable elements occur in many bacterial genomes. We can thus not fully understand bacterial genome evolution, unless we understand how such mobile DNA is maintained, and how it spreads among bacterial genomes. Because transposable elements also cause an important public health threat, the spreading of drug-resistance genes among pathogenic bacteria, such understanding may ultimately also shed light on the epidemiology of drug-resistant pathogens.

Insertion sequences (ISs) are among the simplest kinds of bacterial mobile DNA. They range in size from 600 to more than 3000 bp and fall into 20 major families (1–3). Most ISs consist of short inverted repeat sequences that flank one or more open reading frames (ORFs), whose products encode the transposase proteins necessary for their transposition. Some but not all ISs transpose into specific target sites. ISs typically generate a direct repeat of their target site after transposition. Transposition is often associated with an increase in IS copy number in a genome. In eukaryotes many non-functional copies of transposable elements can often be passively proliferated using transposase from functional copies (4–6). In contrast, bacterial IS transposition is often tightly regulated, occurs at a very low level, and is often restricted to cis activity, where a transposase promotes only the transposition of the IS from which it is expressed (7). Exceptions exist, for example in the form of very short miniature inverted repeat elements that may only proliferate passively (8–10).

In the near future a flood of bacterial genome data will become available. Such data will see many uses in studying IS families, including the identification of functionally important sequences from hundreds of family members, and the reconstruction of the evolutionary history of individual ISs. Efforts like these require a comparison of ISs across (many) different genomes. Such a comparison is hindered by existing IS annotations which may differ greatly among genomes, because they have been produced by different research groups using different tools. In addition, existing annotations provide limited information about sequence elements such as inverted repeats, or about the structure of ISs where the transposase is encoded by more than one open reading frame. With these limitations in mind, we have developed IScan, a software tool that allows a user to identify ISs and their associated direct and inverted repeats automatically, flexibly and in multiple genomes, using a curated reference IS from a database such as ISfinder (11). The consistent annotation provided by IScan will greatly aid evolutionary studies.

In two analyses that address two different classes of questions, we applied IScan to 438 completely sequenced bacterial genomes and all 20 major IS families. The first set of analyses addresses the biological question: Why is mobile DNA maintained in bacterial genomes? Mobile DNA might be a very effective parasite, a prototypical example of selfish DNA (12,13), or it might confer benefits to its host. [For example, mobile DNA can mobilize genes for transfer between bacterial strains or species (14)]. Despite its long history, this question has not been completely resolved.

To find out whether mobile DNA persists because it benefits a host, one needs to understand the dynamics of mobile DNA on evolutionary time scales. Laboratory evolution experiments (15–21) are of limited use here. The reason is that the rates at which ISs transpose, are transferred horizontally, and can cause recombinational and other instabilities are so small (22,23) that even long laboratory evolution experiments may detect IS copy number and position variation, but may not be sufficient to determine whether ISs have net deleterious or beneficial effects.

A different approach to understanding the evolutionary dynamics of ISs focuses on the number and distribution of ISs in bacterial populations or closely related bacterial strains (20,24–28). Most pertinent studies were carried out before large-scale genome sequence data became available, and are thus very limited. In a recent paper, we overcame some of the limitations of pre-genome work by analyzing the distribution of five major IS families in 202 complete genomes (29). This analysis suggested that ISs within a genome have very low nucleotide diversity, cause their host to go extinct on evolutionary time scales, and can only be sustained by horizontal transfer. In other words, ISs are likely to be detrimental to their host in the long run. However, this earlier analysis was also hampered by our reliance on available genome sequence annotations to identify ISs. We here overcome this limitation by our use of IScan to study the distribution and sequence similarity of ISs in more than twice as many genomes and four times as many IS families than in earlier work.

The second of our two applications of IScan addresses a methodological rather than a biological question: Is it possible to distinguish functional from non-functional (especially truncated) ISs computationally—without time-consuming experiments—and for hundreds or thousands of ISs? We suggest an approach based on the similarity of IS inverted repeats. IScan is ideal for this approach, because it can calculate various statistical significance measures for inverted repeat similarity. We show that our approach, while certainly not allowing for perfect discrimination, may enrich a dataset for ISs that are likely to be functional.


Details on information produced by IScan

Via the procedure outlined in the Results, IScan produces a file in FastA format. This file contains the following parts for each IS:

IS section

The section contains a unique identifier, the accession number and version of the sequence in which the IS was found, start and end coordinates on the queried DNA molecule, length, P-value of the inverted repeat alignments and nucleotide sequence.

ORF section

The section contains one entry for each ORF query (and, therefore, BLAST hit), start and end coordinates on the queried DNA molecule, length, strand and nucleotide sequence of the BLAST hit.

BLAST hit section

One entry for each ORF query (and, therefore, BLAST hit), containing the start and end coordinates of the BLAST hit on the queried DNA molecule, the number and proportion of amino acid identities, the expect (E) value and the amino acid sequence of the BLAST hit.

Inverted repeat section

One entry for each of the upstream and downstream inverted repeats, containing the start and end coordinates on the queried DNA molecule; repeat unit length; alignment score; number of matches, mismatches and gaps; P-value of the inverted repeat alignment; and nucleotide alignment for the inverted repeat.

Direct repeat section

One entry for each of the upstream and downstream direct repeats, containing the start and end coordinates on the queried DNA molecule; repeat unit length; alignment score; number of matches, mismatches and gaps; and nucleotide alignment for the direct repeat.

A survey of ISs using IScan

We used IScan to search for ISs belonging to the 20 major IS families listed in Table 1 (1–3) in 438 curated bacterial genomes (consisting of 790 sequenced DNA molecules) available from GenBank (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/). The curated query ISs we used were obtained from the IS repository IS Finder (http://www-is.biotoul.fr 11). We retained BLAST hits to IS ORFs with an E-value of E ⩽1 and at least 35% amino acid identity to the query sequence. For ISs with more than one ORF, we used the parameter h = 50 (Figure 1) to assign ORFs to the same IS. For identification of target direct repeats we used curated data on the length of direct repeats from ISfinder [(2), Table 1, (11)] to define kR and kL for each IS family analyzed. To identify inverted and direct repeats we set mL = mR = 0, and nL = nR = 1.1 × (total IS length – length of IS coding region), where the total and coding region lengths are again derived from information curated for each IS family's reference sequence (11). Inverted and direct repeat alignments were performed with the same scoring matrix of 1 for matches, −2 for mismatches and −5 for gaps and gap extensions.

Figure 1.
Illustration of the different parameters IScan uses to identify ISs. The thin horizontal line at the bottom of the panel indicates the queried DNA sequence (genome). Thick arrows indicate matches to IS ORFs. Open bars indicate inverted repeats (middle) ...
Table 1.
Number of ISs in different families found for the analysis carried out here

Analysis of alignment scores for inverted repeat P-values

We determined various P-values (PLR, P3, PL, PR, see Results) that indicate whether the candidate inverted repeats of an IS are statistically significantly similar. We did so by aligning 105 randomly chosen sequence fragment pairs of length nL+mL + 1 = nR + mR + 1 from the DNA molecule in which the IS was found. Specifically, for PLR, two randomly chosen fragments are aligned against each other, for P3, two randomly chosen fragments are aligned against the left reference inverted repeat and for PL (PR), one randomly chosen fragment is aligned against the left (right) reference inverted repeat. The fraction of these alignments whose score is greater (indicating greater similarity) than the alignment score of the candidate inverted repeat corresponds to the desired P-value. For PLR we used Smith–Waterman local alignment, for P3, PL and PR we used clustalw, which implements a global dynamic programming alignment algorithm (30).

Determination of Ka and Ks for pairs of ISs within the same family

To estimate synonymous and non-synonymous divergence among IS coding regions, we used a previously published tool (31). Briefly, the tool uses information from both the DNA and amino acid sequences, and proceeds in three steps. First, it pre-screens related gene pairs using BLASTP (32) and the Needleman and Wunsch dynamic programming alignment algorithm [Thompson et al. (30)]. Then, it eliminates gene pairs with fewer than 50 alignable amino acid residues and with <50% amino acid identity from further analysis. In the third step, the tool calculates the number of substitutions per synonymous site (Ks) and the number of substitutions per non-synonymous site (Ka) using the maximum likelihood models of Muse and Gaut (33) and Goldman and Yang (34) for the remaining pairs. It uses a simple heuristic test (35) to determine whether a gene pair has been saturated with synonymous substitutions.

For ISs with overlapping ORFs, we merged, for reasons of computational tractability, the overlapping ORFs into one ORF for the calculation of Ka and Ks. (The short overlapping regions are subject to different evolutionary constraints than the non-overlapping regions). Specifically, we calculated the number of nucleotides that overlap in the two ORFs, and eliminated from a sequence containing both ORFs the segment containing the overlap and any additional nucleotides upstream or downstream of the overlapping segment required to retain the reading frames of the two ORFs. On average, IS ORFs were shortened by four nucleotides through this procedure.


IScan, a tool to identify ISs

IScan identifies transposase sequences, inverted repeats and candidate target direct repeats of ISs in complete genomes. IScan is a free open source package developed on a Linux platform and implemented in perl. It is available from the website: http://www.bioc.uzh.ch/wagner/publications-software.html. IScan uses a curated reference or query IS (which, in our case, is a representative member of a major IS family; Table 1) to identify other ISs in one or more completely sequenced genomes, or any other DNA molecules. This query sequence contains (i) the amino acid sequences encoded by one or more transposase ORFs, and (ii) the nucleotide sequence of the upstream (IRL) and downstream inverted repeat (IRR). We note that ISs with two or more transposase ORFs frequently express a single functional transposase through a translational frameshifting mechanism. The extent and length of required sequence similarity to the reference IS are user-specifiable, such that arbitrarily weakly similar ISs or short IS fragments can be identified if needed. IScan identifies ISs in three major steps.

  1. Identification of transposase ORFs. IScan identifies the ORFs of an IS through a tblastn search [using WUBLAST, (32), http://blast.wustl.edu/] which matches the query amino acid sequence(s) to the translation products of the genomic sequence in all six possible reading frames. For ISs consisting of more than one ORF, hits to different ORFs of the query are identified as belonging to the same IS if the ORFs fall within a user-specified distance of each other (Figure 1).
  2. Identification of (candidate) inverted repeats. IScan applies a user-specifiable alignment algorithm, such as the Miller–Myers version (36) of the Smith–Waterman local alignment algorithm (37) to (a) a window of DNA comprising nL nucleotides upstream of the upstream-most ORF's start, and mL nucleotides downstream of the upstream-most ORF's start (thus comprising a total of nL+mL + 1 nucleotides; Figure 1), and (b) the reverse complement of a window of DNA mR nucleotides upstream of the downstream-most ORF's end, and nR nucleotides downstream of the downstream-most ORF's end. If the IS has only one ORF, the upstream-most and downstream-most ORFs are the same ORF. The local alignment of the upstream and downstream windows is used to identify the candidate inverted repeats of this IS. The parameters nL, mL, nR and mR are user-specifiable parameters, as are the match, mismatch and gap penalties used by the alignment algorithm.
  3. Identification of (candidate) direct repeats. Many ISs generate short direct repeats upon transposition into a target site. IScan first identifies a window of kL nucleotides upstream of the upstream inverted repeat, and a window of kR nucleotides downstream of the downstream inverted repeat (as identified in step 2; Figure 1). Alignment of these two windows then yields candidate direct repeats.

A patchy distribution of ISs among bacterial genomes

We applied IScan to the complete DNA sequences (chromosomes and plasmids) of 438 bacterial genomes, to identify all candidate ISs in the 20 major IS families whose ORFs had at least 35% amino acid sequence identity to a family prototype sequence (Table 1) over the length of the prototype sequence. This approach yielded 2091 ISs. 95% (1987) of them occurred on bacterial chromosomes, and the remainder occurred on plasmids. The length distribution of the ISs we identified is shown in Figure 2a. In the literature, the lower end of the typical length range for all IS families considered is 540 bp (1,11). Only 1.2% of the ISs we identified were shorter than 540 bp. The conspicuous peaks in the length distribution come from individual highly abundant ISs, such as IS1, with a highly abundant member of length 695 bp that causes the highest peak in Figure 2a.

Figure 2.
(a) Length distribution of ISs with more than 35% amino acid sequence identity relative to a curated reference sequence. (b) Distribution of the number of IS copies per genome for four abundant IS families studied here. Note the logarithmic scale on the ...

Among all ISs we identified, the most abundant IS families are IS1, IS3, IS481 and IS5 (Table 1). The distribution of any one IS family is extremely patchy and highly skewed: The vast majority of genomes contains no member of the family; most genomes that contain one member of the family contain only one member; and typically only very few genomes contain a large number of members. We illustrate this distribution in Figure 2b (note the logarithmic scale on the y-axis). The numbers of IS copies within a genome show no strong statistical associations among different IS families. Specifically, the copy numbers of only 10 among 136 possible IS family pairs, show a statistically significant (Bonferroni-corrected P = 0.05) Spearman rank order correlation coefficient. All but three of these statistically significant associations vanish, however, if one eliminates genomes from the analysis in which one or both ISs have no copies. The remaining significant associations involve IS1 and IS3 (Spearman's r = 0.93; n = 16), IS110 and IS4 (r = 0.87; n = 13), as well as IS110 and IS3 (Spearman's r = 0.83; n = 15). However, the genomes that account for this correlation are from the extremely closely related species of Escherichia coli and Shigella. This indicates that a common evolutionary history rather than similar host preferences is responsible for the co-occurrence of these ISs. It also illustrates that the shared evolutionary history of many sequenced genomes introduces a bias into the data that has to be taken into account when testing certain hypotheses about IS evolution.

High similarity of ISs within a genome

We next examined the sequence divergence of ISs within and among genomes. Beyond simple nucleotide divergence (Figure 3a), we determined Ka, the fraction of amino acid replacement substitutions at amino acid replacement sites, and Ks, the fraction of synonymous substitutions at synonymous sites, an indicator of the synonymous divergence within an IS's transposase-coding genes. Synonymous sites are generally under weaker selection than amino acid replacement sites. In addition, because of the low expression level of transposases, the synonymous sites we study are not subject to selection for translation efficiency. These two observations render Ks a better (albeit crude) indicator of the IS's age than Ka (38). To estimate Ka and Ks, we used GenomeHistory, a software tool that estimates Ka and Ks using a maximum likelihood method (31).

Figure 3.
Within-genome distribution of (a) overall nucleotide divergence (in fraction of pairwise nucleotide differences), (b) non-synonymous divergence per non-synonymous site Ka (38), (c) synonymous divergence at synonymous sites Ks (38), of IS pairs in the ...

We first focused on the sequence divergence of ISs within a family and within genomes. It is very low. Figure 3b shows a histogram of the distribution of amino acid sequence divergence Ka for pairs of ISs within the same genome. Note the logarithmic scale on the y-axis, indicating a very large number of sequences at low divergence. The mean (median) Ka is 0.012 (0.0019), and its 90th percentile is Ka = 0.0067, even though our approach would readily detect ISs with amino acid sequence divergence up to Ka greater than 1.33.3% of sequence pairs within a genome are completely identical in their amino acid sequence. Synonymous divergence Ks is similarly low (Figure 3c). Excluding IS pairs with saturated synonymous divergence (1.01% of all pairs), the mean synonymous divergence is Ks = 0.033 with its 90th percentile at Ks = 0.013. More than 60% of all IS pairs within a genome are identical at their synonymous sites, such that the median Ks = 0.

Figure 4 shows the mean and SEs of Ks and Ka separately for each IS family (see also Table 2). While amino acid divergence is relatively homogeneous among families, synonymous divergence Ks varies to a greater extent, and it is particularly high for IS5. Most variable is the ratio Ka/Ks, which is normally taken as an indicator of selective constraint on a protein. The mean ratio is Ka/Ks = 0.39 and it varies between Ka/Ks = 0.13 (IS66) and Ka/Ks = 0.86 (IS4). A high ratio Ka/Ks might be taken to indicate the presence of many inactive ISs whose coding region are pseudogenes and thus evolve effectively neutrally (Ka = Ks). However, this interpretation would be misleading, because ISs with high Ka/Ks generally show extremely low intragenomic synonymous and non-synonymous divergence. For example, for IS4, where the mean Ka/Ks = 0.86, the mean values of Ka and Ks are an extremely low Ka = 6.3 × 10−3 and Ks = 7 × 10−3. Similarly low divergences also hold for other ISs with high Ka/Ks ratios (Table 2). Such small values mean that virtually all IS pairs of a family within a genome differ by one or very few nucleotides. At such low divergence, the interpretation of the ratio Ka/Ks as an indicator of selective constraint is not appropriate, because there may be a large amount of stochastic variation in the number of synonymous and non-synonymous nucleotide changes. In this regard, it is also worth mentioning that IS5, which is an extreme outlier with its high mean synonymous divergence of Ks = 0.58 shows a Ka/Ks = 0.22, close to the lower end of the observed range across families.

Table 2.
Mean and SEs for Ka, Ks and Ka/Ks within genomes for those IS families where there exist genomes that contain more than one IS of the same family
Figure 4.
Means and SEs of (a) within-genome Ka and (b) within-genome Ks for those IS families where more than one family member occurred in at least one genome.

Not unexpectedly, the sequence divergence among pairs of ISs in different genomes is substantially greater. At a mean Ks = 0.3 synonymous divergence of ISs among genomes is almost 10 times greater than within genomes. Ten times more IS pairs in different genomes have saturated synonymous divergence (10.5% as opposed to 1.01% within genomes). The mean non-synonymous divergence Ka = 0.064 (90th percentile Ka = 0.29) is a factor five higher amongst genomes than within genomes. It is nonetheless very low compared to the maximum Ka = 1 our approach could have revealed. This suggests that the IS families we study are well defined on the sequence level.

The presence of closely related ISs of the same family in different genomes indicates the importance of horizontal gene transfer in their spreading. For instance, we find a large number (1108) of ISs with identical transposase coding regions in different organisms. Many of these pairs stem from species closely related in evolutionary history or life style, such as ISs in the closely associated genera Escherichia/Shigella/Salmonella and Staphylococcus/Enterococcus. However, some of these pairs involve more distantly related organisms, such as the psychrophilic (cold-loving) arctic bacterium Desulfotalea psychrophila on one hand, and the human E. coli and Haemophilus ducreyi on the other hand, which all share identical IS1 elements.

Inverted repeats are flexible in sequence but provide a signal to enrich for functional ISs

Very few among the many known ISs have been subject to experimental tests for their ability to transpose. For evolutionary studies, it is useful to distinguish such functional from non-functional ISs. Given the flood of new ISs that bacterial genome sequencing is producing, time-consuming experimental approaches are not suitable to make this distinction. We thus suggest a computational strategy that may enrich for non-truncated and likely functional ISs.

In principle, two strategies are conceivable to identify putatively functional ISs computationally. The first takes advantage of coding sequence similarity of a candidate IS to a reference IS. The limitation of this strategy is that functional transposases may be very divergent from any one reference sequence. An alternative strategy focuses on the second major sequence feature of ISs, their inverted repeats. Inverted repeats with significant sequence similarity may indicate functionality of an IS, or at least that an IS has not been truncated.

We used several different approaches to estimate the ‘quality’ of an IS's inverted repeats, as indicated by their similarity to each other and to a reference sequence. (These approaches are also implemented in IScan and are available to IScan users.) The first approach involves a local dynamic programming alignment (39) of the sequences immediately upstream and downstream of the coding sequences that contain the inverted repeats. We compared the score of this alignment with that of a large number (105) of alignments of sequence fragment pairs of the same length but chosen from random positions within the same DNA molecule. This allows us to assign a significance threshold PLR (‘left-right’, Figure 5) of observing an alignment score as high as that observed between putative IRL and IRR by chance alone. Figure 6a shows a histogram of the distribution of these P-values together with an indication of the significance threshold P = 0.05, as well as the significance threshold P = 0.05/2091 = 2.3 × 10−5. This lower value corresponds to a Bonferroni-corrected P = 0.05. It takes into account that we carry out multiple independent tests, but it is excessively conservative. Although a large fraction (55%) of the P-value we determined is significant at P = 0.05, not one of them is smaller than P = 2.3 × 10−5.

Figure 5.
Illustration of the different alignment strategies pursued to assess statistical significance of inverted repeat alignments. The candidate IS is the sequence match produced by IScan to a reference IS from a given family. See text for details.
Figure 6.
Distribution of five test statistics (see text) indicating inverted repeat quality for all ISs studied here. (a) PLR; (b) P3; (c) PL × PR. The value 0.000023 is the Bonferroni-corrected P-value of 0.05.

In the second approach, we aligned the left inverted repeat of the ‘reference’ sequence we used to identify ISs of a given family with both the left and right (candidate) inverted repeats of the ISs we identified (Figure 5). For each IS, we evaluated the statistical significance P3 (‘3-way’) of each alignment score by the same randomization approach as above, using a large number (105) of random DNA fragment pairs from the same molecule, and aligning them to the left inverted repeat of the reference sequence. In this analysis, 79.4% (1660) of ISs showed P3 < 0.05, and two showed P3 < 2.3 × 10−5 (Figure 6b).

In a third analysis, we aligned the putative IRL of each IS candidate with the IRL of the reference IS that we used to identify the IS in the first place. We then carried out the same randomization approach as above to identify the likelihood PL (‘left’) to observe such an alignment by chance alone. We repeated this approach for the putative IRR to obtain PR (‘right’). To obtain a joint significance score for both IRL and IRR of an IS, we simply calculated the product PL × PR. Figure 6c shows a histogram of (PL × PR). In this analysis, 90.3% (1890) of PL × PR -values are smaller than 0.05, and 55.1% (1152) values are smaller than the Bonferroni-corrected P = 2.3 × 10−5. Thus, this alignment strategy significantly enriches for ISs with highly similar inverted repeat units. To be sure, this does not demonstrate that ISs with highly similar repeat units are more likely to be functional. The following analysis suggests, however, that this is the case.

ISs that have a family member with identical DNA sequence in the same genome are more likely to be functional than ISs for which this does not hold. The reason is that the two identical family members have most likely arisen through transposition, because gene duplication, the other prominent process that could account for the two IS copies, is much less frequent than transposition (29). In addition, bacterial IS transposase activity is usually tightly regulated, and often restricted to the IS from which transposase is expressed, such that passive transposition of a defective IS with the aid of an intact ‘helper’ IS may not occur often (1,7). This means that many ISs in our dataset with an identical IS in the same genome will be functional. If the alignment strategy we pursued above enriched for functional ISs, we would predict that the PL × PR values would be significantly lower for ISs with an identical family member in the same genome, than for other ISs. This is exactly what we observe. For example, for ISs with an identical IS in the same genome, the mean (PL × PR) = 0.019, whereas for other ISs, the mean (PL × PR) = 0.046. This difference is highly statistically significant as assessed by either a Mann–Whitney U-test (n1 = 1478; n2 = 613; P = 1.2 × 10−15) or a t-test (P <10−17).

In sum, although the inverted repeat units of an IS have limited sequence similarity, it is possible to enrich a data set for likely functional ISs by considering ISs with high PL × PR values. We note parenthetically that we also carried out a sequence similarity analysis among transposition target ‘direct’ repeat units generated by those ISs that are known to generate long direct repeats. This analysis showed that the direct target repeat units have too limited sequence similarity to be useful for this or other purposes (data not shown).


We have developed IScan, a publicly available tool to identify IS coding regions, and associated sequence elements (direct/inverted repeats). IScan is able to identify ISs with an arbitrary number of ORFs, including ISs with ORFs encoded on both strands. IS annotation in existing genomes may be highly heterogeneous, because different researchers may use different annotation methods. A tool like IScan thus allows the user to create consistent IS annotation with multiple user-specified parameters (repeat length, sequence similarity to a reference family member, etc.) across multiple genomes. This consistency and flexibility is essential for detailed analyses of IS evolution across multiple genomes.

Using IScan, we have surveyed 438 bacterial genomes for members of 20 IS families, and studied the similarity in their coding sequences as well as their inverted repeats. Recent other surveys of IS families focused on different aspects of IS biology and studied fewer or different genomes. Specifically, an intriguing analysis (40) studied ISs in 262 genomes and focused on the question: What determines IS copy numbers in a genome? (Briefly, the answer is genome size.) A short review by Siguier and collaborators (2) surveys different IS families and examples of IS evolution based on individual case studies. Yet another recent analysis focuses on archaeal ISs (8). In contrast to these papers, our work focuses on the sequence divergence of coding regions and inverted repeats in the largest number of genomes analyzed to date. Earlier work on the molecular evolution of ISs dates to the pre-genome era, restricted itself to narrow categories of ISs, or relied on previously available (and heterogeneous) genome sequence annotation for a smaller number of IS families (25,29). Our current analysis overcomes these limitations by analyzing members of all 20 IS families in an unprecedented (>400) number of genomes.

We find that the IS families we analyzed are well defined on the transposase sequence level. Specifically, although our approach would have admitted ISs with as little as 35% amino acid sequence identity to curated reference sequences, the mean amino acid divergence is lower than 7%, and more than 90% of all ISs have more than 70% amino acid identity to the reference. The different IS families show a skewed and patchy distribution, where most genomes carry no members of any given IS family, and a very small number of genomes carry many members [Figure 2b, see also (40)]. This distribution of IS occurrence resembles a similarly skewed distribution of IS occurrence on a much smaller taxonomic resolution, namely for 71 E. coli strains (25), where many strains carry few or no IS copies. At least two explanations might account for this skewed distribution. One of them involves selection against genomes with high IS copy numbers (see also below), another is an unidentified transposition immunity mechanism that suppresses IS copy number. These causes are non-exclusive and might operate jointly.

A strong or highly significant statistical association among IS families of IS copy numbers per genome might indicate that some bacteria are more susceptible to ‘infection’ by ISs in general. However, we do not find strong evidence of such an association for any pair of ISs beyond what would be expected from the shared evolutionary history of many bacterial species. Conversely, some ISs might be more successful than others, in that they can more easily ‘infect’ a larger number of genomes. Different ISs clearly show very different abundances. However, a thorough recent analysis suggests that among many possible factors determining IS abundance, genome size has by far the most important influence (40).

One biologically significant finding of our survey is the extreme sequence homogeneity of ISs within genomes. This substantially extends earlier, more limited work (25,29) and demonstrates a consistent pattern across IS families and vast taxonomic scales. This high sequence homogeneity stands in stark contrast to the greater sequence diversity among duplicate genes, another prominent class of repetitive DNA (29). Our earlier work suggests that gene conversion is not a likely sole cause of this high homogeneity, because common signatures of gene conversion are absent within IS families (29). Instead, this high homogeneity is readily explained by the rapid spreading of ISs within a genome. Consistent with this hypothesis is the observation that transposition and excision rates are very high on the time scale at which DNA sequences change. The high-sequence homogeneity of ISs might be explained by the following evolutionary scenario. After an IS enters a genome, its copy number expands rapidly through transposition (hence the low sequence diversity). Eventually, the IS becomes extinct again from the lineage, mostly due to natural selection, but perhaps aided by excision events. Some time thereafter, it may become reintroduced through horizontal gene transfer. Several other scenarios are not consistent with the data (29): If ISs did not go periodically extinct, they would show higher divergence within a genome; if they were not reintroduced by horizontal transfer, bacterial genomes would be devoid of them; and if the net effect of natural selection was an increase in IS copy number, then ISs should be much more diverse within a genome, because they would remain part of the genome indefinitely.

The only requirement for this evolutionary scenario is the frequent horizontal transfer of ISs. This is not a problematic requirement, as horizontal gene transfer may account for more than 10% of a bacterial genome's gene content (14,41). Its likely importance has been noted in an earlier, sequence-limited study on IS evolution in enteric bacteria (24). In addition, the mere fact that highly similar ISs of the same family occur in multiple distantly related genomes speaks to the importance of horizontal gene transfer for IS maintenance.

Among the methodological questions that a tool like IScan can address is whether rapid, computational, and automatic identification of functional or non-truncated ISs is possible. Aside from coding transposase-coding regions, ISs are typically associated with two sequence elements (direct target repeats and inverted repeats) that might be usable in such an identification. Direct target repeats, however, are of limited use for this purpose. High similarity of direct repeat units might indicate whether the transposition event that produced them occurred recently or a long time ago. However, this criterion only tells us whether the IS in question transposed or inserted successfully, not that it could again do so. In addition, target repeats for members of one IS family are very short, rendering their unambiguous identification difficult. Furthermore, their sequence similarity among different IS family members is highly limited (data not shown).

The second class of potential diagnostic sequence features are inverted repeats, except for IS families ISCR and IS91 that do not harbor such repeats. Even though a truncated IS may be only slightly shorter than its intact counterparts, one would expect that truncation is often associated with deletion of one inverted repeat unit. Also, it is reasonable to expect that the inverted repeat units of a functional IS show greater sequence similarity to each other than DNA fragments of the same length but randomly sampled from the same genome. We implemented a statistical test in IScan that relies on a large number n of such random fragments, to ask whether IS inverted repeats show such significant similarity. We applied this test (n = 105) to all the ISs we identified in the 438 genomes. When sequence similarity of inverted repeats is assessed through comparison with a reference IS (Figure 6c), then 90.3% of ISs have inverted repeats with significant similarity at P = 0.05. Of these, one would expect a fraction of 0.05 to be false positives, leading to an expected number of ISs with significantly similar inverted repeats of 0.95 × 90.3% = 85.8% (1794) ISs. The more stringent Bonferroni-corrected P = 2.3 × 10−5 would yield 55.1% (1152) ISs with significantly similar inverted repeats. The inverted repeat P-values are significantly greater for ISs with an identical member of the same IS family in the same genome, many of which would be functional. Taken together, these observations suggest that, on one hand, inverted repeats may be highly flexible and cannot always be unambiguously identified. On the other hand, identification of ISs with highly similar inverted repeats may allow enrichment of a dataset with functional ISs, which may facilitate subsequent analyses. The high sequence similarity of ISs within genomes thus has not only biological implications. It also aids in defining a heuristic criterion—perhaps the only one—to identify functional ISs based on sequence data alone.

In closing, we note that the applications we illustrated here are only two among many uses that IScan might find. These uses will only increase as more genomes become available, and will include mapping of horizontal transfer histories, as well as transposition sequences within a genome.


Support through SNF grant 315200-116814 and through the Santa Fe Institute is gratefully acknowledged. Funding to pay the Open Access publication charges for this article was provided by The University of Zurich.

Conflict of interest statement. None declared.


1. Mahillon J, Chandler M. Insertion sequences. Microbiol. Mol. Biol. Rev. 1998;62:725–774. [PMC free article] [PubMed]
2. Siguier P, Filee J, Chandler M. Insertion sequences in prokaryotic genomes. Curr. Opin. Microbiol. 2006;9:526–531. [PubMed]
3. Toleman MA, Bennett PM, Walsh TR. ISCR elements: Novel gene-capturing systems of the 21st century? Microbiol. Mol. Biol. Rev. 2006;70:296–316. [PMC free article] [PubMed]
4. Lohe AR, Moriyama EN, Lidholm DA, Hartl DL. Horizontal transmission, vertical inactivation, and stochastic loss of mariner-like transposable elements. Mol. Biol. Evol. 1995;12:62–72. [PubMed]
5. Capy P, Langin T, Bigot Y, Brunet F, Daboussi MJ, Periquet G, David JR, Hartl DL. Horizontal transmission versus ancient origin - mariner in the witness box. Genetica. 1994;93:161–170. [PubMed]
6. Lampe DJ, Witherspoon DJ, Soto-Adames FN, Robertson HM. Recent horizontal transfer of mellifera subfamily Mariner transposons into insect lineages representing four different orders shows that selection acts only during horizontal transfer. Mol. Biol. Evol. 2003;20:554–562. [PubMed]
7. Nagy Z, Chandler M. Regulation of transposition in bacteria. Res. Microbiol. 2004;155:387–398. [PubMed]
8. Filee J, Siguier P, Chandler M. Insertion sequence diversity in Archaea. Microbiol. Mol. Biol. Rev. 2007;71:121–157. [PMC free article] [PubMed]
9. Brugger K, Redder P, She QX, Confalonieri F, Zivanovic Y, Garrett RA. Mobile elements in archaeal genomes. FEMS Microbiol. Lett. 2002;206:131–141. [PubMed]
10. Buisine N, Tang CM, Chalmers R. Transposon-like Correia elements: structure, distribution and genetic exchange between pathogenic Neisseria sp. FEBS Lett. 2002;522:52–58. [PubMed]
11. Siguier P, Perochon J, Lestrade L, Mahillon J, Chandler M. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. (Database Issue) 2006;34:D34–D36. [PMC free article] [PubMed]
12. Orgel LE, Crick FHC. Selfish DNA: the ultimate parasite. Nature. 1980;284:604–607. [PubMed]
13. Doolittle WF, Sapienza C. Selfish genes, the phenotype paradigm, and genome evolution. Nature. 1980;284:601–607. [PubMed]
14. Bushman F. Lateral DNA Transfer: Mechanisms and Consequences. Cold Spring Harbor, NY, USA: Cold Spring Harbor University Press; 2002.
15. Condit R, Stewart F, Levin B. The population biology of bacterial transposons - a priori conditions for maintenance as parasitic DNA. Am. Naturalist. 1988;132:129–147.
16. Cooper VS, Schneider M, Blot M, Lenski RE. Mechanisms causing rapid and parallel losses of ribose catabolism in evolving populations of Escherichia coli. J. Bacteriol. 2001;183:2834–2841. [PMC free article] [PubMed]
17. Schneider D, Lenski RE. Dynamics of insertion sequences elements during experimental evolution of bacteria. Res. Microbiol. 2004;155:319–327. [PubMed]
18. Schneider D, Duperchy E, Coursange E, Lenski RE, Blot M. Long-term experimental evolution in Escherichia coli. IX. Characterization of insertionsequence-mediated mutation and rearrangements. Genetics. 2000;156:477–488. [PMC free article] [PubMed]
19. Treves DS, Manning S, Adams J. Repeated evolution of an acetate-crossfeeding polymorphism in long-term populations of Escherichia coli. Mol. Biol. Evol. 1998;15:789–797. [PubMed]
20. Naas T, Blot M, Fitch WM, Arber W. Insertion sequence-related genetic variation in resting Escherichia coli K-12. Genetics. 1994;136:721–730. [PMC free article] [PubMed]
21. Dunham MJ, Badrane H, Ferea T, Adams J, Brown PO, Rosenzweig F, Botstein D. Characteristic genome rearrangements in experimental evolution of Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA. 2002;99:16144–16149. [PMC free article] [PubMed]
22. Kleckner N. In: Mobile DNA. Berg D, Howe M, editors. Washington, DC: American Society for Microbiology Press; 1989. pp. 211–226.
23. Shen MM, Raleigh EA, Kleckner N. Physical analysis of Tn10 and IS10-promoted transpositions and rearrangements. Genetics. 1987;116:359–369. [PMC free article] [PubMed]
24. Lawrence JG, Ochman H, Hartl DL. The evolution of insertion sequences within enteric bacteria. Genetics. 1992;131:9–20. [PMC free article] [PubMed]
25. Sawyer SA, Dykhuizen DE, DuBose RF, Green L, Mutangadura-Mhlanga T, Wolczyk DF, Hartl DL. Distribution and abundance of insertion sequences among natural isolates of Escherichia coli. Genetics. 1987;115:51–63. [PMC free article] [PubMed]
26. Hall BG, Parker LL, Betts PW, DuBose RF, Sawyer SA, Hartl DL. IS103, a new insertion element in Escherichia coli: Characterization and distribution in natural populations. Genetics. 1989;121:423–431. [PMC free article] [PubMed]
27. Bisercic M, Ochman H. Natural populations of Escherichia coli and Salmonella typhimurium harbor the same classes of insertion sequences. Genetics. 1995;133:449–454. [PMC free article] [PubMed]
28. Ajioka J, Hartl D. In: Mobile DNA. Berg D, Howe M, editors. Washington, DC: American Society for Microbiology Press; 1989. pp. 185–210.
29. Wagner A. Periodic extinctions of transposable elements in bacterial lineages: evidence from intragenomic variation in multiple genomes. Mol. Biol. Evol. 2006;23:723–733. [PubMed]
30. Thompson JD, Higgins DG, Gibson TJ. Clustal-W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting; position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. [PMC free article] [PubMed]
31. Conant GC, Wagner A. GenomeHistory: a software tool and its applications to fully sequenced genomes. Nucleic Acids Res. 2002;30:1–10. [PMC free article] [PubMed]
32. Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ. Gapped Blast and Psi-Blast: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
33. Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitutiuon rates, with application to the chloroplast genome. Mol.r Biol. Evo. 1994;11:715–724. [PubMed]
34. Goldman N, Yang ZH. Codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 1994;11:725–736. [PubMed]
35. Conant GC, Wagner A. Asymmetric sequence divergence of duplicate genes. Genome Res. 2003;13:2052–2058. [PMC free article] [PubMed]
36. Myers E, Miller W. Optimal alignments in linear space. Comput. Appl. Biosci. 1988;4:11–17. [PubMed]
37. Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. [PubMed]
38. Li W-H. Molecular Evolution. MA: Sinauer; 1997. Chapter 7.
39. Smith TF, Waterman MS, Fitch WM. Comparative biosequence metrics. J. Mol. Evol. 1981;18:38–46. [PubMed]
40. Touchon M, Rocha EPC. Causes of insertion sequences abundance in prokaryotic genomes. Mol. Biol. Evol. 2007;24:969–981. [PubMed]
41. Ochman H, Lawrence J, Groisman E. Lateral gene transfer and the nature of bacterial innovation. Nature. 2000;405:299–304. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...