• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. May 15, 2007; 104(20): 8472–8477.
Published online May 7, 2007. doi:  10.1073/pnas.0702412104
PMCID: PMC1895974
Microbiology

Simple sequence repeats in prokaryotic genomes

Abstract

Simple sequence repeats (SSRs) in DNA sequences are composed of tandem iterations of short oligonucleotides and may have functional and/or structural properties that distinguish them from general DNA sequences. They are variable in length because of slip-strand mutations and may also affect local structure of the DNA molecule or the encoded proteins. Long SSRs (LSSRs) are common in eukaryotes but rare in most prokaryotes. In pathogens, SSRs can enhance antigenic variance of the pathogen population in a strategy that counteracts the host immune response. We analyze representations of SSRs in >300 prokaryotic genomes and report significant differences among different prokaryotes as well as among different types of SSRs. LSSRs composed of short oligonucleotides (1–4 bp length, designated LSSR1–4) are often found in host-adapted pathogens with reduced genomes that are not known to readily survive in a natural environment outside the host. In contrast, LSSRs composed of longer oligonucleotides (5–11 bp length, designated LSSR5–11) are found mostly in nonpathogens and opportunistic pathogens with large genomes. Comparisons among SSRs of different lengths suggest that LSSR1–4 are likely maintained by selection. This is consistent with the established role of some LSSR1–4 in enhancing antigenic variance. By contrast, abundance of LSSR5–11 in some genomes may reflect the SSRs' general tendency to expand rather than their specific role in the organisms' physiology. Differences among genomes in terms of SSR representations and their possible interpretations are discussed.

Keywords: comparative genomics, phase variation-slip, strand mutations, tandem repeats, microsatellites

Simple sequence repeats (SSRs) in both prokaryotes and eukaryotes represent hypermutable loci subject to reversible changes in the SSR length (16). Some pathogens use SSRs in a strategy that counteracts the host immune response by increasing the antigenic variance of the pathogen population (4, 5, 7). In this scenario, SSRs located in protein coding regions or in upstream regulatory regions can reversibly deactivate or alter genes involved in interactions with the host (4, 8). Some SSRs may also affect local structure of the DNA molecule (912). Trinucleotide and hexanucleotide repeats in genes translate into amino acid runs and alternating patterns, which may play special roles in protein structure (13, 14) and are enriched in human proteins associated with genetic diseases (15). The best studied cases of SSRs expansion relate to triplet repeats that can cause genetic disorders in humans. Such repeats may be located in both protein-coding and regulatory regions and can alter the structure of the encoded proteins or the DNA molecule when they expand beyond a certain length (16).

Long SSRs tend to be dramatically overrepresented (i.e., found significantly more often than expected by chance) in eukaryotic genomes (2, 17, 18). In prokaryotes, long SSRs are generally less common and may be subject to negative selection (19). Significant differences in SSR representations exist even among closely related species (20), suggesting that the SSR abundance may change relatively rapidly during evolution.

Assessments of SSR representations generally rely on stochastic models used as a null hypothesis. Previous analyses used a homogeneous Bernoulli model (19, 21). In this work, we analyze SSR representations in >300 prokaryotic genomes using more realistic stochastic models of varying complexity. Our results indicate large differences among prokaryotes in terms of SSR representations and point to possible functional differences among SSRs of different types.

Results

Comparison of Long SSR Representations Among Prokaryotic Genomes.

SSR representations in most prokaryotic genomes show few deviations from expectations based on random models except for the suppression of mononucleotide SSRs exceeding a length of 8 bp [Fig. 1 and supporting information (SI) Fig. 4], which is common among prokaryotes (19, 20). SI Table 5 displays counts Nk* of LSSRs for k between 1 and 11 bp and for 378 complete prokaryotic chromosomes (plasmids and megaplasmids are not included). Data for selected species are shown in Table 1. The largest numbers of LSSRs are found in the closely related cyanobacteria Nostoc and Anabaena, followed by Burkholderia species, Frankia, Streptomyces, Methanosarcina, Xanthomonas, and Polaromonas, all with >100 LSSRs. The LSSR counts appear unrelated to taxonomical or phylogenetic relationships beyond the level of genus. The absence of correlations with phylogeny suggests that long SSRs can spread through a genome relatively quickly during evolution. Firmicutes (except Mollicutes) are the only well represented group that always has low LSSR counts. Interestingly, long heptameric repeats (k = 7) are far more common than other types of repeats. Long tri-, hexa- and nonanucleotide repeats are often located in genes, whereas long SSRs of oligonucleotides whose lengths are not multiples of three are generally found in intergenic regions. These SSRs may cause frameshift mutations and are probably selected against in protein coding genes.

Fig. 1.
Mono- and dinucleotide SSRs in the E. coli K12 genome. The plots show the counts Nk(l) for mononucleotide (k = 1, Upper) and dinucleotide (k = 2, Lower) SSRs in the genomic DNA sequence (filled circles) and in random sequences generated by six different ...
Table 1.
Counts of LSSRs in selected genomes

Table 2 shows correlations among counts of different SSR types across different genomes. The counts of LSSR5–11 (k ≥ 5) correlate well among each other. Weaker correlations are also observed among LSSR1–3 (k ≤ 3) but not between LSSR5–11 and LSSR1–3 counts. We include tetranucleotide LSSRs in the LSSR1–4 group, although they exhibit no consistently strong correlations with either LSSR1–3 or LSSR5–11. The lack of correlations between LSSR1–4 and LSSR5–11 indicate that the two LSSR types tend to occur in different organisms. Tables 3 and and44 display lists of prokaryotes with high counts of LSSR1–4 and LSSR5–11, respectively. Both collections include species from diverse taxa. Few prokaryotic genomes contain many LSSR1–4, whereas LSSR5–11 are more common. The two groups are distinct with respect to their genome sizes, G+C content, and pathogenic lifestyle. Most genomes with seven or more LSSR1–4 (Table 3) are small in size, generally ≈2 Mb or less. By contrast, the smallest genome among the 33 with at least 60 LSSR5–11 (Table 4) is 2.5 Mb in size (Xylella fastidiosa) and most are >4 Mb. Genomes with many LSSR5–11 tend to have high G+C content (mostly >60%, Table 4) whereas genomes with LSSR1–4 generally have low G+C content (mostly <40%). Most genomes in Table 3 belong to host-adapted pathogens, including several but not all Mycoplasma species (20), which are not known to survive readily outside the host. The other three are mesophilic archaeal methanogens Methanosarcina mazei, Methanosarcina barkeri, and Methanococcoides burtonii. The two Methanosarcina genomes are unusual in possessing high counts of both LSSR1–4 and LSSR5–11. No thermophiles feature among the genomes with high LSSR counts.

Table 2.
Correlations among counts of LSSRs of different types
Table 3.
Genomes with high counts of LSSR1–4
Table 4.
Genomes with high counts of LSSR5–11

We used Fisher's exact test to evaluate statistical significance of the differences between the collections of genomes with high LSSR1–4 and LSSR5–11 counts (Tables 3 and and4,4, respectively) in terms of genome size (>4 MB versus <3 MB), G+C content (>55% versus <45%), and ability to grow outside the host. To reduce bias resulting from some genera being represented by multiple species, we counted only one species per genus in each collection. The differences were statistically significant with respect to genome size (P = 0.006) and dependence on a host (P = 0.017) but not with respect to G+C content (P = 0.11). When counting all species, the relevant probabilities are 10−5 for genome size, 0.001 for dependence on a host, and 0.0008 for G+C content. The statistical analysis is described in detail in the SI Text.

SSRs in Selected Genomes.

Lawsonia intracellularis.

This obligate intracellular parasite of domestic animals features three long mononucleotide SSRs, four long dinucleotide SSRs, and 27 long trinucleotide SSRs (Fig. 2 and SI Table 6). The N2(l) plot (dinucleotide SSRs) shows a peak around the length 15 bp, preceded by low SSR counts ≈10 bp in length. Likewise, the counts of trinucleotide SSRs decrease in agreement with the random models up to the length 15, followed by a sudden increase at length 16 (Fig. 2). These bimodal distributions resemble SSRs in some Mycoplasma species (20) and suggest that the LSSRs in L. intracellularis may be maintained by selection. By contrast, the genomes of nonpathogenic Desulfovibrio vulgaris and Desulfovibrio desulfuricans, phylogenetically close to L. intracellularis (22), contain no LSSR1–4 and few LSSR5–11 (SI Table 5 and SI Fig. 4). The mono- and dinucleotide LSSRs are exclusively intergenic, whereas trinucleotide SSRs are mostly in genes. The LSSRs in L. intracellularis are located near genes of diverse functions including many hypothetical genes (SI Table 6).

Fig. 2.
Mono- (Top), di- (Middle), and tri- (Bottom) nucleotide SSRs in L. intracellularis. See Fig. 1 legend.

Burkholderia species.

Burkholderia are found in a wide range of environmental niches and include host-adapted pathogens (B. mallei), opportunistic pathogens (B. pseudomallei) as well as nonpathogens such as some strains of B. thailandensis (2325). Their genomes generally comprise two or three chromosomes of ≈4, 3, and 1 Mb in length, respectively. Our data confirm the abundance of LSSRs previously reported in B. mallei (25). In fact, all Burkholderia genomes contain LSSRs, but the counts vary from a moderate 53 in B. cenocepacia to nearly 700 in the two B. pseudomallei strains, which is the most among the >300 prokaryotic genomes analyzed. As in most genomes with many LSSR5–11, the heptanucleotide SSRs are the most abundant excepting B. xenovorans and Burkholderia sp. 383. Interestingly, the counts of LSSRs are generally higher in chromosome 2 than in chromosome 1 (Tables 1 and and4)4) possibly reflecting less stringent selective constraints in the secondary chromosomes. Hexa- and nonanucleotide LSSRs show comparable counts in genes and intergenic regions whereas LSSRs of oligonucleotides whose length is not divisible by 3 are mostly confined to intergenic regions, possibly because of selection against frameshift mutations. Surprisingly, the two strains of B. pseudomallei, K96243 and 1710b, have similar overall counts of SSRs but differ in their distribution between genes and intergenic regions with many more intergenic SSRs in the strain K96243. We believe that this discrepancy could be due to differences in gene annotations and may not have biological roots. The LSSR5–11 in the Burkholderia genomes include repeats of many different oligonucleotides, and it is unlikely that the SSRs arose by amplification of a single or few seed SSRs. The LSSR5–11 tend to be G+C rich in parallel with the high overall genomic G+C content ≈68% (SI Table 7).

Methanosarcinaspecies.

Methanosarcina species are strictly anaerobic, mesophilic, archaeal methanogens. All three Methanosarcina genomes feature multiple LSSRs, mostly LSSR5–11 with particularly high counts of 10-mer and 11-mer LSSR (SI Table 5). M. mazei and M. barkeri also have multiple LSSR1–4. Moreover, some of the tri- and tetranucleotide LSSRs in both M. mazei and M. barkeri genomes are very long, exceeding 50 bp in length. All trinucleotide LSSRs in both genomes are of the type (TAA)n or the inverted complement (TTA)n except one (AAG)7 repeat in M. mazei. Likewise, the tetranucleotide LSSRs are mostly repeats of tetranucleotides AAAT/ATTT, and some AATC, TAGA, AACT, AAGT, and AATG (SI Table 8). Several LSSR in M. mazei and M. barkeri exceed 50 bp in length and appear unlikely to be generated solely by mutational drift (Fig. 2). However, the SSRs that modulate gene expression are typically located in upstream regulatory regions or in genes where they cause frameshift mutations (1, 4), whereas the very long LSSR1–4 in Methanosarcina are located between convergent genes, downstream of genes, and some trinucleotide SSRs are in genes but these do not cause frameshifts (SI Table 8). Such locations argue against a direct role of these SSRs in gene regulation although they may have an indirect effect, e.g., by affecting properties of the DNA molecule or the encoded protein.

Discussion

Diversity of SSR Representations in Prokaryotic Genomes.

Representations of SSRs in prokaryotic genomes have been assessed in several studies. Field and Wills (19) analyzed SSR occurrences in several complete genomes and summarized the trends in prokaryotes as overrepresentation of short SSRs (up to the length of 7–8 bp for mononucleotide SSRs, k = 1) and active selection against long SSRs except where they promote reversible mutations affecting specific genes (typically those encoding surface antigens in pathogens) (4, 7). These results were confirmed by others (2, 5). By contrast, we found that the perceived overrepresentation of short SSRs can mostly be explained by models that take into account the nearest neighbor associations, which likely result from mutational biases and/or selective constraints unrelated to the SSRs (26) (SI Fig. 4). Some Mycoplasma genomes exhibit overrepresentations of mononucleotide SSRs of lengths 4–7 bp (20), but this not common in other prokaryotic genomes.

SSR representations in most prokaryotic genomes exhibit few deviations from random models. One general exception is a sharp decline in mononucleotide SSRs beyond the length of 8 bp, which is common among prokaryotes and applies to both genes and intergenic regions (20). For example, the only other deviation in the Escherichia coli K12 genome relates to two very long octanucleotide SSRs of exactly 52 bp in length, but both are located in prophage regions and are probably not native to the E. coli genome. However, prokaryotic genomes vary significantly in terms of LSSR content and some feature many LSSRs unlikely to occur by chance. Specifically, 50% of the 378 prokaryotic chromosomes analyzed contain <7 LSSRs (SI Table 5), whereas the Anabaena variabilis ATCC 29413 chromosome contains 502 long SSRs, and >700 are present in the B. pseudomallei genome (both chromosomes). Note that the general scarcity of LSSRs does not mean that SSRs are underrepresented (i.e., less frequent than expected) because the cutoff is set such that no LSSRs are expected to be found. Abundance of LSSRs in some genomes is not related to taxonomy or phylogeny and differs significantly even among closely related species. Although shorter SSRs are also variable in length and may play roles in physiology and/or evolution (3, 4, 8, 20, 27), our approach, centering on LSSRs, is suitable for comparisons of SSR representations among different genomes.

Differences Between LSSR5–11 and LSSR1–4.

Based on our data, we hypothesize that LSSR1–4 and LSSR5–11 may arise by different mechanisms. Several pieces of evidence support this hypothesis. First, the LSSR counts Nk* correlate well across different genomes among LSSR5–11. Weaker correlations also occur among the LSSR1–4 counts but no significant correlations are observed between the two classes of LSSRs (Table 2). Second, the two classes of LSSRs tend to occur in different types of organisms. Multiple LSSR1–4 are rare and mostly found in Mycoplasma and several other host-adapted pathogens with reduced genomes (mostly <2 Mb) and low G+C content whereas LSSR5–11 occur in a diverse collection of pathogens and environmental organisms with large genomes and mostly high G+C content (Tables 3 and and4).4). Moreover, the Nk(l) plots for LSSR1–4 and LSSR5–11 often have different shapes. For LSSR1–4 (k ≤ 4), the plots are often discontinuous and/or bimodal, initially following the random models or dropping below the expected counts but featuring a separate peak or a flat tail of longer than expected SSRs (Fig. 2, ref. 20, and SI Fig. 4). The discontinuity suggests that most SSRs in the separate peak may be functionally relevant and maintained by selection, whereas shorter SSRs are either unaffected by selection or subject to negative selection. By contrast, the Nk(l) plots for LSSR5–11 (k ≥ 5) tend to gradually deviate from the expected counts and feature convex tails or linear tails of a lower slope (Fig. 3 and SI Fig. 4). This observation is consistent with a general tendency of the SSRs to expand when they reach some critical length. The LSSR5–11 counts start to deviate from the random models at lengths just exceeding 2k (double the length of the repeated oligonucleotide), suggesting that two full tandem copies of an oligonucleotide are sufficient for the SSR to expand. It is possible that most LSSR5–11 are generated by mutational drift in absence of negative selection. Slip-strand mutations may lead to both expansion and contraction of an SSR, and the shape of the Nk(l) plots for LSSR5–11 is consistent with a model where SSRs expand more frequently that they contract. Along these lines, many SSR-containing DNA fragments did expand during PCR amplification, although the expansion was sequence-dependent and did not apply to all SSRs (28). Interestingly, A+T-rich SSRs were generally more likely to expand in these PCR experiments, seemingly contradicting our observation that LSSR5–11 are more common in G+C-rich genomes. However, large genomes tend to be G+C rich, and the weak correlation between LSSR5–11 counts and G+C content may arise as an artifact of correlations of both with the genome size.

Fig. 3.
Heptanucleotide SSRs in B. pseudomallei (Upper) K96243chromosome 2 and in Anabaena variabilis (Lower). See Fig. 1 legend. SSRs exceeding a length of 50 bp are reported at l = 50 bp.

Mutations resulting in SSR expansion or contraction can be introduced during various cellular processes affecting the DNA, including replication, recombination and different repair mechanisms (29). There is little known about differences in precise functioning of these processes among different prokaryotes and we can only speculate as to why the LSSR5–11 expand in some genomes but not in most. Presumably, two prerequisites have to exist to facilitate the LSSR5–11 expansion: (i) a mutational bias promoting expansion of the LSSR5–11 and (ii) a lack of strong negative selection against the LSSR5–11. The latter is consistent with LSSR5–11 not being found in small genomes where the constraints against expansion may be stronger. However, some large genomes (e.g., Myxococcus xanthus) also lack LSSRs. The differences in LSSR5–11 representations may reflect differences in replication and repair machineries among different prokaryotes.

LSSR1–4 in Pathogens.

There are several well documented examples where LSSR1–4 influence gene activity by reversible mutations (27, 3033). In pathogens, such SSRs can help counteract the host immune response and often affect families of genes encoding surface antigens (4, 5). The fact that LSSR1–4 are often found in host-adapted pathogens, which depend on the ability to avoid the host immune response to a larger degree than opportunistic pathogens, is consistent with a possible role of these SSRs in pathogen-host interactions. However, perhaps contrary to intuitive expectations, and even in pathogens where such regulation is known or thought to take place, many or even most LSSRs are not located proximal to genes encoding surface antigens (1, 8, 20) (see also SI Table 6). The effect of some SSRs on surface antigens might be indirect and facilitated by actions of other proteins (8) or possibly by alterations of DNA or protein structural properties.

Benefits of SSRs in pathogens depend on the degree to which the pathogen is exposed to the immune system of the host and on availability of other strategies to avoid the host immune response. Hence, it is not surprising that many LSSR1–4 are found in Mycoplasma, which are mostly believed to be extracellular and therefore exposed to the host immune system, although some Mycoplasma can enter the host cells (27, 30, 31, 33) (Tables 3 and SI Table 9). Note that several Mycoplasma (e.g., M. penetrans, M. mobile, M. pneumoniae) have few or no LSSR and the differences in SSR representations among Mycoplasma could relate to differences in how they interact with the host (20). In a consistent manner, Mycobacterium leprae features 16 LSSR1–4, whereas all other mycobacteria have no more than one (SI Table 5). M. leprae is a host-adapted pathogen that has not been successfully cultivated outside a host, and its genome size and G+C content are reduced compared with closely related mycobacteria (34). Interestingly, L. intracellularis, which has the second-highest count of LSSR1–4 among the genomes analyzed in this work, is intracellular. In contrast, the genomes of other obligate intracellular pathogens, such as Chlamydia or Rickettsia, contain virtually no LSSRs of any kind (SI Table 5). Likewise, obligate intracellular endosymbionts with reduced genomes do not contain LSSRs. L. intracellularis causes proliferative enteropathy in the infected animals. The bacteria reside in the host cells after colonization but little is known about the early stages of the infection, including colonization, cell adhesion, and cell entry (35). We speculate that differences in the early stages of infection between L. intracellularis and other intracellular pathogens may require effective defense mechanisms in L. intracellularis facilitated by the LSSR1–4. Unlike L. intracellularis, chlamydiae undergo a developmental cycle involving two distinct cell types: reticular bodies and elementary bodies. The reticular bodies are found strictly in vacuoles in the host cells. Outside the host cells, chlamydiae persist as metabolically dormant and physically resilient elementary bodies (36). Perhaps the chlamydial developmental cycle and lack of metabolic activity of the extracellular elementary bodies renders the increased antigenic variance facilitated by LSSR1–4 less important.

High Counts of Heptanucleotide LSSRs.

In most prokaryotes with LSSRs, heptanucleotide LSSRs are significantly more abundant than other LSSR types (Table 1 and SI Table 5). It is feasible that the structural characteristics of DNA polymerases and/or their interactions with the DNA may promote polymerase slippage specifically in heptanucleotide repeats and to a lesser extent in hexa-, octa-, and nonanucleotide repeats. Unfortunately, relevant experimental data are scarce. Analogies can be drawn from protein–DNA interactions involving distantly related enzymes. For example, during DNA repair by human polymerase β, 6 base pairs of template-primer can tether into a flexible single-stranded DNA gap, which covers ≈6–7 bp (37). Likewise, E. coli DNA endonuclease VIII interacts with the 7- to 9-bp central region of DNA (38). It is possible that the preferred 7-bp length of oligonucleotides involved in long SSRs is related to the length of the DNA segment that interacts with the active site of the polymerase.

Materials and Methods

DNA Sequences.

Annotated sequences of complete prokaryotic genomes were downloaded from the National Center for Biotechnology Information ftp server ftp://ftp.ncbi.nih.gov/genomes/Bacteria/. Each replicon (chromosome or plasmid) was analyzed separately. We relied on the existing annotation (the “CDS” features) in differentiating protein-coding and noncoding regions.

Simple Sequence Repeats.

SSRs consist of tandem iterations of an oligonucleotide in a DNA sequence. We measure the length of an SSR in nucleotides (bp) rather than the number of repetitive units, which allows accounting for partial copies and facilitates comparisons among SSRs of different lengths.

Definition.

An SSR of length l composed of iterations of a k-mer starts at the position i in a sequence of nucleotides if xj = xj+k for all ji, ji + lk − 1 and simultaneously xi−1xi−1+k and xi+lkxi+l.

This definition can be applied to all SSRs of length lk. Repeats of a longer oligonucleotide that also qualify as repeats of a shorter oligonucleotide are only counted as the shorter oligonucleotide SSR. We analyze the SSR counts N in a given genome as a function of k and l, and we refer to the SSR counts as Nk(l).

Statistical Assessments of SSR Representations.

We employ two different approaches in assessing over- or underrepresentation of SSRs in a DNA sequence: (i) We use multiple stochastic models of varying complexity, which provide an expected range of counts serving as a null hypothesis (20). (ii) The functions Nk(l) are expected to decrease exponentially under homogeneous models. Hence, deviations from exponential dependence may signal over- or underrepresentation of SSRs of the type k and a particular range of lengths l.

Random Sequence Models.

Homogeneous Bernoulli or Markov models are often used in analyses of DNA sequences whereas real DNA sequences are intrinsically inhomogeneous (3941). We use a combination of 11 previously described models (20) that reproduce different properties of the DNA sequence (SI Table 9). Heterogeneous models were constructed by dividing the original sequence into segments corresponding to individual genes and intergenic regions, generating a random sequence corresponding to each segment with a homogeneous model (Bernoulli, Markov, or periodic Markov), and finally reassembling the segments into a contiguous randomized genome. This procedure reproduces sequence heterogeneity at the scale of individual genes and, depending on the models used, nearest-neighbor associations, codon frequencies, and/or the periodic character of protein-coding sequences. The expected SSR counts for each model were estimated from simulations. Ten random sequences were generated by each model, the counts were averaged over the 10 simulations, and the results with different models provides a range of expected counts Nk(l) (see Figs. 113 and SI Fig. 4). A program to generate random sequences by the 11 models is available for download at www.cmbl.uga.edu/software.html.

Definition of Long SSRs.

The Nk(l) representation of SSR counts is impractical for comparisons of hundreds of different genomes. To simplify the representation, we only report counts of LSSRs unlikely to occur by chance. This reduces the Nk(l) representation to k numbers Nk* for each genome, which signify the counts of SSRs of the type k that exceed a given cutoff Lk. The cutoff Lk is derived from the “m1c1” random model (SI Table 9), which reproduces the dinucleotide frequencies for each intergenic region and codon frequencies and nearest-neighbor associations for each gene of the genome. First, we find the largest length lk(0) such that the expected SSR count based on the m1c1 model Nkm1c1(lk(0)) ≥ 1. The cutoff is set as Lk = lk(0) + 4. The increase by 4 bp is arbitrary, and it is based on our observation that longer SSRs are rare in most genomes. The Nk* representation is suitable for comparisons among different genomes while taking into account specific characteristics of each genome. Pattern Locator (42) was used in the analysis of LSSR distribution with respect to annotated genes.

Supplementary Material

Supporting Information:

Acknowledgments

We thank Ms. Ishla Seager for help in the initial stages of this project and Drs. Larry Shimkets and Mark Schell for comments on the manuscript and stimulating discussions.

Abbreviations

SSR
simple sequence repeat
LSSR
“long” SSR (see Methods for definition)
LSSR1–4
LSSR composed of iterations of 1-mer to 4-mer
LSSR5–11
LSSR composed of iterations of 5-mer to 11-mer.

Footnotes

The authors declare no conflict of interest.

Mrázek, J., Kypr, J. (1994) Miami Biotechnol Short Rep 5:39.

This article contains supporting information online at www.pnas.org/cgi/content/full/0702412104/DC1.

References

1. Karlin S, Mrázek J, Campbell AM. Nucleic Acids Res. 1996;24:4263–4272. [PMC free article] [PubMed]
2. Kashi Y, King DG. Trends Genet. 2006;22:253–259. [PubMed]
3. Metzgar D, Thomas E, Davis C, Field D, Wills C. Mol Microbiol. 2001;39:183–190. [PubMed]
4. Moxon ER, Rainey PB, Nowak MA, Lenski RE. Curr Biol. 1994;4:24–33. [PubMed]
5. Rocha EP. Genome Res. 2003;13:1123–1132. [PMC free article] [PubMed]
6. Tautz D, Schlötterer C. Curr Opin Genet Dev. 1994;4:832–837. [PubMed]
7. Groisman EA, Casadesus J. Mol Microbiol. 2005;56:1–7. [PubMed]
8. Rocha EP, Blanchard A. Nucleic Acids Res. 2002;30:2031–2042. [PMC free article] [PubMed]
9. Htun H, Dahlberg JE. Science. 1989;243:1571–1576. [PubMed]
10. Nordheim A, Rich A. Proc Natl Acad Sci USA. 1983;80:1821–1825. [PMC free article] [PubMed]
11. Shafer RH, Smirnov I. Biopolymers. 2000;56:209–227. [PubMed]
12. van Holde K, Zlatanova J. BioEssays. 1994;16:59–68. [PubMed]
13. Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN. FEBS J. 2005;272:5129–5148. [PubMed]
14. Perutz MF, Pope BJ, Owen D, Wanker EE, Scherzinger E. Proc Natl Acad Sci USA. 2002;99:5596–5600. [PMC free article] [PubMed]
15. Karlin S, Brocchieri L, Bergman A, Mrázek J, Gentles AJ. Proc Natl Acad Sci USA. 2002;99:333–338. [PMC free article] [PubMed]
16. Timchenko LT, Caskey CT. FASEB J. 1996;10:1589–1597. [PubMed]
17. Matula M, Kypr J. J Biomol Struct Dyn. 1999;17:275–280. [PubMed]
18. Tóth G, Gáspári Z, Jurka J. Genome Res. 2000;10:967–981. [PMC free article] [PubMed]
19. Field D, Wills C. Proc Natl Acad Sci USA. 1998;95:1647–1652. [PMC free article] [PubMed]
20. Mrázek J. Mol Biol Evol. 2006;23:1370–1385. [PubMed]
21. Gur-Arie R, Cohen CJ, Eitan Y, Shelef L, Hallerman EM, Kashi Y. Genome Res. 2000;10:62–71. [PMC free article] [PubMed]
22. Gebhart CJ, Barns SM, McOrist S, Lin GF, Lawson GH. Int J Syst Bacteriol. 1993;43:533–538. [PubMed]
23. Holden MT, Titball RW, Peacock SJ, Cerdeno-Tarraga AM, Atkins T, Crossman LC, Pitt T, Churcher C, Mungall K, Bentley SD, et al. Proc Natl Acad Sci USA. 2004;101:14240–14245. [PMC free article] [PubMed]
24. Kim HS, Schell MA, Yu Y, Ulrich RL, Sarria SH, Nierman WC, DeShazer D. BMC Genomics. 2005;6:174. [PMC free article] [PubMed]
25. Nierman WC, DeShazer D, Kim HS, Tettelin H, Nelson KE, Feldblyum T, Ulrich RL, Ronning CM, Brinkac LM, Daugherty SC, et al. Proc Natl Acad Sci USA. 2004;101:14246–14251. [PMC free article] [PubMed]
26. Karlin S, Mrázek J, Campbell AM. J Bacteriol. 1997;179:3899–3913. [PMC free article] [PubMed]
27. Willems R, Paul A, van der Heide HG, ter Avest AR, Mooi FR. EMBO J. 1990;9:2803–2809. [PMC free article] [PubMed]
28. Vondrušková J, Pařízková N, Kypr J. Nucleosides Nucleotides Nucleic Acids. 2007;26:65–82. [PubMed]
29. Bichara M, Wagner J, Lambert IB. Mutat Res. 2006;598:144–163. [PubMed]
30. Glew MD, Baseggio N, Markham PF, Browning GF, Walker ID. Infect Immun. 1998;66:5833–5841. [PMC free article] [PubMed]
31. Hood DW, Deadman ME, Jennings MP, Bisercic M, Fleischmann RD, Venter JC, Moxon ER. Proc Natl Acad Sci USA. 1996;93:11121–11125. [PMC free article] [PubMed]
32. Stern A, Brown M, Nickel P, Meyer TF. Cell. 1986;47:61–71. [PubMed]
33. Wassenaar TM, Wagenaar JA, Rigter A, Fearnley C, Newell DG, Duim B. FEMS Microbiol Lett. 2002;212:77–85. [PubMed]
34. Smith DG, Lawson GH. Vet Microbiol. 2001;82:331–345. [PubMed]
35. Cole ST, Eiglmeier K, Parkhill J, James KD, Thomson NR, Wheeler PR, Honore N, Garnier T, Churcher C, Harris D, et al. Vet Microbiol. 2001;82:331–345.
36. Rockey DD, Matsumoto A. In: Prokaryotic Development. Brun YV, Shimkets LJ, editors. Washington, DC: Am Soc Microbiol Press; 1999. pp. 403–425.
37. Pelletier H, Sawaya MR, Wolfle W, Wilson SH, Kraut J. Biochemistry. 1996;35:12742–12761. [PubMed]
38. Zharkov DO, Golan G, Gilboa R, Fernandes AS, Gerchman SE, Kycia JH, Rieger RA, Grollman AP, Shoham G. EMBO J. 2002;21:789–800. [PMC free article] [PubMed]
39. Fickett JW, Torney DC, Wolf DR. Genomics. 1992;13:1056–1064. [PubMed]
40. Karlin S, Blaisdell BE, Sapolsky RJ, Cardon L, Burge C. Nucleic Acids Res. 1993;21:703–711. [PMC free article] [PubMed]
41. Larhammar D, Chatzidimitriou-Dreismann CA. Nucleic Acids Res. 1993;21:5167–5170. [PMC free article] [PubMed]
42. Mrázek J, Xie S. Bioinformatics. 2006;22:3099–3100. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...