Logo of genoresGenome ResearchCSHL PressJournal HomeSubscriptionseTOC AlertsBioSupplyNet
Genome Res. 2002 Jul; 12(7): 1075–1079.
PMCID: PMC186626

Genome Size Reduction through Illegitimate Recombination Counteracts Genome Expansion in Arabidopsis


Genome size varies greatly across angiosperms. It is well documented that, in addition to polyploidization, retrotransposon amplification has been a major cause of genome expansion. The lack of evidence for counterbalancing mechanisms that curtail unlimited genome growth has made many of us wonder whether angiosperms have a “one-way ticket to genomic obesity.” We have therefore investigated an angiosperm with a well-characterized and notably small genome, Arabidopsis thaliana, for evidence of genomic DNA loss. Our results indicate that illegitimate recombination is the driving force behind genome size decrease in Arabidopsis, removing at least fivefold more DNA than unequal homologous recombination. The presence of highly degraded retroelements also suggests that retrotransposon amplification has not been confined to the last 4 million years, as is indicated by the dating of intact retroelements.

Flowering plants (angiosperms) vary enormously in genome size, from <50 Mb in some members of the Cruciferae to >85,000 Mb in some Liliaceae (Bennett and Leitch 1995). The mechanisms that account for dramatic expansion of angiosperm genomes have been documented, primarily polyploidization and retrotransposon amplification (SanMiguel et al. 1996, 1998; Wendel 2000); however, counterbalancing modes of genome contraction have not been convincingly shown. In the absence of an equally comprehensive and aggressive mechanism for genome size decrease, the question remains whether angiosperms have a “one-way ticket to genome obesity” (Bennetzen and Kellogg 1997). We have addressed this fundamental issue in the genome size debate by studying the structure and evolution of long terminal repeat (LTR) retrotransposons in Arabidopsis.

LTR retrotransposons constitute a large part of the repetitive DNA fraction in plant species. They are characterized by LTRs that vary in size from a few 100 base pairs (bp) to several kilobases and terminate in short inverted repeats, usually 5′-TG-3′ and 5′-CA-3′ (Kumar and Bennetzen 1999). The well-defined structure of LTR retrotransposons, their prevalence and dispersion in the genome, their acknowledged role in genome size expansion, and the fact that individual elements have little or no selective significance make LTR retrotransposons suitable elements for studying genome evolution (Petrov 2001). The prevalence and distribution of LTR retrotransposons have been the subject of several studies, including in Arabidopsis (Marín and Lloréns 2000; Terol et al. 2001). These studies, however, are generally based on the analysis of intact elements of relatively recent origin and provide no information on the long-term fate of these sequences. In our study, LTR-retrotransposon families were established on the basis of homology of the LTRs rather than the open reading frames. An important advantage of this approach is that not only complete elements but also solo LTRs and elements that have undergone a variety of deletions can be identified. It is precisely the structure of this latter group that provides the most important clues regarding plant genome evolution.


We have analyzed a total of 291 LTR-retrotransposon elements belonging to 12 families (four copia, six gypsy, two unknown). The retroelements are distributed over the five Arabidopsis chromosomes and show the typical pericentromeric clustering previously observed for LTR retrotransposons (Lin et al. 1999; Mayer et al. 1999), indicating that the 291 elements form a representative sample. The 12 families were originally identified in two bacterial artificial chromosome (BAC) clones that were randomly chosen from a selection of annotated Arabidopsis BACs that contained putative LTR retroelements. The LTRs of these elements were then used as query sequences in BLAST searches against the Arabidopsis genomic sequence (http://www.arabidopsis.org). Incomplete elements were taken into account only if they retained at least one of the LTR-retrotransposon characteristics such as a primer-binding site (PBS), a polypurine tract (PPT), or a target duplication site (Kumar and Bennetzen 1999). Thus, many severely deleted LTR retrotransposons that we detected were not further studied because their highly fragmentary structure made it impossible to determine the nature of specific rearrangements that they had undergone. Of the 291 studied elements, 87 (29.9%) were found to be “complete”; that is, they contain two LTRs flanked by a 5-bp target-site duplication and separated by an internal region containing a PBS and PPT (Fig. (Fig.1A).1A). By use of the dating strategy described by SanMiguel et al. (1998), but applying the synonymous substitution rate of 1.5 × 10−8 mutations per site per year determined for the Chs and Adh genes in the Brassicaceae (Koch et al. 2000), we estimated that these retrotransposons all inserted in the Arabidopsis genome during the last 4 million years, most within the last 2 million years (data not shown). These estimates are based on the assumption that LTRs evolve at approximately the same rate as coding regions and on our observations that conversion does not frequently occur in these elements, as evidenced by the even distribution of sequence variation (SanMiguel et al. 1998). It would thus appear that, similar to maize (SanMiguel et al. 1998), the Arabidopsis genome has undergone a surge of retrotransposon amplification in recent times.

Figure 1
Unequal intrastrand recombination between LTR retrotransposons. (A) Structure of a complete element, with a direct repeat (DR) of flanking target-site DNA, two long terminal repeats (LTRs), a primer-binding site (PBS), and polypurine tract (PPT) needed ...

In contrast with maize, which contains mainly intact retroelements and rare solo LTRs (SanMiguel et al. 1996; W. Ramakrishna and J.L. Bennetzen, unpubl.), the ratio of solo LTRs to intact elements in Arabidopsis is ∼1 : 1. Solo LTRs can be derived from unequal intrastrand recombination between the 5′ and 3′ LTRs of a single element (Fig. (Fig.1B).1B). Barley, on the other hand, which has a genome size twice that of maize, contains 16-fold more LTRs than internal retroelement domains for the BARE-1 element, and this excess of LTRs has been ascribed to an abundance of solo LTRs (Vicient et al. 1999; Shirasu et al. 2000). Although intraelement recombination can never neutralize the genome expansion driven by LTR-retrotransposon amplification because a solo LTR is retained, it can play a role in attenuating genome growth (Bennetzen and Kellogg 1997; Vicient et al. 1999). Unequal intrastrand homologous recombination between LTRs of different elements belonging to the same family can result in a net loss of DNA (Fig. (Fig.1,1, C–E). Six examples of this were found in our study (Table (Table1),1), four of which resulted in clearly recognizable recombinant products in which an LTR was flanked by both a PBS and PPT (Fig. (Fig.1C).1C). An apparently intact element lacked the 5-bp target-site duplication and was therefore expected to be the product of homologous recombination between two 5′ LTRs, two 3′ LTRs, or the internal regions of two family members (Fig. (Fig.1D).1D). Similarly, a solo LTR that lacked the target-site duplication was assigned as a recombinant element (Fig. (Fig.1E).1E). As observed in numerous studies, including our own, LTR retrotransposons in Arabidopsis are particularly abundant in pericentromeric regions, which are largely devoid of genes (Lin et al. 1999; Mayer et al. 1999). Inter-element deletions therefore are unlikely to have a negative effect on the overall fitness of an individual.

Table 1
Structure of LTR Retroelements Identified in Arabidopsis

In addition to intact elements and solo LTRs, 98 truncated elements (33.7% of the total) were identified (Table (Table1).1). They include (1) elements in which the two LTRs are still recognizable but have undergone deletions at either their 3′ or 5′ end (8.3%), (2) elements in which the 5′ LTR together with part of the internal sequence has been deleted (16.5%), and (3) elements in which the 3′ LTR together with part of the internal sequence has been deleted (8.9%). The remaining LTRs of elements belonging to the latter two groups may have undergone further deletions and have been included in the analysis only if their identity could be established unequivocally. Therefore, the percentage of LTR-retrotransposon remnants is much more than 33.7%. The discovery of small deletions as a major mode for genome size determination in Arabidopsis parallels results obtained by Petrov and coworkers in Drosophila, a species with a DNA content similar to that of Arabidopsis. On the basis of the rate of insertions and deletions in a non-LTR retrotransposon, Petrov and Hartl (1998) calculated that pseudogenes lose ∼50% of their DNA in 14 million years through spontaneous deletions. Deletions have also been shown to be a frequent event in transposable elements in maize (Masson et al. 1987; Marillonnet and Wessler 1998) and to feature in LTR retrotransposons in wheat (Wicker et al. 2001), which have genome sizes that are 20 and 120 times larger than the Arabidopsis genome, respectively. The results in Arabidopsis, maize, and wheat indicate that deletions that are independent of homologous recombination (equal or unequal) represent a key mechanism for DNA elimination in plants.

In an attempt to shed light on the molecular mechanism(s) that gave rise to the deletions, we compared the internal regions of retroelements belonging to 3 of the 12 families. The three families contained 37% of the retroelements for which internal regions could be analyzed and were assumed to be a representative sample. The comparisons included 33, 5, and 5 elements of the three families. We analyzed the breakpoints of a total of 59, 8, and 6 deletions, respectively, ranging in size from 10 to 3766 bp. Deletions that were shared between elements within a family were assumed to have a common descent and were considered only once. It should be noted that although some of the deletions encompassed >500 bp, only four affected the structural characteristics such as LTRs, PBS, and PPT used in our assessment of the intactness of the LTR retroelements and thus led to the classification of the corresponding elements as “incomplete.” Of the 59, 8, and 6 deletions, 46 (78%), 4 (50%), and 6 (100%), respectively, were flanked by short repeats of 2 to 13 bp, some of which were imperfect. Taking into account the base composition in the internal regions of the LTR retroelements and the distribution of sequences homologous to the short flanking repeats, the association of the repeats with the deletions was highly significant for each of the families (Table (Table2).2). Similarly, an analysis of six tandem duplications present in the region under investigation also showed a highly significant association of the duplications with short repeats (Table (Table2).2). The importance of short repeats in deletion and duplication formation is well documented in bacteria and yeast. Classical homologous recombination requires homologous sequences of at least 20 bp in bacteria (Ehrlich 1989) and 50 to 100 bp in yeast (Sugawara and Haber 1992), whereas shorter repeats engage solely in illegitimate recombination. The high frequency of short repeats associated with the deletions in Arabidopsis retroelements indicates that genome expansion through retrotransposon amplification can be counterbalanced by a gradual removal of the elements through illegitimate recombination. Unfortunately, our data set does not allow us to ascertain whether illegitimate recombination takes place by errors in DNA replication, by double-strand break repair (Gorbunova and Levy 1999), or by some unknown mechanism.

Table 2
Statistical Significance of the Association of Short Repeats with Deletions and Tandem Duplications

The formation of deletions during double-strand break repair was recently investigated in Arabidopsis and tobacco, two species that vary 20-fold in their DNA content (Gorbunova and Levy 1997; Kirik et al. 2000). It was shown that strand rejoining after a break frequently occurs at short repeats and results in the deletion of a few base pairs to several kilobases of DNA. The average deletion size was significantly smaller in tobacco than Arabidopsis (920 bp vs. 1341 bp) (Kirik et al. 2000). A negative correlation between genome size and rate of DNA loss was also postulated by Petrov and coworkers. They estimated that the rate of DNA loss in Drosophila was ∼40-fold higher than in the 11-fold larger genome of Laupala and 60-fold higher than in, on average, 18-fold larger mammalian genomes (Petrov and Hartl 1998; Petrov et al. 2000). The observation that short repeats are often associated with deletions in Drosophila (Petrov and Hartl 1998) indicates to us that illegitimate recombination is also a major determinant of DNA loss in Drosophila. Although we can conclude that DNA is effectively removed from small genome organisms through illegitimate recombination, no information is available on the driving force behind the differential loss of DNA in small and large genomes. Genome size is clearly the result of a balance between amplification and loss of DNA. However, it remains to be seen whether organisms have an active role in determining the ratio of DNA gain to loss or whether this ratio is the result of evolutionary forces acting on the nongenic DNA.

The insight that elimination of LTR retrotransposons takes place through illegitimate recombination forces us to reassess our earlier suggestion that the genome size of Arabidopsis has increased considerably over the past 4 million years through retrotransposon insertion. Considering the large number of retrotransposon remnants, it now seems likely that the apparent absence of elements older than a few million years is simply a reflection of their gradual degradation over time. As our data set does not allow conclusions to be drawn on the relative rate of DNA removal and amplification, it is now an open question whether the Arabidopsis genome size has increased, decreased, or remained constant over recent times.

We only included clearly recognizable elements in our study. Sequences of a few tens to hundreds of base pairs with homology with retroelement LTRs, however, were identified and bear further witness to the fact that genomes use mechanisms other than unequal homologous recombination to remove repetitive elements. Moreover, our data indicate that deletion through illegitimate recombination is more important than unequal homologous recombination events in eliminating DNA in Arabidopsis. We only analyzed deletions in relatively intact retrotransposons because they could be precisely defined; however, clearly identifiable retroelements make up <10% of the Arabidopsis genome (The Arabidopsis Genome Initiative 2000). Although retrotransposons served as a particularly clear indicator for genome size contraction, we have no reason to doubt that illegitimate recombination will also remove DNA from the other 90% of the Arabidopsis nuclear genome. Selection against gene loss will attenuate deletions in the 44% of the genome that is genic (25,000 genes with an average length of 2 kb) (The Arabidopsis Genome Initiative 2000), but illegitimate recombination is likely to proceed unimpeded in the noncoding DNA. Furthermore, in contrast with unequal homologous recombination, which requires the presence of closely linked direct repeats and ends when only one LTR unit remains, multiple independent illegitimate recombination events can and do occur in any region, eventually removing all unselected sequence. We predict that illegitimate recombination removes at least fivefold more DNA than unequal homologous recombination because illegitimate recombination can act on at least 5 times more of the genome than can unequal recombination between LTRs, and because we saw many more severely deleted LTR retrotransposons than we did solo LTRs. Observations that the Arabidopsis genome is composed of numerous duplicated segments with subsequent genic deletions (Blanc et al. 2000; Ku et al. 2000) is totally compatible with our model of genome contraction via illegitimate recombination.


Identification and Alignment of LTR Retroelements

The programs Repeat and Gap from the Wisconsin Package Version 10.1, Genetics Computer Group were used for the initial identification and alignment of LTRs belonging to the same retroelement on Arabidopsis BAC clones K11J14 (AP000411) and T24G23 (AC006268). Putative LTR retroelements were scrutinized manually for the presence of a TG/CA inverted repeat in the LTRs, a PBS, a PPT, and a target-site duplication. LTRs of confirmed elements were used as query sequences in BLASTN (NCBI BLAST 2.0) searches against the Arabidopsis thaliana database (http://www.arabidopsis.org) to identify additional family members. LTRs and internal regions of elements belonging to the same family were aligned using CLUSTALX (Thompson et al. 1997). If needed, sequence alignments were edited manually using JalView (M. Clamp, EBI).

Statistical Analysis

Randomization tests involving matching of flanking sequences (Test 1) and matching of individual bases in the flanking sequences (Test 2) were performed to determine the statistical significance of the association of short repeats with deletions and tandem duplications. In each run of Randomization Test 1, one flanking sequence of each deletion or duplication was held fixed while sequences of the same length were sampled at random from all available sequences of the appropriate LTR-retrotransposon family. Deletions or duplications for which the match between the randomly sampled sequence and the fixed sequence was at least as close as that of the two actual flanking sequences were counted. One thousand separate randomizations were run for deletions in each LTR-retrotransposon family for the complete set of LTR retrotransposons and for tandem duplications involving Family 3. The probability (P) is that of obtaining by chance at least the observed number of matches under the null hypothesis that sequences flanking each deletion or duplication are unrelated (one-tailed t test using a logarithmic transformation). Randomization Test 2 was essentially similar but involved matching of individual bases in the flanking sequences rather than the complete sequences.


http://www.arabidopsis.org; Arabidopsis information resource.


K.M. Devos acknowledges funding from the Biotechnology and Biological Sciences Research Council (BBSRC) through a David Phillips Research Fellowship and ISIS International Fellowship, and J.L. Bennetzen thanks NSF for supporting this research (Grant 9975793).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.


E-MAIL ku.ca.crsbb@soved.neirtaK; FAX 44 1603 450 023/24.

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.132102.


  • Bennett MD, Leitch IJ. Nuclear DNA amounts in Angiosperms. Ann Bot. 1995;76:113–176.
  • Bennetzen JL, Kellogg EA. Do plants have a one-way ticket to genomic obesity? Plant Cell. 1997;9:1509–1514. [PMC free article] [PubMed]
  • Blanc G, Barakat A, Guyot R, Cooke R, Delseney M. Extensive duplication and reshuffling in the Arabidopsis genome. Plant Cell. 2000;12:1093–1101. [PMC free article] [PubMed]
  • Ehrlich SD. Illegitimate recombination in bacteria. In: Berg DE, Howe MM, editors. Mobile DNA. Washington, D.C.: American Society for Microbiology; 1989. pp. 799–832.
  • Gorbunova V, Levy AA. Non-homologous DNA end-joining in plant cells is associated with deletions and filler DNA insertions. Nucleic Acids Res. 1997;25:4650–4657. [PMC free article] [PubMed]
  • ————— How plants make ends meet: DNA double-strand break repair. Trends Plant Sci. 1999;4:263–269. [PubMed]
  • Kirik A, Salomon S, Puchta H. Species-specific double-strand break repair and genome evolution in plants. EMBO J. 2000;19:5562–5566. [PMC free article] [PubMed]
  • Koch MA, Haubold B, Mitchell-Olds T. Comparative evolutionary analysis of chalcone synthase and alcohol dehydrogenase loci in Arabidopsis, Arabis, and related genera (Brassicaceae) Mol Biol Evol. 2000;17:1483–1498. [PubMed]
  • Ku H-M, Vision T, Liu J, Tanksley SD. Comparing sequenced segments of the tomato and Arabidopsis genomes: Large-scale duplication followed by selective gene loss creates a network of synteny. Proc Natl Acad Sci. 2000;97:9121–9126. [PMC free article] [PubMed]
  • Kumar A, Bennetzen JL. Plant retrotransposons. Annu Rev Genet. 1999;33:479–532. [PubMed]
  • Lin XY, Kaul SS, Rounsley S, Shea TP, Benito MI, Town CD, Fujii CY, Mason T, Bowman CL, Barnstead M, et al. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature. 1999;402:761–768. [PubMed]
  • Marillonnet S, Wessler SR. Extreme structural heterogeneity among the members of a maize retrotransposon family. Genetics. 1998;150:1245–1256. [PMC free article] [PubMed]
  • Marín I, Lloréns C. Ty2/Gypsy retrotransposons: Description of new Arabidopsis thaliana elements and evolutionary perspectives derived from comparative genomic data. Mol Biol Evol. 2000;17:1040–1049. [PubMed]
  • Masson P, Surosky R, Kingsbury JA, Fedoroff NV. Genetic and molecular analysis of the Spm-dependent a-m2 alleles of the maize a locus. Genetics. 1987;117:117–137. [PMC free article] [PubMed]
  • Mayer K, Schuller C, Wambutt R, Murphy G, Volkaert G, Pohl T, Düsterhölt A, Stickema W, Entian K-D, Terryn N, et al. Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature. 1999;402:769–777. [PubMed]
  • Petrov DA. Evolution of genome size: New approaches to an old problem. Trends Genet. 2001;17:23–28. [PubMed]
  • Petrov DA, Hartl DL. High rate of DNA loss in the Drosophila melanogaster and Drosophila virilis species group. Mol Biol Evol. 1998;15:293–302. [PubMed]
  • Petrov DA, Sangster TA, Johnston JS, Hartl DL, Shaw KL. Evidence for DNA loss as a determinant of genome size. Science. 2000;287:1060–1062. [PubMed]
  • SanMiguel P, Tikhonov A, Jin Y-K, Motchoulskaia N, Zakharov D, Melake Berhan A, Springer PS, Edwards KJ, Avramova Z, et al. Nested retrotransposons in the intergenic regions of the maize genome. Science. 1996;274:765–768. [PubMed]
  • SanMiguel P, Gaut BS, Tikhonov A, Nakajima Y, Bennetzen JL. The paleontology of intergene retrotransposons of maize: Dating the strata. Nat Genet. 1998;20:43–45. [PubMed]
  • Shirasu K, Schulman AH, Lahaye T, Schulze-Lefert P. A contiguous 66-kb barley DNA sequence provides evidence for reversible genome expansion. Genome Res. 2000;10:908–915. [PMC free article] [PubMed]
  • Sugawara N, Haber JE. Characterization of double-strand break-induced recombination: Homology requirements and single-stranded DNA formation. Mol Cell Biol. 1992;12:563–575. [PMC free article] [PubMed]
  • Terol J, Castillo MC, Bargues M, Pérez-Alonso M, de Frutos R. Structural and evolutionary analysis of the copia-like elements in the Arabidopsis thaliana genome. Mol Biol Evol. 2001;18:882–892. [PubMed]
  • The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. [PubMed]
  • Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. The ClustalX windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 1997;24:4876–4882. [PMC free article] [PubMed]
  • Vicient CM, Suoniemi A, Anamthawat-Jónsson K, Tanskanen J, Beharav A, Nevo E, Schulman AH. Retrotransposon BARE-1 and its role in genome evolution in the genus Hordeum. Plant Cell. 1999;11:1769–1784. [PMC free article] [PubMed]
  • Wendel JF. Genome evolution in polyploids. Plant Mol Biol. 2000;42:225–249. [PubMed]
  • Wicker T, Stein N, Albar L, Feuillet C, Schlagenhauf E, Keller B. Analysis of a contiguous 211 kb sequence in diploid wheat (Triticum monococcum L.) reveals multiple mechanisms of genome evolution. Plant J. 2001;26:307–316. [PubMed]

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • MedGen
    Related information in MedGen
  • Nucleotide
    Primary database (GenBank) nucleotide records reported in the current articles as well as Reference Sequences (RefSeqs) that include the articles as references.
  • PubMed
    PubMed citations for these articles
  • Taxonomy
    Taxonomy records associated with the current articles through taxonomic information on related molecular database records (Nucleotide, Protein, Gene, SNP, Structure).
  • Taxonomy Tree
    Taxonomy Tree

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...