• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Mar 4, 1997; 94(5): 1872–1877.

Sequence patterns indicate an enzymatic involvement in integration of mammalian retroposons


It is commonly accepted that the reverse-transcribed cellular RNA molecules, called retroposons, integrate at staggered breaks in mammalian chromosomes. However, unlike what was previously thought, most of the staggered breaks are not generated by random nicking. One of the two nicks involved is primarily associated with the 5′-TTAAAA hexanucleotide and its variants derived by a single base substitution, particularly A → G and T → C. It is probably generated in the antisense strand between the consensus bases 3′-AA and TTTT complementary to 5′-TTAAAA. The sense strand is nicked at variable distances from the TTAAAA consensus site toward the 3′ end, preferably within 15–16 base pairs. The base composition near the second nicking site is also nonrandom at positions preceding the nick. On the basis of the observed sequence patterns it is proposed that integration of mammalian retroposons is mediated by an enzyme with endonucleolytic activity. The best candidate for such enzyme may be the reverse transcriptase encoded by the L1 non-long-terminal-repeat retrotransposon, which contains a freshly reported domain homologous to the apurinic/apyrimidinic (AP) endonuclease family [Martin, F., Olivares, M., Lopez, M. C. & Alonso, C. (1996) Trends Biochem. Sci. 21, 283–285; Feng, Q., Moran, J. V., Kazazian, H. H. & Boeke, J. D. (1996) Cell 87, 905–916] and shows nicking in vitro with preference for targets similar to 5′-TTAAAA/3′-AATTTT consensus sequence [Feng, Q., Moran, J. V., Kazazian, H. H. & Boeke, J. D. (1996) Cell 87, 905–916]. A model for integration of mammalian retroposons based on the presented data is discussed.

The reverse-transcribed cellular RNA molecules, or retroposons, are commonly integrated in mammalian chromosomes (13). They are usually flanked by short direct repeats resulting from integration at staggered chromosomal breaks (4). The flanking repeats vary in length and, although rich in adenine (5, 6), they did not seem to contain any specific sequence signals which would indicate enzymatic involvement in retroposon integration (reviewed in ref. 1). This led to a widespread opinion that their origin is attributable to randomly generated staggered breaks. However, sequence analysis of the flanking regions from human and rodent retroposons presented in this paper shows that the staggered breaks are associated with specific, albeit short, sequence signals, consistent with the involvement of an enzyme with endonucleolytic activity. This evidence implies a specific mechanism of retroposon integration in mammals.


The analysis was performed on sequences flanking human Alu (7) and rodent ID (BC1-like) retroposons (8). GenBank coordinates of the majority of unique Alu elements were obtained from Repbase (ftp http://ncbi.nlm.nih.gov/repository/repbase), and coordinates of ID and additional Alu sequences were directly identified in GenBank (release 97.0) by screening of the corresponding consensus sequences against human and rodent portions of the GenBank database, using the censor program (9). All Alu and ID elements immediately flanked by other repetitive elements were discarded to avoid systematic biases in base compositions of their flanking regions.

Pairs of sequences approximately 50 bp long, flanking each full-length Alu and ID element, were extracted and aligned against each other using an implementation of the Smith–Waterman algorithm (10). Based on the alignment, only identical subsequences at both ends of complete Alu or ID sequences were selected as potential flanking repeats. If the 5′ subsequences in any of the homologous pairs were not immediately adjacent to the 5′ ends of the Alu or ID sequences, the pairs were eliminated from the set by manual editing using the sequence editor mase (11). The finally selected fragments, at least 10 bp long, were considered to be representative flanking repeats of 344 human Alu and 56 rodent ID elements. The remaining 356 Alu and 49 ID flanking repeats 4–9 bp long were put in a separate set for comparative studies, as described in Results.

The 5′ flanking repeats over 9 bp long were adjusted to the left so that they all started at the same position, and the 3′ repeats were adjusted to the right so that they all ended at the same position. A nonadjusted version of the same set was also preserved for comparative purposes. No alignment was attempted to maximize sequence similarity. The adjusted flanking repeats were further extended by additional 15 bp away from the retroposon insertion site, wherever sequence data was available. The extensions are referred to as 5′ and 3′ adjacent sequences.

The second set containing flanking sequences shorter than 10 bp was left unadjusted, since many flanking repeats, particularly on the short end of the distribution, could not be confirmed with any certainty and were likely to represent coincidental matches. Instead, the entire 30-bp regions preceding 5′ ends of Alu and ID retroposons were compared with the analogous regions of the unadjusted set containing longer flanks, in terms of base occurrences at individual positions, as described in the following sections.

Two additional random sets of nonredundant sequence fragments 30 bp long were selected from human sequences deposited in GenBank. One contained 356 human sequence fragments with base composition similar to that of the flanking regions of Alu elements, and another included 913 sequences 30 bp long, chosen irrespectively of their base composition. Both sets were used to illustrate a degree of background fluctuations in base occurrences.

The flanking repeats and the adjacent sequences have been deposited in the repbase/publ directory and are available electronically (see the electronic address above).


Flanking Repeats over 9 bp Long.

The base occurrences at individual positions of 5′ flanking repeats 10 bp long or more, and of the 5′ adjacent regions in Alu and ID retroposons, are listed in Table Table1,1, and analogous data for 3′ flanking repeats and their 3′ adjacent regions are listed in Table Table2.2.

Table 1
Base occurrences at different positions of flanking repeats and of the 5′ adjacent regions
Table 2
Base occurrences at different positions of flanking repeats and of the 3′ adjacent regions

Qualitatively, the most striking pattern emerging from Table Table11 is a previously unreported relative abundance of T in the 5′ adjacent regions (columns 3 and 9), particularly at positions −2 and −1 immediately preceding the 5′ flanking repeats of both Alu and ID retroposed elements. In contrast, the 5′ flanking repeats starting at position 1 begin with a very high content of A, particularly at positions 1–4, as observed before (5, 6). It suggests that the preferred 5′ nicking site is either between 5′ TT and AAAA or 3′ AA and TTTT in the complementary strand. This consensus pattern has been verified by a χ2 test that compares observed base occurrences at individual positions listed in Table Table11 with the expected ones based on the overall base composition of the 5′ flanking repeats and of the adjacent regions listed in the middle of Table Table1,1, as explained in the legend of Fig. Fig.1.1. The χ2 values are plotted in Fig. Fig.11a and are compared against the reference value 16.27 indicated by a broken horizontal line, which corresponds to P = 0.001. The χ2 values at positions −2 through +4 discussed above are significant at the 0.001 level in both Alu and ID retroposons.

Figure 1
χ2 values for individual positions of Alu and ID flanking repeats and of adjacent regions. χ2 = Σi = 14(OiEi)2/Ei; Ei = (Total) × (Compositioni), where individual base occurrences (Oi), total numbers ...

Table Table33 shows overall frequencies of different types of hexanucleotides located at statistically significant positions −2 through +4 (see Table Table11 and Fig. Fig.11a), in both Alu and ID sequences. The most abundant among them is TTAAAA and its six variants differing by one transition-type mutation (TTAAGA, TTAGAA, TTGAAA, TTAAAG, CTAAAA, and TCAAAA). They represent over 41.5% of all hexamers from Table Table3.3. Over 20% is contributed by hexanucleotides differing from the above TTAAAA consensus and its six variants by an additional base substitution. In general, as the similarity to TTAAAA and its six variants goes down so does the occurrence of different hexanucleotides at the nicking site. The most diverse are hexanucleotides that occur once or twice, and they represent a little over one-third of the total number of hexamers from Table Table3.3.

Table 3
Overall frequency of hexanucleotides around primary nicking sites associated with 400 Alu and ID sequences

As illustrated in Table Table22 and Fig. Fig.11b, the 3′ ends of flanking repeats also show nonrandom base occurrences at positions −4, −3, and −2 preceding the other nicking site. In particular, all three positions are pyrimidine-enriched, specifically T-enriched, in Alu and ID flanking repeats. The minimum consensus sequence shared by the 3′ ends of flanking repeats in both families of retroposed elements is 5′-TYTN-3′, where Y denotes pyrimidine; T, thymine; and N, any base. The base distribution at position −1 and at the following two positions in the 3′ adjacent regions may also be nonrandom, but it is significant at the 0.001 level only in Alu repeats.

It must be emphasized here that the consensus sequences only summarize the base occurrences from Tables Tables11 and and22 and should not be viewed as the entire representation of the underlying nonrandomness. For example, the 5′ TYTN consensus described above does not reflect the relative deficiency of A at position −2 (see Table Table2)2) or differences in the relative proportions of T and C at positions −2, −3, and −4. These and similar factors must be included in future comparative analyses of the sequence signals associated with retroposons.

Flanking Repeats 4–9 bp Long.

The distinction between flanking repeats at least 10 bp long and the shorter ones is somewhat arbitrary, but it quite naturally follows the length distribution of Alu and ID flanking repeats as illustrated in Table Table4.4. The distribution is clearly bimodal in the case of ID elements, which have numerous flanking repeats under 8 bp and over 9 bp long, but none exactly 8 or 9 bp long. The distinction between short and long flanking repeats is less sharp in the case of Alu, but it follows a similar pattern with a shallow minimum around the same length range.

Table 4
Length distribution of Alu and ID flanking repeats

The frequencies of flanking repeats over 9 bp long grow steadily with length and reach a maximum at 15 bp in ID and 16 bp in Alu elements. After that they abruptly decline, indicating some restrictions on flanking repeats over 15–16 bp long. Analogous length limitations over 15 bp were previously reported in non-SINE retroposons (5) (SINE, short interspersed element).

The abundance of flanking repeats under 7 bp may be artifactual, since many flanks may be incomplete, or even entirely missing, at either end of the inserted element. This is compounded by the growing chance of a random match between oligonucleotides as their length goes down. It is estimated from random simulations (data not shown) that ≈56% of 4-bp “flanking repeats” may come from coincidental matches.

If at least some flanking repeats merely appear to be shorter because their matching copies are either truncated or missing, then the complete copies with the TTAAAA-like nicking signals may still be abundant in the regions preceding or following the retroposons. Fig. Fig.22 compares two groups of 30-bp-long sequence regions immediately preceding 5′ ends of complete Alu elements, using a χ2 test as described for Fig. Fig.1.1. The first group of 30-bp-long segments includes 344 long flanking repeats discussed in the previous section, and the second one includes 356 shorter flanking repeats. In spite of their various lengths, the 5′ ends of flanking repeats were not adjusted in either group, which contributes to some “blurring” of the χ2 distribution relative to that in Fig. Fig.11a. Nevertheless, the χ2 values go well above the random background in both groups, at positions −10 to −17 corresponding to the TTAAAA consensus signal. Smaller, but significant, χ2 values can also be seen around positions immediately preceding the 5′ ends of Alu retroposons and corresponding to the 3′ signal shown in Fig. Fig.11b. This clearly indicates that many presumed short flanking repeats indeed represent fragments of longer flanks indistinguishable from those described in the previous section.

Figure 2
χ2 values for individual positions of 30-bp regions preceding 5′ ends of Alu retroposons. The χ2 values were calculated and presented as explained in the legend of Fig. Fig.1,1, except that flanking repeats and the adjacent ...

Flanking Repeats of Other Retroposed Pseudogenes.

Patterns observed for Alu and ID flanking repeats appear to be shared with B1 and B2 elements from rodents (12). Furthermore, 5′ ends of flanking repeats of recently retroposed rat and human L1 elements (refs. 1315; see also compilation in ref. 16), and two of the three de novo retroposed mRNA pseudogenes in HeLa cells (17) begin within 5′ TTAAAA sequence or its most common variants listed in Table Table3.3. As illustrated in Fig. Fig.3,3, similar patterns can be observed in a large variety of processed pseudogenes from mammals. Furthermore, a single example of apoferritin pseudogene from frog, flanked by perfect direct repeats was reported (18); it begins with AAAA and is preceded by tc, thus producing a 5′ tcAAAA pattern at positions −2, +4. This pattern differs from the 5′ TTAAAA consensus signal by a single T → C substitution.

Figure 3
Examples of direct repeats flanking diverse processed pseudogenes from GenBank 97.0. GenBank accession numbers are listed before each sequence. The flanking repeats are indicated in uppercase letters and the adjacent sequences in lowercase. The omitted ...

The above examples indicate that at least one of the two patterns observed in Alu and ID flanking repeats can be found in most processed pseudogenes not only in mammals but also in amphibians, although more sequence data will be needed to substantiate the latter. This suggests a universal mechanism of retroposition in mammals and beyond.


The nonrandom distribution of bases near both ends of Alu and ID flanking repeats provides a strong argument in support of enzymatic involvement in the generation of the staggered nicks prior to retroposition. Enzymatic nicking is an important first step leading to reverse transcription and integration of R2 retroposed elements in insects (19). It is proposed here that enzymatic nicking also takes place prior to the integration of mammalian retroposons and is guided by short sequence signals described above. Of the two, the 5′ consensus signal (5′-TTAAAA/3′-AATTTT) appears to be more robust statistically and is a likely target for the initial nicking prior to the reverse transcription. The position of the second nick may be determined not only by the sequence signal alone but also by its distance from the first nicking site which, in turn, may depend on the distance between the active sites in the hypothetical nicking enzyme. It can be speculated further that if the best signal is not found within the preferred distance, then a weaker signal is chosen. Conversely, if a strong nicking signal appears closer or further away than the preferred 15- to 16-bp distance it can be accommodated by the nicking enzyme within certain limits. This simple model can explain the variable lengths of flanking repeats and the relative weakness of the sequence patterns associated with the 3′ ends of flanking repeats.

The existing model for RNA-mediated integration of R2 elements (20) does not account for the presence of flanking repeats. In a variant of this model, presented in Fig. Fig.4,4, it is proposed that the antisense nick originally initiates RNA-dependent DNA polymerization (Fig. (Fig.44b), which is followed by the formation of the second nick toward the 3′ end of the sense strand. One possibility is that the formation of the second nick is associated with the ligation of the 5′ end of RNA to the exposed 3′ end of the sense strand. This could help to stabilize the transition from reverse transcription to DNA-dependent DNA polymerization and ligation of the antisense strand (Fig. (Fig.44c). However, alternative mechanisms involving specific proteins are also possible. The final stage includes elimination of the RNA and synthesis of the second DNA strand (Fig. (Fig.44d).

Figure 4
Model for retroposon integration in mammals. (a) Enzymatic nicking in the presence of RNA indicated by a vertical black arrow. (b) Synthesis of cDNA, indicated by a dotted line, and formation of the second nick, indicated by a black arrow pointed down. ...

While this paper was in review, I became aware of new evidence indicating that the reverse transcriptase encoded by the human L1 non-LTR (long terminal repeat) retrotransposon contains a domain homologous to the apurinic/apyrimidinic (AP) endonuclease family (16, 20). It is also capable of nicking DNA in vitro, primarily between runs of pyrimidines and purines in a very A+T-rich region, as shown in experiments with pBS plasmid DNA (16). Furthermore, the authors (16) compiled a list of additional L1 elements with identifiable flanking repeats and concluded that nicking signals in their flanking regions share common patterns with those identified by in vitro cleavage experiments. These conclusions complement independent observations made in this paper that the nicking signals associated with L1 elements follow the 5′-TTAAAA/3′-AATTTT pattern established for Alu and ID elements, which is similar but not identical to the consensus proposed by the authors (16). This adds significant weight to previous hypothesis, based on different premises, that Alu and other mammalian retroposons parasitize on the L1 retroposition machinery (21).

Similarity between AP endonucleases and the RNase H domains of certain reverse transcriptases from insects has also been reported (22), which indicates that the same mechanism for phosphodiester bond cleavage could be used in different steps involved in retroposon integration.

As indicated in Results, some 5′ hexamers from Table Table33 do not seem to follow the 5′-TTAAAA consensus or any other significant sequence pattern. They represent about one-third of the studied sample. This leaves the possibility that significant numbers of retroposons still integrate at random chromosomal breaks as envisioned some time ago (23) and substantiated by recent experiments (24, 25). The majority, however, appear to be associated with specific targets, which may explain both frequent clustering and head-to-tail orientation of retroposons (7).

Generation of staggered breaks by an endogenous enzyme at nonrandom targets, combined with homologous recombination (26), can potentially lead to improved targeting of extrachromosomal DNA to predetermined chromosomal sites in mammals.


I thank Dr. Thomas Eickbush for helpful suggestions on retroposon integration mechanisms, Drs. Jef Boeke and Haig Kazazian and their coworkers for providing me with their manuscripts prior to submission, Paul Klonowski for skillful computer assistance, and Jolanta Walichiewicz for help with the preparation of this manuscript. This work was supported by the U.S. Department of Energy, Grant DE-FG0395ER62139.


1. Weiner A M, Deininger P L, Efstratiadis A. Annu Rev Biochem. 1986;55:631–661. [PubMed]
2. Deininger P L. In: Mobile DNA. Berg D E, Howe M M, editors. Washington, DC: Am. Soc. Microbiol.; 1989. pp. 619–636.
3. Deininger P L, Batzer M A. In: Evolutionary Biology. Hecht M K, MacIntyre R J, Clegg M T, editors. Vol. 27. New York: Plenum; 1993. pp. 157–196.
4. Van Arsdell S W, Denison R A, Bernstein L B, Weiner A M. Cell. 1981;26:11–17. [PubMed]
5. Moos M, Gallwitz D. EMBO J. 1983;2:757–761. [PMC free article] [PubMed]
6. Daniels G R, Deininger P L. Nucleic Acids Res. 1985;13:8939–8954. [PMC free article] [PubMed]
7. Jurka J. In: Molecular Biology Intelligence Unit: The Impact of Short Interspersed Elements (SINEs) on the Host Genome. Maraia R, editor. Austin, TX: Landes; 1995. pp. 25–41.
8. Deininger P L, Batzer M A. In: Molecular Biology Intelligence Unit: The Impact of Short Interspersed Elements (SINEs) on the Host Genome. Maraia R, editor. Austin, TX: Landes; 1995. pp. 43–60.
9. Jurka J, Klonowski P, Dagman V, Pelton P. Comput Chem. 1996;20:119–122. [PubMed]
10. Smith T F, Waterman M J. J Mol Biol. 1981;147:195–197. [PubMed]
11. Faulkner D V, Jurka J. Trends Biochem Sci. 1988;13:321–322. [PubMed]
12. Jurka J, Klonowski P. J Mol Evol. 1996;43:685–689. [PubMed]
13. Furano A V, Somerville C C, Tsichlis P N, D’Ambrosio E. Nucleic Acids Res. 1986;14:3717–3727. [PMC free article] [PubMed]
14. Woods-Samuels P, Wong C, Mathias S L, Scott A F, Kazazian H H, Jr, Antonorakis S. Genomics. 1989;4:290–296. [PubMed]
15. Miki Y, Nishisho I, Horii A, Miyoshi Y, Utsunomiya J, Kinzler K W, Vogelstein B, Nakamura Y. Cancer Res. 1992;52:643–645. [PubMed]
16. Feng Q, Moran J V, Kazazian H H, Boeke J D. Cell. 1996;87:905–916. [PubMed]
17. Maestre J, Tchenio T, Dhellin O, Heidmann T. EMBO J. 1995;24:6333–6338. [PMC free article] [PubMed]
18. Dickey L F, Sreedharan S, Theil E C, Didsbury J R, Wang Y-H, Kaufman R E. J Biol Chem. 1987;262:7901–7907. [PubMed]
19. Luan D D, Korman M H, Jakubczak J L, Eickbush T H. Cell. 1993;72:595–605. [PubMed]
20. Martin F, Olivares M, Lopez M C, Alonso C. Trends Biochem Sci. 1996;21:283–285. [PubMed]
21. Smit A F A, Toth G, Riggs A D, Jurka J. J Mol Biol. 1995;246:401–417. [PubMed]
22. Barzilay G, Hickson I D. BioEssays. 1995;17:713–719. [PubMed]
23. Hutchison C A, Hardies S C, Loeb D D, Shehee W R, Edgell M H. In: Mobile DNA. Berg D E, Howe M M, editors. Washington, DC: Am. Soc. Microbiol.; 1989. pp. 593–617.
24. Teng S-C, Kim B, Gabriel A. Nature (London) 1996;383:641–644. [PubMed]
25. Moore J K, Haber J E. Nature (London) 1996;383:644–646. [PubMed]
26. Rouet P, Smih F, Jasin M. Proc Natl Acad Sci USA. 1994;91:6064–6068. [PMC free article] [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...