Overview of our transposon insertion–discovery pipeline. A, The time line for speciation of humans and chimpanzees is compared with the time line for the generation of transposon insertions. Common insertions occurred a very long time ago and are fixed in both species. “Species-specific” insertions are differentially present in the two species and occurred mostly during the past ∼6 million years. MYA = million years ago. B, Our strategy for identifying new transposon insertions in humans and chimpanzees. Recently mobilized transposons are flanked by TSDs and are precisely absent from one of the two genomes. One of the two copies of the TSD is actually found within the indel. Thus, the transposon plus one TSD copy equals the “fill.” C, Our computational pipeline. The five sequential steps of our computational pipeline for discovering species-specific transposon insertions in humans and chimpanzees are depicted. The draft chimpanzee-genome (build panTro1) and human-genome (build hg17) sequences were obtained from the University of California Santa Cruz browser (Kent et al. 2002). BAC clone sequences for the chimpanzee genome were obtained from GenBank (National Center for Biotechnology Information [NCBI]). BLAST programs also were obtained from NCBI. Repeatmasker was obtained from Arian Smit (Institute for Systems Biology). RepBase version 10.02 and the consensus sequence for the L1-Hs element were obtained from Jurzy Jurka (Jurka 2000). Full-length consensus sequences for L1-PA2, L1-PA3, L1-PA4, and L1-PA5 were obtained from GenBank (Boissinot et al. 2000). Custom MySQL databases and PERL scripts were generated as necessary. All analysis was performed locally on SUN SunFire v40z or Dell Power Edge 2500 servers running Linux operating systems. Our computational pipeline began with identification of all indels in humans versus chimpanzees using genomic alignments that were generated with BLASTz. Next, indels containing transposons were identified using Repeatmasker (A. Smit, unpublished material) and RepBase version 10.02 (Jurka 2000). RepBase libraries for humans and chimpanzees were modified to include full-length L1-PA2, L1-PA3, L1-PA4, L1-PA5 consensus sequences (Boissinot et al. 2000). TSDs were identified using a Smith-Waterman local alignment algorithm on the regions flanking each indel junction. The algorithm was restricted to require the optimum alignment to be located within 5 bp of the indel junction. Aligned sequences smaller than 4 bp or having an identity <90% were not scored as TSDs. A probability scoring system was developed to determine the likelihood that a given indel was caused by a single transposon insertion plus its TSD. This score was obtained by adding together the fraction of the indel that was accounted for by the transposon, its TSD, and a poly (A) tail (if present). A score of 1.0 indicated that the gap was fully accounted for by the transposon and associated sequences. We empirically determined that a lower cutoff of 0.85 provided accurate results while eliminating few, if any, true positives. SVA elements initially were annotated poorly by Repeatmasker. This program often split SVA elements into 2–3 segments (and thus counted most elements more than once). We developed a new method to reassemble these segments into a single element, where appropriate.