- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC2703446

# Estimating the age of retrotransposon subfamilies using maximum likelihood

^{a,}

^{*}JINCHUAN XING,

^{b}DAVID J. WITHERSPOON,

^{b}LYNN B. JORDE,

^{b}and ALAN R. ROGERS

^{c}

^{a}Division of Medical Genetics, University of Washington, BOX 357720, Seattle, WA 98195

^{b}Eccles Institute of Human Genetics, University of Utah, Salt Lake City, UT 84112

^{c}Department of Anthropology, University of Utah, Salt Lake City, UT 84112

^{*}Corresponding author: Elizabeth Marchani, PhD, Tel: 206-543-1471, Fax: 206-616-1973, Email: ude.notgnihsaw.u@72me

## Abstract

We present a maximum likelihood model to estimate the age of retrotransposon subfamilies. This method is designed around a master-gene model which assumes a constant retrotransposition rate. The statistical properties of this model and an *ad hoc* estimation procedure are compared using two simulated data sets. We also test whether each estimation procedure is robust to violation of the master gene model. According to our results, both estimation procedures are accurate under the master gene model. While both methods tend to overestimate ages under the intermediate model, the maximum likelihood estimate is significantly less inflated than the *ad hoc* estimate. We estimate the ages of two subfamilies of human-specific LINE-I insertions using both estimation procedures. By calculating confidence intervals around the maximum likelihood estimate, our model can both provide an estimate of retrotransposon subfamily age and describe the range of subfamily ages consistent with the data.

**Keywords:**LINE-1, maximum likelihood, subfamily age, retrotransposon

## Introduction

Retrotransposons are genomic sequences capable of producing duplicates that insert into a new position within the host genome [1]. Retrotransposons disrupt host genetic structure as they duplicate themselves by inducing transductions, duplications, and deletions [2–4]. Retrotransposons can promote genetic instability, influence gene expression, and affect the process of double-strand breaks and DNA repair [3,5–7]. It is even thought that this genome shuffling could create the fertility barrier necessary for speciation to occur [8]. Retrotransposons thus act as powerful mutagens in the genome of their hosts.

These mobile elements not only contain information about themselves, but also about the history of their hosts. Retrotransposons accumulate mutations over time as their frequency distribution within the host population changes through the process of genetic drift. The known mechanics of retrotransposition make these elements especially well-suited as genetic markers. The ancestral state of retrotransposon insertions is always known to be the empty (no insertion present) allele and nearly all insertion sites are free of homoplasy [4,9–11]. Polymorphic subfamilies of retrotransposons are thought to have arisen within the last few million years, and therefore, their distribution and diversity reflect relatively recent history [12]. The biology of polymorphic mobile elements thus provides researchers mutational events describing a known period of evolutionary history.

Although retrotransposons are powerful generators of genomic variation, the number of active elements and rate of retrotransposition are not well understood. Under the strict master gene model, a single element, the “master gene,” generates all daughter elements within a subfamily [13]. Only one master gene is active at a time, eventually being replaced by another. Under the strict transposon model, all elements are capable of retrotransposition. Intermediate models assert that a few elements descended from a master gene are themselves capable of retrotransposition [14,15]. In this case, multiple master genes may coexist within a single subfamily. Identifying which model best describes available data has been difficult. Brookfield and Johnson [15] have shown that intermediate models can produce phylogenies that mimic those created by the master gene model as long as the number of retrotranspositionally-active elements is few and the rate at which elements are removed from the host genome is low. However, Cordaux and others [16] have shown that phylogenetic networks, rather than trees requiring bifurcating relationships, can be used to identify the number of active elements within a subfamily of *Alu* insertions. A reliable estimate of subfamily age is necessary to estimate reliable insertion rates and may help describe the underlying biological process of retrotransposition.

Established methods used to estimate the age of subfamilies include relative measures [17], require estimates of insertion rates [18], were developed for multispecies comparisons [19,20], or are restricted to recent subfamilies with polymorphic insertion frequencies [21]. These restrictions leave many researchers either to estimate subfamily age as slightly before the age of the most divergent element [22] or simply to estimate the ages of individual elements instead of the subfamily itself [23,24].

Here we evaluate two approaches that use sequence data to estimate the age of retrotransposon subfamilies. We introduce a maximum likelihood estimation procedure that incorporates individual retrotransposon sequences as well as the process by which those retrotransposons were ascertained. The second approach, which we call *ad hoc* estimation, uses the average sequence diversity among retrotransposons within a subfamily, as described elsewhere [25]. We describe the statistical properties of these methods by comparing their performance under the strict master gene model and an intermediate model using computer simulation. We then estimate the age of the Pre-Ta and Ta-1 subfamilies of human LINE-1 (L1) retrotransposons by applying both methods to published data. The Pre-Ta and Ta-1 subfamilies are included in our applied analysis because they contain several hundred members, have 3′ UTR sequences available, and the Pre-Ta subfamily is believed to be older than the Ta-1 subfamily based on sequence analysis and differing insertion frequency distributions [26,27]. This will allow a second test of accuracy, namely whether the estimates will show that the Pre-Ta subfamily is older than the Ta-1 subfamily.

## Results

### The age estimation procedures

For our likelihood estimates, we assume that the master gene(s) generate daughter elements at a constant rate. Each daughter element begins as an exact replicate of the master gene, accumulating mutations at a neutral rate. Each of these new mutations is a novel event that occurred after insertion as described by the infinite sites model [28]. The age *T* of a subfamily is measured backwards in mutational time in units of 1/*u*, where *u* is the mutation rate per base pair (bp) per year. In this way, each element *i* inserted *t _{i}*units of time ago is expected to differ from the master gene by an average of

*t*mutations per nucleotide of sequence. Elements within the subfamily are inserted at a uniform rate across the interval [0,

_{i}*T*], and so the expected value of

*t*for a randomly chosen element is

_{i}*T*/2. This value is both the expected number of mutations per base pair of sequence on a random element and the expected age of that random element in mutational time. Multiplying this time estimate by two approximates the age of the subfamily. This defines the

*ad hoc*estimation method previously described [25].

The new model begins with a data set of sequences for all *n* elements belonging to the subfamily identified within a single haploid genome. Each element *i* has a length of *k _{i}* base pairs with

*x*substitutions relative to the consensus or ancestral sequence. The distribution of substitution events observed within element

_{i}*i*is Poisson with mean

*t*, where

_{i}*t*is the insertion date of element

_{i}*i*.

The likelihood of an estimate of *T* is a function of both the ascertainment process and the mutational process that generates sequence diversity. As we assume that new elements are equally likely to hit the haploid genome at any point in time between *T* and the present, the probability density of a retrotransposition event can be written as 1/*T*. The number of mutations hitting the sequence of the *i*th element is a Poisson random variable, with mean *k _{i}t_{i}*. Conditional on

*k*, the likelihood of the

_{i}t_{i}*i*th element is

The probability of element *i* appearing in the data set is found by integrating the probability curve. The likelihood of *T* given the data is therefore equal to

The derivative of *L(T)* is set equal to zero, then solved for *T* in order to maximize the likelihood estimate of *T*. The derivative of the log-likelihood of *T* is equal to

The sampling variance of is estimated by the negative reciprocal of the second derivative of the likelihood evaluated at *T* = · can be interpreted in years by dividing the estimate by a sequence mutation rate per bp per year. We used two estimates of this sequence mutation rate to interpret our results. The first DNA sequence mutation rate (0.105% per million years) is derived from estimates of pseudogene sequence divergence between human and chimpanzee populations, assuming that the two populations split 6 million years ago (MYA) from a shared ancestral population with an effective population size of 10^{4} [29]. The second DNA sequence mutation rate (0.25% per million years) is derived from sequence divergence between human-specific and orangutan-specific L1 subfamilies [22].

### Simulation

Although the master gene model is a fair approximation for the amplification dynamics of L1 retrotransposons, there are notable exceptions: some mutations hit the L1 master gene(s) and eventually lead to subfamily-specific mutations, while multiple L1 elements within polymorphic subfamilies are full-length and capable of retrotransposition [14]. The performance of the new model and the *ad hoc* estimate are evaluated using computer simulation under the master gene model and under an intermediate model that allows for multiple active elements.

The first simulated data set tests the performance of the maximum likelihood estimation procedure and *ad hoc* estimation under the master gene model. A master gene is inserted at time *T* proportional to 6 MYA (approximating the human-chimpanzee divergence) and spawns a number *n* of daughter elements in a haploid genome. Each daughter element *i* accumulates mutations under the Poisson distribution with probability *k _{i}t_{i}* conditional on its insertion date

*t*. Both

_{i}*n*and the distribution of

*k*are set equal to that observed in the Ta-1 data set described elsewhere [2]. We generated 10

_{i}^{4}data sets by sampling from the Poisson distribution and the distribution of

*k*just described. Each of these simulated data sets is used to estimate

_{i}*T*.

The second simulated data represents an intermediate model of retrotransposition. Data sets are generated as described under the master gene model, except that 20% of generated elements are allowed to have spawned not from the master gene but from an older daughter element. These “granddaughter” elements represent the product of retrotranspositionally-active daughters of the original master gene. The granddaughters inherit the mutations already accumulated by its parental element while still accumulating additional mutations as it matures.

Table 1 summarizes the simulation results, reporting characteristics of the distribution of subfamily age estimates by simulation model. These distributions are illustrated in Figure 1. The null hypotheses that the new maximum likelihood model and the *ad hoc* estimate produce estimates of equal to the true value cannot be rejected by the simulated data sets (P > 0.05). Under the strict master gene model, the average maximum likelihood estimate ( = 5.9967 MYA) and average *ad hoc* estimate ( = 6.0032 MYA) were accurate and do not significantly differ (P > 0.05). Neither method is particularly biased, with relative biases less than 0.1% of their estimate of .

Under our intermediate model simulation, the average maximum likelihood estimate ( = 6.4689 MYA) and average *ad hoc* estimate ( = 6.5908 MYA) are clearly inflated, but neither is able to reject the true value of 6 MYA (P > 0.05). This upward bias is shown in Figure 1, as the age distributions estimated from the simulations based on the intermediate model are shifted to the right, relative to the distribution of age estimates from the simulations based on the master gene model. The distribution of maximum likelihood estimates is less shifted than the *ad hoc* distribution, producing an average maximum likelihood estimate of that is significantly less than the average *ad hoc* estimate (P < 0.025). While their relative biases are comparable, the new maximum likelihood estimate is slightly less biased than the *ad hoc* estimate (relative bias = 7.25% and 8.96 %, respectively).

### Application

L1s are non-LTR (Long Terminal Repeat) retrotransposons that have been actively inserting into the mammalian genome for 150 million years and number ~0.5 million copies in the human genome [3,30,31]. Nearly all L1 elements have been silenced by 5′ truncation, inversions, and point mutations [3,26,32]. Although several full-length L1s are capable of generating daughter elements, the majority of new insertions believe to be generated by a small subset of “hot” L1s [14], and so follow an intermediate model of retrotransposition. Pre-Ta is the oldest subfamily of human polymorphic L1 subfamilies. The 362 unique Pre-Ta elements identified in the haploid human genome have an average age of 2.34 million years [26]. Analysis of ~208bp of sequence from these 362 Pre-Ta elements produces a maximum likelihood estimate equal to 0.0108 units of mutational time (95% CI: 1.0765*10^{−2}, 1.0835*10^{−2}). A total of 404 substitutions are observed in 72,872bp of sequence, equaling a sequence divergence among Pre-Ta elements of 0.55%. This sequence divergence yields an *ad hoc* estimate of *T* equal to 0.0110 units of mutational time.

Ta-1 is the youngest subfamily of human polymorphic L1s, with elements averaging an age of 1.71 million years [2]. Analysis of ~886bp of sequence from the 191 Ta-1 elements ascertained in the haploid human genome database [2] leads to a maximum likelihood estimate equal to 0.0050 units of mutational time (95% CI: 4.9715*10^{−3}, 5.0285*10^{−3}). A total of 402 substitutions are observed in 154,384bp of sequence. This indicates a sequence divergence of 0.26% within the Ta-1 subfamily, leading to an *ad hoc* estimate equal to 0.0052 units of mutational time.

In order to interpret , it can be converted from mutational time to years by dividing it by an appropriate DNA sequence mutation rate. If we apply a mutation rate of 0.105% per million years, as estimated from the sequence divergence between human and chimpanzee pseudogenes [29], the age of Pre-Ta subfamily is estimated to be 10.29 MYA (95% CI: 10.25, 10.32 MYA) using the maximum likelihood estimate, or 10.48 MYA using the *ad hoc* estimate. This same mutation rate estimates the age of the Ta-1 subfamily to be 4.79 MYA (95% CI: 4.76, 4.82 MYA) using the maximum likelihood estimate or 4.95 MYA using the *ad hoc* estimate. If instead we apply an L1-specific DNA sequence mutation rate of 0.25% per million years, as estimated from the sequence divergence between human-specific and orangutan-specific L1 subfamilies [22], the age of the Pre-Ta subfamily is estimated to be 4.32 MYA (95% CI: 4.31, 4.33 MYA) using the new maximum likelihood estimate, or 4.40 MYA using the *ad hoc* estimate. This L1-specific mutation rate estimates the age of the Ta-1 subfamily to be 2.01 MYA (95% CI: 2.00, 2.02 MYA) using the maximum likelihood estimate compared to the *ad hoc* estimate of 2.08 MYA. In every case, the maximum likelihood estimate of *T* is significantly less than the *ad hoc* estimate of *T* (P < 0.025).

## Discussion

Our simulation study indicates that the maximum likelihood model and the *ad hoc* procedure both closely predict the true value of *T*, though they are biased upwards as the number of retrotranspositionally-active elements in the subfamily increases. Despite this slight bias, both estimates failed to reject the true value of *T* under both the master gene model and the intermediate model. This suggests both methods are robust to moderate violation of the master gene model. However, the average maximum likelihood estimate is significantly less than the average *ad hoc* estimate under the intermediate model.

When applied to real data collected for Pre-Ta and Ta-1 subfamilies of human L1s, the Pre-Ta subfamily was reliably estimated to be approximately twice the age of the Ta-1 subfamily. Although the maximum likelihood and *ad hoc* estimates of *T* were quite similar, the maximum likelihood estimate is significantly less than the *ad hoc* estimate (P < 0.025). This is consistent with what we observed in our simulation study under the intermediate model, although the differences in the applied case are less extreme than in the simulation study.

As demonstrated in the simulation study, both our maximum likelihood and *ad hoc* methods inflate estimates of *T* as the proportion of active elements increases. As it is known that L1 subfamilies do not strictly follow the master gene model, it is likely our estimates of *T* are inflated. It is difficult to determine the exact amount of bias in our estimates in this applied example. However, we do know that ~26% of Pre-Ta L1s are approximately full length, while ~31% of Ta (including Ta-0 and Ta-1 subfamilies) elements are approximately full length [2,26]. Using Brouha et al.’s [14] observation that ~7% of full length L1 elements are “hot” L1s, we can estimate that approximately 2% of all Pre-Ta and Ta-1 elements account for the majority of retrotransposition within their subfamilies. The results of our simulation study therefore suggest it is therefore likely that our estimates of L1 subfamily ages are minimally biased.

The conversion of the estimated values into units of millions of years highlighted an important difference between the mutation rate estimated from pseudogenes and the rate estimated from L1 sequence data. Using the lower pseudeogene mutation rate, the Pre-Ta subfamily is estimated to have emerged 10.29 MYA using our maximum likelihood approach. As both subfamilies are known to be human-specific polymorphisms, this date in excess of 6 MYA does not seem reasonable. If instead the L1-specific mutation rate is used, the maximum likelihood estimates of *T* suggest that Pre-Ta subfamily emerged with the *Australopithecines*, while the Ta-1 subfamily arose at the dawn of the genus *Homo* [33]. Violation of the assumptions of the master gene model, such as a variable L1 insertion rate or multiple “hot” L1’s, could alter sequence diversity with L1 subfamilies, and therefore cause the L1-specific DNA sequence mutation rate to differ from the pseudogene mutation rate.

In the search for the age of retrotransposon subfamilies, this paper has introduced a maximum likelihood estimation procedure and compared its statistical properties to those of the *ad hoc* procedure. The two methods produce similar estimates and accurately estimate the age of a subfamily when that subfamily mimics the master gene model of retrotransposition. This work suggests that the *ad hoc* method can be used to easily obtain the age of a retrotransposon subfamily, while the new maximum likelihood method may be used to estimate confidence intervals around such an age. Significant differences between the *ad hoc* and maximum likelihood estimate of *T* suggests violation of the master gene model and may implicate an intermediate model of retrotransposition.

While the new maximum likelihood estimation procedure may perform well for polymorphic subfamilies of human L1s, not all families of mobile elements follow the master gene model as closely. Our simulation study results suggest that as the deviation from the master gene model increases, so does both the maximum likelihood and *ad hoc* estimates of T. Care should be taken when interpreting results using our method when applied to subfamilies of retrotransposons known to strongly deviate from the master gene model. One approach may be to independently analyze subfamilies or clusters of elements within a subfamily that appear to descend from a very few master genes using our model, as we did for the two active subfamilies of human L1s.

Future development of this model could relax some of its underlying assumptions. The model has been designed to analyze active subfamilies, though it could be extended to allow for the study of inactive subfamilies. However, this extension would require knowing the time at which the subfamily became inactive, which may not be estimable with certainty. The assumptions of the maximum likelihood model could be modified to incorporate a variable number of active master genes within a given subfamily, or to allow fluctuation in retrotransposition rate over time. The results of our simulation study suggest that both our maximum likelihood estimate and the *ad hoc* estimate will be inflated in the presence of multiple active master genes, while variation in retrotransposition rate over time will likely bias estimates of subfamily age toward time periods with high retrotransposition rates. Until further development, the model in its current form provides insights into retrotransposon biology and may be applied to active retrotransposon subfamilies believed to approximately follow the master gene model of retrotransposition.

## Materials and Methods

### Simulation

We assess the accuracy of the new model and *ad hoc* estimation using 95% confidence intervals and estimates of bias. Our empirical 95% confidence intervals represent the central 95% of estimated from simulated data under each condition. If the empirical 95% confidence interval excludes *T* = 6 MYA, then we can reject the null hypothesis that our estimate is equal to the true value of *T*. We calculate a 95 % confidence interval about the mean maximum likelihood estimate of *T* as

where *σ* = the sample standard deviation and *n* = the number of sequences analyzed. This calculated 95% confidence interval is used to test the null hypothesis that the mean maximum likelihood is greater than or equal to the *ad hoc* estimate. If the *ad hoc* estimate of *T* is too large to be captured by the maximum likelihood 95% confidence interval, we are able to reject the null hypothesis at the 0.025 level. The bias of an estimate is calculated as the square root of the difference between the mean squared error and the variance of the estimate. Bias then describes the degree to which the estimate is shifted away from the true value.

### Application

Sequences of the 3′ UTR belonging to human-specific Pre-Ta and Ta L1 elements were collected from the literature [2,26]. Ta-1 elements were identified from this data set using subfamily-defining mutations [32]. Clustered substitutions, inversions, or other mutations not resulting from single base misincorporation were eliminated from the analysis [34]. We analyzed ~208bp of sequence from each of 362 Pre-Ta L1s ascertained in the haploid human genome [26]. For comparison, we also analyzed ~886bp of sequence from each of 191 Ta-1 L1s ascertained in the haploid human genome [2]. The sequences were then compared to their consensus and the number of substitutions was recorded. These values were then evaluated using the new maximum likelihood estimator to find an estimate of *T*. The total number of substitutions observed and base pairs analyzed were used to calculate sequence divergence within each subfamily. This was then used to calculate the *ad hoc* estimate [25]. 95% confidence intervals were constructed as given in equation 4.

Scripts to perform these analyses were written using Matlab and are available from the authors upon request.

## Acknowledgments

We thank Henry Harpending for his helpful comments during the preparation of this manuscript. This research was supported by NIH grant GM-59290 and NSF grant BCS-0218370.

## Footnotes

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (525K)

- Bayesian and maximum likelihood phylogenetic analyses of protein sequence data under relative branch-length differences and model violation.[BMC Evol Biol. 2005]
*Mar JC, Harlow TJ, Ragan MA.**BMC Evol Biol. 2005 Jan 28; 5:8. Epub 2005 Jan 28.* - Collaborative double robust targeted maximum likelihood estimation.[Int J Biostat. 2010]
*van der Laan MJ, Gruber S.**Int J Biostat. 2010 May 17; 6(1):Article 17. Epub 2010 May 17.* - LINE drive. retrotransposition and genome instability.[Cell. 2002]
*Kazazian HH Jr, Goodier JL.**Cell. 2002 Aug 9; 110(3):277-80.* - Maximum-likelihood estimation of coalescence times in genealogical trees.[Genetics. 2005]
*Meligkotsidou L, Fearnhead P.**Genetics. 2005 Dec; 171(4):2073-84. Epub 2005 Aug 5.* - Retrotransposon-adenovirus hybrid vectors: efficient delivery and stable integration of transgenes via a two-stage mechanism.[Curr Gene Ther. 2004]
*Soifer HS, Kasahara N.**Curr Gene Ther. 2004 Dec; 4(4):373-84.*

- Inference of Transposable Element Ancestry[PLoS Genetics. ]
*Wacholder AC, Cox C, Meyer TJ, Ruggiero RP, Vemulapalli V, Damert A, Carbone L, Pollock DD.**PLoS Genetics. 10(8)e1004482* - Transduction-Specific ATLAS (TS-ATLAS) reveals a cohort of highly active L1 retrotransposons in human populations[Human mutation. 2013]
*Macfarlane CM, Collier P, Rahbari R, Beck CR, Wagstaff JF, Igoe S, Moran JV, Badge RM.**Human mutation. 2013 Jul; 34(7)10.1002/humu.22327* - Phylogenetic and DNA methylation analysis reveal novel regions of variable methylation in the mouse IAP class of transposons[BMC Genomics. ]
*Faulk C, Barks A, Dolinoy DC.**BMC Genomics. 1448* - LINE-1 Retrotransposition Activity in Human Genomes[Cell. 2010]
*Beck CR, Collier P, Macfarlane C, Malig M, Kidd JM, Eichler EE, Badge RM, Moran JV.**Cell. 2010 Jun 25; 141(7)1159-1170* - Evolution of Stress-Regulated Gene Expression in Duplicate Genes of Arabidopsis thaliana[PLoS Genetics. 2009]
*Zou C, Lehti-Shiu MD, Thomashow M, Shiu SH.**PLoS Genetics. 2009 Jul; 5(7)e1000581*

- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree