![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||
Copyright Benachenhou et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Evolutionary Conservation of Orthoretroviral Long Terminal Repeats (LTRs) and ab initio Detection of Single LTRs in Genomic Data 1Section of Virology, Department of Medical Sciences, Uppsala University, Uppsala, Sweden 2Unit of Physiology, Department of Neuroscience, Uppsala University, Uppsala, Sweden 3Helsinki Institute for Information Technology, Department of Computer Science, University of Helsinki and Laboratory of Computer and Information Science, Helsinki University of Technology, Helsinki, Finland Niyaz Ahmed, Editor University of Hyderabad, India * E-mail: Jonas.Blomberg/at/medsci.uu.se Conceived and designed the experiments: FB PJ MO GS VB PS SK JB. Performed the experiments: FB. Analyzed the data: FB PJ MO GS VB PS SK JB. Contributed reagents/materials/analysis tools: JB. Wrote the paper: FB JB. ¤Current address: Department of Molecular Biology and Microbiology, Tufts University School of Medicine, Boston, Massachusetts, United States of America Received December 4, 2008; Accepted March 10, 2009. This article has been cited by other articles in PMC.Abstract Background Retroviral LTRs, paired or single, influence the transcription of both retroviral and non-retroviral genomic sequences. Vertebrate genomes contain many thousand endogenous retroviruses (ERVs) and their LTRs. Single LTRs are difficult to detect from genomic sequences without recourse to repetitiveness or presence in a proviral structure. Understanding of LTR structure increases understanding of LTR function, and of functional genomics. Here we develop models of orthoretroviral LTRs useful for detection in genomes and for structural analysis. Principal Findings Although mutated, ERV LTRs are more numerous and diverse than exogenous retroviral (XRV) LTRs. Hidden Markov models (HMMs), and alignments based on them, were created for HML- (human MMTV-like), general-beta-, gamma- and lentiretroviruslike LTRs, plus a general-vertebrate LTR model. Training sets were XRV LTRs and RepBase LTR consensuses. The HML HMM was most sensitive and detected 87% of the HML LTRs in human chromosome 19 at 96% specificity. By combining all HMMs with a low cutoff, for screening, 71% of all LTRs found by RepeatMasker in chromosome 19 were found. HMM consensus sequences had a conserved modular LTR structure. Target site duplications (TG-CA), TATA (occasionally absent), an AATAAA box and a T-rich region were prominent features. Most of the conservation was located in, or adjacent to, R and U5, with evidence for stem loops. Several of the long HML LTRs contained long ORFs inserted after the second A rich module. HMM consensus alignment allowed comparison of functional features like transcriptional start sites (sense and antisense) between XRVs and ERVs. Conclusion The modular conserved and redundant orthoretroviral LTR structure with three A-rich regions is reminiscent of structurally relaxed Giardia promoters. The five HMMs provided a novel broad range, repeat-independent, ab initio LTR detection, with prospects for greater generalisation, and insight into LTR structure, which may aid development of LTR-targeted pharmaceuticals. Introduction Retroviral long terminal repeats (LTRs) are elaborate structures with important functions. They are the hallmark of the LTR retrotransposons which constitute from several percent to over half of many genomes. Understanding LTR structure and its evolution is a major problem in retrovirology and in genomics. The retroviruses are grouped in seven genera: alpha-, beta-, gamma-, delta-, epsilon-, lenti- and spumaretrovirus [1]. Of these, the first six are classified as orthoretroviruses. Endogenous proviruses resembling the beta, gamma and spuma genera are common in vertebrates such as humans and mice. Alpharetroviruses have been found only in birds. Epsilonretroviruses have been predominantly detected in fish and amphibians. Endogenous lentiviral sequences were recently found in rabbits [2], whereas deltaretroviruses so far have no known endogenous counterparts. LTRs contain regulatory sequences such as promoters, polyadenylation signals/sites and enhancers, and can therefore dramatically influence the RNA expression of both retroviral and nonretroviral sequences [3]. Vertebrate genomes contain thousands of endogenous retroviruses (ERVs) and, naturally, also thousands of LTRs. Single LTRs frequently form by excision of internal proviral structure after homologous recombination between the two proviral LTRs, identical at the time of provirus integration. Detection of single LTRs is a bioinformatical challenge and are difficult to align due to variable structures. Even more challenging is to find the common structure of LTRs and to understand their function. Part of the difficulty is that ERVs may be several 100 million years old [4] and have accumulated many deleterious mutations. LTR recognition can be aided by the presence of target site duplications, and a few landmarks like AATAAA and TATA, together with promoter recognition algorithms [5], [6], [7], [8], but these features alone are too imprecise and prone to false positivity to allow genome wide searches. Paired LTRs in proviruses are less ambiguous. In fact, presence of two LTRs (formed identical during integration), separated by a characteristic distance, is one of the best ways to detect proviruses. The LTR_STRUC program [9], which is based on this method, is successful in identifying known families and discovering new families of LTR retrotransposons. Similar techniques are LTR_FINDER [10], a combined use of profile HMMs and other retrotransposon characteristics [11], [12], LTR_par [13] and LTRharvest [14]. However, in the human genome, most LTRs are present as single LTRs; they are about ten times more numerous [4] than proviral LTRs but cannot be detected by LTR_STRUC. A third means of LTR detection is based on their repetitiveness, i.e. occurrence several times in the genome. Their subsequent identification as retroviral LTRs rests on their occasional presence in proviral structures. The identification of the latter entails manual intervention and is not self evident. RepeatMasker (unpublished; see http://repeatmasker.org) searches for repeats, including single LTRs. RepeatMasker works against RepBase [15] which is a repeat database. The main drawbacks of RepeatMasker are that it is unable to detect low copy number LTRs, that it gives no information about their structure and that LTRs may be mislabelled as other kinds of repeats. Several other repeat-detecting new algorithms have been published, e.g. RECON [16], RepeatScout [17], PILER [18], and the use of multiple genomic alignments [19]. RetroTector©, a program package developed in our group [20] can detect endogenous retroviruses (ERVs) in genomic material in a repeat-independent way, and reconstruct LTRs and proteins. It uses the first two methods for detecting LTRs and is successful in identifying paired LTRs but less so in the case of single LTRs. During the work on RetroTector©, several features of LTR structure were identified, and incorporated into the LTRID program module. However, the sensitivity and specificity of these features were not enough for genome-wide analyses. We therefore had to find other pattern-recognition algorithms for single LTR detection. They are introduced in this paper. LTR structure is highly variable. LTRs vary widely in length from a few hundred base pairs to over one thousand base pairs, and in structure. Among the vertebrate retroviruses, the longest LTRs occur in the betaretroviruses, e.g. MMTV and the HML elements, and the spumaretroviruses. LTRs comprise three regions, the U3 region which contains enhancer and promoter elements, the R region and the U5 region. Some of the long LTRs, like the MMTV LTRs, contain open reading frames (ORFs) which distort the basic LTR structure. The R region starts at the transcription start site (TSS) and ends at the polyadenylation site. LTRs usually contain a TATA-box located in the U3 region which however is missing or is distorted in some retroviruses, e.g. ERV9 [21]. Instead, ERV9 has two other promoter elements: One located upstream of the transcription start site, a GC/GT-box binding to the Sp1 transactivator protein, and an initiator motif (Inr) located at the transcription start site. The presence of a GC/GT-box (in inverted form) is also crucial for the promoter activity of HERV-H LTRs while their TATA-box may be dispensable [22], [23]. LTRs always have a polyadenylation signal, usually in the R region. This signal is most often the sequence AATAAA but ATTAAA and AGTAAA also occur. It is normally located 10–30 nt upstream of the polyadenylation site [24], [25]. However, in the case of the betaretroviruslike (http://www.ncbi.nlm.nih.gov/ICTVdb/Ictv/index.htm) human MMTV-like group 4 (HML4) [4], [26], the mRNAs were found to be polyadenylated 69 or 84 nt downstream of the AATAAA motif [27]. [27] found evidence for a stem-loop structure in R, which reduces the distance between the AATAAA motif and the polyadenylation site. Similar findings had previously been made for the deltaretrovirus HTLV-1 where the Rex responsive element (RexRE) forms a stem-loop structure [28]. In HTLV-1, the interaction between the RexRE and the Rex protein is essential for the regulation of expression of viral genes. Deltaretroviruses like HTLV have an AATAAA motif which comes before the TATA-box. The structural basis behind this anomaly is uncertain but folding back of U3 onto R may be the explanation (J Blomberg, unpublished) [28]. The R region is important for both transcription initiation and regulation of gene expression in other retroviruses as well. Examples are HIV [29], MLV and related retroviruses [30], [31], [32] and MMTV [33]. Surprisingly, transcripts promoted by HML2 LTRs were found to initiate close to the AATAAA motif which may therefore also function as a TATA-box [34] (Fig S1). Recent reports of antisense retroviral transcripts, promoted from the 3′LTR, in gammaretrovirus- and betaretroviruslike HERVs, as well as in HIV and HTLV [35], [36], [37], [38], [39], [40], [41], [42], [43], emphasise the need for a deeper understanding of LTR structure and function. In the present study, we first improved the detection of single LTRs and second, learned more about their conserved structure. To do this, we used a mathematical tool, Hidden Markov Models (HMMs), Viterbi alignments [44] based on the HMMs, and proprietary heuristic algorithms to facilitate the modelling process and to remove false-positive sequences detected by the HMMs. The generality of LTR detection was assessed in several ways. One was the use of retroviral sequences from evolutionarily distant host genomes. For example, bony fishes, represented by zebrafish, diverged from other vertebrates around 400 million years ago [45], [46]. Birds, represented by chicken, diverged from other vertebrates around 300 million years ago [45], [46]. Marsupials, represented by opossum, diverged from placental mammals around 130 million years ago [47]. Although horisontal retroviral transfers of retroviruses between vertebrates have been common, see e.g. [1], these long periods of genetic separation in general correspond to a large difference of the ERVs of these species. The HMMs revealed LTR structural features common to several retroviral genera. Heterologous cross-genus HMM runs revealed the nucleotides responsible for the generalised LTR detection capability of some of the HMMs, which gave an insight into basic LTR structure, and, maybe, into basic LTR function. Although a completely general LTR detection was not achieved, several HMMs could detect LTRs in widely differing host species, showing that ab initio LTR detection was possible in a subset of LTRs. Results HMMs are widely used in pattern recognition, e. g. in speech recognition [44]. The two key features of the model building is the training of the model whereby the HMM is taught which group of sequences it should recognise and the evaluation of the model whereby the HMM is tested on sequences belonging to the group of interest but not part of the training set. Five HMMs were built for five groupings of vertebrate orthoretroviruses: Betaretroviral HERVs or HMLs, general betaretroviral ERVs, gammaretroviral ERVs, lentiviruses and general vertebrate orthoretroviruses. The sequences of the training and evaluation sets were mainly RepBase consensus sequences [15], see Text S1, S2, S3, S4, S5. We first explored the efficiency of the HMMs in detecting single LTRs in genomes, and then analyzed the HMM models to reveal conserved structures in LTRs. Testing HMMs for LTR detection in the human genome As can be seen from Table 1, the average score of the training set varied from 45 for the most specialised lenti HMM to 5.5 for the broadest model, the general HMM. Each HMM was used to align the training set as a way to visualise the HMM. To test the detection capability of the model, human chromosome 19 (63 million base pairs) was screened. The chromosome was screened using a sliding window typically 1200 nt long with an overlap of 600 nt (the length of an average LTR). The log of the probability of each chunk was computed with the forward algorithm given the HMM. The scoring of sequences was done using the log odds ratio [48], which is the logarithm of the ratio between the probability given the HMM and the probability given a null model. The null model was identical to the HMM except that its match states had emission probabilities 0.25 for each base.
Preliminary runs indicated the presence of false positive hits coming from very CT-rich chunks. To reject them, a routine calculating the CT-content of each chunk was added. The highest CT-content in a sliding 100 nt window was determined. If it was higher than 80%, the chunk was considered as false positive because approximately 98% of the gamma- and betaretroviruslike LTRs had a CT-content less than 80%, corresponding to two standard deviations above the mean for a normal distribution. The statistics were performed on the training sets for betaretrovirus- and gammaretroviruslike LTRs. The result of the screening was compared with the RepeatMasker output for chromosome 19 of the human genome version Hg15 downloaded 2005/07/30. A cross-correlation table between the LTR HMMs and RepeatMasker is shown in Table 2 and 3. The third, fourth and fifth columns are: The number of true positives (TP), i. e. positive by HMM and positive by RepeatMasker. The number of additional positives (AP), i. e. positive by HMM and negative by RepeatMasker. The number of false negatives (FN), negative by HMM but positive by RepeatMasker. The sensitivity is defined by the ratio TP/(TP+FN) and the specificity by the ratio TP/(TP+AP). This sensitivity and specificity are probably not the true ones. First of all, most detected retroviral groups were part of the HMM training sets meaning that detection may in part be due to overfitting. Furthermore, some retroviral groups in the studied genomes are heavily mutated, which makes it unrealistic to detect all members of such LTR groups. Last, RepeatMasker is not a perfect method. Thus, the calculated sensitivity and specificity only give an indication of the true values. To further evaluate the method, evaluation sets containing LTRs from opossum and other groups were used (see below).
The betaretroviruslike HERVs consist of HML1 to HML10 [26]. The specialised HML HMM (“Hml” in Table 2 and 3) detected 87% of the HML LTRs detected by RepeatMasker in chromosome 19, with 96% specificity. The RepBase name was mapped to the corresponding HML group according to [4], [49]. The corresponding figures for the broader beta HMM (“Beta” in Table 2 and 3) were 32% and 92%, respectively. These results are dependent on the choice of the scoring threshold (see Table 2 and 3). Therefore, we show the sensitivity and specificity at two different thresholds for each model (Table 2 and 3). The thresholds were chosen so that the number of additional positives (found by the HMM but not by RepeatMasker) was roughly the same among the models (in Table 2 a few tens of them in chromosome 19 and in Table 3 around 1000), making a comparison between them easier. At the lower threshold 2 instead of 7 the sensitivity of the beta HMM increased to 68% at the price of a lower specificity: 28% (Table 3). To check for the generality of the HML HMM, jackknifing was performed on the HMM training set and as can be seen in Table 4, the HMMs in the “HML HMM family” could detect most missing groups (excluded as part of the jackknifing scheme) in chromosome 19 with percent detection (of the missing groups alone) ranging from 6% to 88% and specificities around 90%, the exception being HML5.
The set of gammaretroviruslike HERVs can be divided into four main groups [1], [4]: i. The “HERV-E group”, also containing HERV-T and ERV3 [50]. This group is related to the exogenous retrovirus MLV and its relatives (the traditional gammaretroviruses) [1]. ii. The “HERV-I group”, also containing HERV-ADP. iii. The “ERV9 group”, also containing HERV-W, HUERSP3, MER41, MER66 and a few other groups [1], [51]. iv. The “HERV-H group”, also containing HERV-F [1], [52], [53]. At high stringency the gamma HMM detected 39% of the chromosome 19 LTRs belonging to the aforementioned groups and detected by RepeatMasker (see Table 2). The specificity was 71% corresponding to 159 additional positives. 139 of these additional positives were ERV1 elements according to the RepBase nomenclature [15]. The ERV1 elements are gammaretroviruslike [4]. Thus, these additional positives should not be considered as false positives. By decreasing the threshold from 5 to −1, the sensitivity and specificity changed to 64% and 38%, respectively (see Table 3). The generality of the gamma HMM was tested with the jackknifing technique and was found to be somewhat less than the HML HMM, see Table 5. There was no correlation between the age of ERV1 and ERV2 LTRs, as measured by the divergence between the 5′ LTR and the 3′ LTR (in chains containing both LTRs), and their detectability. Rather, the detectability depended on the LTR species, indicating that some orthoretroviral LTRs did not conform to the HMMs.
At a threshold of 4, the general LTR HMM, “general” in Table 2 and 3, had a sensitivity of 7% and a specificity of 54% compared to the RepeatMasker hits of HMLs, gamma- and spuma- like HERVs on human chromosome 19, while the corresponding figures at the lower threshold 1 were 27% and 21% respectively (Table 3). The lenti HMM had as expected (because there are no known lentivirus ERVs in humans) low sensitivities at both thresholds (see Table 2 and 3) but did find 12 HML LTRs and 55 gammaretroviruslike LTRs at threshold −2. To study the usefulness of HMMs as a general LTR screening tool, all models were combined (“Combined” in Table 2 and 3). At the higher thresholds, each of them equal to the threshold in Table 2, the sensitivity and specificity were 53% and 74% respectively while at the lower thresholds (Table 3) they were 71% and 20%, respectively. In some models, “gamma”, “beta” and “general”, the nature of additional positives, found by the HMM but not by RepeatMasker, was investigated by using BLAT at the UCSC genome browser site. The ones that occurred as repeats, with at least 10 occurrences in the human genome, had a RepeatMasker hit such as a LINE or SINE elements [15], indicating that the more generalised HMMs occasionally detected these repeats. The execution time on a computer with a 2.40 GHz dual processor and 2 GB RAM was about 2 hours per 10 million base pairs for an HMM with 100 match states. Comparison of HMM detection in random to actual genomic sequence Testing HMMs for LTR detection in the opossum genome and other groups The detection capability of all five models was tested on LTRs obtained from a RetroTector© run on the opossum genome (monDom4). This marsupial is separated from the human lineage by more than 100 million years, thus presenting a critical test on the detection range of the HMMs. The selected LTRs belonged to either the beta- or gammaretroviruslike chains according to the classification of RetroTector©. This machine-made grouping must however be regarded as provisional. Anyway, the current classification of retroviruses is largely based on those of the mouse. Vertebrate retroviruses are highly diverse. The mouse is relatively distant both from humans (another eutherian) and opossum (a marsupial). The nomenclature will need revision. Table 6 and 7 show the sensitivities of all five HMMs on the set of beta and gamma opossum LTRs and ten other sets of LTRs: beta exogenous, HML consensus, alphabeta chicken [1], alpha exogenous, lenti, delta, spuma , epsilon, gamma exogenous and HERV gamma consensus. The results are shown in Table 6 and 7 for two different thresholds, the same as in the RepeatMasker comparison (Table 2 and 3).
Depending on the threshold, the gamma HMM detected 9.1% and 22% of the gamma opossum LTRs, and the beta HMM detected 9.0% and 18% of the beta opossum LTRs (Table 6 and 7). The general HMM had a similar detection capability for both groups. A dependency on LTR length was obvious. A higher frequency of opossum genome LTR detection was obtained with LTRs of similar length as those of the training set. Among 500–600 nt long opossum gamma LTRs, 66% were detected by the gamma HMM (Excel S1). A less dramatic effect of opossum LTR length was observed with the beta HMM. Around 60% of the opossum beta LTRs of 500–800 nt were detected by the beta HMM (Excel S1). The low scores of the opossum evaluation set depend on the presence of LTRs with aberrant LTR length. HIV and other lentiviruses were most of them detected by both the beta and the general HMMs. Alpha RSV and most deltaretroviral HTLV:s were detected by the beta and general HMMs, but in this case they were part of the training set (see Table 7). Thus, the HMMs had a degree of generality in detection of diverse LTRs which they were not trained for. Conserved structures of betaretroviruslike LTRs The training set of the beta HMM with 110 match states, “beta” in Table 1 was Viterbi aligned and a sequence logo [54] was created from the alignment after the insert states had been removed. The most salient feature is the AATAAA motif at pos. 52–57 of the sequence logo (see Fig 1 A sequence logo for the more restricted HML HMM with 170 match states is shown in Fig 2a
As described by Colgan et al. [24], polyadenylation is dependent not only on the AATAAA motif but also on a T-, GT- or CT-rich area usually located 20–40 bases downstream of the polyadenylation site. This area is well conserved in the alignment (pos. 138–156). Conserved LTR structures of gammaretroviruslike HERVs A 109 nt consensus sequence was derived from the gamma LTR HMM. Transcription factor binding sites were sought with the MOTIF program. The consensus contains non-contiguous match states but this was taken into account by looking back at the Viterbi alignment of the training set. Only MOTIF matches which were contiguous were taken into account. The Viterbi alignment was visualised as a sequence logo (see Fig 3a
Conserved LTR structures of lentiviruses The lenti HMM scored highly with all lentiviral LTRs. The structure of the lentivirus LTRs is qualitatively similar to that of the gammaretroviruslike LTRs. The most notable features of the 190 nt consensus (see weblogo Fig 4a
Structures revealed by the general LTR HMM The general LTR HMM with 130 match states, “general” in Table 1, generated the weblogo in Fig 5
Combining Viterbi alignments of five different HMMs into a “SuperViterbi” alignment During the work with the Viterbi alignments for each HMM, evidence for a common orthoretroviral LTR structure emerged. Insert states were not randomly distributed. They tended to occur in contiguous stretches at certain relative positions in the Viterbi alignments. This, and the previously mentioned conservation of characteristic contiguous nucleotide stretches, sometimes appreciable as “motifs” in the match states, made it possible to align them in a combined Viterbi alignment. Initially, an HMM was trained on the five HMM consensuses. A preliminary “SuperViterbi” alignment resulted (Fig S7). This alignment was then manually adjusted, taking into account the distribution of insert states, nucleotide composition and the presence of known motifs (Fig 6
The most conspicuous feature was the almost universal occurrence of a TG at the 5′ end, and a frequent occurrence of CA at the 3′end. These inverted repeats are characteristic of vertebrate retroviral LTRs [29]. Their inclusion in the Viterbi alignments serves as a quality control. The other modules are discussed under the respective LTR region below. As seen in Fig 6 1. The initial, almost universal, “TG” module was followed by 0-237 insert states. The long insert state in some lentiviruses is mainly due to the nef ORF. 2. The “TG-adjacent” module contained two short portions. The first contained combinations of “GG” and “GA”, the second of “CC” and “CT”, and 0 to 1793 nt insert states. 3. The “first A-rich” module consisted of stretches of up to four A:s, followed by a region of “C”, “TG” and “C”. This may correspond to the noncanonical TATA which sometimes occurs upstream of the canonical TATAA [7] in gammaretroviruses. In lentiviruses, the TATA box, is present in this module. It was placed there by the HMM alignment of HMM consensus sequences (Fig S7). On inspection, this is reasonable because of sequence similarity to the modules of the other four HMM consensuses. It ends with 10-335 insert state stretches. 4. The “second A-rich” module contained a more or less clear TATA box in all except the “beta” and “lenti” HMM consensuses. The A-rich stretch is followed by G and C repeats, and ends with 1-4 T and 18-1656 insert states. Especially long insert state stretches occurred in the HML HMMs, and contained long (>97 aa) ORFs in the HML2, HML3, HML4 and HML8 LTR consensus sequences, illustrating that long ORFs are especially common in betaretroviruslike ERV LTRs (which include sag of MMTV; data not shown). Except for one of the HML3 ORFs, which encompasses also a part of the intermediate module, all observed HML ORFs start or end in the intersection of the second A-rich and intermediate modules. These ORFs are further discussed below. 5. The “intermediate” module was “CT” rich and ended with 16-67 insert states. It was not present in the beta and lenti HMM consensuses. 6. The third A-rich module, situated in R-U5, encompassed the AATAAA box. It started with “AATAAA” in different embodiments, directly followed by a CT/GT rich region which contains the polyadenylation site, and ends with 0-430 insert states. Match states could be mapped onto stem loops in this region (see above). 7. Finally, the “CA” module started with combinations of “GG” and “CC” and in three of five cases ended with “CA”, the canonical target site duplication. In fact, “CA” was present in nearly all of the LTRs on which the HMMs were based. However, the HMMs did not always pick it up as a majority consensus, as can be seen in the web logos. Fit of exogenous retroviral LTRs in the respective HMMs The structure and function of alpharetroviral (ALV), betaretroviral (MMTV, MPMV and JSRV), gammaretroviral (MLV) and lentiviral (HIV) LTRs are especially well known [60]. The degree of fit of well characterised exogenous retroviral LTRs, occurring in the last three of the four genera, to the HMMs (which were largely based on ERV LTRs) was therefore of interest. The exogenous MMTV (AF033807) and MPMV (AF033815) LTRs fitted to some extent (scores of -4.0 and -7.9, respectively) in the beta HMM, while JSRV (AF357971) fitted well (score 10.3). The exogenous MLV-like sequences (MLV J01998, GaLV M26927 and FLV M18247) fitted (with scores 8.2, 13 and 14, respectively) in the gamma HMM in the second A-rich module: These HMMs thus encompassed many endo- and exogenous beta- and gammaretroviruslike sequences, even if their generality for opossum counterparts with aberrant LTR length seemed relatively weak. The lenti HMM detected all lentiviral LTRs, including Visna (score 25), EIAV (score 15) and the rather aberrant RELIK (score 12) [2] LTRs. RELIK was not part of the training set for lenti LTRs. This attests to the fidelity of the lenti HMM for lentiviral sequences. Its HMM consensus was mainly based on nef-pruned primate lentiviral LTRs, and did not yield “second A-rich” and “intermediate” modules (nr 4 and 5). The pruning, necessary to achieve convergence during HMM training, may have contributed to this lack of two modules. However, the first and third A-rich modules including a short CT rich stretch, were obvious. Evidence for a redundant basic LTR design The three A-rich modules have common features. The A-dense islands are 5–15 nt long, surrounded by “CC”, “CT” and “GT” rich stretches. The consensus TATA and AATAAA sequences are often strikingly similar. In the lentiviral HMM consensus they are conserved within “CATATAAAG” and “CAATAAAG” contiguous match state stretches, respectively, differing only by an inserted T. Lentiviral LTRs have only two A-rich modules. Their TATA consensus maps to the first A-rich module, and they lack the second A-rich module. It has been noted by others [8] that gammaretrovirus LTRs have a TATA-like sequence upstream of the “regular” TATA box. Another sign of redundancy is that AATAAA functions as TATA in HML2 [34]. A conserved structural feature is a stem loop with a UG-rich crown, corresponding to tar in HIV-1, just after the transcription start site (TSS), see Fig 2 Mapping of sense and antisense transcripts onto the HMM LTR consensuses At least three retroviruses are now known to produce antisense transcripts [35], [36], [37], [38], [39], [40], [41], [42], [43]. Some of these initiate within the 3′ LTR. We therefore mapped the antisense TSS onto the SuperViterbi alignment. These sites were variable in position (Fig 6 Cross-genus recognition by the HMMs When the HMMs were tested against LTRs from retroviral genera which they were not trained for, the general pattern was that the TG, second and third A-rich, the T-rich and CA modules most frequently contained match states with conserved nucleotides (data not shown). The degree of crossreactivity of the HMMs tended to follow the degree of relatedness of the Pol sequences of the corresponding proviruses [1]. The gamma HMM was the most cross-reactive (Table 7). It detected 83% of the HML consensus LTRs, 4/8 of the lenti- and 5/8 of the deltaretroviral sequences. The beta HMM detected 49% of the gamma HERVs (“ERV1” RepBase consensus sequences), and 3/7 of the spumaretroviruses (“ERVL” or “ERV3” in the RepBase notation). The aberrant primary structure of deltaretroviral LTRs [60] could be a reason why deltaretroviral LTRs gave weak cross-genus scores and weak cross-genus Viterbi alignments. However, the general structural pattern of cross-genus recognition is a further sign of the generality of the HMMs. Position of known and putative ORFs in the combined Viterbi alignment The occurrence of open reading frames longer than 100 amino acids in LTRs of 500–1000 nt is at the fringe of likelihood. A likelihood fringe of one standard deviation was calculated in a simulated set of random sequences of different length (100 random sequences for each length in increments of 100, from 200 nt to 1500 nt). Open reading frames outside of the likelihood fringe occurred in some of the LTRs (Fig 8
Projecting transcription factor binding sites (TFBSs) onto the Combined Viterbi alignment TFBSs are best known for ASLV, MLV, HIV and MMTV. They were often hard to identify in the SuperViterbi alignment. However, a CCAAT just before the second A-rich domain was conserved in the gamma HMM. TFBSs are adaptive stuctures which change due to requirements for tissue specificity and pathogenicity. When LTRs of entire genera were Viterbi aligned, they were often not conserved and occurred as insert states. However, the lenti HMM had conserved specificity protein 1 (SP1) TFBSs. Discussion Detection ab initio of single LTRs in genome databases is an important bioinformatical goal. In this work, a wide variety of vertebrate retroviral LTRs was investigated. However, orthoretroviruses, comprising the alpha-, beta-, gamma-, delta-, epsilon- and lentiretroviral genera, were the main target of our efforts. Nevertheless, epsilonretroviruses generally got low scores. Neither were attempts to include spuma-, erranti- and pseudoretroviral LTRs in training and evaluation sets successful. However, the training and evaluation LTR sets were gathered from evolutionarily very diverse hosts. Without the knowledge that they all were LTRs from vertebrate proviruses it would have been impossible to demonstrate a common structure in them. In this sense, the common structures demonstrated by the Viterbi alignments were a significant achievement. Using them as a base, it might be possible to extend the generality of LTR detection further. We used HMMs to detect and align LTRs from mainly beta-, gamma- and lentiretroviruslike RVs. The results show that the problem of detecting and aligning single LTRs can be solved for at least certain kinds of LTRs. The outcome of runs against a variety of vertebrate ERV LTRs, and the results of jackknifing, indicate a considerable generality in the detected LTR structure. There were two drawbacks with the LTR detection. First, our HMMs sometimes did not find groups not present in the training set. For example, a beta HMM not trained for HML5 LTRs would not detect them. It seems that LTRs are too variable in structure to allow a complete generalisation with the chosen technique. It is also expected that postintegrational mutations will distort some of the structures. Second, the computation speed was quite low on a standard computer. Common to all HMMs was that random sequence had a much lower positivity rate than actual genomic sequence (which was used to calculate specificity). Thus, there seems to exist a basic “LTR-like” character of vertebrate genomic sequence that is not present in purely random sequence. The reason could be that genomes were largely built from transposons more or less related to LTRs, or that LTR trained HMMs have a propensity to react with common genomic stuctures like promoters and enhancers. A similar difference between random and genomic sequence was found during evaluation of promoter recognition algorithms see e.g. [61]. The most conserved structures, common for all five HMMs, are TG, the AATAAA motif, the T-rich element and CA. The same tendency was seen in the cross-genus runs. The TATA-box is well conserved in gamma and lenti LTRs but less so in beta LTRs. The Viterbi alignments were compared with Clustal alignments (not shown). The main difference between them was that the conserved motifs were displaced in Clustal alignments and as a result it was almost impossible to construct a reliable consensus sequence. Clustal alignments are also more sensitive to the number and choice of sequences. For example, if many post-integrationally disrupted sequences are present, or sequences with a variable amount of inserts, they can destroy the alignment. To a degree, HMMs seem better at extracting information from such sequences. We found evidence for a stem-loop in the RU5 region of the human beta (HML) LTRs, reducing the unusually large distance (up to 60 nucleotides) between the poly(A) signal and the T-rich segment, which contains the polyadenylation site. Since the AATAAA motif can function as a TATA-box for HML2 [34] we speculate that this stem-loop plays a role in transcriptional activity as well. Although TATA normally acts at the DNA level and AATAAA at the RNA level, they seem to have a functional flexibility where the former may substitute for the latter. Stem-loops in the beginning of R were predicted also in gamma and lentiviruses, in accordance with previous models [29], [31]. The Viterbi consensus of the five HMMs did not generally include the loop stems. Although the stem-loop secondary structure may be conserved, the corresponding primary structure, on which the HMMs work, is not as conserved. The existence of retroviral antisense transcripts has been proven for at least three viruses: HIV-1, HTLV-1 and HERV-H [35], [36], [43]. So far, there is no consensus regarding their start sites. We therefore plotted some of the known start sites in the 3′ LTR in the SuperViterbi LTR model. Both HIV and HTLV antisense transcripts were reported to start close or relatively close to the respective sense TSS. However, antisense HERV-H starts at a different LTR location than the HERV-H sense transcript. Sense and antisense ORFs were found in consensus sequences of LTRs from certain betaretroviruslike ERVs; HML2, HML3, HML4 and HML8. HML LTRs are 500–1000 nt long, longer than most other LTRs, but shorter than the MMTV LTR, which is 1300 nt, and contains the sag ORF. Nef ORFs occur in some primate lentiviral LTRs. Otherwise we did not find ORFs of the same length in LTRs of other orthoretroviruses. The HML group consensus sequences are probably close to the proviral sequence as it was just after integration. The translated amino acid sequences of the ORFs (see Excel S2) did not start with a methionine, which would have been expected. The sequences did not have closely matching known proteins in GenBank. Neither were they similar to each other. This could be a sign of independent acquisition by the respective retroviruses. When the long HML8 ORFs were used in a search against RefSeq cDNA sequences, several interesting transcripts were found: i. The Xist sense-antisense pair which regulates epigenetic X chromosome silencing [62], [63] and ii. a transcript overlapping the APOBEC3B gene in antisense were found (Fig S8). However, the HML8 sequence probably did not encode a functional protein in these transcripts because there was no long ORF where the HML8 ORF fitted in. This fits with the reported loss of ORF and accumulation of transposon repeats during the evolution of XIST [62]. Moreover the HML8 ORF was broken by a stop mutation in both transcripts. Incidentally, the transcript which overlaps the APOBEC3B gene on chromosome 22 starts in the HML8/MER11C LTR very close to the start site found in HML2 (Fig S1). Although out of scope for his paper, we cannot resist mentioning that it should be considered whether HML8 can somehow participate in epigenetic silencing at least at these two loci. Maybe a retroviral promoter (HML8) controls the expression of the antiretroviral gene APOBEC3B. Apart from this, the putative HML ORFs deserve a further study. In the well studied HIV and MLV cases, it was possible to project known enhancer and promoter sequences onto the combined Viterbi alignment. Conserved lentiviral landmarks were the SP1 transcription factor binding sequences. The gammaretrovirus HMM homogeneously detected most gammaretroviruslike sequences, and probably gives an accurate representation of conserved gammaretrovirus LTR structure. Preceding the second A-rich stretch was a partially conserved CCAAT (CAAT box). Otherwise, transcription factor binding sites in U3 were not well conserved in the HMM consensus sequences. With the experience from orthoretrovirus-directed HMMs, it should be possible to tackle LTRs of other retroviruses, and LTR transposons such as the Meta- and Errantiviridae, i.e. Ty1/copia and Ty3/gypsy elements, respectively. They did not score highly with the present HMMs, but it is likely that more or less specialised HMMs could be built from them also. Although at least two A-rich domains occur in the Viterbi alignments of all five HMMs, their exact sequence is variable. To some extent they can substitute for each other. Indeed, TATA and AATAAA box consensuses tend to be similar. This variable and redundant structure and function is reminiscent of the relaxed stuctural requirements for promoter structure in some protists, like Giardia [64], where A-rich domains can serve as bidirectional promoters in a flexible fashion. LTRs seem to retain some of this flexibility, and thus may reveal an original basic RNA polymerase II promoter organisation. Current knowledge of LTR structure and function is becoming more detailed, but also less definite than previously appreciated [60]. For example, bidirectionality of transcription, and variability of transcriptional start sites, even in ordinary sense transcripts, have recently been reported [35], [36], [37], [38], [39], [40], [41], [42], [43]. The suggestive evidence for a modular LTR structure presented here, and previously by others [6], [7], may provide a basis for a more fundamental understanding of LTR function, and for construction of more general LTR-recognition algorithms. A better understanding of LTR structure will shed light on changes in retroviral tissue specificity and pathogenicity [60], on how bidirectional promoters and open reading frames [65] can be accommodated in their structure and on how LTRs can serve as alternative promoters in vertebrate genomes [3]. In addition, LTRs can spawn highly variable individual-specific minisatellite sequences [66], [67], [68], [69]. Knowledge of LTR structure could improve the understanding of this process. In summary, the HMMs were able to detect single LTRs in the human genome, without reference to cognate complete proviruses or repetitiveness. The sensitivity and specificity were high for some of them. They also provided alignments encompassing LTRs of most of beta-, gamma- and lentiretroviruslike LTRs, presenting further insight into the common LTR structure of these genera. This can have implications for gene therapy, LTR-based antivirals and for the understanding of both retroviral and vertebrate genomic evolution. Materials and Methods LTR sequences for training, testing and evaluation The training sets consisted mainly of endogenous LTR consensuses retrieved from RepBase [15]: 23 HML LTRs, 69 gammaretroviruslike human ERV1 LTRs selected according to [4] from a set of 198 human ERV1 sequences and 138 betaretroviruslike ERV2 mouse consensus sequences. Primate lentiviral LTRs were obtained from “HIV Databases”: http://www.hiv.lanl.gov/. LTRs from genomewide RetroTector© analyses of the opossum genome version monDom4 and the chicken genome version galGal3 were used mainly in the evaluation sets. LTRs from proviruses with two LTRs, scoring higher than 500 in RetroTector©, were used. This insured that the LTRs were authentic, with retained structure. Some reference exogenous and endogenous LTRs from all seven genera, downloaded from GenBank were also collected. Genome sequences were downloaded from the UCSC ftp site. The training, test and evaluation sets are described in Table 1, 4, 5, 6 and 7 for each HMM. The detection capability of the HMMs in large genome sequences was tested against human chromosome 19 of the Hg15 assembly. HMM algorithms Algorithms such as Clustal have difficulties in aligning sequences that differ much from each other and lack easily detectable structure. Since LTRs are of this kind [3], we decided to work with HMMs. HMMs are probabilistic models which precisely can represent the degree and sequence of interdependence of a series of states. As described below, we trained several hundred HMMs, separately for each retroviral genus, or more general ones, using different training sets and training parameters. Each HMM was evaluated on sets of independent LTRs. The HMMs were used to screen genome sequences with the ultimate goal to find previously undetected single LTRs. This is demanding, however, as in many cases there will be no way to verify such newly detected single LTRs. Beside the detection aspect, the HMM models can be analyzed to gain insight into LTR structure. We used a so-called profile HMM, an architecture introduced by Krogh et al [70]. Profile HMMs can construct and represent nucleotide alignments. As any HMM, they consist of interconnected states. A profile HMM has a number of modules, corresponding to the conserved columns in the alignment. Each module contains a match state, an insertion state to allow for insertions and a delete state to handle gaps. The three basic tasks for HMMs [44] are: 1. Given a family of sequences, to construct the HMM which best represents the family. This is a statistical estimation problem for which the standard method used is a maximum likelihood estimate computed by an algorithm known as the Baum-Welch algorithm [48]. 2. To calculate the probability that a given HMM will generate a given sequence. This is done by a dynamic programming algorithm called the forward algorithm. 3. To align one or several sequences against a given HMM or equivalently, determine the highest probability path(s) through the HMM. This is solved by the Viterbi algorithm and the resulting alignment is called a Viterbi alignment. Building HMMs specific for LTRs There are two steps in building an HMM, initialisation and training. The parameters of the HMM can be initialised by using a single sequence or a pre-existing alignment. The main step is training: A set of sequences is used to optimise the parameters of the HMM. One difficulty with training an HMM is to design a good training set. If there are many sequences that are very closely related to each other, the HMM will become overspecialised to this group. One way of circumventing this pitfall is to replace the group of closely related sequences with their consensus. Another way is sequence weighting of which there are several variants. In this work the maximum entropy weighting method [71] and manual weighting were used together with the first method. The most serious difficulty with an HMM model is overfitting, which means that it represents the training sequences well but fails to generalise to related sequences not present in the training set. To reduce overfitting, a maximum a posteriori (MAP) estimation algorithm was used besides the maximum likelihood based Baum-Welch algorithm. The MAP estimation algorithm in this work is from Brand [72]. It uses a so called entropic prior which has a parameter z that controls the order (if positive) or disorder (if negative) of the HMM model. In general the somewhat disordered or noisy HMMs had higher generalising capacity. Many HMMs (an HMM “family”) with different number of match states and different z were constructed for the gammaretroviruslike genus of LTRs. An epsilonretroviral LTR sequence closely related to the gamma genus, WDSV (accession number AF033822), was chosen to initialise the HMMs. The training set contained 72 sequences: 69 Repbase human ERV1 LTR consensuses and three exogenous gamma LTRs. Similarly, two HMM families designed to cover mainly the beta genus were constructed. One HMM family was specialised to the HML group of betaretroviruslike LTRs. Its training set contained 23 human ERV2 or HML LTR consensuses. The other HMM family was broader in scope: Its training set consisted of LTRs from five groups (see Table 1): 1) The same 23 HML LTR consensuses. 2) 138 mouse ERV2 LTR consensuses. 3) Two intermediate beta LTR consensuses constructed from a RetroTector© analysis of the chicken genome. 4) Three alpha exogenous LTRs and one endogenous alpha LTR also from the chicken genome. 5) 8 delta exogenous LTRs. The initialising sequence were in both cases a BLV LTR which belongs to the delta genus (accession number K02120). Because of the varying number of sequences in each group manual weighting was performed. A purely lentiviral HMM family, with training set consisting of 32 primate lentiviral LTRs plus five lentiviral LTRs from other mammals, was built in the same way. Three of the 32 primate LTRs were CLUSTAL [73] consensuses. Manual weighting was done due to the overrepresentation of primate LTRs and it was found necessary to remove the long ORFs present in the primate LTRs (nef) to achieve convergence. One general HMM family was also built. The initialising sequence was a BLV LTR as before and its training set was a combination of the gamma and the broader beta training sets (see Table 1). The initialisation gave an HMM with number of match states equal to the length of the initialisation sequence, around 500 for a typical LTR. Since the probable number of conserved sites is much lower than this, model surgery [48] was applied during the training to reduce the number of match states. Another reason for reducing the number of match states is to save computation time which increases linearly with the number of match states. Finally, by reducing the number of match states, the number of free parameters in the HMM is decreased, which counteracts overfitting. The choice of initiation sequence was not so critical as long as it was not too short. To choose the best model among the HMMs with varying number of match states and different z-values, test sets were used and the model giving the highest score for this test set was selected. In some cases (see Table 1) the test set was just a family of sequences not contained in the training set and in other cases “jackknifing” was performed (see Table 4 and 5), i. e. removing one family at the time from the training set and calculating the score of the family removed. It was found that the score of the test set increased linearly with the number of match states in the HMM model until it reached a plateau. The “best” HMM model was chosen at the end of the linear regime or beginning of the plateau. Generally the selected models had 100–200 match states and negative z-values. Software Several bioinformatical tools were used in this work: RetroTector© [20], as described above. Mega version 3.1 [74] and ClustalX [75] were used for alignments and phylogenetic analyses. Bioedit [76] for viewing alignments. Mfold [77] for prediction of nucleic acid secondary structures. MOTIF [78] for finding transcription factor binding sites. The RepeatMasker output downloaded from the UCSC genome ftp site (unpublished; see http://repeatmasker.org). Weblogo was used for constructing sequence logos [54]. To implement HMMs, we used programs written in C. These were mainly modules implementing the three basic tasks for HMMs as described above. To these, a few modules were added: A module for Viterbi alignments, a module for weighting sequences using the maximum entropy method, a module for adjusting the length of HMMs and a module for regularising HMMs based on [72]. These programs were applied to the problem of LTR detection and characterisation. Figure S1 Start sites of some HML transcripts, including one from the ABOBEC3B locus on chromosome 22. (2.38 MB TIF) Click here for additional data file.(2.2M, tif) Figure S2 Conformational analysis of the HML4 and HML5 RU5 regions. The MFOLD package was used. (2.90 MB TIF) Click here for additional data file.(2.7M, tif) Figure S3 Conformational analysis of the HMM generated HML consensus. The MFOLD program was used. (2.17 MB TIF) Click here for additional data file.(2.0M, tif) Figure S4 Conformational analysis of RU5 of Murine Leukemia virus (Accession nr MLMCG). The MFOLD program was used. (2.51 MB TIF) Click here for additional data file.(2.3M, tif) Figure S5 Conformational analysis of the gamma HMM generated LTR consensus sequence. MFOLD was used. (2.10 MB TIF) Click here for additional data file.(2.0M, tif) Figure S6 Conformational analysis of RU5 of HIV-1 LAV LTR. MFOLD was used. (2.84 MB TIF) Click here for additional data file.(2.7M, tif) Figure S7 The Viterbi alignment of five HMM consensus sequences. This HMM-generated “superviterbi” alignment was manually adjusted with regard to insert state clustering and presented in figure 6 (3.25 MB TIF) Click here for additional data file.(3.0M, tif) Figure S8 BLAT result of a search with HML8 LTR ORF in the hg18 assembly. An antisense HML8 LTR was found in the APOBEC3B locus on chromosome 22. (9.34 MB TIF) Click here for additional data file.(8.9M, tif) Excel S1 The figures illustrate a great dependence on LTR length when the beta and gamma HMMs are applied to opossum LTRs. (0.66 MB XLS) Click here for additional data file.(642K, xls) Excel S2 Sequences of the ORFs of human MMTV-like LTRs, and their positions in them. (0.16 MB XLS) Click here for additional data file.(152K, xls) Footnotes Competing Interests: The authors have declared that no competing interests exist. Funding: Swedish Research Council (grant K2004-32X-14252-03A). Local funds at Uppsala Academic Hospital. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. References 1. Jern P, Sperber GO, Blomberg J. Use of endogenous retroviral sequences (ERVs) and structural markers for retroviral phylogenetic inference and taxonomy. Retrovirology. 2005;2:50. [PubMed] 2. Katzourakis A, Tristem M, Pybus OG, Gifford RJ. Discovery and analysis of the first endogenous lentivirus. Proc Natl Acad Sci U S A. 2007;104:6261–6265. [PubMed] 3. Leib-Mosch C, Seifarth W, Schön U. Influence of Human Endogenous Retroviruses on Cellular Gene Expression. Retroviruses and Primate Genome Evolution: Landes Bioscience. 2005. 4. Mager DL, Medstrand P. London, UK: Nature publishing group; 2003. Retroviral repeat sequences. Nature encyclopedia of the human genome. 5. Quandt K, Frech K, Karas H, Wingender E, Werner T. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 1995;23:4878–4884. [PubMed] 6. Blusch JH, Haltmeier M, Frech K, Sander I, Leib-Mosch C, et al. Identification of endogenous retroviral sequences based on modular organization: proviral structure at the SSAV1 locus. Genomics. 1997;43:52–61. [PubMed] 7. Frech K, Brack-Werner R, Werner T. Common modular structure of lentivirus LTRs. Virology. 1996;224:256–267. [PubMed] 8. Frech K, Danescu-Mayer J, Werner T. A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter. J Mol Biol. 1997;270:674–687. [PubMed] 9. McCarthy EM, McDonald JF. LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics. 2003;19:362–367. [PubMed] 10. Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007;35:W265–268. [PubMed] 11. Rho M, Choi JH, Kim S, Lynch M, Tang H. De novo identification of LTR retrotransposons in eukaryotic genomes. BMC Genomics. 2007;8:90. [PubMed] 12. Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, et al. Combined evidence annotation of transposable elements in genome sequences. PLoS Comput Biol. 2005;1:166–175. [PubMed] 13. Kalyanaraman A, Aluru S. Efficient algorithms and software for detection of full-length LTR retrotransposons. J Bioinform Comput Biol. 2006;4:197–216. [PubMed] 14. Ellinghaus D, Kurtz S, Willhoeft U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9:18. [PubMed] 15. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005;110:462–467. [PubMed] 16. Bao Z, Eddy SR. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002;12:1269–1276. [PubMed] 17. Price AL, Jones NC, Pevzner PA. De novo identification of repeat families in large genomes. Bioinformatics. 2005;21(Suppl 1):i351–358. [PubMed] 18. Edgar RC, Myers EW. PILER: identification and classification of genomic repeats. Bioinformatics. 2005;21(Suppl 1):i152–158. [PubMed] 19. Caspi A, Pachter L. Identification of transposable elements using multiple alignments of related genomes. Genome Res. 2006;16:260–270. [PubMed] 20. Sperber GO, Airola T, Jern P, Blomberg J. Automated recognition of retroviral sequences in genomic data RetroTector(C). Nucleic Acids Res. 2007. 21. La Mantia G, Majello B, Di Cristofano A, Strazzullo M, Minchiotti G, et al. Identification of regulatory elements within the minimal promoter region of the human endogenous ERV9 proviruses: accurate transcription initiation is controlled by an Inr-like element. Nucleic Acids Res. 1992;20:4129–4136. [PubMed] 22. Sjottem E, Anderssen S, Johansen T. The promoter activity of long terminal repeats of the HERV-H family of human retrovirus-like elements is critically dependent on Sp1 family proteins interacting with a GC/GT box located immediately 3′ to the TATA box. J Virol. 1996;70:188–198. [PubMed] 23. Anderssen S, Sjottem E, Svineng G, Johansen T. Comparative analyses of LTRs of the ERV-H family of primate-specific retrovirus-like elements isolated from marmoset, African green monkey, and man. Virology. 1997;234:14–30. [PubMed] 24. Colgan DF, Manley JL. Mechanism and regulation of mRNA polyadenylation. Genes Dev. 1997;11:2755–2766. [PubMed] 25. Guntaka RV. Transcription termination and polyadenylation in retroviruses. Microbiol Rev. 1993;57:511–521. [PubMed] 26. Andersson ML, Lindeskog M, Medstrand P, Westley B, May F, et al. Diversity of human endogenous retrovirus class II-like sequences. J Gen Virol. 1999;80 (Pt 1):255–260. [PubMed] 27. Baust C, Seifarth W, Germaier H, Hehlmann R, Leib-Mosch C. HERV-K-T47D-Related long terminal repeats mediate polyadenylation of cellular transcripts. Genomics. 2000;66:98–103. [PubMed] 28. Ahmed YF, Gilmartin GM, Hanly SM, Nevins JR, Greene WC. The HTLV-I Rex response element mediates a novel form of mRNA polyadenylation. Cell. 1991;64:727–737. [PubMed] 29. Coffin J, Hughes S, Varmus H. Retroviruses: Cold Spring Harbor Laboratory Press; 1997. Tat, the Trans-activator of HIV. pp. 225–230. 30. Trubetskoy AM, Okenquist SA, Lenz J. R region sequences in the long terminal repeat of a murine retrovirus specifically increase expression of unspliced RNAs. J Virol. 1999;73:3477–3483. [PubMed] 31. Cupelli L, Okenquist SA, Trubetskoy A, Lenz J. The secondary structure of the R region of a murine leukemia virus is important for stimulation of long terminal repeat-driven gene expression. J Virol. 1998;72:7807–7814. [PubMed] 32. Cupelli LA, Lenz J. Transcriptional initiation and postinitiation effects of murine leukemia virus long terminal repeat R-region sequences. J Virol. 1991;65:6961–6968. [PubMed] 33. Pierce J, Fee BE, Toohey MG, Peterson DO. A mouse mammary tumor virus promoter element near the transcription initiation site. J Virol. 1993;67:415–424. [PubMed] 34. Kovalskaya E, Buzdin A, Gogvadze E, Vinogradova T, Sverdlov E. Functional human endogenous retroviral LTR transcription start sites are located between the R and U5 regions. Virology. 2006;346:373–378. [PubMed] 35. Briquet S, Vaquero C. Immunolocalization studies of an antisense protein in HIV-1-infected cells and viral particles. Virology. 2002;292:177–184. [PubMed] 36. Cavanagh MH, Landry S, Audet B, Arpin-Andre C, Hivin P, et al. HTLV-I antisense transcripts initiating in the 3′LTR are alternatively spliced and polyadenylated. Retrovirology. 2006;3:15. [PubMed] 37. Dunn CA, Romanish MT, Gutierrez LE, van de Lagemaat LN, Mager DL. Transcription of two human genes from a bidirectional endogenous retrovirus promoter. Gene. 2006;366:335–342. [PubMed] 38. Leupin O, Attanasio C, Marguerat S, Tapernoux M, Antonarakis SE, et al. Transcriptional activation by bidirectional RNA polymerase II elongation over a silent promoter. EMBO Rep. 2005;6:956–960. [PubMed] 39. Ludwig LB, Ambrus JL, Jr, Krawczyk KA, Sharma S, Brooks S, et al. Human Immunodeficiency Virus-Type 1 LTR DNA contains an intrinsic gene producing antisense RNA and protein products. Retrovirology. 2006;3:80. [PubMed] 40. Zhang Z, Kim E, Martineau D. Functional characterization of a piscine retroviral promoter. J Gen Virol. 1999;80 (Pt 12):3065–3072. [PubMed] 41. Domansky AN, Kopantzev EP, Snezhkov EV, Lebedev YB, Leib-Mosch C, et al. Solitary HERV-K LTRs possess bi-directional promoter activity and contain a negative regulatory element in the U5 region. FEBS Lett. 2000;472:191–195. [PubMed] 42. Christy RJ, Huang RC. Functional analysis of the long terminal repeats of intracisternal A-particle genes: sequences within the U3 region determine both the efficiency and direction of promoter activity. Mol Cell Biol. 1988;8:1093–1102. [PubMed] 43. Feuchter A, Mager D. Functional heterogeneity of a large family of human LTR-like promoters and enhancers. Nucleic Acids Res. 1990;18:1261–1270. [PubMed] 44. Rabiner L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989;77:257–286. 45. Barton N, Briggs D, Eisen J, Goldstein D, Patel N. Evolution: Cold Spring Harbor Laboratory Press; 2007. 46. Futuyma D. Sunderland, MA 01375: Sinauer; 2005. Evolution. p. 603. 47. Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001;294:2348–2351. [PubMed] 48. Durbin R, Eddy SR, Krogh A, Mitchison GJ. Cambridge, UK: Cambridge University Press; 1998. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. 49. Blikstad V, Benachenhou F, Sperber GO, Blomberg J. Endogenous retroviruses : Evolution of human endogenous retroviral sequences: a conceptual account. Cell Mol Life Sci. 2008;65:3348–3365. [PubMed] 50. Andersson AC, Yun Z, Sperber GO, Larsson E, Blomberg J. ERV3 and related sequences in humans: structure and RNA expression. J Virol. 2005;79:9270–9284. [PubMed] 51. Oja M, Sperber GO, Blomberg J, Kaski S. Self-organizing map-based discovery and visualization of human endogenous retroviral sequence groups. Int J Neural Syst. 2005;15:163–179. [PubMed] 52. Jern P, Sperber GO, Ahlsen G, Blomberg J. Sequence variability, gene structure, and expression of full-length human endogenous retrovirus H. J Virol. 2005;79:6325–6337. [PubMed] 53. Jern P, Sperber GO, Blomberg J. Definition and variation of human endogenous retrovirus H. Virology. 2004;327:93–110. [PubMed] 54. Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18:6097–6100. [PubMed] 55. Seifarth W, Baust C, Murr A, Skladny H, Krieg-Schneider F, et al. Proviral structure, chromosomal location, and expression of HERV-K-T47D, a novel human endogenous retrovirus derived from T47D particles. J Virol. 1998;72:8384–8391. [PubMed] 56. Mager DL. Polyadenylation function and sequence variability of the long terminal repeats of the human endogenous retrovirus-like family RTVL-H. Virology. 1989;173:591–599. [PubMed] 57. Chen HR, Barker WC. Nucleotide Sequences of the retroviral long terminal repeats and their adjacent regions. Nucleic Acids Research. 1984;12:1767–1778. [PubMed] 58. Chang NT, Yang WK, Huang HC, Yeh KW, Wu CW. The transcriptional activity of HERV-I LTR is negatively regulated by its cis-elements and wild type p53 tumor suppressor protein. J Biomed Sci. 2006 59. Bucher P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol. 1990;212:563–578. [PubMed] 60. Coffin JM, Hughes SH, Varmus HE, editors. New York: Cold Spring Harbor Laboratory Press; 1997. Retroviruses. 61. Karlin S. Statistical signals in bioinformatics. Proc Natl Acad Sci U S A. 2005;102:13355–13362. [PubMed] 62. Elisaphenko EA, Kolesnikov NN, Shevchenko AI, Rogozin IB, Nesterova TB, et al. A dual origin of the Xist gene from a protein-coding gene and a set of transposable elements. PLoS ONE. 2008;3:e2521. [PubMed] 63. Nesterova TB, Popova BC, Cobb BS, Norton S, Senner CE, et al. Dicer regulates Xist promoter methylation in ES cells indirectly through transcriptional control of Dnmt3a. Epigenetics Chromatin. 2008;1:2. [PubMed] 64. Teodorovic S, Walls CD, Elmendorf HG. Bidirectional transcription is an inherent feature of Giardia lamblia promoters and contributes to an abundance of sterile antisense transcripts throughout the genome. Nucleic Acids Res. 2007;35:2544–2553. [PubMed] 65. Coffin J, Hughes S, Varmus H. Retroviruses: Cold Spring Harbor Laboratory Press; 1997. Appendix 2. pp. 776–777. 66. Armour JA, Wong Z, Wilson V, Royle NJ, Jeffreys AJ. Sequences flanking the repeat arrays of human minisatellites: association with tandem and dispersed repeat elements. Nucleic Acids Res. 1989;17:4925–4935. [PubMed] 67. Kelly RG. Similar origins of two mouse minisatellites within transposon-like LTRs. Genomics. 1994;24:509–515. [PubMed] 68. Tek AL, Song J, Macas J, Jiang J. Sobo, a recently amplified satellite repeat of potato, and its implications for the origin of tandemly repeated sequences. Genetics. 2005;170:1231–1238. [PubMed] 69. Wolgamot G, Miller AD. Replication of Mus dunni endogenous retrovirus depends on promoter activation followed by enhancer multimerization. J Virol. 1999;73:9803–9809. [PubMed] 70. Krogh A, Brown M, Mian IS, Sjolander K, Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994;235:1501–1531. [PubMed] 71. Krogh A, Mitchison G. Maximum entropy weighting of aligned sequences of proteins or DNA. Proc Int Conf Intell Syst Mol Biol. 1995;3:215–221. [PubMed] 72. Brand M. Structure Learning in Conditional Probability Models via an Entropic Prior and Parameter Extinction. Neural Comp. 1999;11:1155–1182. 73. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. [PubMed] 74. Kumar S, Tamura K, Nei M. MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment. Brief Bioinform. 2004;5:150–163. [PubMed] 75. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 1997;25:4876–4882. [PubMed] 76. Hall T. BioEdit: A user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl Acids Symp Ser. 1999;41:95–98. 77. Zuker M, Mathews D, Turner D. Algorithms and thermodynamics for RNA secondary structure prediction: A practical guide in RNA biochemistry and biotechnology. In: Barciszewski JaCBFC., editor. Kluwer Academic Publishers; 1999. 78. Heinemeyer T, Chen X, Karas H, Kel AE, Kel OV, et al. Expanding the TRANSFAC database towards an expert system of regulatory molecular mechanisms. Nucleic Acids Res. 1999;27:318–322. [PubMed] 79. Lavrentieva I, Khil P, Vinogradova T, Akhmedov A, Lapuk A, et al. Subfamilies and nearest-neighbour dendrogram for the LTRs of human endogenous retroviruses HERV-K mapped on human chromosome 19: physical neighbourhood does not correlate with identity level. Hum Genet. 1998;102:107–116. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||
Retrovirology. 2005 Aug 10; 2():50.
[Retrovirology. 2005]Proc Natl Acad Sci U S A. 2007 Apr 10; 104(15):6261-5.
[Proc Natl Acad Sci U S A. 2007]Nucleic Acids Res. 1995 Dec 11; 23(23):4878-84.
[Nucleic Acids Res. 1995]Genomics. 1997 Jul 1; 43(1):52-61.
[Genomics. 1997]Virology. 1996 Oct 1; 224(1):256-67.
[Virology. 1996]J Mol Biol. 1997 Aug 1; 270(5):674-87.
[J Mol Biol. 1997]Bioinformatics. 2003 Feb 12; 19(3):362-7.
[Bioinformatics. 2003]Nucleic Acids Res. 1992 Aug 25; 20(16):4129-36.
[Nucleic Acids Res. 1992]J Virol. 1996 Jan; 70(1):188-98.
[J Virol. 1996]Virology. 1997 Jul 21; 234(1):14-30.
[Virology. 1997]Genes Dev. 1997 Nov 1; 11(21):2755-66.
[Genes Dev. 1997]Microbiol Rev. 1993 Sep; 57(3):511-21.
[Microbiol Rev. 1993]Science. 2001 Dec 14; 294(5550):2348-51.
[Science. 2001]Retrovirology. 2005 Aug 10; 2():50.
[Retrovirology. 2005]Cytogenet Genome Res. 2005; 110(1-4):462-7.
[Cytogenet Genome Res. 2005]J Gen Virol. 1999 Jan; 80 ( Pt 1)():255-60.
[J Gen Virol. 1999]Cell Mol Life Sci. 2008 Nov; 65(21):3348-65.
[Cell Mol Life Sci. 2008]Retrovirology. 2005 Aug 10; 2():50.
[Retrovirology. 2005]J Virol. 2005 Jul; 79(14):9270-84.
[J Virol. 2005]Int J Neural Syst. 2005 Jun; 15(3):163-79.
[Int J Neural Syst. 2005]J Virol. 2005 May; 79(10):6325-37.
[J Virol. 2005]Virology. 2004 Sep 15; 327(1):93-110.
[Virology. 2004]Cytogenet Genome Res. 2005; 110(1-4):462-7.
[Cytogenet Genome Res. 2005]Retrovirology. 2005 Aug 10; 2():50.
[Retrovirology. 2005]Retrovirology. 2005 Aug 10; 2():50.
[Retrovirology. 2005]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]Virology. 2006 Mar 15; 346(2):373-8.
[Virology. 2006]J Virol. 1998 Oct; 72(10):8384-91.
[J Virol. 1998]Genomics. 2000 May 15; 66(1):98-103.
[Genomics. 2000]J Virol. 1998 Oct; 72(10):8384-91.
[J Virol. 1998]Genes Dev. 1997 Nov 1; 11(21):2755-66.
[Genes Dev. 1997]Genes Dev. 1997 Nov 1; 11(21):2755-66.
[Genes Dev. 1997]Virology. 1989 Dec; 173(2):591-9.
[Virology. 1989]Nucleic Acids Res. 1984 Feb 24; 12(4):1767-78.
[Nucleic Acids Res. 1984]Nucleic Acids Res. 1990 Mar 11; 18(5):1261-70.
[Nucleic Acids Res. 1990]Nucleic Acids Res. 1992 Aug 25; 20(16):4129-36.
[Nucleic Acids Res. 1992]Nucleic Acids Res. 1990 Mar 11; 18(5):1261-70.
[Nucleic Acids Res. 1990]Virology. 1989 Dec; 173(2):591-9.
[Virology. 1989]Nucleic Acids Res. 1984 Feb 24; 12(4):1767-78.
[Nucleic Acids Res. 1984]J Virol. 1998 Oct; 72(10):7807-14.
[J Virol. 1998]Virology. 1996 Oct 1; 224(1):256-67.
[Virology. 1996]Hum Genet. 1998 Jan; 102(1):107-16.
[Hum Genet. 1998]Virology. 2006 Mar 15; 346(2):373-8.
[Virology. 2006]Genomics. 2000 May 15; 66(1):98-103.
[Genomics. 2000]J Virol. 1998 Oct; 72(10):8384-91.
[J Virol. 1998]Virology. 1997 Jul 21; 234(1):14-30.
[Virology. 1997]Virology. 1996 Oct 1; 224(1):256-67.
[Virology. 1996]Proc Natl Acad Sci U S A. 2007 Apr 10; 104(15):6261-5.
[Proc Natl Acad Sci U S A. 2007]J Mol Biol. 1997 Aug 1; 270(5):674-87.
[J Mol Biol. 1997]Virology. 2006 Mar 15; 346(2):373-8.
[Virology. 2006]Virology. 2002 Jan 20; 292(2):177-84.
[Virology. 2002]Retrovirology. 2006 Mar 2; 3():15.
[Retrovirology. 2006]Gene. 2006 Feb 1; 366(2):335-42.
[Gene. 2006]EMBO Rep. 2005 Oct; 6(10):956-60.
[EMBO Rep. 2005]Retrovirology. 2006 Nov 8; 3():80.
[Retrovirology. 2006]Retrovirology. 2005 Aug 10; 2():50.
[Retrovirology. 2005]Proc Natl Acad Sci U S A. 2005 Sep 20; 102(38):13355-62.
[Proc Natl Acad Sci U S A. 2005]Virology. 2006 Mar 15; 346(2):373-8.
[Virology. 2006]J Virol. 1998 Oct; 72(10):7807-14.
[J Virol. 1998]Virology. 2002 Jan 20; 292(2):177-84.
[Virology. 2002]Retrovirology. 2006 Mar 2; 3():15.
[Retrovirology. 2006]Nucleic Acids Res. 1990 Mar 11; 18(5):1261-70.
[Nucleic Acids Res. 1990]PLoS One. 2008 Jun 25; 3(6):e2521.
[PLoS One. 2008]Epigenetics Chromatin. 2008 Oct 27; 1(1):2.
[Epigenetics Chromatin. 2008]Nucleic Acids Res. 2007; 35(8):2544-53.
[Nucleic Acids Res. 2007]Virology. 2002 Jan 20; 292(2):177-84.
[Virology. 2002]Retrovirology. 2006 Mar 2; 3():15.
[Retrovirology. 2006]Gene. 2006 Feb 1; 366(2):335-42.
[Gene. 2006]EMBO Rep. 2005 Oct; 6(10):956-60.
[EMBO Rep. 2005]Retrovirology. 2006 Nov 8; 3():80.
[Retrovirology. 2006]Nucleic Acids Res. 1989 Jul 11; 17(13):4925-35.
[Nucleic Acids Res. 1989]Genomics. 1994 Dec; 24(3):509-15.
[Genomics. 1994]Genetics. 2005 Jul; 170(3):1231-8.
[Genetics. 2005]J Virol. 1999 Dec; 73(12):9803-9.
[J Virol. 1999]Cytogenet Genome Res. 2005; 110(1-4):462-7.
[Cytogenet Genome Res. 2005]J Mol Biol. 1994 Feb 4; 235(5):1501-31.
[J Mol Biol. 1994]Proc Int Conf Intell Syst Mol Biol. 1995; 3():215-21.
[Proc Int Conf Intell Syst Mol Biol. 1995]Nucleic Acids Res. 1994 Nov 11; 22(22):4673-80.
[Nucleic Acids Res. 1994]Brief Bioinform. 2004 Jun; 5(2):150-63.
[Brief Bioinform. 2004]Nucleic Acids Res. 1997 Dec 15; 25(24):4876-82.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 1999 Jan 1; 27(1):318-22.
[Nucleic Acids Res. 1999]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]