![]() | ![]() |
Formats:
|
||||||||||||||||||||||
Copyright © The Author 2005. Published by Oxford University Press. All rights reserved Scoring schemes of palindrome clusters for more sensitive prediction of replication origins in herpesviruses 1Department of Mathematics, National University of Singapore, Singapore 2Department of Statistics and Applied Probability, National University of Singapore, Singapore 3Bioinformatics Program, University of Texas at El Paso, El Paso, Texas 79968, USA 4Department of Mathematical Sciences, University of Texas at El Paso, El Paso, Texas 79968, USA *To whom correspondence should be addressed. Tel: +65 6847 1653; Fax: +65 6779 5452; Email: matchewd/at/nus.edu.sg Received May 30, 2005; Revised June 27, 2005; Accepted August 15, 2005. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oupjournals.org This article has been cited by other articles in PMC.Abstract Many empirical studies show that there are unusual clusters of palindromes, closely spaced direct and inverted repeats around the replication origins of herpesviruses. In this paper, we introduce two new scoring schemes to quantify the spatial abundance of palindromes in a genomic sequence. Based on these scoring schemes, a computational method to predict the locations of replication origins is developed. When our predictions are compared with 39 known or annotated replication origins in 19 herpesviruses, close to 80% of the replication origins are located within 2% of the genome length. A list of predicted locations of replication origins in all the known herpesviruses with complete genome sequences is reported. INTRODUCTION Early studies (1,2) have reported that the nucleotide sequences around replication origins of certain herpesviruses have complex repetitive structures of closely spaced direct and inverted repeats. A palindrome is a special case of inverted repeats where a segment of nucleotide bases is immediately followed by its reverse complement. A high concentration of palindromes around replication origins has been found in these herpesviruses. Herpesviruses utilize two different types of replication origins during lytic and latent infections. For each type of origins, the count and locations in the genome vary from one kind of herpesvirus to another. Most herpesviruses have one to two copies of latent and lytic origins. Presence of palindromes around replication origins is prevalent in both latent and lytic types (1–5). As the central step in the reproduction of herpesviruses, viral DNA replication has been the target for a number of anti-herpesvirus drugs (e.g. acyclovir). Understanding the molecular mechanisms involved in DNA replication is of great importance in further developing strategies to control the growth and spread of viruses (6–8). Since replication origins are regarded as major sites for regulating genome replication, labor-intensive laboratory procedures have been used to search for replication origins (9–11). With the increasing availability of genomic DNA sequence data, one way that may save time and resources would be to scan the viral genome sequence for the expected sequence features by a computer program before an experimental search for replication origins is launched. Masse et al. (3) first used this computational approach to predict the replication origin oriLyt on the human cytomegalovirus (HCMV) and then confirmed it by experimentation. In that computational analysis, one of the sequence features being scanned for in the genome sequence is the presence of a high concentration of palindromes of length 10 or above clustering within a window of 1000 bases. A palindrome reads exactly the same from the 5′ end to the 3′ end on both strands of DNA (see Figure 1
Palindromes play important roles as protein-binding sites in DNA replication processes [(12), Chapter 1]. The local 2-fold symmetry created by the palindrome provides a binding site for DNA-binding proteins which are often dimeric in structure. Such double binding markedly increases the strength and specificity of the binding interaction [(13), Chapter 8]. High concentration of palindromes around replication origins is generally attributed to the reason that the initiation of DNA replication typically requires the binding of an assembly of enzymes to these DNA sequences. Helicase is an example of these enzymes known to bind with the initiation site, locally unwind the DNA helical structure, and pull apart the two complementary strands. This explanation is consistent with the observation of AT-rich regions, believed to facilitate the unwinding, in replication origin domains of the genome (5). Leung et al. (14) describe how an evaluation criterion, based on the scan statistics (15,16), is developed for assessing palindrome clusters by modeling the occurrences of palindromes in the genome as points randomly sampled from the unit interval according to the uniform distribution. By identifying windows on the genome sequence containing statistically significant clusters of palindromes, the scan statistics, in principle, provide a method to predict likely locations of replication origins. This criterion, however, essentially assesses a window of the genome by only the counts of palindrome contained in it, regardless of the actual extent of the palindrome lengths. This drawback has led to missing some replication origins which contain one extremely long palindrome rather than a cluster of moderately long ones. In the present paper, we propose two new schemes for evaluating palindrome clusters and use the rankings of these evaluation criteria to predict the replication origins in the herpesviruses. By checking with known replication origins reported either in published literature or GenBank annotations, we assess the accuracy of the new prediction schemes. These assessments demonstrate that there is a substantial improvement over the original scan statistics criterion. In Methods section, we describe the main steps of the prediction method and three scoring schemes. The first scoring scheme, called the palindrome count scheme (PCS), is essentially the scan statistics method first described by Leung et al. (14), and further discussed in the articles of Leung and Yamashita (17), and Leung et al. (4). Two new scoring schemes, namely, the palindrome length scheme (PLS) and the base-pair weighted scheme (BWS) are introduced as measures of palindrome clusters. In Results and Discussion section, we report the results of applying these scoring schemes to predict the locations of replication origins for 39 fully sequenced herpesviruses, and compare the prediction accuracies in terms of sensitivity and positive predictive value. A few concluding remarks are given in the final section. METHODS We propose a computational method to identify regions of a genome which harbor unusual clusters of palindromes. This, in turn, becomes the basis of our method to predict replication origins for the herpesviruses. Table 1 presents the viruses to be analyzed. The data set comprises all complete genome sequences of the herpesvirus family downloaded from GenBank at the NCBI web site in April 2005. For each virus, we list its abbreviation, accession number, sequence length and the relative frequencies of the four nucleotide bases in the genome (see Table 1).
Our method for predicting replication origins consists of four basic steps: (i) locate palindromes at or above a prescribed length; (ii) choose a scoring scheme for palindromes; (iii) compute a score for each window of the genome according to the chosen scoring scheme; and (iv) select regions with high scores. Step (i): Locating palindromes at or above a prescribed length As very short palindromes occur frequently by chance, a parameter, L, needs to be chosen where palindromes of length below 2L will not be considered in the analysis. Leung et al. (4) propose a procedure, which is based on bench-marking with the well-studied HCMV virus, for the choice of L. This choice takes into account the length of the sequence, as well as the base frequencies in the genome. Using this criterion, L is chosen to be 6 for the BoHV1, BoHV5, CeHV1, HSV1, HSV2 and SHV1 sequences and 5 for the other sequences. Once the minimal palindrome length has been chosen, the sequences are run through the palindrome program, which is part of EMBOSS [European Molecular Biology Open Software Suite, (18)], to extract the palindrome positions and lengths. Each of these palindromes will be assigned a score according to a scoring scheme chosen in the next step. Note that although it is possible for one palindrome to contain a shorter one in it (e.g. the length 12 palindrome ACCGTGCACGGT contains the length 10 palindrome CCGTGCACGG), EMBOSS automatically discards the shorter redundant palindrome and report only the longest one. Step (ii): Choosing a scoring scheme for palindromes Three schemes for scoring palindromes are described. In all of them, any palindrome of length less than 2L will always get a score 0.
We give a simple example of calculating the BWS0 score. In the Markov model with order m = 0, the letters in the sequence are independent of each other. A palindrome containing respectively nA, nC, nG, nT of A, C, G and T occurs with probability Step (iii): Computing the window score The score of a window in the genome is simply the total of the scores of all the palindromes occurring in this window. A palindrome is considered in the window if its left-center is. By trying out a variety of window lengths with the method, we have found that it is best to choose the window length w at 0.5% of the genome length, rounded down to the nearest hundred bases for convenience. Also, we let consecutive windows overlap by half their lengths. That is, the first window spans the first through the wth bases, the second from the ( Step (iv): Selecting regions with significant palindrome clusters For the PCS, regions that harbor statistically significant clusters of palindromes are identified using the scan statistics criterion as described in Leung et al. (14). As the criteria for statistical significance for PLS and BWS have not yet been established, we use a non-parametric approach where a fixed number of top scoring windows are chosen as the predicted locations of replication origins. It is well known that herpesviruses have multiple replication origins. However, there does not appear to be any obvious rule to determine the number of top scoring windows that one should take. Based on sensitivity and positive predictive value consideration (defined below), we find that using the top 3–5 ranked windows for prediction works well for the herpesviruses. RESULTS AND DISCUSSION Scan statistics method versus the new scoring schemes To compare and contrast the two new scoring schemes with the scan statistics method, now called PCS, the sliding window plots for HCMV and HSV1 using PCS, PLS and BWS0 score schemes are displayed in Figure 2
Table 2 shows the top 3 scoring windows for each of the 39 viruses under both the PLS and BWS schemes. The numbers in the table indicate the middle positions of the windows. In cases where two or more high scoring windows are close to one another, only one of them is picked to represent the region that gave the high scores. We adopt the practice that when a certain high scoring window is chosen, the neighboring 8 windows both to the left and to the right of it will not be considered subsequently. Rows that are shaded indicate that the particular viruses have known replication origins either from literature or from annotation. Underlined entries denote the middle positions of the windows which are within 2 map units (a map unit, abbreviated mu, is 1% of the genome length) of known replication origins. Shaded rows without any underlined entries show that the computational method fails to predict the known origins of replication. Finally, rows that are not shaded denote those viruses whose origins of replication are not known, as far as we know. Table 3 lists the regions with significant clusters of palindromes as found by the PCS scheme.
Prediction accuracy We next examine the correspondence between the locations of these high scoring windows and those of the known replication origins. From Genbank sequence entries, annotations and literature, we are able to compile a list of 39 known replication origins for some of the viruses in our dataset. Table 4 shows the distance between each known origin from the nearest significant palindrome cluster for PCS, or the nearest high scoring window for PLS and BWS1 if the center of the cluster or window is within 2 mu of the origin. Otherwise a ‘—’ is entered. The distance is calculated from the mid-point of the window to the mid-point of the closest replication origin. Clearly, Table 4 shows that both PLS and BWS present a substantial improvement in the prediction accuracy of replication origins. For the PLS and BWS, we have used the top 3 scoring windows for each virus to construct this table.
Prediction accuracy of the different schemes can be quantified by two commonly accepted measures: sensitivity and positive predictive value (PPV). In our context, sensitivity is the percentage of known origins that are close to the regions suggested by the prediction; and positive predictive value is the percentage of identified regions that are close to the known origins. Figure 3
Difference between PLS and BWS Note that both PLS and BWS take the length of the palindromes into account, as longer palindromes have lower probability of occurrence than shorter ones. Moreover, the BWS takes into account the base and word frequencies which affect the probability of occurrence of the palindrome. Consider, for example, the BWS0 score
In essence, the BWS includes more information about the sequence in its prediction and so we expect it to give better prediction accuracy. Our results show that this is indeed true. When we choose to use 3 or more top ranking windows, the BWS performs better than the PLS in terms of (higher) sensitivity and positive predictive value. Suspecting that the probability of occurrence of palindromes might not be well estimated on the basis of a global base and word frequencies, we also try calculating palindrome probabilities using the base and word frequencies of those at the local window rather than those of the entire genome. Figure 4
Further improvement of the algorithm While our results show that using PLS and BWS with the ranking approach clearly outperforms the PCS, we have to note that the PCS is the only scheme where a rigorous statistical significance criterion, based on the probability distribution of the scan statistics, is currently available. The probability distributions of the maximal window scores with PLS and BWS have yet to be established. We have some preliminary results on approximating the distributions of the window score under PLS by compound Poisson distribution. The compound Poisson distribution is motivated from a marked Poisson process point of view. The occurrence of a palindrome of length 2L and above is modeled by a Poisson process (4), and the actual length of this palindrome is modeled by a geometric distribution. On closer examination of the known replication origins in this set of genome sequences, we notice that some of the origins missed by this prediction algorithm are actually rather long approximate palindromes. They are missed because we choose to consider only the perfect palindromes. For example, in HSV2, allowing just one error would have let us pick up a 136 base long approximate palindrome centered at 62 930, which is where the reported replication origin is located. If we include these approximate palindromes in our consideration, the sensitivity can be further increased.CONCLUDING REMARKS It is mentioned in the introduction that palindromes are merely one type of sequence features known to be associated with replication origins. Other frequently observed characteristics around replication origins include clustering of closely spaced direct and inverted repeats, as well as high AT content. We have actually examined each of these other types of sequence features and found that none of them, when used alone on our data set, reaches the same level of prediction accuracy offered by the BWS. However, it is likely that the prediction accuracy can be further improved by appropriately incorporating them in the prediction scheme. In fact, several replication origins in BoHV4, EHV4 and HSV2 which are not identified by any of PCS, PLS or BWS can be easily detected by the high local AT content around them. Exactly in what way all the different sequence features should be combined to produce the optimal prediction results is the subject of an ongoing investigation. While it is encouraging to see that close to 80% of replication origins can be predicted using a palindrome-based scoring scheme like BWS, we have also noted that the positive predictive value is rather low whenever the corresponding sensitivity exceeds 50%. This means that a substantial percentage of the high-scoring windows do not correspond to confirmed replication origins. On closer examination of these high scoring windows which are not replication origins, some of them turn out to be regulatory sequences such as transcription factor binding sites. So far, we have not made use of palindromes to predict regulatory sites, but this would be an important area to explore. Our prediction scheme is geared towards herpesviruses and still needs to be tested on other DNA viruses. There are a few other methods proposed for prediction of replication origins for bacterial, archaeal and yeast genomes (20–23). These methods, which are based on DNA asymmetry, flanking sequence similarity, z-curves, might be adapted to work on viral DNA as well. Finally, we note that these endeavors to accurately predict replication origins has motivated several interesting and challenging mathematical problems about random letter sequences and probability distributions of patterns on them. We are now dealing with palindromes only but there will be a stream of similar problems about direct and inverted repeats that calls for efforts from the mathematical scientists. Acknowledgments We would like to thank the editor and two anonymous reviewers for helpful comments and suggestions. Kwok Pui Choi was supported by BMRC grant BMRC01/1/21/19/140 and National University of Singapore ARF Research grant R-146-000-068-112; and Ming-Ying Leung by NIH grants 5S06-GM08012-34 and RCMI 2G13-RR008124. Funding to pay the Open Access publication charges for this article was provided by NIH grant 5S06-GM08012-34. Conflict of interest statement. None declared. REFERENCES 1. Weller S.K., Spadaro A., Schaffer J.E., Murray A.W., Maxam A.M., Schaffer P.A. Cloning, sequencing, and functional analysis of oriL, a herpes simplex virus type 1 origin of DNA synthesis. Mol. Cell. Biol. 1985;5:930–942. [PubMed] 2. Reisman D., Yates J., Sugden B. A putative origin of Replication of plasmids derived from Epstein–Barr virus is composed of two cis-acting components. Mol. Cell. Biol. 1985;5:1822–1832. [PubMed] 3. Masse M.J., Karlin S., Schachtel G.A., Mocarski E.S. Human cytomegalovirus origin of DNA replication (oriLyt) resides within a highly complex repetitive region. Proc. Natl Acad. Sci. USA. 1992;89:5246–5250. [PubMed] 4. Leung M.Y., Choi K.P., Xia A., Chen L.H.Y. Nonrandom clusters of palindromes in herpesvirus genomes. J. Computat. Biol. 2005;12:331–354. 5. Lin C.L., Li H., Wang Y., Zhu F.X., Kudchodkar S., Yuan Y. Kaposi's sarcoma-associated Herpesvirus lytic origin (ori-Lyt)-dependent DNA replication: identification of the ori-Lyt and association of K8 bZip protein with the origin. J. Virol. 2003;77:5578–5588. [PubMed] 6. Delecluse H.J., Hammerschmidt W. The genetic approach to the Epstein–Barr virus: from basic virology to gene therapy. J. Clin. Pathol. Mol. Pathol. 2000;53:270–279. 7. Hartline C.B., Harden E.A., Williams-Aziz S.L., Kushner N.L., Brideau R.J., Kern E.R. Inhibition of herpesvirus replication by a series of 4-oxo-dihydroquinolines with viral polymerase activity. Antiviral Res. 2005;65:97–105. [PubMed] 8. Villarreal E.C. Current and potential therapies for the treatment of herpesvirus infections. Prog. Drug Res. 2003;60:263–307. [PubMed] 9. Zhu Y., Huang L., Anders D.G. Human cytomegalovirus oriLyt sequence requirements. J. Virol. 1998;72:4989–4996. [PubMed] 10. Newton C.S., Theis J.F. DNA replication joins the revolution: whole genome views of DNA replication in budding yeast. BioEssays. 2002;24:300–304. [PubMed] 11. Deng H., Chu J.T., Park N., Sun R. Identification of cis sequences required for lytic DNA replication and packaging of murine gammaherpesvirus 68. J. Virol. 2004;78:9123–9131. [PubMed] 12. Kornberg A., Baker T.A. DNA Replication, 2nd edn. New York: W. Freeman; 1992. 13. Creighton T.E. Proteins. New York: W.H. Freeman; 1993. 14. Leung M.Y., Schachtel G.A., Yu H.S. Scan statistics and DNA sequence analysis: the search for an origin of replication in a virus. Nonlinear World. 1994;1:445–471. 15. Glaz J. Approximations and bounds for the distribution of the scan statistics. J. Am. Statist. Assoc. 1989;84:560–566. 16. Dembo A., Karlin S. Poisson approximations for r-scan processes. Ann. Appl. Probab. 1992;2:329–357. 17. Leung M.Y., Yamashita T.E. Applications of the scan statistic in DNA sequence analysis. In: Glaz J., Balakrishnan N., editors. Scan Statistics and Applications. Boston: Birkhauser Publishers; 1999. pp. 269–286. 18. Rice P., Longden I., Bleasby A. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genetics. 2000;16:276–277. 19. Durbin R., Eddy S., Krogh A., Mitchison G. Biological Sequence Analysis—Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1998. 20. Breier A.M., Chatterji S., Cozzarelli N.R. Prediction of Saccharomyces cerevisiae replication origins. Genome Biol. 2004;5:R22. [PubMed] 21. Salzberg S.L., Salzberg A.J., Kerlavage A.R., Tomb J-F. Skewed oligomers and origins of replication. Gene. 1998;217:57–67. [PubMed] 22. Mackiewicz P., Zakrzewska-Czerwinska J., Zawilak A., Dudek M.R., Cebrat S. Where does bacterial replication start? Rules for predicting the oriC region. Nucleic Acids Res. 2004;16:3781–3791. [PubMed] 23. Zhang R., Zhang C.T. Identification of replication origins in archaeal genomes based on the Z-curve method. Archaea. 2004;1 |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||
Mol Cell Biol. 1985 May; 5(5):930-42.
[Mol Cell Biol. 1985]Mol Cell Biol. 1985 Aug; 5(8):1822-32.
[Mol Cell Biol. 1985]Mol Cell Biol. 1985 May; 5(5):930-42.
[Mol Cell Biol. 1985]J Virol. 2003 May; 77(10):5578-88.
[J Virol. 2003]Prog Drug Res. 2003; 60():263-307.
[Prog Drug Res. 2003]J Virol. 1998 Jun; 72(6):4989-96.
[J Virol. 1998]J Virol. 2004 Sep; 78(17):9123-31.
[J Virol. 2004]Proc Natl Acad Sci U S A. 1992 Jun 15; 89(12):5246-50.
[Proc Natl Acad Sci U S A. 1992]J Virol. 2003 May; 77(10):5578-88.
[J Virol. 2003]Proc Natl Acad Sci U S A. 1992 Jun 15; 89(12):5246-50.
[Proc Natl Acad Sci U S A. 1992]Genome Biol. 2004; 5(4):R22.
[Genome Biol. 2004]