![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||
Copyright © 2008 The Author(s) Pseudocounts for transcription factor binding sites 1Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8562, 2Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Tokyo 135-0064, 3Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639 and 4Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Agency (JST), 5-3 Yonbancho, Chiyoda-ku, Tokyo 1002-0081, Japan *To whom correspondence should be addressed. Tel: Phone: +81 3 5449 5131; Fax: +81 3 5449 5133; Email: knakai/at/ims.u-tokyo.ac.jp Received September 24, 2008; Revised November 12, 2008; Accepted December 5, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract To represent the sequence specificity of transcription factors, the position weight matrix (PWM) is widely used. In most cases, each element is defined as a log likelihood ratio of a base appearing at a certain position, which is estimated from a finite number of known binding sites. To avoid bias due to this small sample size, a certain numeric value, called a pseudocount, is usually allocated for each position, and its fraction according to the background base composition is added to each element. So far, there has been no consensus on the optimal pseudocount value. In this study, we simulated the sampling process by artificially generating binding sites based on observed nucleotide frequencies in a public PWM database, and then the generated matrix with an added pseudocount value was compared to the original frequency matrix using various measures. Although the results were somewhat different between measures, in many cases, we could find an optimal pseudocount value for each matrix. These optimal values are independent of the sample size and are clearly correlated with the entropy of the original matrices, meaning that larger pseudocount vales are preferable for less conserved binding sites. As a simple representative, we suggest the value of 0.8 for practical uses. INTRODUCTION Transcription factors regulate gene expression by binding to cis-regulatory elements in DNA. These binding sites typically exhibit characteristic sequence patterns, called ‘motifs’. Experimental discovery of binding sites, however, remains non-trivial. So, although not all motifs are known and motif models are immature, motif matching is widely used to predict binding sites in DNA sequences. The position weight matrix (PWM) is a common way of representing a motif (1). A PWM has a numeric score for each of the four nucleotide types (rows) at each position in the motif (columns). It can be scanned across a DNA sequence to predict binding sites, as follows. If the PWM has width w, each w-long window of DNA (w-mer) receives a score obtained by summing the scores of the nucleotides at each position. If the w-mer score exceeds some threshold, it is predicted as a binding site. A PWM is usually created from a position frequency matrix (PFM) via a position probability matrix (PPM). The PFM is constructed from a sample of binding sites, by counting the number of nucleotides of each type (row) at each position (column). The PPM is just a probabilistically normalized form of the PFM, so that each column sums to 1. Finally, the PWM is obtained by logarithmic transformation of the PPM divided by the nucleotides’ background probabilities. Thus, PWM scores are log likelihood ratios. Motif matrices are often visualized as sequence logos (e.g. Figure 1
Starting from a small sample of binding sites, we will often get zeros in the PFM, leading to zeros in the PPM and negative infinities in the PWM: this is frequently regarded as undesirable (3). The reasoning is that nucleotides that can occur in the motif will often be absent from a small sample by chance, and assigning them a probability of zero is too harsh. It is important to distinguish this motivation from the merely technical issue that it can be troublesome to deal with negative infinity using computers. In any case, it is common practice to add so-called pseudocounts to the PFM, in order to avoid zero probabilities. Pseudocounts can also be motivated by a Bayesian approach: if we assume a Dirichlet prior distribution for nucleotide frequencies, then the posterior mean estimator is equivalent to adding pseudocounts to the observed counts (3). In spite of the ubiquity of pseudocounts, there is no standard way of choosing them. Many types of pseudocounts are used, 0.01 (4), 1 (5), 1.5 (6), 2 (7), 4 (8) and the square root of the number of binding sites in the sample (9). (In this article, when we speak of a pseudocount of, e.g. 1.5, we mean that 1.5/4 is added to each of the four counts in a column.) Frith et al. chose pseudocounts by fitting a Dirichlet prior to motifs in the TRANSFAC database (6). However, the actual effect of changing pseudocount values has not been studied systematically: detailed evaluations have been made only for protein motifs (10–12). Therefore, more practical analysis is required for DNA motifs. Here, we evaluate pseudocount choices using known transcription factor binding sites. The main approach in this study is comparison between an original PPM, which we regard as representing the real sequence specificity of each transcription factor, and a sampled PPM with pseudocounts. The sampled motif matrix is created by stochastic generation from the original PPM. The results show how different pseudocount choices affect the similarity of the original and sampled PPMs. Our results provide guidance on choosing pseudocounts for motif matching. MATERIALS AND METHODS JASPAR dataset We obtained PFMs from the JASPAR database (13). JASPAR provides non-redundant experimentally defined transcription factor binding site motifs for multicellular eukaryotes. JASPAR CORE 2008 has 138 motifs, and provides the original sample of binding sites for 124 of these. We removed matrices that have no binding sites. Additionally, the samples ‘MA0030.sites’ and ‘MA0031.sites’ are identical. Therefore we removed these two. Finally we obtained 122 PFMs. Sampled PFM from original PFM To generate sequences from a JASPAR motif, we made a PPM from the PFM using the formula: pa,i = ca,i/m. pa,i is the probability of nucleotide a in position i; ca,i is the observed count of nucleotide a in position i; and m is the number of sequences. Using these probabilities, we generated sequences randomly as virtual transcription factor binding sites. In this study, we generated sets of 10, 20, 30, 40 and 50 sequences. Pseudocount addition For each sampled PFM, we made a sampled PPM by adding pseudocounts:
Comparison procedures For comparing original PPMs and sampled PPMs, we used seven methods, in two categories. One category is matrix-based comparison: comparison of corresponding elements of matrices. The other category is sequence-based comparison, which involves enumeration of all possible w-mers. The probability s that a PPM generates a w-mer is
The probabilities of generating each w-mer from both the original and generated PPMs were used in the comparison function. Thus, matrix-based comparison tests how accurately the original matrix is recovered, whereas sequence-based comparison tests how accurately the matrix's function is recovered. Enumeration of all w-mers for large motifs requires prohibitive calculation time. Therefore we removed motif positions with higher entropy. The upper limit for motif length is 10. If the original motif was longer than 10, we used only the 10 consecutive positions with minimum entropy. The formula for entropy h is
Comparison functions Euclidean distance (ED) was used to compare motifs. The formula for matrix-based comparison is
Cosine distance (COS) was also used to compare motifs. The formula for matrix-based comparison is
We tried 61 pseudocounts for ED, COS and TVD. Due to heavy calculation time, we tried only 10 pseudocount values between 0.01 and 10.0 for SRC.
RESULTS AND DISCUSSION Differences between the seven comparison methods As an example, we show the results for a typical matrix, MA0001 (transcription factor AGL3 in Arabidopsis thaliana), with 30 generated sequences (Figure 1 Effect of sample size Figure 2
Existence of an optimal pseudocount Table 1 indicates how often there is an optimal pseudocount value (giving minimum distance) other than 0.01 or 10.0, for each comparison method and sample size. These are pooled results for all 122 JASPAR motifs. When using SRC or TVD, the smallest distance is usually achieved with a pseudocount of 0.01 (the lowest that we tested), so smaller pseudocounts are better, and there is no optimum within the range that we examined. When using ED or COS, on the other hand, there is nearly always an optimal pseudocount within this range.
Dependence of optimal pseudocount on sample size and entropy Figure 3
Figure 3
Confirmation using real sequences The results described above were obtained using virtual binding site samples generated from an original PPM. We repeated our tests by, instead, sampling with replacement from the real binding site sequences upon which the JASPAR matrices are based. These results are essentially the same as those above (not shown). CONCLUSIONS Although optimal pseudocount values depend on the comparison method, we can draw some general conclusions. First, all comparison methods indicate that pseudocounts much above 1 are a poor choice. This argues against using the square root of the sample size (9). It also argues for using values below 4, corresponding to Dirichlet priors with an anti-peaked shape, as suggested previously (6). (The value 4—more generally, the alphabet size—is special because it corresponds to an uninformed prior.) The results using ED and COS suggest that values close to 1 are optimal, although the optimal pseudocount depends on the entropy of the original motif (which is unknown in practice). This supports the choice of Hughes et al. to use a pseudocount of 1 (5). For large sample sizes, however, pseudocounts much less than one are only marginally worse. Several previous studies have examined methods to compare motif matrices, although not with the aim of optimizing pseudocounts (14–16): they have generally concluded that ED is a good measure for their purposes. On the other hand, the results using SRC and TVD support the use of pseudocounts much less than 1, e.g. Ref. (4). It could be argued that SRC is the most relevant measure for motif scanning, since one typically makes predictions for all w-mers that score above some threshold, i.e. all w-mers above some rank. To maximize the correlation between the predicted ranks and the true ranks, it seems best to use a very low pseudocount. Thus, the widespread belief that zeros should be avoided in motif matrices may be incorrect, at least for DNA motifs. In summary, depending on the comparison method, optimal pseudocounts for transcription factor binding motifs are either around 1, or very low. It is noteworthy that our main conclusions (data size independency and correlation with entropy) for the cases when optimal values exist have been obtained by a totally different approach on a different system also (17). This agreement between our relatively empirical and intuitive approach and their theoretical approach should strengthen the validity of our conclusions. Supplementary Data Supplementary Data are available at NAR Online. FUNDING This work was supported in part by Global COE Program (Center of Education and Research for Advanced Genome-Based Medicine), MEXT, Japan. Funding for open access charge: Japan Science and Technology Agency (JST). Conflict of interest statement. None declared. [Supplementary Data]
ACKNOWLEDGEMENTS The authors would like to thank Riu Yamashita and Nicolas Sierro for helpful discussion and critical remarks. They also thank Stephen F. Altschul for letting us read their manuscript before submission. Computational time was provided by the Super Computer System, Human Genome Center, Institute of Medical Science, The University of Tokyo. REFERENCES 1. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 2004;5:276–287. [PubMed] 2. Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18:6097–6100. [PubMed] 3. Durbin R, Eddy SR, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1998. 4. Chen QK, Hertz GZ, Stormo GD. MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput. Appl. Biosci. 1995;11:563–566. [PubMed] 5. Hughes JD, Estep PW, Tavazoie S, Church GM. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 2000;296:1205–1214. [PubMed] 6. Frith MC, Fu Y, Yu L, Chen JF, Hansen U, Weng Z. Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res. 2004;32:1372–1381. [PubMed] 7. Mironov AA, Koonin EV, Roytberg MA, Gelfand MS. Computer analysis of transcription regulatory patterns in completely sequenced bacterial genomes. Nucleic Acids Res. 1999;27:2981–2989. [PubMed] 8. Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 1987;193:723–750. [PubMed] 9. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262:208–214. [PubMed] 10. Tatusov RL, Altschul SF, Koonin EV. Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc. Natl Acad. Sci. USA. 1994;91:12091–12095. [PubMed] 11. Karplus M, Sali A. Theoretical studies of protein folding and unfolding. Curr. Opin. Struct. Biol. 1995;5:58–73. [PubMed] 12. Henikoff JG, Henikoff S. Using substitution probabilities to improve position-specific scoring matrices. Comput. Appl. Biosci. 1996;12:135–143. [PubMed] 13. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 2004;32:D91–D94. [PubMed] 14. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8:R24. [PubMed] 15. Mahony S, Auron PE, Benos PV. DNA familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Comput. Biol. 2007;3:e61. [PubMed] 16. Pape UJ, Rahmann S, Vingron M. Natural similarity measures between position frequency matrices with an application to clustering. Bioinformatics. 2008;24:350–357. [PubMed] 17. Altschul SF, Gertz EM, Agarwala R, Schaffer AA, Yu Yi-K. PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res. 2009 doi:10.1093/nar/gkn981. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||
Nat Rev Genet. 2004 Apr; 5(4):276-87.
[Nat Rev Genet. 2004]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]Comput Appl Biosci. 1995 Oct; 11(5):563-6.
[Comput Appl Biosci. 1995]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Nucleic Acids Res. 2004; 32(4):1372-81.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 1999 Jul 15; 27(14):2981-9.
[Nucleic Acids Res. 1999]J Mol Biol. 1987 Feb 20; 193(4):723-50.
[J Mol Biol. 1987]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2004; 32(4):1372-81.
[Nucleic Acids Res. 2004]Science. 1993 Oct 8; 262(5131):208-14.
[Science. 1993]Nucleic Acids Res. 2004; 32(4):1372-81.
[Nucleic Acids Res. 2004]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Genome Biol. 2007; 8(2):R24.
[Genome Biol. 2007]PLoS Comput Biol. 2007 Mar 30; 3(3):e61.
[PLoS Comput Biol. 2007]