![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||
Statistical detection of cooperative transcription factors with similarity adjustment 1Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestr. 73 and 2Mathematics and Computer Science, Free University of Berlin, Takustr. 9, 14195 Berlin, Germany *To whom correspondence should be addressed. Associate Editor: Trey Ideker Received September 29, 2008; Revised February 9, 2009; Accepted March 10, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Motivation: Statistical assessment of cis-regulatory modules (CRMs) is a crucial task in computational biology. Usually, one concludes from exceptional co-occurrences of DNA motifs that the corresponding transcription factors (TFs) are cooperative. However, similar DNA motifs tend to co-occur in random sequences due to high probability of overlapping occurrences. Therefore, it is important to consider similarity of DNA motifs in the statistical assessment. Results: Based on previous work, we propose to adjust the window size for co-occurrence detection. Using the derived approximation, one obtains different window sizes for different sets of DNA motifs depending on their similarities. This ensures that the probability of co-occurrences in random sequences are equal. Applying the approach to selected similar and dissimilar DNA motifs from human TFs shows the necessity of adjustment and confirms the accuracy of the approximation by comparison to simulated data. Furthermore, it becomes clear that approaches ignoring similarities strongly underestimate P-values for cooperativity of TFs with similar DNA motifs. In addition, the approach is extended to deal with overlapping windows. We derive Chen–Stein error bounds for the approximation. Comparing the error bounds for similar and dissimilar DNA motifs shows that the approximation for similar DNA motifs yields large bounds. Hence, one has to be careful using overlapping windows. Based on the error bounds, one can precompute the approximation errors and select an appropriate overlap scheme before running the analysis. Availability: Software to perform the calculation for pairs of position frequency matrices (PFMs) is available at http://mosta.molgen.mpg.de as well as C++ source code for downloading. Contact: utz.pape/at/molgen.mpg.de 1 INTRODUCTION An important goal in computational biology is to decipher the transcriptional regulation of genes. Interaction of nearby transcription factors (TFs) initiate or inhibit transcription of a gene (Arnone and Davidson, 1997; Fickett, 1996; Yuh et al., 1998). They mainly bind to DNA upstream of genes by recognizing TF-specific sequences which can be summarized by a DNA motif. TFs which combinatorially regulate genes are called cooperative. Such TFs are assumed to have exceptionally many DNA motif occurrences in proximity to each other. Thus, a significant number of co-occurrences of the corresponding DNA motifs can be used to assess the strength of cooperativity. The set of DNA motif occurrences upstream of a gene is called a cis-regulatory module (CRM; Berman et al., 2002). A CRM is a sequence region with dense clusters of DNA motif occurrences as demonstrated experimentally (Clyde et al., 2003; Harbison et al., 2004) and computationally (Lifanov et al., 2003; Wagner, 1999). In general, they can be divided into CRMs bound by the same TF, homotypic CRMs, and heterotypic CRMs bound by different TFs (Brown et al., 2002; Wagner, 1997). Homotypic CRMs are often detected using a scoring function (Papatsenko et al., 2002; Wagner, 1999), e.g. FLYENHANCER (Markstein et al., 2002), SCORE (Rebeiz et al., 2002) and CLUSTER (Lifanov et al., 2003). Common programs to find heterotypic CRMs are ClusterDraw (Papatsenko, 2007), ModuleSearcher (Aerts et al., 2003), MCAST (Bailey and Noble, 2003), eCISANALYST (Berman et al., 2004), Cister (Frith et al., 2001), Cluster-Buster (Frith et al., 2003) and TargetExplorer (Sosinsky et al., 2003). CRMs can be detected using ab initio discovery of new (e.g. Gupta and Liu, 2005; Zhou and Wong, 2004) or based on known DNA motifs. We assume that the DNA motifs are known. Many approaches have been proposed integrating different kinds of data for improving CRM prediction (Manke et al., 2005; Pilpel et al., 2001; Yu et al., 2006). Since the main characteristic of CRMs is their high local density of DNA motif occurrences, one essential data source is always the DNA sequence annotated with DNA motif occurrences. Here, we focus on DNA motifs represented by position frequency matrices (PFMs; Stormo, 2000). Other approaches compute the cooperative binding energy of multiple sites of TFs (Frith et al., 2004; GuhaThakurta and Stormo, 2001) using thermodynamical models. Based on the PFM representation, GuhaThakurta (2006) classifies the approaches to find CRMs into hidden Markov models (Crowley et al., 1997; Frith et al., 2001) and occurrence-based approaches. We further divide the occurrence-based approaches into two categories (Fig. 1
In Pape and Vingron (2008), we propose a fast and accurate approximation for the significance calculation of CRMs circumventing the position independence assumption, incorporating similarity between PFMs, and incorporating the complementary strand. We define a CRM to be a sequence region, which we call a window, of defined length where all DNA motifs of a given set have at least one occurrence. This is called the co-occurrence event. Thus, we assume that TFs only interact if their motifs occur within the window size. Although long-range interactions are reported, especially in higher organisms (e.g. Yoshida et al., 1999), it is impossible to predict such interactions on the sequence level due to high stochastic noise. In fact, the larger the window the higher the probability for the co-occurrence event to be in a random sequence. Hence, the length of the window has to be small to get statistically significant CRMs. Using TransCompel (Matys et al., 2006) to get a first idea of a good choice for the window size shows that 98% of the 375 known vertebrate composite elements have a distance of less than 100 bp (Klein and Vingron, 2007). We compute the probability of a CRM which is the probability of the co-occurrence event in a random sequence given a window length. Considering the overlap probabilities between the occurrences of the TF binding sites, we capture the (self-)similarities of the PFMs and most of the dependencies introduced by the complementary strand. In this article, we extend the approach such that one can compute the length of the window for a specific set of DNA motifs by defining the probability of the co-occurrence event as parameter. We focus on pairs of DNA motifs. Intuitively, the results show that for similar PFMs the length of the window is smaller than for dissimilar PFMs given the same probability. Due to this computation, one can adjust the window size based on the similarity of the PFMs. Hence, by using different window sizes for sets of PFMs sharing different degrees of similarity between their PFMs, one can obtain equal co-occurrence probabilities for all sets. Therefore, follow-up analyses do not have to consider the similarity between PFMs anymore. Otherwise, similar PFMs would yield more co-occurrence events than dissimilar PFMs just due to their similarity. This would generally bias statistics based on the number of co-occurrence events. Hence, window size adjustment by considering the similarity of PFMs is necessary. We provide strong evidence for this by comparing our approach with an approach ignoring similarities based on simulated data. Furthermore, one is interested in whether specific TFs are generally involved in the same CRMs. We call this cooperativity of TFs. In Pape and Vingron (2008), we also show how to compute the significance of cooperativity. The sequence is divided into equal-sized non-overlapping windows covering the whole sequence (Fig. 2
In the next section, we first show that the approach can generally be extended to sets of PFMs. Afterwards, we focus on pairs of PFMs for simplicity. There, we derive formulae for the window length and explicitly state the Chen–Stein error bounds. Furthermore, we introduce the independence approach ignoring similarities and describe the dataset of human TFs and how the PFMs are selected. Section 3 applies the formulae for window length and the Chen–Stein error bounds to selected pairs of TFs and compares the new approach with the independence approach based on simulated data. 2 METHODS We assume that each TF is given by a PFM. For each position j of a sequence, we have an indicator random variable Yj(A) which is 1 if the summed score at this position reaches the threshold. We denote the random variables for the complementary strand by a prime, e.g. Y′j(A). The threshold can be controlled by the type I error αA:=P(Yj(A)=1)=P(Y′j(A)=1) in a random sequence. The model for the random sequence is assumed to be an i.i.d. sequence defined by the GC content. We assume this simple background model, since it causes the distribution of hits on both strands to be equal. As stated before, a CRM is a window of given length w with at least one hit for TF A and one hit of TF B. We split up the calculation of this co-occurrence event into three parts: Let Nw(A)=∑j=1w(Yj(A)+Y′j(A)) denote the random variable for the number of hits of TF A in a random sequence of length w where we allow hits overlapping the boundary of the window. Now, we can state the probability p(w) of a CRM in a given window of length w by p(w):=P(Nw(A)>0, Nw(B)>0). Calculation using the inclusion–exclusion formula results in
2.1 Sets of PFMs So far, we derived formulae to compute the co-occurrence probability for pairs of PFMs. Here, we briefly extend the approach to deal with a set of PFMs with size | |. Equation (1) reduces the calculation of the co-occurrence probability to compute the (joint) events of zero counts of the PFMs. For a set of TFs, we apply the inclusion–exclusion formula on the count variables of all PFMs:
of the power set of . Calculation of these probabilities is straightforward using the same technique as described in Pape and Vingron (2008) and are given in Pape (2008).2.2 Calculate window size From now on, we only consider pairs of PFMs although extension to sets of PFMs is possible. In practice, the probability for the co-occurrence event is given as parameter and the window size has to be computed. In this case, we have to find the roots of
2.3 P-value for cooperativity Previously, we have shown how to compute the co-occurrence probability p(w) in a given window. To compute cooperativity, we suggest to decompose the sequence into non-overlapping windows of equal size and count the number x of CRMs (windows with the co-occurrence event). We define for each window i a Bernoulli random variable Wi which is 1 if the corresponding window contains a co-occurrence event and otherwise 0. Denoting the number of windows by m=n/w with sequence length equal to n, we define W:=∑i=1m Wi. The number W of windows with co-occurrence events is distributed as Poisson ( ) with =p(w)·m if p(w)→0 and m→∞.2.4 Bounds for overlapping windows Considering overlapping windows necessitate the step size s as parameter, the number m of windows becomes m=n/s−w+1. We assume that n, s, w are chosen such that m, n, s, w are positive integers and . Obviously, overlapping windows are dependent on each other. In this case, we can still use a Binomial or Poisson distribution but the dependencies lead to an error in the approximation. Using the Chen–Stein method (Chen, 1975), the error can be quantified. The quantification is done in terms of the total variation distance. Let U and V be any two random processes with values in the same space E, then the total variation distance between their distributions [denoted by (·)] is
(W), ( )). Let I:={i:0<i≤m} denote the index set of the Bernoulli variables. The main idea is to define for each Bernoulli variable Wi a neighborhood set Bi I of random variables which have strong dependencies with Wi. We also require i Bi. In our case, there are only local dependencies since only overlapping windows are dependent on each other. Therefore, we capture all dependencies in the sets Bi which means that for each window i the set Bi contains the index i and the indices of overlapping windows to the left and to the right. Hence, we obtain the bound derived from Theorem 1 in Arratia et al. (1990) using an improved bound (Barbour et al., 1992) dTV( (W), ( ))≤ −1(1−e− )(b1+b2) with
The second bound b2 is more complicated to calculate because it contains the second moment. Since we consider Bernoulli variables, the second moment is the probability that both variables are equal to one: E[WiWi+k]=P(Wi=1, Wi+k=1). Considering only two PFMs A and B, we can write this probability in terms of the count random variables by decomposing it into four disjoint events as illustrated in Figure 3
Denoting the size of each non-overlapping part by d=k·s while the overlapping part has a length of v=w − d, we obtain for the second moment:
To compute the bound, we observe that E[WiWi+k] is independent of i since all Wis are identically distributed and have the same pairwise dependencies. Therefore, we clarify notation by defining ζk:=E[WiWi+k]. For the same reason, we also obtain ζk=E[WiWi−k]. Using the further definition of ζ=∑k=1r−1ζk, we yield for bound b2 applying the same logic as above:
2.5 Alternative independence approach To assess the necessity to incorporate dependencies into the calculation, we compare the results with an approach ignoring dependencies. For the probability of no hits, we obtain
2.6 Data The PFM set used here is the vertebrate_non_redundant_minFP set from the TRANSFAC database (v. 11.3) (Matys et al., 2003). Since, despite the name, the set contains more than one PFM per TF (214 in total), we only select the first PFM per TF and obtain a set of 142 PFMs. Hence, we are left with a set of one PFM per TF. However, the remaining similarities between PFMs in this set are not negligible. To show this, we measure the similarity between all pairs of PFMs by the limiting covariance (Pape et al., 2008b). Then, we select the pair of PFMs with highest similarity (0.0002): S8 (V$S8_01) and CHX10 (V$CHX10_01). We use this pair for our analysis. To assess the influence of similarity, we also select a very dissimilar pair of PFMs. Given S8, the most dissimilar PFM is HIC (V$HIC1_02) with a similarity of −0.000004. The similarity between CHX and HIC is higher with a value of −0.000003. Hence, we define a pair of similar PFMs S8 and CHX10 and two pairs of dissimilar PFMs S8 and HIC as well as CHX and HIC (Fig. 4 All analyses regarding PFMs are performed based on a balanced type I error (α) in a sequence of length 500 controlled at a level of 10% [see Pape et al. (2006) for details]. In a step called regularization, we add pseudo-counts to the position-specific distributions of the PFM according to the information content of the position (Rahmann, 2003). Simulated sequences are generated i.i.d. with 50% GC content. 3 RESULTS In this section, we analyze the influence of the similarity between PFMs on the co-occurrence probabilities. First, we determine the window size for each pair such that the co-occurrence probability is 1%. Next, we confirm the approximated window size by a simulation. Based on these results, we compare the approximated cooperativity distributions for all pairs with the corresponding empirical distributions and the results from the independence approach. Finally, we apply the approach to overlapping windows and report the accuracy of the approximation. 3.1 Co-occurrence probability First, we apply the formulae for the window size given a co-occurrence probability of P=0.01 to all pairs of PFMs. The pair of similar PFMs S8:CHX10 yields a window size of 54 bp for both Newton iteration and Taylor expansion. Computing the co-occurrence probability for the window size 54 bp yields exactly 0.01. Hence, both approximations are very accurate. The most dissimilar pair S8:HIC yields for the same given co-occurrence probability a window size of 297 bp using Newton iteration and 281 bp using Taylor expansion. The corresponding co-occurrence probabilities are 0.01 and 0.009. Hence, the Newton iteration is slightly more accurate than the Taylor expansion. The dissimilar pair CHX:HIC yields a window size of 266 bp using Newton iteration and a slightly smaller window of 252 bp using Taylor expansion. Again, the window size derived from the Newton iteration is exact such that it leads to a co-occurrence probability of 0.01, while the Taylor extension yields 0.009. In comparison to the similar pair, one obtains an ~5-fold larger window size for the dissimilar pairs. Since similar PFMs tend to have overlapping hits, their probability of co-occurrence which includes overlapping hits is high. Therefore, an occurrence of one PFM increases the probability of an occurrence of the other PFM. In contrast, dissimilar PFMs cannot overlap. Thus, presence of one PFM decreases the probability of an (overlapping) occurrence of the other PFM. Due to the big difference in the window sizes, it is very important to consider the similarity between PFMs. The presented approach shows that one can simply adjust the window size. Hence, one would use a window size of 54 bp for the similar pair and of 297 bp and 266 bp, respectively, for the dissimilar pairs. Then, all pairs have almost equal co-occurrence probabilities. We verify this prediction by a simulation study. After annotating 100 random sequences each of length 1 000 000 bp with the corresponding PFMs, we count the number of co-occurrence events given above window sizes. The histograms for all three pairs are shown in Figure 5
In contrast, applying the window size of one of the dissimilar pairs (e.g. 297 bp) to the similar pair would yield a co-occurrence probability of around 0.04 (retrieved by simulation). Hence, by adjusting the window size the difference between co-occurrence probabilities decreases from almost 3- to 4-fold to quite comparable co-occurrence probabilities. As we will see next, such small differences already have strong influence on the cooperativity P-values. 3.2 Cooperativity Based on the co-occurrence probabilities and the window sizes, one can compute P-values for cooperativity. This is done by counting the number of windows with a co-occurrence event. The P-value is the probability for at least as many co-occurrence events as observed. A simulation with 10 000 sequences of length 100 000 bp is used as reference. In each sequence, we count the number of co-occurrence events. The frequencies of the counts are the empirical distribution. Figure 6
The center panels of Figure 6 The dissimilar pair CHX:HIC is compared in the right panels of Figure 6 In summary, we can state that the independence approach works for dissimilar pairs of PFMs while it cannot be used for similar pairs. In contrast, the new approach incorporates the similarity and returns accurate approximations for all pairs of PFMs independent of the shared similarity. Furthermore, overlapping windows lead to high approximation errors such that overlapping windows should be used carefully. However, using the new approach one can compute the approximation error before performing the analysis. Based on this, one can ensure that the overlapping scheme can yield significant P-values at least theoretically. Here, the analysis is done for sequences of length 100 000 bp. The Chen–Stein bounds implicitly depend on the sequence length because the number of windows is considered. Therefore, we also analyze the bounds for smaller sequences in the next section. 3.3 Overlapping windows for small sequences Assuming a sequence length of 1000 bp, we compute Chen–Stein error bounds for the cooperativity P-values. Using 54 bp long windows which overlap by 10% yields an error bound of 0.04 for the similar pair S8:CHX10. Hence, it will still be difficult to obtain significant results since one cannot obtain P-values less than 0.04. In general, similar PFMs have a high approximation error for overlapping windows since overlapping occurrences induce high dependencies between two windows. In contrast, the dissimilar pairs S8:HIC and CHX:HIC have error bounds of 0.002 and 0.003 for window sizes of 297 and 266 bp, respectively. The bounds are smaller for two reasons: first, the windows are larger and thus fewer windows are used for the sequence. Second, dependencies between overlapping windows are smaller since dissimilar PFMs have smaller overlap probabilities. Hence, in case of dissimilar PFMs one can use overlapping windows and still obtain significant cooperativity. 4 DISCUSSION In conclusion, we can state that detection of significant co-occurrences and cooperativity based on PFM occurrences is a difficult problem due to strong dependencies induced by similarity between PFMs. We show a reasonable approximation to adjust the window size such that co-occurrence and cooperativity probabilities are comparable between similar and dissimilar PFMs. Therefore, statistical followup analyses can ignore the similarity issue. Instead, the interpretation of cooperativity changes slightly: the window size defines the longest distance between two motifs such that the corresponding TFs are assumed to interact. Therefore, similar pairs of interacting TFs are required to have smaller distances between occurrences than dissimilar pairs of TFs. This is due to the fact that interaction over longer distances cannot be predicted with sufficient statistical support for similar TF pairs. Furthermore, we propose a new approximation for cooperativity using overlapping windows. Using the Chen–Stein technique, we can bound the approximation error. Results show that similar PFMs imply strong dependencies between overlapping windows. This leads to high approximation errors. In contrast, dissimilar PFMs yield low approximation errors. Based on our error bounds, one can precompute the approximation errors and select an appropriate overlap scheme before running the analysis. We give strong evidence for the accuracy of our approach and the necessity of incorporating similarities by comparison with the empirical distribution and the independence approach. Our results underline the difficulty in applying overlapping windows especially for similar motifs. However, it is important to use overlapping windows, otherwise, a motif occurring at the end of one window with another occurring at the beginning of the next window would not be counted as a co-occurrence event although the distance between them might only be a few base pairs. Hence, one could derive statistics for the distances between motifs instead of using windows (see Fig. 1 The main shortcoming of the approach is the limitation to an i.i.d. background model. Extension to a Markov model is not straightforward since calculation of co-occurrence probabilities rely on the independencies between sequence positions. In addition, we require the distribution of occurrences on both strands to be equal. This can be justified by Chargaff's second law (Chargaff et al., 1951). Furthermore, in contrast to coding sequence, there is no motivation to handle both strands in the upstream region differently. Therefore, modeling of CpG islands and other higher order sequence features cannot be done by using a more elaborate sequence model. However, one can circumvent this problem by using different window sizes for different sequences incorporating the respective GC content. Another strategy could use a mixture Poisson distribution based on different rate parameters incorporating variable GC content as approximation.ACKNOWLEDGEMENTS We thank the organizers of the GCB 2008 for the opportunity to present this work at the conference. Furthermore, discussions with Hugues Richard helped to improve the manuscript. Funding: International Research Training Group - Genomics and Systems Biology of Molecular Networks (to H.K.). Conflict of Interest: none declared. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||
Development. 1997 May; 124(10):1851-64.
[Development. 1997]Gene. 1996 Jun 12; 172(1):GC19-32.
[Gene. 1996]Science. 1998 Mar 20; 279(5358):1896-902.
[Science. 1998]Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):757-62.
[Proc Natl Acad Sci U S A. 2002]Nature. 2003 Dec 18; 426(6968):849-53.
[Nature. 2003]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Genome Res. 2003 Apr; 13(4):579-88.
[Genome Res. 2003]Bioinformatics. 1999 Oct; 15(10):776-84.
[Bioinformatics. 1999]Proc Natl Acad Sci U S A. 2005 May 17; 102(20):7079-84.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2004 Aug 17; 101(33):12114-9.
[Proc Natl Acad Sci U S A. 2004]Nat Genet. 2001 Oct; 29(2):153-9.
[Nat Genet. 2001]Nucleic Acids Res. 2006; 34(17):4925-36.
[Nucleic Acids Res. 2006]Bioinformatics. 2000 Jan; 16(1):16-23.
[Bioinformatics. 2000]Nucleic Acids Res. 2006; 34(12):3585-98.
[Nucleic Acids Res. 2006]J Mol Biol. 1997 Apr 25; 268(1):8-14.
[J Mol Biol. 1997]Bioinformatics. 2001 Oct; 17(10):878-89.
[Bioinformatics. 2001]Bioinformatics. 1999 Mar; 15(3):180-6.
[Bioinformatics. 1999]Bioinformatics. 1999 Oct; 15(10):776-84.
[Bioinformatics. 1999]Nucleic Acids Res. 2002 Jul 15; 30(14):3214-24.
[Nucleic Acids Res. 2002]Bioinformatics. 1999 Oct; 15(10):776-84.
[Bioinformatics. 1999]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Genome Biol. 2007; 8(5):R83.
[Genome Biol. 2007]Nucleic Acids Res. 2002 Oct 1; 30(19):4278-84.
[Nucleic Acids Res. 2002]Algorithms Mol Biol. 2007 Oct 10; 2():13.
[Algorithms Mol Biol. 2007]Genes Cells. 1999 Nov; 4(11):643-55.
[Genes Cells. 1999]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D108-10.
[Nucleic Acids Res. 2006]Genome Inform. 2007; 18():109-18.
[Genome Inform. 2007]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Bioinformatics. 2008 Feb 1; 24(3):350-7.
[Bioinformatics. 2008]Genome Res. 2004 Jun; 14(6):1188-90.
[Genome Res. 2004]Genome Res. 2004 Jun; 14(6):1188-90.
[Genome Res. 2004]Genome Inform. 2006; 17(1):134-40.
[Genome Inform. 2006]Bioinformatics. 1999 Oct; 15(10):776-84.
[Bioinformatics. 1999]J Comput Biol. 2008 Jul-Aug; 15(6):547-64.
[J Comput Biol. 2008]J Biol Chem. 1951 Sep; 192(1):223-30.
[J Biol Chem. 1951]