# ChIP-on-chip significance analysis reveals large-scale binding and regulation by human transcription factor oncogenes

^{a,}

^{b,}

^{c,}

^{1}Teresa Palomero,

^{d,}

^{e}Pavel Sumazin,

^{b}Andrea Califano,

^{a,}

^{b,}

^{d,}

^{2,}

^{3}Adolfo A. Ferrando,

^{d,}

^{e,}

^{f,}

^{2,}

^{3}and Gustavo Stolovitzky

^{b,}

^{c,}

^{2,}

^{3}

^{a}Department of Biomedical Informatics,

^{b}Joint Centers for Systems Biology,

^{d}Institute for Cancer Genetics,

^{e}Department of Pathology, and

^{f}Department of Pediatrics, Columbia University, New York, NY 10032; and

^{c}Functional Genomics and Systems Biology Group, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598

^{3}To whom correspondence may be addressed. E-mail: moc.mbi.su@ovatsug, Email: ude.aibmuloc.2b2c@onafilac, or ; Email: ude.aibmuloc@6912fa

Author contributions: A.A.M., P.S., A.C., A.A.F., and G.S. designed research; A.A.M., T.P., and P.S. performed research; T.P. and A.A.F. contributed new reagents/analytic tools; A.A.M. and P.S. analyzed data; and A.A.M., T.P., P.S., A.C., A.A.F., and G.S. wrote the paper.

^{1}Present address: The Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, MA 02142.

^{2}A.C., A.A.F., and G.S. contributed equally to this work.

## Abstract

ChIP-on-chip has emerged as a powerful tool to dissect the complex network of regulatory interactions between transcription factors and their targets. However, most ChIP-on-chip analysis methods use conservative approaches aimed at minimizing false-positive transcription factor targets. We present a model with improved sensitivity in detecting binding events from ChIP-on-chip data. Its application to human T cells, followed by extensive biochemical validation, reveals that 3 oncogenic transcription factors, NOTCH1, MYC, and HES1, bind to several thousand target gene promoters, up to an order of magnitude increase over conventional analysis methods. Gene expression profiling upon NOTCH1 inhibition shows broad-scale functional regulation across the entire range of predicted target genes, establishing a closer link between occupancy and regulation. Finally, the increased sensitivity reveals a combinatorial regulatory program in which MYC cobinds to virtually all NOTCH1-bound promoters. Overall, these results suggest an unappreciated complexity of transcriptional regulatory networks and highlight the fundamental importance of genome-scale analysis to represent transcriptional programs.

**Keywords:**regulatory networks, T cell lymphoblastic leukemia, transcriptional regulation, systems biology

The dysregulated activity of oncogenic transcription factors (TFs) contributes to neoplastic transformation by promoting aberrant expression of target genes involved in regulating cell homeostasis. Therefore, characterization of the regulatory networks controlled by these TFs is a critical objective in understanding the molecular mechanisms of cell transformation. ChIP-on-chip (ChIP^{2}) (1) has emerged as a promising technology in the dissection of transcriptional networks by providing high-resolution maps of genome-wide TF–chromatin interactions.

ChIP^{2} uses microarray technology to measure the relative abundance of genomic fragments derived from an immunoprecipitate (IP) sample, which is enriched in fragments bound by an immunoprecipitated protein (usually a TF), and a whole-cell extract (WCE) sample, containing fragments derived from a total chromatin preparation (input control) or an immunoprecipitation with a nonspecific control antibody (2). The 2 samples may either be hybridized to different arrays or labeled with different dyes and hybridized to the same array. Correct interpretation of ChIP^{2} data depends critically on an accurate statistical model to compute the probability that a given IP/WCE ratio is produced by a binding event rather than experimental noise.

Recently, several elegant ChIP^{2} analysis methods have been proposed to tackle problems such as integrating measurements from adjacent probes (3–6) or inferring binding site locations at subprobe resolution (7). However, the lower-level problem of developing an accurate error model to define meaningful statistical thresholds has received comparably little attention [see SI and Fig. 1]. Thus, ChIP^{2} data analysis methods often use highly conservative approaches aimed at minimizing the rate of false-positive predictions. Although several studies have experimentally validated novel target collections produced at a given statistical threshold (8–12), these studies likely miss a large number of true binding events, obscuring the full complexity of transcriptional processes.

_{2}IP/WCE probe ratio values from a MYC ChIP

^{2}experiment. The histogram displays distinct, overlapping

**...**

Using an empirically determined model of the distribution of intensity ratios for non-IP-enriched probes in ChIP^{2} experiments, we developed an analytical method called ChIP^{2} Significance Analysis (CSA). When applied to ChIP^{2} data from the NOTCH1, MYC, and HES1 protooncogenes in human T cell acute lymphoblastic leukemia (T-ALL) cells, CSA increased the number of detected binding sites by up to an order of magnitude compared with other routinely used methods. Both binding site analysis and biochemical validation demonstrate quantitative agreement with CSA-predicted false-positive rates. Analysis of gene expression signatures indicates functional regulation by NOTCH1 across the entire range of predicted targets. Finally, the increased sensitivity reveals that virtually all NOTCH1-bound promoters are also bound by MYC. Overall, these results highlight the power of the proposed analysis framework for the identification of transcriptional networks and provide an improved and fundamentally different picture of the transcriptional programs controlled by NOTCH1, HES1, and MYC in T-ALL.

## Results

### Probe Statistics Are Accurately Modeled by CSA.

T-ALL is a malignant tumor characterized by the aberrant activation of oncogenic TFs (13). We recently demonstrated that constitutive activation of NOTCH1 signaling due to mutations in the *NOTCH1* gene activates a transcriptional network that controls leukemic cell growth (11, 14–16). These studies also demonstrated a fundamental role for *HES1* and *MYC* as transcriptional mediators of NOTCH1 signals (15, 17). To characterize the structure of the oncogenic transcriptional network driven by activated NOTCH1 in T cell transformation, we sought to identify the direct transcriptional targets of NOTCH1, HES1, and MYC. We hypothesized that the development of an accurate statistical model would result in improved sensitivity in the identification of TF targets and a more accurate description of the individual and combinatorial regulatory programs controlled by these TFs.

We first generated an empirical model of the distribution of IP/WCE intensity ratios for probes associated with unbound fragments (see *Materials and Methods*), and we used it to assign a *P* value to each probe in the analysis of ChIP^{2} assays representing replicate experiments for NOTCH1, MYC, and HES1. ChIP^{2} assays for these TFs were performed in HPB-ALL cells, a well-characterized T-ALL cell line with high expression levels of activated NOTCH1, MYC, and HES1. For NOTCH1, ChIP^{2} assays were also performed in CUTLL1 cells, another NOTCH1-dependent T-ALL cell line. The magnitude versus amplitude plots (Fig. 2*A*) of the intensity-dependent distributions of probe-ratio values showed marked differences for the four experiments. In each case CSA accurately modeled the left tail of the probe ratio probability distribution, where the contribution from bound probes is expected to be minimal (Fig. 2 *A* and *B*). We note that if bound-probe ratios are well separated from the experimental noise, the *P* value distribution for all probes should be uniform between zero and one (unbound probes) with a single peak near zero (bound probes). Importantly, CSA accurately captured these statistical properties (see *SI*).

### Improved ChIP^{2} Sensitivity by CSA.

CSA then incorporates the probe significance model with an analytical method that integrates the statistics for replicate experiments and probes with nearby genomic locations (to account for ChIP^{2} fragmentation lengths, see *Materials and Methods*). We used CSA to compute the false discovery rate (FDR) associated with the most significant 500-bp region on each of the 16,697 promoters represented on the array. Analysis of NOTCH1, MYC, and HES1 promoter occupancy in T-ALL showed a larger than anticipated number of candidate target genes for these TFs. Specifically, using CSA at a conservative FDR of 0.05, the number of promoters on the array bound by the TFs in this study are: MYC (8,016; 48.0%), NOTCH1 in CUTLL1 (3,154; 18.9%), HES1 (3,074; 18.4%), and NOTCH1 in HPB-ALL (2,471; 14.8%) (Table 1).

Although the numbers reported above are far larger than the number of predicted targets commonly reported in ChIP^{2} analysis studies, we also compared against predictions from several published analysis methods with available software. One class of methods relies heavily on analyzing the shape of ratio values from multiple probes with nearby genomic proximity (3–7). These methods are generally not applicable to the relatively sparse arrays used in this study and produced very few predicted targets. As an initial benchmark, we compared against the Single Array Error Model (SAEM) (1, 18), the standard method packaged with the Agilent analysis software. This method models the intensity-dependent variance of probe ratio values, but then computes significance based on whole-dataset statistics. CSA predicted approximately an order of magnitude more bound promoters than SAEM (Table 1). Two published methods, ChIPOTle (19) and Chipper (8), compute significance using only probes with low ratio values, and these methods indeed predicted more targets than SAEM. However, both methods use normalization techniques that, to varying extents, rely on whole-dataset statistics, and as a result CSA predicted more targets than both. This was most apparent for MYC, which contained the largest number of high ratio probes. For the other TFs, CSA predicted roughly the same number of targets as ChIPOTle and twice as many targets as Chipper (Table 1).

### Accuracy of CSA Predictions Is Supported by Binding Site Enrichment Analysis.

As a first test of the broad TF binding predictions generated by CSA, we evaluated the enrichment of MYC binding sites, using the TRANSFAC (20) position-specific scoring matrix M00322, in the promoters of target genes identified by CSA and other analysis methods. The DNA-binding component of NOTCH1 transcriptional complexes, CSL, is not represented in TRANSFAC or JASPAR (21), and the only HES1-associated matrix was found to be of low quality and a poor predictor of HES1 binding, independent of the algorithm used. For each analysis method, promoters were ranked by their *P* values computed from the MYC ChIP^{2} experiment, and MYC/M00322 matching sites were identified in the 600-bp fixed-length window centered on the most significant probe in the highest-scoring promoter region. The match threshold was set so that a negative set, *S*^{(−)}, of 3,000 fragments showing the least amount of MYC binding would produce a false-positive rate of 30%. Details on the procedure are given in the *SI*.

Analysis of the cumulative proportion MYC/M00322-matching fragments as a function of their ChIP^{2} ranking by the corresponding method showed that fragments inferred by all methods were enriched in MYC/M00322 sites and that site enrichment was correlated with the ChIP^{2} ranking (Fig. 3). However, fragments identified by CSA were more likely to contain MYC binding sites than those identified by the other methods. The highest-scoring ≈2,000 sites identified by Chipper performed better than those identified by CSA; however, CSA-identified fragments beyond this ranking were significantly more enriched in MYC binding sites. Comparing nonoverlapping fragments in the top 5,000 promoters inferred by CSA versus each of the other 3 methods demonstrated a statistically significant enrichment of MYC binding sites in CSA-inferred fragments (*P* < 10^{−10}, based on the hypergeometric distribution, for each comparison).

*Inset*) For bins of 100 genes ranked by CSA, we

**...**

To compare the predicted false-positive rate by CSA with the significance of MYC binding site enrichment, we binned the fragments based on their CSA rankings (100/bin) and assessed whether the MYC/M00322 motif could be successfully used to distinguish the fragments in each bin from those in the negative set, *S*^{(−)}. The classification *P* value based on binding site enrichment was in excellent agreement with the CSA-inferred false-positive rate of each bin, suggesting significant enrichment of MYC sites in the promoters of ≈7,000 genes, corresponding to the range of high-confidence targets predicted by CSA (Fig. 3 *Inset*). Beyond this threshold, both quantities degraded very rapidly, and for ranks greater than ≈8,000 the CSA-inferred false-positive rate reached 100% and, correspondingly, fragments showed no ability to be classified by MYC binding sites.

Overall, these results suggest that CSA has increased sensitivity in identifying a larger number of binding events and that meaningful statistical cutoffs can be determined from data.

### Experimental Validation of CSA TF Binding Predictions.

To further test the accuracy of CSA-based TF target predictions, we performed independent chromatin immunoprecipitation (ChIP) experiments for each of the 4 ChIP^{2} conditions and tested the IP enrichment of specific promoters by quantitative PCR (qPCR). We first analyzed 8 predicted NOTCH1 targets in HPB-ALL cells, randomly sampled at an FDR ≤20%. Seven of these 8 predicted fragments were validated as bound by NOTCH1, and only the least significant fragment failed validation (Table 2).

We tested an additional 12 targets for HES1 and MYC in HPB-ALL and for NOTCH1 in CUTLL1, sampling predicted targets uniformly at an FDR of 20% (i.e., 20% expected false-positives) (Table 2). In this analysis, 26 of 36 targets (72.2%) were positive by ChIP/qPCR, and 9 (25%) were negative. The remaining gene (the second least significant for MYC) could not be amplified by qPCR. Nonvalidated/false-positive targets were, in general, at the end of the ranked lists (Table 2). The only outlier was the first-ranked fragment for HES1 (*KIAA1407* gene promoter). To obtain experimental evidence on the robustness of our validation assay, we randomly selected 10 genomic regions not identified as bound by MYC and 10 not identified as bound by HES1. Nine selected regions were within promoters, and 11 were in intergenic regions. As expected, none of these 20 regions showed evidence of binding by MYC or HES1 when tested by ChIP/qPCR.

For all experiments, numerous validated genes had CSA ranks in the thousands. The lowest-ranking validated genes before encountering a false-positive were as follows: 2,223 for NOTCH1 in CUTLL1; 2,958 for NOTCH1 in HPB-ALL; 4,901 for MYC; and although the top-ranking gene for HES1 failed validation, the following 7, down to rank 3,247, were positive. Notably, many of the validated targets showed subtle ChIP^{2} signals. For example, *C6orf82*, a validated HES1 target, had ChIP^{2} binding ratios in replicate experiments of 1.37 and 1.68 for the most significant probe in its promoter, and there was no enrichment (ratios of 0.81 and 1.15) for its adjacent probe. However, upon ChIP/qPCR validation, this region showed binding ratios of 2.69 and 4.65. ChIP/qPCR results are available in the *SI*.

Overall, 33 of the 44 genes (75%) selected from those with an FDR of 20% by CSA were validated by ChIP/qPCR. These biochemical validation results support our computationally derived conclusions regarding the broad range of binding for all tested TFs and demonstrate the power of CSA for reducing the false-negative rate in ChIP^{2} experiments.

### NOTCH1 Regulates Direct Target Genes Predicted by CSA.

To test whether CSA-predicted NOTCH1-bound genes are also functionally regulated by this TF, we treated a panel of 10 T-ALL cell lines with Compound E, a γ-secretase inhibitor that blocks an essential proteolytic cleavage step required for release of the intracellular domains of NOTCH1 from the membrane and their translocation to the nucleus (22). Genome-wide expression profiles of cells treated for 72 h with Compound E (100 nM) or vehicle only (DMSO) were measured using microarrays, and expression changes were compared with NOTCH1 promoter occupancy identified by CSA analysis of the ChIP^{2} data. Overall, 11,606 genes were represented on both the ChIP^{2} and the expression arrays. For each gene we computed: (*i*) the ChIP^{2} FDR based on the highest-scoring 500-bp region in its promoter; (*ii*) the log_{2} expression ratio of the control versus treatment, averaged over the 10 cell lines and duplicate experiments; and (*iii*) the number of microarray experiments in which the gene was expressed (not called absent by MAS5), considering both Compound E-treated and DMSO-treated samples (because the group of expressed genes is essentially the same for both treatments, considering expressed genes using only one subgroup does not substantially change the results).

Predicted NOTCH1-bound genes were more likely to be expressed than genes not identified as bound by NOTCH1 and showed clear down-regulation upon NOTCH1 inhibition (Fig. 4). The 2,000 most confident NOTCH1 targets (FDR < 0.058) were expressed in 83.3% of experiments, whereas the 6,000 least confident NOTCH1 targets were expressed in 38.8% of experiments (*P* < 10^{−100}). The top 2,000 targets also showed coordinated down-regulation upon NOTCH1 inhibition that was subtle in magnitude (mean = 12.2%) but extremely significant (*P* < 10^{−100}). The ChIP^{2} analysis predicted a rapid increase in false-positives beyond the top 2,000 targets and, correspondingly, their likelihoods to be expressed and regulated by NOTCH1 decreased. However, even for genes with ChIP^{2} ranks between 4,000 and 5,000, there was significant enrichment for both the percent of expressed genes (59.4%; *P* < 10^{−33}) and the expression change upon NOTCH1 inhibition (*P* < 10^{−15}). These results demonstrate that, in contrast with previous analysis based on a limited number of targets (17), NOTCH1 directly contributes to the transcriptional activity of thousands of genes.

### Interaction of NOTCH1 and MYC Regulatory Networks.

NOTCH1 and MYC operate as highly interrelated regulators of cell growth, proliferation, and survival during T cell development and transformation. In a recent study (11), we compared the regulatory networks controlled by NOTCH1 and MYC by using the ARACNE reverse engineering algorithm (23, 24) to predict 58 and 61 targets of NOTCH1 and MYC, respectively, and observed a significant overlap of 12 genes between the 2 lists (*P* < 10^{−51}). We went on to characterize a feed-forward loop in which NOTCH1 directly regulates *MYC*, and both TFs regulate a common set of targets promoting leukemic cell growth. Based on these findings, we sought to further investigate the relationship between the genes bound by MYC and by NOTCH1 using the much larger list of targets inferred by CSA. Strikingly, the analysis predicted that MYC bound to 1,668 of the 1,804 (92.5%; *P* < 10^{−11}) genes that were bound by NOTCH1, using a ChIP^{2} FDR threshold of 0.01. In agreement with the fundamental role of NOTCH1 in controlling leukemia cell growth (11), the NOTCH1-bound genes were highly enriched in gene ontology (GO) (25) categories related to cellular growth and metabolism, such as cellular metabolism (*P* < 10^{−41}), RNA metabolism (*P* < 10^{−24}), and protein biosynthesis (*P* < 10^{−9}). The complete output of the GO enrichment analysis is given in the *SI*.

## Discussion

We have shown that the choice of a realistic statistical model can dramatically affect the result of ChIP^{2} data analysis and its biological interpretation and proposed the CSA algorithm to assign meaningful statistical significance scores used to predict a more complete range of TF–target interactions. The method of assessing probe statistical significance relies on minimal assumptions: that the null distribution is symmetric and that bound fragments do not significantly affect the left tail of the null hypothesis statistics. As a result, it should generalize well to ChIP^{2} experiments performed using other platforms and cellular conditions. We used an independence model for replicate experiments and adjacent probes in the null hypothesis. Although this assumption is valid for relatively sparse arrays, denser arrays may introduce correlation for unbound nearby probes that are within the DNA fragmentation length, leading to unrealistically low FDR values if independence is assumed. We therefore recommend caution that the independence assumption applies when analyzing denser arrays, in which case the CSA method may be further improved by incorporating existing, more sophisticated models for the integration of data from nearby probes (3–7). However, for the arrays used in this study, which contain an average probe spacing of more than 200 nucleotides, we show that our results are in quantitatively good agreement with biochemical validation assays and that no correction seems to be required.

The analysis of ChIP^{2} data from 3 oncogenic TFs reveals that CSA identifies far more bound gene promoters than standard analyses. Specifically, CSA predicts that each studied TF binds to several thousand target genes, with MYC binding to roughly half of the assayed promoters, providing additional insight into the extreme pluripotency of this protooncogene (26). These predictions might still be an underestimate, because only the proximal promoter regions (−0.8 kb to +0.2 kb, relative to transcription start site) are represented on the arrays used in this study.

CSA predictions were validated by 3 independent tests. ChIP/qPCR experiments are in excellent correspondence with CSA-inferred FDRs, especially considering that ChIP/qPCR itself has a 10–20% false-negative rate (9–12). Computational validation by sequence analysis further indicates that CSA-inferred FDRs are in agreement with MYC binding site enrichments. Finally, gene expression analysis after NOTCH1 inhibition both provides further support for the CSA predictions and creates a stronger than expected association between bound and regulated genes. We found that NOTCH1 binds to a large number of promoters (>2,000) and that the set of corresponding genes is consistently, albeit weakly, regulated upon NOTCH1 inhibition. These results are highly consistent with a previous study performed in yeast (27) that also observed correspondence of ChIP^{2} results with both binding site enrichment and expression changes for a large number of genes.

GO enrichment analysis shows that NOTCH1 subtly regulates a large number of genes involved in the cellular growth machinery. These results add an additional layer of regulation to the effects of NOTCH1 signaling in promoting cell growth, with important implications for understanding the role of NOTCH1 signaling in development and transformation. Thus, in addition to the established role of NOTCH1 in promoting growth through its interaction with *MYC* (17) and the PI3K-AKT (15) signaling pathway, NOTCH1 also has a direct effect in promoting cell growth. This irreversibly couples the developmental programs involved in stem cell homeostasis and lineage commitment activated upon NOTCH1 activation with the metabolic pathways needed for the expansion of stem cells and T cell progenitors.

Finally, the availability of a more complete repertoire of bound promoters allows us to truly assess the extent of a TF's regulatory program and the combinatorial overlap between independent programs. Our analysis shows that 92.5% of the promoters bound by NOTCH1 are also bound by MYC. Indeed, it appears that NOTCH1 coregulates a specific subset of the MYC regulatory program. Although this was previously hinted at by the similarity of the regulatory programs inferred for the 2 TFs by expression analysis (17), the true extent of this overlap can only be grasped after resolving a more complete map of NOTCH1 and MYC targets. While contributing to our understanding of transcriptional regulation at the genome-scale, our findings suggest an even greater than expected complexity of transcriptional networks.

## Materials and Methods

### CSA Algorithm.

The CSA algorithm takes as input probe intensity measurements after background subtraction and correction for factors such as spatial position on the array and print tip variability. In this study we used the standard Agilent procedure to obtain background-corrected probe intensity values, and we determined the statistical significance of binding regions by using the procedure described below.

### Single-Probe Significance Analysis.

For each probe, the statistical significance of a measurement is inferred by computing the conditional probability of the magnitude (*M*) given the amplitude (*A*), *P*_{null}(*M*|*A*), where *M* = log2(*IP*/*WCE*) and *A* = [log2(*IP*) + log2(*WCE*)]/2, under the null hypothesis (i.e., no enrichment in the IP compared with the WCE channel). Here, *IP* and *WCE* represent, respectively, the probe intensity measurements for the IP and WCE channels. The dependency of *M* on *A* is illustrated in Fig. 2*A*.

The method begins by estimating the joint probability distribution, *P*(*M*,*A*), using a bivariate Gaussian kernel density estimator (28). The kernel width of the estimator is calculated using the AMISE criterion (29). Conditioning on *A* yields the conditional distribution *P*(*M*|*A*) = *P*(*M*,*A*)/*P*(*A*), where *P*(*A*) is calculated using a univariate Gaussian kernel. For a particular average intensity value, *A*_{0}, the conditional mean of the null distribution is inferred as

The conditional null distribution given *A* = *A*_{0} is inferred by projecting *P*(*M*|*A* = *A*_{0}) across μ^_{M|A0} for M < μ^_{M|A0}. This procedure is used to calculate *P*_{null}(*M*|*A*) for an evenly spaced grid of *A* and *M* values, excluding the 1% of probes with the lowest *A* values (which are assigned a *P* value of 1). In this work we used step sizes of 0.05 and 0.01 for the *A* and *M* values, respectively. The complete conditional null distribution, P_{null}(M|A), is computed using 2-dimensional linear interpolation. For each probe, statistical significance is assessed using a 1-tailed test with reference to this distribution. Because the distribution is empirical, there is a limit to the inferable minimum *P* value, which depends on the number of arrayed probes. For the arrays used in this study, we set the minimum *P* value to 10^{−5}, which is roughly 1 divided by the number of probes on the array. We stress the importance of using an empirical distribution because we have observed that the empirical data generally display significantly non-Gaussian tails.

### Combining Replicates.

We use Fisher's method (30) to compute an aggregate *P* value for each probe based on measurements from replicate experiments, under the null hypothesis that the probe is unbound in all replicates. Let *p _{i}^{j}* denote the

*P*value computed for the

*i*

^{th}probe in the

*j*

^{th}replicate experiment. Assuming that replicates are independent in the null hypothesis, a test statistic for evaluating the probability of a joint observation of

*P*values across experiments is the product of the individual

*P*values,

*= Π*

_{i}_{j=1}

^{M}

*p*, where

_{i}^{j}*is the test statistic and*

_{i}*M*is the number of replicate experiments. If modeled correctly,

*P*values under the null hypothesis should be uniformly distributed (See

*SI*). It is useful to log-transform this equation such that we evaluate −log(

*) = −Σ*

_{i}_{j=1}

^{M}log(

*p*). Because the logarithm of a uniform distribution is exponentially distributed with mean 1, this equation is a sum of exponentially distributed random variables, which is a gamma-distributed random variable with mean 1 and

_{i}^{j}*M*degrees of freedom. Thus, significance can be evaluated as Γ

*(−Σ*

_{CDF}^{M}_{j=1}

^{M}log(

*p*)), where Γ

_{i}^{j}*is the gamma cumulative distribution function with mean 1 and*

_{CDF}^{M}*M*degrees of freedom.

### Combining Regions.

Because of sonication, the signal derived from a binding event may be detected by multiple probes in close genomic proximity to the binding site. To compute a combined statistic representing the probability of a binding event within the region spanned by multiple probes, we adapt a commonly used strategy (19) of using a fixed-size sliding window and integrating the values of probes falling within this window. Based on published measurements of fragmentation lengths (7), in this work we used a 500-bp window and a step size of 150 bp. Assuming that measurements from adjacent probes are independent in the null hypothesis, Fisher's method can again be applied to integrate the values from nearby probes. That is, let *W* represent the set of probes falling within a given 500-bp window. The integrated probability for this region is then calculated as

To compute the probability that any region within a gene's promoter is bound, we consider the most significant window, controlling for multiple tests using Bonferroni correction based on the number of probes in the promoter. This correction is not exact, because the number of tests (i.e., the number of windows containing unique subsets of probes) is likely greater than the number of probes in a promoter, causing underestimation of the significance, and the tests are not independent (i.e., the same probe may fall within multiple windows), causing overestimation of the significance. However, because the number of probes in each promoter (and therefore the number of probes within each window) is relatively small, especially for the arrays used in this study, we expect this simplification to have little impact on the calculated statistics. For very dense arrays, a more sophisticated multiple-test correction procedure, such as those described in (31), may yield more accurate results.

### FDR Calculation.

After computing a corrected *P* value for each gene representing the probability that the most significant region on the gene's promoter is bound, we control for multiple tests across genes and compute a false discovery rate using the Benjamini–Hochberg procedure (32). Let *p _{k}* represent the corrected

*P*value computed for gene

*k*, let

*r*represent the rank of gene

_{k}*k*sorted by the ChIP

^{2}

*P*values, and let

*G*represent the total number of genes on the array; then, the false discovery rate for gene

*k*is computed as

*FDR*=

_{k}*G**

*p*/

_{k}*r*.

_{k}## Acknowledgments.

A.A.M. was supported by an IBM PhD fellowship. This work was supported by National Cancer Institute Grants R01CA109755 (to A.C.) and R01CA120196 (to A.A.F.), National Institute of Allergy and Infectious Diseases Grant R01AI066116, the Alex Lemonade Stand Foundation (T.P.), The Cancer Research Institute, the WOLF Foundation, the National Centers for Biomedical Computing National Institutes of Health Roadmap Initiative (U54CA121852), and the Leukemia and Lymphoma Society (Grants 1287-08 and 6237-08). A.A.F. is a Leukemia and Lymphoma Scholar.

## Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: The microarray data have been deposited in the Gene Expression Omnibus (GEO) Database, www.ncbi.nlm.nih.gov/geo (accession no. {"type":"entrez-geo","attrs":{"text":"GSE12868","term_id":"12868","extlink":"1"}}GSE12868). ChIP^{2} data is at http://wiki.c2b2.columbia.edu/califanolab/PNASAM2009/.

This article contains supporting information online at www.pnas.org/cgi/content/full/0806445106/DCSupplemental.

## References

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (765K) |
- Citation

- NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic cell growth.[Proc Natl Acad Sci U S A. 2006]
*Palomero T, Lim WK, Odom DT, Sulis ML, Real PJ, Margolin A, Barnes KC, O'Neil J, Neuberg D, Weng AP, et al.**Proc Natl Acad Sci U S A. 2006 Nov 28; 103(48):18261-6. Epub 2006 Nov 17.* - Notch1 contributes to mouse T-cell leukemia by directly inducing the expression of c-myc.[Mol Cell Biol. 2006]
*Sharma VM, Calvo JA, Draheim KM, Cunningham LA, Hermance N, Beverly L, Krishnamoorthy V, Bhasin M, Capobianco AJ, Kelliher MA.**Mol Cell Biol. 2006 Nov; 26(21):8022-31. Epub 2006 Sep 5.* - Nkx3.1 and Myc crossregulate shared target genes in mouse and human prostate tumorigenesis.[J Clin Invest. 2012]
*Anderson PD, McKissic SA, Logan M, Roh M, Franco OE, Wang J, Doubinskaia I, van der Meer R, Hayward SW, Eischen CM, et al.**J Clin Invest. 2012 May 1; 122(5):1907-19. Epub 2012 Apr 9.* - Differentiation primary response genes and proto-oncogenes as positive and negative regulators of terminal hematopoietic cell differentiation.[Stem Cells. 1994]
*Liebermann DA, Hoffman B.**Stem Cells. 1994 Jul; 12(4):352-69.* - Ldb1 complexes: the new master regulators of erythroid gene transcription.[Trends Genet. 2014]
*Love PE, Warzecha C, Li L.**Trends Genet. 2014 Jan; 30(1):1-9. Epub 2013 Nov 27.*

- BioProjectBioProjectBioProject links
- CompoundCompoundPubChem Compound links
- GeneGeneGene links
- GEO DataSetsGEO DataSetsGEO DataSet links
- GEO ProfilesGEO ProfilesRelated GEO records
- HomoloGeneHomoloGeneHomoloGene links
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links

- ChIP-on-chip significance analysis reveals large-scale binding and regulation by...ChIP-on-chip significance analysis reveals large-scale binding and regulation by human transcription factor oncogenesProceedings of the National Academy of Sciences of the United States of America. Jan 6, 2009; 106(1)244

Your browsing activity is empty.

Activity recording is turned off.

See more...