![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||
Copyright © 2008 The Author(s) Predicting functional transcription factor binding through alignment-free and affinity-based analysis of orthologous promoter sequences 1Department of Biological Sciences, Columbia University, New York, NY 10027 and 2Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10032, USA *To whom correspondence should be addressed. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Motivation: The identification of transcription factor (TF) binding sites and the regulatory circuitry that they define is currently an area of intense research. Data from whole-genome chromatin immunoprecipitation (ChIP–chip), whole-genome expression microarrays, and sequencing of multiple closely related genomes have all proven useful. By and large, existing methods treat the interpretation of functional data as a classification problem (between bound and unbound DNA), and the analysis of comparative data as a problem of local alignment (to recover phylogenetic footprints of presumably functional elements). Both of these approaches suffer from the inability to model and detect low-affinity binding sites, which have recently been shown to be abundant and functional. Results: We have developed a method that discovers functional regulatory targets of TFs by predicting the total affinity of each promoter for those factors and then comparing that affinity across orthologous promoters in closely related species. At each promoter, we consider the minimum affinity among orthologs to be the fraction of the affinity that is functional. Because we calculate the affinity of the entire promoter, our method is independent of local alignment. By comparing with functional annotation information and gene expression data in Saccharomyces cerevisiae, we have validated that this biophysically motivated use of evolutionary conservation gives rise to dramatic improvement in prediction of regulatory connectivity and factor–factor interactions compared to the use of a single genome. We propose novel biological functions for several yeast TFs, including the factors Snt2 and Stb4, for which no function has been reported. Our affinity-based approach towards comparative genomics may allow a more quantitative analysis of the principles governing the evolution of non-coding DNA. Availability: The MatrixREDUCE software package is available from http://www.bussemakerlab.org/software/MatrixREDUCE Contact: Harmen.Bussemaker/at/columbia.edu Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION Genome sequences encode not only the sequences of RNAs, but also the rates at which these are transcribed under various conditions. This cis-regulatory code is a consequence of the sequence specificity of transcription factors (TFs) and their interactions with other TFs, nucleosomes and other chromatin-associated proteins. The identification of cis-regulatory elements on a genomic scale is complicated by the fact that, although TF sequence specificity is generally well-characterized in vitro, functional elements in the genome are in fact much sparser than would be predicted from sequence alone (Gao et al., 2004; Pilpel et al., 2001). There are two types of constraint on the in vivo selection of functional targets by a given TF: those that prevent the TF from binding to DNA, and those that prevent a bound TF from driving transcription (Fig. 1
Most comparative genomics methods rely on local alignment of orthologous promoters or statistical measures of sequence overrepresentation (Cliften et al., 2003; Kellis et al., 2003; Li and Wong, 2005; Moses et al., 2004; Siddharthan et al., 2005; Sinha et al., 2004), neither of which reflect the evolutionary constraints on regulatory sequence. Local alignment is not well-suited to detect lower affinity binding sites, which may be distant in sequence space yet functional; nor can it capture the rapid turnover of binding sites, which often occurs without conservation of position (Dermitzakis and Clark, 2002; Ludwig, 2002; Tautz, 2000; Wray, 2003). These limitations could be overcome by not directly comparing orthologous sequences, but rather comparing their predicted affinities for various TFs. Various biophysically motivated models of promoter-TF affinity have been developed (Bintu et al., 2005; Djordjevic et al., 2003; Liu and Clarke, 2002; Roider et al., 2007; Ronen et al., 2002; Stormo et al., 1986) that allow such a comparison. In this article, we build upon the principles of conservation of promoter-TF affinity across a wide range of interaction strengths (Tanay, 2006) and conservation of the core transcriptional network (Pritsker et al., 2004), and posit that the fraction of a promoter's affinity that is conserved in all species—that is, the minimum total affinity among orthologous promoters—can be used as a proxy for the fraction of the affinity that is functional. We tested this idea using Saccharomyces cerevisiae and three closely related yeast species. We used the position-specific affinity matrix (PSAM) model (Foat et al., 2006) to predict the affinity of each promoter in each species for a set of TFs with previously characterized sequence specificities (MacIsaac et al., 2006). For each TF, this yields a value for the total affinity at every promoter in S. cerevisiae (NT). The minimum of the four orthologous promoters’ affinities defines the conserved promoter affinity (NC) at each promoter, and the unconserved promoter affinity (NU) is calculated by subtracting NC from NT (Fig. 2
We also develop a measure of correlation between genome-wide NC landscapes for pairs of TFs (affinity co-conservation). The interactions thus predicted are highly enriched for known physical or functional interactions between TFs. When the same approach is repeated using NT (affinity co-occurrence), no such enrichment is detected. Our method holds promise for predicting in vivo function when only in vitro TF binding data and an ensemble of closely related genome sequences are available. It is fundamentally different from other methods because it is free of the parameters which govern local alignment, thresholding between targets and non-targets, and any distinction between conserved and non-conserved instances of individual binding sites. 2 METHODS 2.1 Modeling TF affinities for orthologous promoter sequences Genome sequences for S. bayanus, S. cerevisiae, S. mikatae and S. paradoxus were obtained from Kellis et al. (2003). Only genes for which the authors defined orthologs in all species were considered. For the correlation and GO analyses, we extracted sequences 500 bp upstream from each start codon and truncated them to exclude any upstream coding sequences. For the co-occurrence and co-conservation analyses, to negate length effects and spurious correlations, we extracted 500 bp upstream sequences without regard for upstream coding sequences; additionally, to avoid any overlap, we considered only the sequence upstream of the gene encoded on the Watson strand at divergent promoters with a length <1 kb.We used the PSAM model for TF–DNA binding affinity (Foat et al., 2006), consisting of a matrix of parameters wjb. These weights represent the energetic consequences of mutations from the highest affinity sequence Sref to nucleotide j at position b within the binding window, resulting in a mutated sequence Smut with an increased free energy of binding:
We used the AffinityProfile utility from the MatrixREDUCE software package to calculate the affinity Ngfs for each promoter g for each TF f in each species s. We define the total promoter affinity as the affinity Ngfs for s= S. cerevisiae, and the conserved promoter affinity ( ) of a promoter g for a TF f as the minimum of Ngfs among all four species s. The unconserved promoter affinity is given by
2.2 Correlation with functional genomics data We selected those TFs for which both a PSAM and a corresponding TF deletion expression microarray experiment from Hughes et al. (2000) were available and calculated the Pearson correlation between NU for all promoters and the corresponding log2 expression ratios, and between NC for all promoters and the corresponding log2 expression ratios. Similarly, we considered TFs for which Gao et al. (2004) applied the MA-Networker algorithm, which predicts the regulatory susceptibility of each promoter to a given TF by combining whole-genome chromatin immunoprecipitation (ChIP–chip) data with a compendium of expression profiles. We calculated the Pearson correlation between both NU and NC for each TF and the corresponding promoter–TF coupling T-values from MA-Networker. To investigate the relationship between conservation and nucleosome occupancy, we used the nucleosome ChIP–chip data from Bernstein et al. (2004). We calculated the Pearson correlation between either NU or NC for each TF on one hand, and the mean log2 ratios across nucleosome ChIP experiments on the other. 2.3 GO enrichment analysis We performed GO (Ashburner et al., 2000) analysis using a non-parametric variant of the T-profiler algorithm (Boorsma et al., 2005) described by Scheer et al. (2006) and Bussemaker et al. (2007a). Briefly, using the NT, NC, and NU for each TF, we performed a Mann–Whitney–Wilcoxon test between the affinities for genes within each GO category and affinities for all other genes. We performed a Bonferroni correction on the resulting P-values by multiplying them by the number of unique GO categories. We use the YEAST package from the BioConductor platform (Gentleman et al., 2004) within the R statistical programming environment. 2.4 Affinity co-occurrence and co-conservation We define the affinity co-occurrence between two TFs as the rank correlation between NT profiles of those TFs beyond what is explained by the similarity between pairs of PSAMs. Affinity co-conservation is defined similarly, but using NC instead of NT. We first calculated the Spearman correlation coefficient ρ using each of the 7626 pairs of TFs. To control for the confounding effect of similarity between PSAMs, we then randomly permuted the sequence of every promoter in each genome, and performed the same analysis. Using 3874 random samples, we obtained a null distribution for ρ for each TF pair. This procedure was performed separately for co-occurrence and co-conservation. These distributions were found to be normal upon examination of a Q–Q plot, which then allowed us to calculate a mean and SD for the null distribution for each TF pair, and assign two P-values to each TF–TF pair which served as our co-occurrence and co-conservation metrics. We also assigned a false-discovery rate (FDR) threshold α to each pair as described by Benjamini and Hochberg (1995). We assigned ranks i to the co-affinity and co-occurrence P-values, and calculated the FDR threshold α=pn/i, where n is the number of pairs (7626). We compared our co-occurrence and co-conservation scores against four sources of validation: the subset of cofactor pairs reported by Banerjee and Zhang (2003) that had been experimentally validated; physical interactions deposited in the Saccharomyces Genome Database (SGD) (Cherry et al., 1998), including interactions detected by mass spectrometry affinity capture, Western Blot affinity capture, complex reconstitution and yeast two-hybrid; synthetic lethal interactions from SGD; and synthetic rescue interactions from SGD. We used the Mann–Whitney–Wilcoxon P-value and the area under the ROC curve (ROC AUC) to assess the performance of the co-occurrence and co-conservation scores at predicting validated interactions of each type. We used the ROCR package developed by Sing et al. (2005) within the R statistical programming environment for calculating the ROC AUC. 3 RESULTS 3.1 Most predicted binding affinity for TFs is unconserved For each TF, we first analyzed the overall conservation of affinity cf, defined as the ratio of the genome-wide sum of conserved promoter affinities NC and the genome-wide sum of total promoter affinities NT. The overall conservation varies greatly between factors (Fig. 3
3.2 Conserved affinity correlates with in vivo regulatory susceptibility To explore whether the conserved fraction of the affinity at a promoter NC corresponds to the part of the affinity that is functional, and conversely whether the unconserved fraction NU corresponds to non-functional affinity, we analyzed transcriptional response to TF deletion as measured using expression microarrays by Hughes et al. (2000). Using the Pearson correlation between both components of the affinity and the expression log2 ratios shows that, indeed, the NC tends to predict the transcriptional response of a promoter better than the NU (Fig. 4
Such microarray experiments are often difficult to interpret, because of the drastic physiological change induced and the lack of distinction between direct and indirect targets. The analysis by Gao et al. (2004) was one of a family of techniques to integrate binding and expression data to estimate the regulatory susceptibility of each S. cerevisiae promoter to each of a variety of TFs. The authors noted that this quantity did not correlate well with sequence-based predictions of affinity due to an abundance of non-functional binding. Again using the Pearson correlation, we found that these susceptibilities are explained by the conserved affinity NC, in contrast to the unconserved affinity NU (Fig. 4 3.3 Conserved affinity correlates with nucleosome depletion There are a variety of reasons why a promoter with high affinity for a TF might not be regulated by that TF and consequently would not display conserved affinity across orthologous promoters. Specifically, (i) the TF may occupy the promoter, yet lack the appropriate cofactors at that promoter, or the preferred binding site within that promoter might be in the wrong position or orientation relative to the transcription start site to recruit or activate polymerase; or (ii) the TF might not bind as predicted by sequence because of competing occupancy by nucleosomes. We can distinguish between these two models using ChIP–chip data that probe genomic occupancy by nucleosomes and by the factors themselves. When we compare NC and NU for TFs with their corresponding promoter occupancies as measured by Harbison et al. (2004), we find modest support for the former mechanism; NC and NU tend to both correlate with binding of the TF, although NC often correlates more strongly (data not shown). However, when we look at nucleosome occupancy data from Bernstein et al. (2004), we observe a strong and consistent relationship between nucleosome depletion and NC, but not NU (Fig. 5
3.4 Conserved affinity profiles allow predictions of novel TF functions Any genome-wide measure of promoter–TF connectivity can be combined with prior classifications of genes such as those curated by the GO consortium (Ashburner et al., 2000) to detect functional bias (Bussemaker et al., 2007a). We tested the utility of calculating the conserved affinities NC by analyzing their bias toward GO categories, and by comparing this bias to that which is detected using unconserved affinities NU. We expected NU to be distributed at random in the genome, and that any functional bias should be restricted to NC. Indeed, we find that in general, the significance of the bias toward the most enriched GO category is much stronger when using NC (Fig. 6
We also find that the number of GO categories within which significant bias is detected is almost always greater when using the conserved affinity NC compared to the unconserved affinity NU (Supplementary Table 1). In some cases, we are able to predict functions for TFs using NC that are not ascertainable from NT; these predictions tend to agree with those provided by the literature, when available (Supplementary Table 1). We are able to formulate several hypotheses about TF functions using NC that have not been reported in the literature:
3.5 Affinity for TFs that regulate RP genes displays a unique conservation pattern In the course of our analysis, we made the surprising discovery that the factors Rap1, Sfp1, and Fhl1 do not follow the same pattern as the other studied factors: the unconserved affinity NU for these factors appears to be functionally biased and associated with nucleosome depletion more so than the conserved affinity NC. Each of these TFs is known to regulate RP genes, whose expression constitutes half of the RNA polymerase II transcription in rapidly growing yeast cells (Warner, 1999). Our GO analysis for these factors indicates that the strongest functional bias toward RPs in each case is found by considering NT rather than either NU or NC. One explanation might be that the function of Rap1/Sfp1/Fhl1 affinity is context independent, leading even unconserved affinity to be functional, whereas for other TFs, a uniform background level of NU is found throughout the genome (because it is non-functional outside of the proper genomic context, and therefore not selected against). The association of NU with RP promoters, which one would expect to be exceptionally nucleosome depleted, explains the observed anomalous correlation between NU for these factors and nucleosome depletion. 3.6 Affinity co-conservation provides evidence for TF–TF interactions Co-occurrence and co-conservation of individual TF binding sites has been an extensively studied line of evidence for predicting cofactor pairs (Chiang et al., 2003; Pilpel et al., 2001; Sudarsanam et al., 2002). Pairwise comparison of genome-wide NT and NC distributions constitutes a natural extension to these methods, which we term affinity co-occurrence and affinity co-conservation. Similarity in genome-wide NT and NC profiles between two TFs is governed, to an extent, by the similarity between their sequence specificities (PSAMs); we controlled for this effect by developing a null model for each TF–TF pair based on scrambled genomes. At a FDR threshold of α<0.05, we discovered 29 pairs using co-occurrence and 1530 by co-conservation (reported in full in Supplementary Table 2). We first compared these results to the subset of 11 cofactors predicted by Banerjee and Zhang (2003) that were experimentally validated. None of these validated pairs were among the 29 co-occurring pairs, while 8 of the 11 validated pairs were among the 1530 co-conserved pairs (hypergeometric P=1.9×10−5). At this FDR-level stringency it is difficult to compare the two methods, so we used the ROC AUC (Fig. 7 We then compared our predictions to other sets of TF interactions discovered via high-throughput experiments: physical interactions, synthetic lethal interactions, and synthetic rescue interactions (Table 1).
While physical interaction or synthetic lethality between two TFs does not necessarily imply that they act together as cofactors or even target the same genes, we nevertheless find that these types of evidence are associated with significant signals of affinity co-conservation. Synthetic rescue interactions are not associated with co-conservation, which is to be expected because such interactions indicate a complementary rather than synergistic relationship. Again, it is notable that in none of these cases does co-occurrence of affinity in a single genome provide evidence that agrees with experimentally validated pairs. The pairs of interactions that we detect by co-conservation (Supplementary Table 2) are biologically plausible. The master cell-cycle regulators Fkh1 and Fkh2 have many interaction partners; the cell-cycle regulators Mcm1 and Yox1 are connected; the RP regulators Fhl1, Rap1, and Sfp1 are connected; heme-activator proteins Hap1, Hap2, Hap3, and Hap4 are connected; sulfur amino acid biosynthesis regulators Met4 and Met32 are connected; and glycolysis regulators Gcr1 and Gcr2 are connected. Interestingly, the serum response factor like protein Rlm1 has many interaction partners, suggesting that it has a genome-wide regulatory function that is shared by many factors with diverse functions. CONCLUSION We have exploited the pattern of TF affinity conservation across orthologous promoters to infer the fraction of TF affinity at each promoter that is functional. This method should be broadly applicable to situations where the in vitro binding specificity of a TF is known, but its in vivo function has not been demonstrated experimentally. It may also be especially useful in the analysis of the genomes of higher eukaryotes, in which non-functional binding sites outnumber functional ones to an even higher degree than in yeast. In contrast to phylogenetic footprinting methods, information from the full range of potential binding affinities is incorporated, allowing for more sensitive detection of potential TF–promoter and TF–TF interactions. [Supplementary Data]
ACKNOWLEDGEMENTS The author would like to thank Richard Mann, Gabor Halasz, and Ronald Tepper for critical readings of the manuscript, and members of the Bussemaker laboratory for helpful discussions. Funding: This work was supported by National Institutes of Health (NIH) grants R24GM074105, T32GM008798, R01HG003008, and U54CA121852. Conflict of Interest: none declared. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||
Nat Genet. 2001 Oct; 29(2):153-9.
[Nat Genet. 2001]Science. 2003 Jul 4; 301(5629):71-6.
[Science. 2003]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Proc Natl Acad Sci U S A. 2005 Jul 5; 102(27):9481-6.
[Proc Natl Acad Sci U S A. 2005]Genome Biol. 2004; 5(12):R98.
[Genome Biol. 2004]PLoS Comput Biol. 2005 Dec; 1(7):e67.
[PLoS Comput Biol. 2005]Bioinformatics. 2006 Jul 15; 22(14):e141-9.
[Bioinformatics. 2006]BMC Bioinformatics. 2006 Mar 7; 7():113.
[BMC Bioinformatics. 2006]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Bioinformatics. 2006 Jul 15; 22(14):e141-9.
[Bioinformatics. 2006]BMC Bioinformatics. 2006 Mar 7; 7():113.
[BMC Bioinformatics. 2006]Annu Rev Biophys Biomol Struct. 2007; 36():329-47.
[Annu Rev Biophys Biomol Struct. 2007]J Mol Biol. 1987 Feb 20; 193(4):723-50.
[J Mol Biol. 1987]Nucleic Acids Res. 1986 Aug 26; 14(16):6661-79.
[Nucleic Acids Res. 1986]Cell. 2000 Jul 7; 102(1):109-26.
[Cell. 2000]Genome Biol. 2004; 5(9):R62.
[Genome Biol. 2004]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W592-5.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W510-5.
[Nucleic Acids Res. 2006]BMC Bioinformatics. 2007 Sep 27; 8 Suppl 6():S6.
[BMC Bioinformatics. 2007]Genome Biol. 2004; 5(10):R80.
[Genome Biol. 2004]Nucleic Acids Res. 2003 Dec 1; 31(23):7024-31.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 1998 Jan 1; 26(1):73-9.
[Nucleic Acids Res. 1998]Bioinformatics. 2005 Oct 15; 21(20):3940-1.
[Bioinformatics. 2005]Cell. 2000 Jul 7; 102(1):109-26.
[Cell. 2000]Cell. 2000 Jul 7; 102(1):109-26.
[Cell. 2000]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Genome Biol. 2004; 5(9):R62.
[Genome Biol. 2004]Nat Genet. 2001 Aug; 28(4):327-34.
[Nat Genet. 2001]Proc Natl Acad Sci U S A. 2004 Oct 5; 101(40):14315-22.
[Proc Natl Acad Sci U S A. 2004]Mol Cell Biol. 1999 Aug; 19(8):5279-88.
[Mol Cell Biol. 1999]Genome Biol. 2004; 5(9):R62.
[Genome Biol. 2004]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]BMC Bioinformatics. 2007 Sep 27; 8 Suppl 6():S6.
[BMC Bioinformatics. 2007]Mol Cell Biol. 1997 Dec; 17(12):6982-93.
[Mol Cell Biol. 1997]FEBS Lett. 1998 Mar 27; 425(2):323-8.
[FEBS Lett. 1998]Nucleic Acids Res. 2001 May 15; 29(10):2181-90.
[Nucleic Acids Res. 2001]Genes Dev. 2002 Dec 1; 16(23):3034-45.
[Genes Dev. 2002]Nucleic Acids Res. 2003 Dec 1; 31(23):7024-31.
[Nucleic Acids Res. 2003]Mol Cell Biol. 1997 Dec; 17(12):6982-93.
[Mol Cell Biol. 1997]FEBS Lett. 1998 Mar 27; 425(2):323-8.
[FEBS Lett. 1998]Nucleic Acids Res. 2001 May 15; 29(10):2181-90.
[Nucleic Acids Res. 2001]Genes Dev. 2002 Dec 1; 16(23):3034-45.
[Genes Dev. 2002]Nucleic Acids Res. 2003 Dec 1; 31(23):7024-31.
[Nucleic Acids Res. 2003]Trends Biochem Sci. 1999 Nov; 24(11):437-40.
[Trends Biochem Sci. 1999]Genome Biol. 2003; 4(7):R43.
[Genome Biol. 2003]Nat Genet. 2001 Oct; 29(2):153-9.
[Nat Genet. 2001]Genome Res. 2002 Nov; 12(11):1723-31.
[Genome Res. 2002]Nucleic Acids Res. 2003 Dec 1; 31(23):7024-31.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2003 Dec 1; 31(23):7024-31.
[Nucleic Acids Res. 2003]