• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Dec 15, 2002; 30(24): 5549–5560.
PMCID: PMC140044

Discovery of novel transcription factor binding sites by statistical overrepresentation

Abstract

Understanding the complex and varied mechanisms that regulate gene expression is an important and challenging problem. A fundamental sub-problem is to identify DNA binding sites for unknown regulatory factors, given a collection of genes believed to be co-regulated. We discuss a computational method that identifies good candidates for such binding sites. Unlike local search techniques such as expectation maximization and Gibbs samplers that may not reach a global optimum, the method discussed enumerates all motifs in the search space, and is guaranteed to produce the motifs with greatest z-scores. We discuss the results of validation experiments in which this algorithm was used to identify candidate binding sites in several well studied regulons of Saccharomyces cerevisiae, where the most prominent transcription factor binding sites are largely known. We then discuss the results on gene families in the functional and mutant phenotype catalogs of S.cerevisiae, where the algorithm suggests many promising novel transcription factor binding sites. The program is available at http://bio.cs.washington.edu/software.html.

INTRODUCTION

One of the major challenges facing biologists is to understand the varied and complex mechanisms governing the regulation of gene expression. This paper focuses on one important aspect of this challenge, the identification of binding sites in DNA for the factors involved in regulation. This is a necessary first step in determining which factors regulate the gene and how.

The analysis of non-coding regions in eukaryotic genomes in order to identify regulatory elements is a difficult problem and one that is not yet well solved. Some of the reasons for this difficulty are as follows: (i) binding sites of multiple interacting transcription factors often play a role in the regulation of a single gene; (ii) there can be great variability in the binding sites of a single factor, and the nature of the allowable variations is not well understood; (iii) the regulatory elements may be located quite far from the corresponding coding region, either upstream or downstream or in the introns.

Any algorithm whose goal is to discover novel regulatory elements takes as input a set of regulatory regions of genes, many of which are suspected to contain a common regulatory element. There are many possible sources for such co-regulated genes, including expression microarray experiments, gene knockout experiments and functional classes from the literature. This paper focuses on the regulation of genes in the yeast Saccharomyces cerevisiae, since much is known both about its transcription factors and about the functions of its genes.

A number of algorithms to discover general motifs have been proposed (19). Many of these algorithms are designed to find longer or more general motifs than are required for identifying transcription factor binding sites. The price paid for this generality is that many of the cited algorithms are not guaranteed to find globally optimal solutions, since they employ some form of local search, such as Gibbs sampling, expectation maximization or greedy algorithms, that may terminate in a locally optimal solution. There have been some studies that have applied these local search techniques specifically to the problem of identifying transcription factor binding sites in S.cerevisiae, with some success (1014).

The number of well conserved bases in the collection of binding sites of a single S.cerevisiae transcription factor is typically six to ten (1516). This number is small enough that, for this particular problem, one need not rely on such general local search heuristics. Instead, one can afford to use enumerative methods that guarantee global optimality. This is the approach taken by the current paper, whose method is most closely allied to those of van Helden et al. (1719) and Tompa (20). There are also other studies using an enumerative approach to motif finding (2123).

We review a motif model that is tailored to accurately represent transcription factor binding sites in S.cerevisiae. We then review an enumerative algorithm from Sinha and Tompa (24) called YMF (Yeast Motif Finder) which, given the regulatory regions of several related genes, is guaranteed to produce the motifs with greatest z-scores. The present paper focuses on the application of that method to classes of yeast genes. We first present the results of validation experiments in which YMF was used to identify candidate binding sites in several well studied regulons of S.cerevisiae, where the most prominent transcription factor binding sites are largely known. We then present results on gene families in the functional and mutant phenotype catalogs of S.cerevisiae taken from the MIPS database (25), where YMF suggests many novel transcription factor binding sites. Our goal was to discover motifs in the classes from these catalogs, since genes with common mutant phenotypes or common function may have the same regulatory mechanism and hence may share informative binding sites.

Hughes et al. (11) performed a similar analysis of the MIPS functional catalog using AlignACE, a local search algorithm based on Gibbs sampling. There are a number of differences between their results and ours. Most important is that differences in the motif model and search method (local search heuristic versus enumerative search) lead to different significant motifs. In a separate paper (S.Sinha and M.Tompa, in preparation) we compare the accuracy of YMF and other methods such as AlignACE on both simulated data and on yeast regulons. Those results suggest that YMF may provide more accurate prediction of regulatory elements. A second difference is that Hughes et al. (11) merge motifs found in different functional classes. As a result, the important connection between transcription factor binding site and gene function, necessary for understanding regulatory relationships, is not apparent from their tables, whereas it is explicit in ours. Finally, Hughes et al. (11) report results only on the functional catalog and not on the mutant phenotype catalog.

MATERIALS AND METHODS

Variability among binding site instances

The first question that must be addressed is ‘What constitutes a motif?’ for the application of transcription factor binding sites in S.cerevisiae. An inspection of transcription factor databases such as TRANSFAC (15; http://transfac.gbf-braunschweig.de/TRANSFAC/) and SCPD (16; http://cgsigma.cshl.org/jian/) and of the relevant literature (2633), particularly Jones et al. (26), which is rich in examples, reveals that there is significant variation among the binding sites of any single transcription factor. Moreover, the nature of the variability itself varies from factor to factor, so that the ‘correct’ motif model is far from clear.

Certain trends that must be incorporated in the motif model do, however, emerge from this literature, particularly from SCPD (see the column labeled ‘Consensus’ in Table Table11 for examples). (i) Many of the motifs, such as the Gal4p binding site CGGNNNNNNNNNNNCCG, have spacers varying in length from 1 to 11 bp. The spacers usually occur near the middle of the motif, often because the factors bind as dimers or tetramers. (ii) The number of well conserved bases (not including spacers, of course) is usually in the range 6–10. This number is called the length of the motif. (iii) When there is variation in a conserved motif position, it is often a transition (i.e. the substitution of a purine for a purine or a pyrimidine for a pyrimidine) rather than a transversion. This is because of the similarity in nucleotide size necessary to fit the transcription factor’s fixed DNA-binding domain. Somewhat less often, the variation in a given position may be between a pair of complementary bases. Other positional variations are rarer. (iv) Insertions and deletions among binding sites are uncommon, again because of the fixed structure of the factor’s DNA-binding domain.

Table 1.
Performance of YMF on regulons in SCPD with known binding sites

Based on these observations, a motif for our application is a string of length 6–10 over the alphabet {A,C,G,T,R,Y,S,W}, with 0 or more consecutive N residues inserted at the center, and a limited number of R (purine), Y (pyrimidine), S (strong) and W (weak) characters, also called degenerate symbols. We choose such a consensus model rather than (say) a weight matrix in order to be able to enumerate motifs. An examination of the 50 binding site consensi included in SCPD (16) revealed that the number of consensi that exactly fit this characterization is 34 (68%). About 10 more fit the characterization if very slight differences from the exact consensus are tolerated.

Measure of statistical significance

Given some set of (presumably co-regulated) S.cerevisiae genes, the input to YMF is the corresponding set of promoter regions, each having length 800 bp and having its 3′ end at the gene translation start site. For each motif s, let Ns be the number of occurrences of s in the input sequences, allowing an arbitrary number of occurrences in both orientations per promoter region. A reasonable measure of s as a motif will reflect how unlikely it would be to have Ns occurrences, if the sequences were instead drawn at random according to the background distribution. We use as this measure the statistical significance of the ‘z-score’ of Ns. First, to specify the background distribution, let X be a set of random DNA sequences of the same number and length as the input promoter sequences, but generated by a Markov chain of order m, whose transition probabilities are determined by the (m + 1) mer frequencies in the full complement of 6000+ promoter regions (each of length 800 bp) of S.cerevisiae. In our experiments, we chose m = 3 in order for the background model to account for the TATA, AAAA and TTTT sequences that are ubiquitous throughout the genome’s promoter regions (17). Let the random variable Xs be the number of occurrences of the motif s in these random sequences X and let E(Xs) and σ(Xs) be its mean and standard deviation, respectively. Then the z-score associated with s is

zs = (Ns – E(Xs))/σ(Xs)1

The measure zs is the number of standard deviations by which the observed value Ns exceeds its expectation. See Leung et al. (34) for a detailed discussion of this statistic.

The z-score zs obeys a normal distribution in the asymptotic limit as the total length of the input promoter regions increases (35). If the assumption of normality is inaccurate, it may not be as meaningful to compare the z-scores of different motifs. In view of this, YMF will be most accurate when the total length of the input promoter region is large. The ultimate test of the method, however, is not whether the z-score passes normality tests, but whether YMF successfully predicts true transcription factor binding sites. Therefore, in order to demonstrate the robustness of the method, in the validation experiments on known regulons we report the results on regulons consisting of as few as three genes.

Since YMF enumerates a large motif space, thereby sampling a large number of points from the distribution, it is expected that some motifs will have a high z-score by chance. To address this, we associate with z-score x a significance pmax(x), which measures the probability that the maximum z-score is at least x, if the input sequences were random. This maximum is taken over all motifs of the given length, number of spacers and number of degenerate symbols. We precompute pmax for a variety of motif parameters and input sequence lengths, by simulation. Random sequences of the same length as the input promoter regions are generated according to the Markov model being used, and YMF is run on these random sequences. The maximum z-score reported is recorded. This experiment is repeated 100 times. The fraction of experiments that yielded maximum z-score at least x is used as an estimate of pmax(x).

Algorithm summary

The algorithm used by YMF is summarized here. The inputs to the algorithm are as follows: (i) a set of promoter regions; (ii) the number l of non-spacer characters in the motifs to be enumerated (called the motif length); (iii) the maximum number w of spacers in the motifs; (iv) the transition matrix for a third order Markov chain modeling the background distribution of promoter regions.

The parameters l and w, along with the implicitly assumed motif model, define a search space of all candidate motifs that will be evaluated. This space consists of all motifs that have l characters from {A,C,G,T,R,Y,S,W}, and between 0 and w spacers (N) in the middle. Typically, the maximum number of degenerate symbols (R, Y, S or W) was restricted to 2 for computational efficiency, although YMF can be configured to handle different values of this parameter. YMF first makes a pass over the input sequences, tabulating the number Ns of occurrences of each motif s in either orientation, including overlapping occurrences. For each motif s for which Ns > 0, it then computes the mean and standard deviation of the motif count using a method described by Sinha and Tompa (24). Finally, it uses equation 1 to compute the z-score zs and pmax(zs) and outputs the motifs sorted by z-score.

Because the number of motifs is exponential in l, we can afford this enumerative method only for modest values of l. In contrast, however, the running time is linear in the size of the input sequences, so that the method scales very well to larger gene families and longer promoter regions. The current implementation typically runs in a few seconds for motifs of length 6 on a Pentium processor with 256 MB memory. For length 9 motifs it requires a few minutes.

Both a web interface and the source code for YMF are freely available at http://bio.cs.washington.edu/software.html.

Experimental methods

The maximum number w of spacers allowed in a motif was varied depending on the motif length parameter l. For l = 6, we used w = 11, which means that length 6 motifs were allowed to have between 0 and 11 spacers in the middle. This is in accord with observed motifs from SCPD. However, this introduces an inherent bias in the method toward finding motifs with spacers, since there are 11 times as many motifs with spacers as without. To include some runs without this bias, when YMF was run with l > 6, we used w = 0, i.e. no spacers allowed.

There are three different types of post-processing steps that were used to produce the most promising candidate binding sites to report. The first is a tool called FindExplanators (36). A set of promoter sequences having bindings sites for a few different transcription factors typically contains hundreds of statistically overrepresented motifs, most of them being minor variations of the true binding site motifs. YMF will report all these overrepresented motifs. For example, suppose a factor binds to TCACGCT in a set of sequences, causing this motif to be overrepresented. Many of its variations, e.g. CACGCTT or TCACGCW, are also likely to be overrepresented, simply because each has its number of occurrences artifically increased by the presence of TCACGCT. FindExplanators is a tool that extracts the few significantly independent motifs from the vast number that are simply artifacts of these few.

Since YMF evaluates a motif based on its total number of occurrences in a set of sequences, a motif may have a high z-score (low pmax) even if it occurs unusually often in only one of the promoters. Such motifs may not be interesting candidates for transcription factor binding sites. Multiple occurrences of a motif in a promoter suggest some significance, but a very large number of occurrences in the same promoter may suggest a repetitive element rather than a regulatory element. Thus motifs are post-processed so that those that have high z-scores due to a large number of occurrences in one or two promoters are not reported. For this purpose we developed a numerical measure that captures the notion of a motif being well distributed among the promoters. Given a set X of promoters and a motif s, we first count the occurrences of s in each promoter. Let X+ be the set of promoters that have at least one occurrence of s and let D = {d1, d2, …, dp} (where p = |X+|) be the distribution of occurrences of s in X+. Intuitively, a well distributed motif is one for which D has a low variance. However, the variance itself is not comparable for sample distributions obtained from different populations, so we normalize it by dividing by the expectation. Thus, our statistic is w = Σi = 1 … p (di – µ)2/µ, where µ is the mean of the distribution D. We call this the w-score of motif s. Lower values of the w-score indicate better distributed motifs. Note that w is identical to the χ2 statistic and we use the χ2 distribution with p – 1 degrees of freedom to compute a significance threshold on the w-score. Notice that we compute w from X+ and not from X, since in general we may not find the binding site present in all the input promoter sequences.

As another means of evaluating the motifs, we introduce a co-expression score, which measures the similarity of the expression profiles of the genes corresponding to X+. This score is computed from data in the database ExpressDB (37), which catalogs mRNA expression level information from several different studies under a common framework. The expression data is normalized across studies by converting them into estimated relative abundancies or ‘ERAs’. Such values are available for all yeast genes under 217 different conditions. For each pair of genes, we compute the correlation coefficient of their ERA values. Given the set X+, we compute the average pairwise correlation coefficient over all pairs of genes in X+. We then estimate a p-value of this average pairwise correlation coefficient, by choosing |X+| random genes and computing the same score for these, and repeating several times. This p-value is called the co-expression score of X+ and a low value indicates that the genes in X+ have an unusually high pairwise correlation coefficient on average.

RESULTS AND DISCUSSION

Validation on known regulons

The SCPD database (16) has a collection of transcription factors and the genes regulated by each factor. Each such set of genes comprises a regulon. For each gene in a regulon, the database lists the experimentally determined binding sites of the transcription factor, and in many cases the consensus sequence of the binding sites in the regulon is also given. It is not always clear from the binding sites alone what their consensus should be, because there is often more than one way to align them, and to choose a consensus with degenerate symbols. Hence, we rely on the consensus listed at SCPD. YMF was run on each regulon in SCPD that has at least three genes and has a cataloged consensus sequence for its binding sites. There are 23 such regulons. The success of YMF was assessed by comparing the top motifs reported with the known consensus for the regulon. The program was run three times on each regulon, to find motifs of length 6, 7 and 8, respectively. For length 6 motifs, a maximum of 11 spacers in the middle was allowed. For lengths 7 and 8, the motif model did not include spacers (see Materials and Methods for details). In all runs, a maximum of two degenerate symbols (R, Y, S or W) was allowed in the candidate motifs.

The results are summarized in Table Table1.1. Each row corresponds to a regulon. For each of the three runs of YMF on that regulon, the motif with greatest z-score is presented, along with its total count in the input promoter regions, its z-score z, and pmax(z). Lower pmax values are indicative of higher statistical significance (see Materials and Methods). Reported motifs that can be superimposed with the known consensus for the regulon without conflicting characters at any position, and that have at least four positions (possibly degenerate symbols) identical to the consensus are considered matches and are typeset in bold.

For 15 of the 23 regulons, the top motif reported (for one or more values of the motif length parameter) was a match. For 14 of the 15 regulons, there was a match with pmax < 0.1, the exception being MATa2. In another regulon, MCM1, the top ranking motifs (for length 6) were variants of the poly(A) element (any motif that can be instantiated to a string of all A residues, e.g. AAAAWAAA), and the first non-poly(A) motif, at rank 11 with pmax 0.01, was CCSNNNNAGG, similar to the known consensus CCNNNWWRGG. For the regulon RAP1, the top motif reported (for length 7) is GCAYGTG, which matches part of the inositol/choline response element (ICRE) with consensus SCAYRTGAARW (we discuss this motif and its connection to the RAP1 regulon below). The first motif reported by YMF (for the same length parameter) that is not a variant of GCAYGTG is RCACCCA, at rank 11 with pmax 0.02. Note that this closely matches the known consensus RMACCCA for the RAP1 regulon. For the regulon HSE,HSTF, the consensus cataloged at SCPD is GAANNTCC. However, an alignment of the known binding sites of this transcription factor, as reported in the same database, reveals a consensus pattern of TCTAGAA. This closely matches the top motif TCYAGAA reported by YMF for length 7. Thus, counting MCM1, RAP1 and HSE,HSTF also as successes, we are left with only five regulons (GCR1, ROX1, SFF, TBP and UASPHR) on which YMF failed to report any match to the known binding site consensus. Note that the 23 regulons represent the typical input for a motif-finder; they are of varying sizes (3–38 genes) and have a variety of known binding sites (length 5–10, with few to many spacers or degenerate symbols). The results thus demonstrate the applicability of the method on a variety of data sets.

In most cases, a match was found in the top three motifs for multiple values of l, indicating that the performance is not crucially dependent on prior knowledge of the motif length. In some cases, YMF found a match even though the known consensus of the binding site does not conform to the motif model YMF uses. For instance, the regulon SCB has the sequence CNCGAAA as its binding site consensus, with an ‘N’ that is not in the middle. Nevertheless, a very similar motif CACGAAA was reported. Similarly, for HAP1 (consensus CGGNNNTANCGG), the motif SGGNNNNNNSGG was discovered.

The regulon ABF1 is an example of a case where multiple occurrences of the binding site are found in the same promoter region. Of the 19 genes in this regulon, eight have two or more occurrences of the motif TCRNNNNNNACG in their promoter region. There are a total of 36 occurrences of the motif, giving it a very high z-score of 10.07. If each of the 19 genes had only one occurrence of the motif, for a total of 19 occurrences, the z-score would have been about 4.03, which is rather low, meaning that the motif would not have been reported as significant.

As noted above, a pmax value of 0.1 or less served as a good indicator of a significant motif, in the sense that most of the matches occurred with pmax < 0.1. We therefore examined all motifs (from Table Table1)1) that are reported to have a pmax value less than this threshold, to see if there are interesting signals in the regulon that are different from the known binding sites, and also to have an idea of the false positive rate. Table Table22 summarizes our observations. It includes each motif from Table Table11 that has a pmax value <0.1, is not a poly(A), poly(T) or TATA motif and is not a match. There are 13 such motifs. Three of them (CGCWCGG and CGCACGGA in the GAL4 regulon and ARCCGCCG in the MIG1 regulon) occur overlapping with known Gal4 binding sites in the respective promoters. The motif CGGNNNNNNNNNNNCCG in the MIG1 regulon is identical to the Gal4 binding site consensus, and this family contains the genes Gal3, Gal10, Gal1 and Gal4, which are known to contain this binding site. The motif GCAYGTG in the RAP1 regulon matches a prefix of the ICRE consensus SCAYRTGAARW (38,39). Among the genes of this regulon are Fas1, known to contain the ICRE motif (40), and Opi3 and Itr1, both known to have the similar motif CATGTGAA, which is shared by promoters of phospholipid synthetic enzymes such as these two (25). Thus, for five of the motifs in Table Table2,2, there is strong evidence that they correspond to known binding sites of other transcription factors, leaving eight motifs about which we do not have any clear evidence. Even if we regard all these eight motifs as spurious, the resulting false positive rate would be small, considering that a total of 39 motifs in Table Table11 meet the criteria of having pmax < 0.1 and not being a poly(A), poly(T) or TATA motif. Some occur at approximately conserved positions relative to the translation start site, strengthening the possibility that they might be targets of other transcription factors.

Table 2.
Motifs in SCPD regulons that are different from the known principal binding sites

Results on MIPS catalogs

The MIPS database at the Munich Information Center for Protein Sequences (25) catalogs yeast genes classified according to different criteria. One such catalog is based on gene function, while another classifies genes based on phenotypes with which mutant versions of the genes have been implicated. These catalogs will be referred to as the functional and the phenotype catalogs, respectively. Each catalog has a hierarchical organization, the different levels of the hierarchy corresponding to different degrees of specificity of the classification criterion. Our goal was to discover motifs in the classes from these catalogs, since many genes with common mutant phenotypes or common function may have the same regulatory mechanism and hence may share binding sites. We extracted from each catalog the classes that were at or near the bottom of the hierarchy and had five or more genes. YMF was run on the 800 bp long promoter regions of genes in each class, with the same set of parameter values as in the experiments on SCPD regulons. In some cases, a class of genes contains one or more pairs of divergent genes, whose promoter regions overlap. For such pairs, the single promoter region between the two genes replaced two separate promoters. The top 1000 motifs from each of the three runs of YMF were input to the program FindExplanators (see Materials and Methods), which reported the three best independent motifs in its input list of 1000 motifs. For classes with over 100 genes, only the single best motif of the 1000 was reported, for computational efficiency. For each motif obtained from the previous step, the w-score (see Materials and Methods) was computed to measure how well distributed the motif is, and motifs with poor w-scores (at 95% level of significance) were rejected. All motifs with pmax > 0.1, as well as those that are poly(A), poly(T) or TATA repeats, were rejected. Matches of each of the remaining motifs to binding sites of known transcription factors in yeast (as cataloged in the database TRANSFAC) are reported. Also, for each remaining motif, the co-expression score was computed (see Materials and Methods) for the set of genes in the class that contain the motif in their promoters.

Functional catalog. There were 204 classes extracted from the functional catalog and the motif-finding steps reported a total of 465 motifs. Tables Tables33 and and44 present a selection of the results. This selection was done by manual inspection of the 465 reported motifs, using the following more stringent criteria. Motifs with pmax > 0.05 were eliminated. If a single functional class had motifs of different lengths that were variants of each other, only that with the least pmax value was retained. Motifs that had more than two matches to known binding sites of the same transcription factor are presented in Table Table3,3, while the others are in Table Table4,4, and are good candidates as novel transcription factor binding sites.

Table 3.
Significant motifs in classes from the MIPS functional catalog
Table 4.
Significant novel motifs in classes from the MIPS functional catalog

Most of the motifs in Table Table33 match the known binding site consensus of some transcription factor, in which case the name of the factor is reported along with the consensus. Some of the motifs in this table do not match a known consensus, but do match two or more binding sites of a single transcription factor. For such motifs, we report the name of the factor, along with the number of matching binding sites. In either case, it would be interesting to pursue, for each of the motifs in the table, whether the transcription factor whose binding sites it matches has some regulatory role for the genes in that functional class. For instance, the motif CACGTGSG, which matches the Pho4 consensus, is found to be significant in the functional class ‘phosphate metabolism’ (Table (Table3),3), and it may be verified from the literature that the Pho4 transcription factor indeed regulates many of the genes in this class that have the motif in their promoters. Many other similar connections can be found in the comments column of Table Table33.

We will now discuss some of the most interesting observations from Table Table4,4, showing that some of these motifs are excellent candidates as novel transcription factor binding sites. The 7mer CGATGAG is highly overrepresented in the promoters of the functional class ‘rRNA transcription’. This motif was also discovered by Hughes et al. (11), who recognized it as the PAC box (41), an element for which neither function nor binding factor has been identified. It occurs a total of 50 times in 45 of the 109 promoters in the class, with a z-score of 17.00 and pmax < 0.01. It is a very well distributed motif, its 50 occurrences being spread over 45 promoters. Moreover, these 45 promoters belong to genes that are highly co-expressed. Their co-expression score is 0.04, which means that the average pairwise correlation coefficient of their expression data has a p-value of 0.04. Another property that makes this motif a very compelling candidate for a binding site is its extremely high conservation in position in the promoter sequences. Figure Figure11 illustrates this point. It shows a plot of the occurrences of the motif in the 45 promoters, the 3′ end being on the right. We also plotted this motif in promoter regions of orthologs of the 45 yeast genes in other yeast species (Fig. (Fig.2).2). The orthologous genes considered here belong to other yeast strains and the orthology information was obtained from Paul Cliften (personal communication). We see that the motif occurs frequently and is conserved in position in these orthologous promoters also, even though the orthologous genes were identified based on their protein sequences. Moreover, a very similar motif GATGAGS is found to be significant in the related functional class ‘tRNA transcription’. This motif occurs 45 times in 34 promoters of the class, with a z-score of 9.11 (pmax < 0.01).

Figure 1
Occurrences of motif CGATGAG in 45 promoters of the MIPS functional class ‘rRNA transcription’. Each horizontal line represents a promoter, the right end being at the translation start site. Vertical bars represent motif occurrences.
Figure 2
Occurrences of motif CGATGAG in orthologous promoters of genes in the MIPS functional class ‘rRNA transcription’. The orthology information was obtained from Paul Cliften (personal communication). The sequences are plotted with their ...

Another motif worth special mention is the 8mer CGGAGWWA, which occurs in the functional class ‘C-compound and carbohydrate transporters’ that has 46 genes. It occurs a total of 28 times in 16 different promoters of this class, whose corresponding genes have a co-expression score of 0.05. This motif is significantly well conserved in its location (Fig. (Fig.3),3), although not as strongly as the previous motif. Included in the 16 promoters of the class that contain the motif are nine of the glucose transporting HXT genes (Hxt2, Hxt3, Hxt5, Hxt8, Hxt11, Hxt13, Hxt15, Hxt16 and Hxt17). The regulation of these genes has been the subject of detailed biological studies. One study by Theodoris and Bisson (42) shows that ‘DNA sequence dependent suppressing elements’ (DDSEs) located in the promoters of HXT genes affect glucose sensing, and the authors further hypothesize that the DDSE region contains binding sites for the Rgt1p transcriptional repressor/activator. Rgt1p is believed to bind to promoters of Hxt2, Hxt3 and Hxt4. However, the Rgt1p binding site they propose for the HXT genes is TTTCAC GGAAAATTATATTTTG, which does not match our motif CGGAGWWA. A review by Ozcan and Johnston (43) describes another mechanism that represses transcription of some HXT genes in high glucose conditions through Mig1, which is a transcription factor (repressor) known to bind to the promoter of Hxt2 and Hxt4 genes. Again, we verified that the known binding sites of Mig1 (consensus CCCCRNN WWWWW) do not match the motif CGGAGWWA. Only limited information is available on the expression of Hxt5 and Hxt8 to Hxt17. In fact, it is not even certain if these are involved in glucose transport; they could act as transporters for other sugars. Hxt11 is bound by the transcription factor PDR3, although the PDRE (the binding site for PDR3, with consensus TCCGYGGA) does not match our motif CGGAGWWA. The promoter of Hxt13 was obtained in a screen for targets of the transcription factor Hap2, whose binding sites are quite different from CGGAGWWA. Gal2, a galactose permease that is >60% similar to the HXT proteins, is also one of the 16 genes whose promoters contain the motif under investigation. However, it is well established that Gal2 is regulated by the Gal4p transcription factor, which binds to the element CGGNNNNNNNNNNNCCG. In summary, while much is known about the transcriptional regulation of the glucose transporting genes, none of the known mechanisms seem to explain the presence of such a strong shared motif, which is therefore worth investigating further.

Figure 3
Occurrences of motif CGGAGWWA in 16 promoters of the MIPS functional class ‘C-compound and carbohydrate transporters’.

The motif AAWTTTTY occurs 170 times in 52 of the 64 promoters of the class ‘translation’, with a z-score of 9.95 (pmax < 0.01). These 52 genes are highly co-expressed, as indicated by a co-expression score of 0.01. Another compelling feature of the motif is its high conservation in location in the promoters, as revealed by Figure Figure4.4. The class ‘amino acid transport’ has a significant motif GCCGTRCS, which occurs 13 times in nine promoters of the class, with a very high z-score of 17.70 (pmax < 0.01). Among the nine promoters that have this motif are TAT1, DIP5, GAP1 and GNP1. SPS-initiated signals are known to modulate the expression of these four genes (44), and it would be interesting to find out if the discovered motif is related to this known regulation.

Figure 4
Occurrences of motif AAWTTTTY in 52 promoters of the MIPS functional class ‘translation’.

Phenotype catalog. The phenotype catalog from MIPS yielded 138 classes. All motifs reported by the motif-finding steps described above were examined. There were a total of 265 such motifs. Tables Tables55 and and66 present a selection from these motifs. Once again, we find among these motifs both known binding sites (Table (Table5)5) and novel motifs (Table (Table6)6) that may be good candidates for experimental verification. The motifs in Table Table55 match the binding sites of the transcription factors Reb1, Mcb, repressor of Car1, Cbf1, Rap1, Mata1 and Pho4. It would be interesting to find out if these transcription factors are known to regulate some of the genes in the respective phenotype classes.

Table 5.
Significant motifs in classes from the MIPS phenotype catalog
Table 6.
Significant novel motifs in classes from the MIPS mutant phenotypes catalog

We now discuss some of the most interesting motifs reported in Table Table6.6. The motif GTYGCCG occurs a total of nine times (z-score 8.30, pmax 0.01) in seven of the 14 promoters of the class ‘sensitivity to immunosuppressants’. The seven promoters belong to highly co-expressed genes, as indicated by the low co-expression score of 0.015. The motif does not match any known transcription factor binding site.

The motif CTSCCCSG, found in the class ‘mating efficiency’, deserves special mention. It occurs eight times in seven different promoters of the class, with a z-score of 9.56 (pmax 0.01). An interesting feature of this motif is that its instances occur, with up to one mismatch, overlapping Gal4p binding sites in five of the six genes regulated by this transcription factor. Though it does not match the Gal4p consensus CGGNNNNNNNNNNNCCG, this coincidence seems worth investigating.

Another interesting motif is CCGCACRC, found in four of the 21 promoters of the class ‘killer toxin resistance’. It occurs a total of five times in these promoters, with a z-score of 10.30 (pmax 0.04). The occurrences of the motif are conserved in position in the four promoters, as Figure Figure55 reveals. Moreover, the four corresponding genes are highly co-expressed, having a co-expression score of 0.025.

Figure 5
Occurrences of motif CCGCACRC in four promoters of the MIPS phenotype class ‘killer toxin resistance’.

Unclassified proteins. The MIPS database also has a table of 230 ORFs with strong sequence similarity to known proteins. YMF was run on sets of genes that have similarity to the same protein or family of proteins, and in some cases significant motifs were reported. For instance, there are seven ORFs with a strong similarity to members of the SRP1/TIP1 family. YMF reported two strong motifs AGGCAY (pmax < 0.01) and TCGTWYA (pmax < 0.01) in this set. Both these motifs were found to be significant when YMF was run on the eight members of the SRP1/TIP1 family, thereby furthering evidence for a relation between the seven unclassified ORFs and the SRP1/TIP1 family.

Further research

In Tables Tables33 and and5,5, it would be interesting to pursue connections (if not already described in these tables) between the known transcription factor listed in the last column and the gene family in which it was found, to determine whether the transcription factor plays a role in the regulation of genes in this family.

Even more interesting would be to pursue the novel motifs listed in Tables Tables2,2, ,44 and and66 to see whether they lead to transcription factors of heretofore unknown function.

Finally, the motif model described in Materials and Methods was developed from a study of S.cerevisiae transcription factor binding sites. It would be very interesting to understand whether this model is also suitable for the discovery of transcription factor binding sites in other organisms and, if not, how it should be modified. It is relatively easy to extend our algorithm to handle motif models where a motif corresponds to a fixed set of strings over the alphabet {A,C,G,T,N}. The mismatch model, where the motif is a string over this alphabet, and its occurrences may have some fixed number of mismatches from the consensus, belongs to this category of models.

ACKNOWLEDGEMENTS

We thank Linda Breeden and Chris Roberts for sharing their insights on transcription factor binding sites, Mathieu Blanchette and Rimli Sengupta for numerous thorough and helpful discussions, and Jacques van Helden for helpful comments. This material is based upon work supported in part by the National Science Foundation and DARPA under grant DBI-9601046, in part by the National Science Foundation under grant DBI-9974498, and in part by a Microsoft Fellowship.

REFERENCES

1. Bailey T.L. and Elkan,C. (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learn., 21, 51–80.
2. Fraenkel Y.M., Mandel,Y., Friedberg,D. and Margalit,H. (1995) Identification of common motifs in unaligned DNA sequences: application to Escherichia coli Lrp regulon. Comput. Appl. Biosci., 11, 379–387. [PubMed]
3. Galas D.J., Eggert,M. and Waterman,M.S. (1985) Rigorous pattern-recognition methods for DNA sequences: analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186, 117–128. [PubMed]
4. Hertz G.Z. and Stormo,G.D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563–577. [PubMed]
5. Lawrence C.E., Altschul,S.F., Boguski,M.S., Liu,J.S., Neuwald,A.F. and Wootton,J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208–214. [PubMed]
6. Lawrence C.E. and Reilly,A.A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins Struct. Funct. Genet., 7, 41–51. [PubMed]
7. Rigoutsos I. and Floratos,A. (1998) Motif discovery without alignment or enumeration. In RECOMB98: Proceedings of the Second Annual International Conference on Computational Molecular Biology. ACM Press, New York, NY, pp. 221–227.
8. Rocke E. and Tompa,M. (1998) An algorithm for finding novel gapped motifs in DNA sequences. In RECOMB98: Proceedings of the Second Annual International Conference on Computational Molecular Biology. ACM Press, New York, NY, pp. 228–233.
9. Staden R. (1989) Methods for discovering novel motifs in nucleic acid sequences. Comput. Appl. Biosci., 5, 293–298. [PubMed]
10. Chu S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O. and Herskowitz,I. (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 699–705. [PubMed]
11. Hughes J.D., Estep,P.W., Tavazoie,S. and Church,G.M. (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol., 296, 1205–1214. [PubMed]
12. Roth F.P., Hughes,J.D., Estep,P.W. and Church,G.M. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol., 16, 939–945. [PubMed]
13. Spellman P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K., Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297. [PMC free article] [PubMed]
14. Tavazoie S., Hughes,J.D., Campbell,M.J., Cho,R.J. and Church,G.M. (1999) Systematic determination of genetic network architecture. Nature Genet., 22, 281–285. [PubMed]
15. Wingender E., Dietze,P., Karas,H. and Knüppel,R. (1996) TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res., 24, 238–241. [PMC free article] [PubMed]
16. Zhu J. and Zhang,M.Q. (1999) SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics, 15, 563–577. [PubMed]
17. van Helden J., André,B. and Collado-Vides,J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol., 281, 827–842. [PubMed]
18. van Helden J., Olmo,M. and Pérez-Ortin,J. (2000) Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. Nucleic Acids Res., 28, 1000–1010. [PMC free article] [PubMed]
19. van Helden J., Rios,A. and Collado-Vides,J. (2000) Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res., 28, 1808–1818. [PMC free article] [PubMed]
20. Tompa M. (1999) An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 262–271. [PubMed]
21. Brāzma A., Jonassen,I., Vilo,J. and Ukkonen,E. (1998) Predicting gene regulatory elements in silico on a genomic scale. Genome Res., 15, 1202–1215. [PMC free article] [PubMed]
22. Sagot M. (1998) Spelling approximate repeated or common motifs using a suffix tree. In Latin ’98: Theoretical Informatics, Springer-Verlag Lecture Notes in Computer Science no. 1380. Springer-Verlag, Heidelberg, Germany, pp. 111–127.
23. Pavesi G., Mauri,G. and Pesole,G. (2001) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, 17, S207–S214. [PubMed]
24. Sinha S. and Tompa,M. (2000) A statistical method for finding transcription factor binding sites. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 344–354.
25. Mewes H.W., Albermann,K., Heumann,K., Liebl,S. and Pfeiffer,F. (1997) MIPS: a database for protein sequences, homology data and yeast genome information. Nucleic Acids Res., 25, 28–30. [PMC free article] [PubMed]
26. Jones E.W., Pringle,J.R. and Broach,J.R. (eds) (1992) The Molecular and Cellular Biology of the Yeast Saccharomyces: Gene Expression. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
27. Blaiseau P.-L., Isnard,A.-D., Surdin-Kerjan,Y. and Thomas,D. (1997) Met31p and Met32p, two related zinc finger proteins, are involved in transcriptional regulation of yeast sulfur amino acid metabolism. Mol. Cell. Biol., 17, 3640–3648. [PMC free article] [PubMed]
28. Mai B. and Breeden,L. (1997) Xbp1, a stress-induced transcriptional repressor of the Saccharomyces cerevisiae Swi4/Mbp1 family. Mol. Cell. Biol., 17, 6491–6501. [PMC free article] [PubMed]
29. McInerny C.J., Partridge,J.F., Mikesell,G.E., Creemer,D.P. and Breeden,L.L. (1997) A novel Mcm1-dependent element in the SWI4, CLN3, CDC6 and CDC47 promoters activates M/G1-specific transcription. Genes Dev., 11, 1277–1288. [PubMed]
30. Nurrish S.J. and Treisman,R. (1995) DNA binding specificity determinants in MADS-box transcription factors. Mol. Cell. Biol., 15, 4076–4085. [PMC free article] [PubMed]
31. Oshima Y., Nobuo,O. and Harashima,S. (1996) Regulation of phosphatase synthesis in Saccharomyces cerevisiae – a review. Gene, 179, 171–177. [PubMed]
32. Roulet E., Bucher,P., Schneider,R., Wingender,E., Dusserre,Y., Werner,T. and Mermod,N. (2000) Experimental analysis and computer prediction of CTF/NFI transcription factor DNA binding sites. J. Mol. Biol., 297, 833–848. [PubMed]
33. Wemmie J.A., Szczypka,M.S., Thiele,D.J. and Moye-Rowley,W.S. (1994) Cadmium tolerance mediated by the yeast AP-1 protein requires the presence of an ATP-binding cassette transporter-encoding gene, YCS1. J. Biol. Chem., 269, 32592–32597. [PubMed]
34. Leung M.-Y., Marsh,G.M. and Speed,T.P. (1996) Over- and underrepresentation of short DNA words in herpesvirus genomes. J. Comput. Biol., 3, 345–360. [PMC free article] [PubMed]
35. Nicodème P., Salvy,B. and Flajolet,P. (1999) Motif statistics. In 7th Annual European Symposium on Algorithms—ESA99. Springer-Verlag Lecture Notes in Computer Science no. 1643. Springer-Verlag, Heidelberg, Germany, pp. 194–211.
36. Blanchette M. and Sinha,S. (2001) Separating real motifs from their artifacts. Bioinformatics, 17, S30–S38. [PubMed]
37. Aach J., Rindone,W. and Church,G.M. (2000) Systematic management and analysis of yeast gene expression data. Genome Res., 10, 431–445. [PubMed]
38. Bachhawat N., Ouyang,Q. and Henry,S.A. (1995) Functional characterization of an inositol-sensitive upstream activation sequence in yeast. A cis-regulatory element responsible for inositol-choline mediated regulation of phospholipid biosynthesis. J. Biol. Chem., 270, 25087–25095. [PubMed]
39. Schüller H.-J., Richter,K., Hoffman,B., Ebbert,R. and Schweizer,E. (1995) DNA binding site of the yeast heteromeric Ino2p/Ino4p basic helix-loop-helix transcription factor; structural requirements as defined by saturation mutagenesis. FEBS Lett., 370, 149–152. [PubMed]
40. Wagner C., Blank,M., Strohmann,B. and Schüller,H.-J. (1999) Overproduction of the Opi1 repressor inhibits transcriptional activation of structural genes required for phospholipid biosynthesis in the yeast Saccharomyces cerevisiae. Yeast, 15, 843–854. [PubMed]
41. Dequard-Chablat M., Riva,M., Carles,C. and Sentenac,A. (1991) RPC19, the gene for a subunit common to yeast RNA polymerases A (I) and C (III). J. Biol. Chem., 266, 15300–15307. [PubMed]
42. Theodoris G. and Bisson,L.F. (2001) DDSE: downstream targets of the SNF3 signal transduction pathway. FEBS Microbiol. Lett., 197, 73–77. [PubMed]
43. Ozcan S. and Johnston,M. (1999) Function and regulation of yeast hexose transporters. Microbiol. Mol. Biol. Rev., 63, 554–569. [PMC free article] [PubMed]
44. Forsberg H., Gilstring,C.F., Zargari,A., Martinez,P. and Ljungdahl,P.O. (2001) The role of the yeast plasma membrane SPS nutrient sensor in the metabolic response to extracellular amino acids. Mol. Microbiol., 42, 215–228. [PubMed]
45. Lowndes N.F., Johnson,A.L., Breeden,L. and Johnston,L.H. (1992) SWI6 protein is required for transcription of the periodically expressed DNA synthesis genes in budding yeast. Nature, 357, 505–508. [PubMed]
46. Kean L.S., Grant,A.M., Angeletti,C., Mahe,Y., Kuchler,K., Fuller,R.S. and Nichols,J.W. (1997) Plasma membrane translocation of fluorescent-labeled phosphatidylethanolamine is controlled by transcription regulators, PDR1 and PDR3. J. Cell Biol., 138, 255–270. [PMC free article] [PubMed]
47. Cheng C., Kacherovsky,N., Dombek,K.M., Camier,S., Thukral,S.K., Rhim,E. and Young,E.T. (1994) Identification of potential target genes for Adr1p through characterization of essential nucleotides in UAS1. Mol. Cell. Biol., 14, 3842–3852. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...