• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jul 1, 2003; 31(13): 3580–3585.
PMCID: PMC169014

Gibbs Recursive Sampler: finding transcription factor binding sites

Abstract

The Gibbs Motif Sampler is a software package for locating common elements in collections of biopolymer sequences. In this paper we describe a new variation of the Gibbs Motif Sampler, the Gibbs Recursive Sampler, which has been developed specifically for locating multiple transcription factor binding sites for multiple transcription factors simultaneously in unaligned DNA sequences that may be heterogeneous in DNA composition. Here we describe the basic operation of the web-based version of this sampler. The sampler may be acces-sed at http://bayesweb.wadsworth.org/gibbs/gibbs.html and at http://www.bioinfo.rpi.edu/applications/bayesian/gibbs/gibbs.html. An online user guide is available at http://bayesweb.wadsworth.org/gibbs/bernoulli.html and at http://www.bioinfo.rpi.edu/applications/bayesian/gibbs/manual/bernoulli.html. Solaris, Solaris.x86 and Linux versions of the sampler are available as stand-alone programs for academic and not-for-profit users. Commercial licenses are also available. The Gibbs Recursive Sampler is distributed in accordance with the ISCB level 0 guidelines and a requirement for citation of use in scientific publications.

INTRODUCTION

Transcription regulation is arguably the most important foundation of cellular function, since it exerts the most fundamental control over the abundance of virtually all of a cell's functional macromolecules. A predominant feature of transcription regulation is the binding of regulatory proteins, transcription factors (TFs), to cognate DNA binding sites known as transcription factor binding sites (TFBS) in the genome. The computational identification of TFBS through the analysis of DNA sequence data has emerged in the last decade as a major new technology for the elucidation of transcription regulatory networks. The Gibbs Motif Sampler is a software package used to locate common elements in collections of biopolymer sequences. It has been applied to the analysis of protein sequences (1,2). Gibbs sampling has also been used extensively in the identification of TFBS (3,4) and an earlier version of this software has been available at this web location for some time. In this paper we describe a new variation, the Gibbs Recursive Sampler, designed to search for multiple TFBS simultaneously. It includes several features that are designed specifically for locating TFBS in unaligned DNA sequences. These features are based on characteristics of TF/DNA complexes or their components.

THE GIBBS RECURSIVE SAMPLER

Gibbs sampling is a Markov Chain Monte Carlo procedure that has seen wide application in the statistical community. It was first applied in bioinformatics as a tool for multiple sequence alignment in 1993 (1). Gibbs sampling techniques have subsequently seen numerous enhancements and applications (2,5). A key feature of sequence-based Gibbs sampling algorithms and related expectation maximization algorithms (6,7) is the use of motif models in the form of product multinomial models to capture sequence patterns common to the binding sites of each TF.

The recursive sampler, described here, was specifically developed for the identification of TFBS in unaligned DNA sequences. It includes several features that are unique to this software: a rigorous Bayesian method for inferring the number and the locations of the TFBS for multiple TF motifs simultaneously; a background model of the heterogeneity in the composition of non-coding nucleotide sequence and the ability to use prior information of binding motifs. In addition, it includes features to allow the use of palindromic, direct repeat and concentrated alphabet models, preferred binding site locations, and a rigorous test of the statistical significance of the results, the Wilcoxon signed-rank test.

In the following, we briefly describe how the algorithm incorporates these features and we provide instructions on the use of the algorithm and on the interpretation of its results.

RECURSIVE DISCOVERY OF SITES AND SITE COUNTS

Multiple TFs often bind in a combinatorial fashion to regulate transcription. These collections of factors may contain multiple TFBS for any or all of the factors. Furthermore, some of the DNA sequences in a data set may contain no sites at all, or no sites for one or more of the factors involved in regulating the rest. While the total number of sites in a given input sequence often spans a specifiable and relatively short range, the exact number of sites and the number of sites corresponding to each motif in any input sequence are unknown and often vary among the input sequences. To address these unknowns, the sampler uses recursive sums over all possible alignments of 0≤kKmax sites in a sequence, to obtain Bayesian inferences on the number of sites for each motif and the total number of sites in each sequence. The recursion examines the placements of sites for k TFs, as represented by p different motifs, in each sequence. The algorithm infers for each sequence the total number of sites, the number of each of the p motifs and the alignments and orderings of these sites in the sequence. In its back sampling step it simultaneously samples sites in each sequence according to these inferences. As in previous Gibbs sampling algorithms, the widths of sites are inferred using a fragmentation algorithm (5). The sampling process iterates over the sequences one at a time, using currently sampled values in all other sequences, to guide the sampling process toward a converged result.

After all of the sequences have been examined and a set of multiple motif positions has been determined, the log of the posterior alignment probability is calculated (5). The maximum value of this probability, the MAP (maximum a posteriori probability), provides the optimal solution. The MAP value is measured relative to an empty or ‘null’ alignment, by taking the difference between the log of the probability of the alignment and the log of the probability of an empty alignment. A value greater than zero indicates that the alignment is more likely than unaligned background. The process continues until a maximal number of iterations has been executed or until a certain number of iterations, the plateau period, has occurred without an increase in the estimated MAP. As a default we report the MAP solution. Alternatively, we provide a Bayesian inference of site frequencies based on continued sampling after convergence. These Bayesian inferences take the form of estimated probabilities for each predicted site. Both types of solution adjust inferences for the lengths of the input sequences and of sites and motifs in a Bayesian manner analogous to Webb (8).

HETEROGENEOUS BACKGROUND

As in previous Gibbs sampling models, the probability that a particular position in the sequence is sampled as a site is calculated as the ratio of the probability of the site under motif models to the probability under a background model that describes the sequence in the absence of TFBS. In previous implementations, background models assumed homogeneity in the composition of each sequence. However, variations in sequence composition in non-coding DNA are often complex. The most common approach to this complexity has been to model the background sequence using Markov chains (9,10).

Non-coding sequence is also often heterogeneous in composition, particularly in eukaryotes (Fig. (Fig.1).1). This variation in local base composition can adversely affect sequence alignment (11,12). In addition, since TFBS are often A-T or G-C rich, masking algorithms are often not useful for reducing the effect of background variation (13). To address this heterogeneity, the recursive sampler uses the Bayesian segmentation algorithm (14) to produce a position-specific background model. A two-step process is employed. First, the individual input sequences are analyzed for heterogeneity in composition using the Bayesian segmentation algorithm. This algorithm returns the probabilities of observing each of the four bases at each position in a sequence, based only on that specific sequence's compositional heterogeneity and on the uncertainty in this heterogeneity. These probabilities are then used as position-specific background models in the above ratio.

Figure 1
The compositional variation of a 500 bp region upstream of the translation start site of the YDR226W/ADK1 gene from Saccharomyces cerevisiae. The probability of each base at each position is calculated by the Bayesian segmentation algorithm ...

PALINDROMES AND DIRECT REPEATS

Several TFs that bind as symmetric homodimers or homo-multimeric protein complexes have palindromic DNA binding motifs (6). Other regulatory TFs such as Escherichia coli PhoB and certain zinc finger proteins bind in directly repeating multimers and therefore have directly repeating binding patterns (15). Often, the spacing between the palindromic half-sites is unknown. The algorithm makes inferences in these cases, using a modified version of the fragmentation algorithm (5), by restricting fragmentation to the center positions. In a number of cases, AT/TA and GC/CG base pairs may be indistinguishable to TF binding (16). For example, bases in the minor groove of DNA are less accessible to the protein, which limits the ability to distinguish the bases. In addition, AT/TA pairs may allow the DNA greater flexibility to facilitate the DNA bending often associated with TF binding (6). To address all of these cases the recursive sampler includes motif models that use reduced alphabets.

PREFERRED SITE LOCATIONS

While at least in principle a TFBS may be located over the entire range of the input DNA sequences, there are often preferred locations for binding. For example, in prokaryotes, while TFBS may occur anywhere in an intergenic sequence as well as in the coding region, they are much more common in and around the RNA polymerase binding site. In some cases, an empirical distribution of the locations of the sites within the sequences may be known. For example, Figure Figure22 shows the distribution of 182 experimentally reported intergenic TFBS relative to the start codons of the regulated genes in E.coli. The use of such a distribution will allow the algorithm to focus on the most probable locations for sites (4). By default, the algorithm uses a uniform distribution of site positions.

Figure 2
The histogram shows the distribution of the distances of 182 reported TFBS from start codons in E.coli regulatory sequences. The solid line is a mixture of Gaussian functions fitted to the histogram data and represents the default prokaryotic probability ...

PRIOR INFORMATION ON CANDIDATE MOTIFS

Position weight matrices are now available for many TFs. When a sufficient number of sites from prior studies are available the corresponding multiple alignments can be converted to position weight matrices that can be used to scan sequences for additional sites (2,17). However, it is often the case that limited data are available for the determination of a position weight matrix. Fortunately, Bayesian statistics provides the means to use such limited prior information through the specification of prior motif models. These informed prior models provide clues to the expected patterns in DNA binding motifs that influence but do not control posterior inference of sites and motifs. The recursive sampler permits incorporation of informed motif priors and gives the user control over the strength of the clue. By default, uninformed prior motif models are assumed.

WILCOXON TEST OF MOTIF SIGNIFICANCE

The assessment of statistical significance of any multiple sequence alignment is difficult. A general, robust and non-parametric procedure for this assessment has been described by Liu and coworkers (5). Conceptually, this procedure formalizes comparisons with negative controls to assess statistical significance. Under the null hypothesis, sites will be no more likely to occur in study sequences than in negative controls. In this case, application of the Gibbs sampler or other site-finding algorithm to an extended set of sequences, the first half of which are study sequences and the second half of which are negative control sequences, will be equally likely to find a site in either half. The Wilcoxon signed-rank test, included in the recursive sampler, implements this concept in a manner that uses not only the differences in the number of sites predicted in study sequences versus the negative control sequences, but also the ranks of the scores of each predicted site. Two characteristics of this test contribute to its robustness: it is non-parametric and there are no restrictions on the characteristics of the negative controls that can be used with the procedure.

PROGRAM INTERFACE

An effort was made to keep the web-based program interface straightforward. At a minimum, to use the recursive sampler the user must supply a set of sequences in FASTA format, the maximum number of sites per sequence and an initial width estimate for each motif. The sequence data may be pasted into the entry window or loaded from a file. Sets of prokaryotic or eukaryotic default values for all parameters can be chosen. Alternatively, the user can specify values for these parameters. Figure Figure33 illustrates the initial interface screen. Required entries are marked with asterisks. Each entry field has an associated hyperlink that gives details about the required data format.

Figure 3
The basic Gibbs Recursive Sampler data entry form, showing the minimum amount of data required by the program.

The sequence data is typically the non-coding sequences of co-regulated genes from a single species (18) or the non-coding sequences of orthologous genes from several species (4,19). McCue and colleagues (19) describe a number of factors influencing the choice of species for use in the identification of TFBS in prokaryotes.

ADVANCED OPTIONS

Clicking on the Show Advanced Options link brings up a new screen with more options (Fig. (Fig.4).4). While defaults have been set for all advanced features, specification of alternative values on this page allows the user greater control over how the sampler searches for sites, and also allows adjustment of the program runtime parameters. For example, in a particular prokaryotic application, the user may decide to replace the default choice of palindromic sites with a non-palindromic model. A number of other parameters influence the operation of the program. Several items have been mentioned above and further information on formats and acceptable values can be obtained by clicking on the associated hyperlinks.

Figure 4
The Gibbs Recursive Sampler advanced entry page, showing modifiable parameters.

A FREQUENCY-BASED SOLUTION

By default, the algorithm returns the single best solution that it has found as measured by the MAP value. However, no prediction method is perfect, and not all of the predictions of this or other algorithms are equally compelling. For examination of the creditability of each prediction, a Bayesian sampling solution is also available as an option. For this solution, after the MAP solution has been determined, the algorithm continues to sample, in order to explore variations in the models. The frequencies with which sites are sampled are recorded and reported to the user. Variations in models may alter estimates of site frequencies from the values obtained with the single MAP solution. Our experience shows that some sites are strong and are sampled ~100% of the time, while others are weaker and are sampled with a lower frequency. These sampling frequencies are an estimate of the probability that the cognate TF binds at each predicted site. By default, only sites that have a frequency of at least 50% are reported.

PROGRAM OUTPUT

The web-based version of Gibbs Recursive Sampler produces the same output as does the stand-alone version. Normally, the output is returned via email. The user also has the option of receiving the output online. Figure Figure55 illustrates typical program output. These results are for a set of 18 sequences extracted from E.coli regulatory regions. The sequences are known to contain binding sites for the protein CRP (6). The results were generated using the recursive sampler with the following parameters: a width of 16, a palindromic model, a maximum number of sites per sequence of two and heterogeneous background composition. Since we used a palindromic model, the search for sites in the reverse complement direction was turned off. The Wilcoxon signed-rank test was applied to test the statistical significance of the resulting alignment.

Figure 5Figure 5Figure 5Figure 5
Sample output from Gibbs Recursive Sampler.

INTERPRETATION OF RESULTS

The first portion of the results in Figure Figure55 is a list of the options used for the current run. This is followed by a list of the FASTA headings for the input sequences. Next are the results from the MAP solution for each motif. The listing for each motif begins with a table having a column for each of the bases and a row for each of the positions in the motif. The numbers in the table indicate the estimated probability of occurrence of each base in each position within the motif, throughout the alignment. The last column gives the information contribution to the model of each position, with a maximum of two bits. Thus, this table gives the estimated binding pattern of the motif and a measure of the degree of conservation of each position in bits.

The next portion of the results is the alignment of the motif sites, along with the probability that each site belongs to the current motif model. Immediately below the listing of motif sites is a row containing asterisks. The asterisks indicate the conserved columns of the site. In the example in Figure Figure5,5, even though the initial width of the site was entered as 16, the program fragmented the sites to a total width of 22, 16 conserved columns and six non-conserved columns. Only the conserved columns are used in the alignments. The fragmentation algorithm dynamically learns which columns are conserved. The p-value of 0.000396 from the Wilcoxon signed-rank test indicates that the solution is highly significant. In addition, the predicted sites match well with the reported crp binding sites for these sequences (6).

The MAP calculation includes terms for each component of the model: the motif model, the fragmentation, the background model and the alignment. The individual contributions of the two motif-specific terms, the motif model term and the fragmentation term are listed for each individual motif. The final full log MAP minus the log of the null MAP (MAP of the sequence with no sites) is returned at the bottom of the output, as are the background and null MAP terms. A full description of the output, along with other examples can be found at http://bayesweb.wadsworth.org/gibbs/content.html#results.

ADDITIONAL FEATURES

The Gibbs Motif Sampler web page also allows for the analysis of amino acid sequences. With the exception of nucleotide-specific features described above, all of the sampling options are applicable to proteins. Low complexity regions of amino acid sequences can be removed with an XNU process (20) using a BLOSSUM62 matrix.

OTHER RESOURCES

A number of other web sites exist for the discovery of motifs in nucleotide sequences. Several of these are based on Gibbs sampling methods (9,2123). Others use expectation maximization (7, http://www.cse.ucsc.edu/~kent/improbizer/index.html), neural nets (24) or suffix trees or similar enumerative algorithms (2527) to search for motifs in unaligned sequence data. Workman and Stormo (24) provide a comparison of an earlier version of this software with neural network and expectation maximization algorithms.

FOR FURTHER INFORMATION

The Bayesian Bioinformatics web page at http://www.wadsworth.org/resnres/bioinfo/ provides a number of other web based applications for sequence analysis, a database of predicted E.coli TFBS, a tutorial on Bayesian bioinformatics and references.

ACKNOWLEDGEMENTS

We thank our colleagues at the Wadsworth Center for their assistance, especially C. Steven Carmack and Clarence Chan for assistance with the web programming, and Lee Ann McCue and Michael Palumbo for critical readings of the manuscript. This work was supported by NIH grant R01HG01257 and DOE grant DE-FG02-01ER63204.

REFERENCES

1. Lawrence C.E., Altschul,S., Boguski,M., Liu,J., Neuwald,A. and Wootton,J. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208–214. [PubMed]
2. Neuwald A., Liu,J. and Lawrence,C.E. (1996) Gibbs Motif Sampling: detection of bacterial outer membrane protein repeats. Protein Sci., 4, 1618–1632. [PMC free article] [PubMed]
3. Robison K., McGuire,A.M. and Church,G.M. (1998) A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. J. Mol. Biol., 248, 241–254. [PubMed]
4. McCue L.A., Thompson,W., Carmack,C.S., Ryan,M., Liu,J., Derbyshire,V. and Lawrence,C.E. (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes, Nucleic Acids Res., 39, 774–782. [PMC free article] [PubMed]
5. Liu J., Neuwald,A. and Lawrence,C.E. (1995) Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Amer. Stat. Assoc., 90, 1156–1170.
6. Lawrence C.E. and Reilly,A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7, 41–51. [PubMed]
7. Bailey T.L. and Elkin,C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Altman,R., Brutlag,D., Karp,P., Lathrop,R. and Searls,D. (eds), Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, pp. 28–36.
8. Webb B.J., Liu,J.S. and Lawrence,C.E. (2002) BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Res., 30, 1268–1277. [PMC free article] [PubMed]
9. Liu X., Brutlag,DL. and Liu,J.S. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Proc. Pac. Symp. Biocomp., 6, 127–138. [PubMed]
10. Thijs G., Lescot,M., Marchal,K., Rombauts,S., De Moor,B., Rouze,P. and Moreau,Y. (2001). A higher order background model improves the detection of regulatory elements by Gibbs Sampling. Bioinformatics, 17, 1113–1122. [PubMed]
11. Marchal K., Thijs,G., De Keersmaecker,S., Monsieurs,P., De Moor,B. and Vanderleyden,J. (2003) Genome-specific higher-order background models to improve motif detection, Trends Microbiol., 11, 61–66. [PubMed]
12. Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. [PubMed]
13. Wootton J.C. and Federhen,S. (1996) Analysis of compositional biased regions in sequence databases. Methods Enzymol., 266, 554–571. [PubMed]
14. Liu J. and Lawrence,C.E. (1999) Bayesian inference on biopolymer models. Bioinformatics, 15, 38–52. [PubMed]
15. Warner B.L. (1996) Phosphorous assimilation and control of the phosphate regulon. In Neihdhardt, F.C. (ed.), Escherichia coli and Salmonella: Cellular and Molecular Biology. ASM Press, Washington, DC, pp. 1357–1381.
16. Twyman R.M. (1998) Advanced Molecular Biology, A Concise Reference. Springer-Verlag, New York, NY, pp. 246–247.
17. Staden R. (1989) Methods for calculating the probabilities of finding patterns in sequences. Comput. Appl. Biosci., 5, 89–96. [PubMed]
18. Wasserman W., Palumbo,M., Thompson,W., Fickett,J. and Lawrence,C.E. (2000) Human-mouse genome comparisons to locate regulatory sites. Nature Genet., 26, 225–228. [PubMed]
19. McCue L.A., Thompson,W., Carmack,C.S. and Lawrence,C.E. (2002) Factors influencing the identification of transcription factor binding sites by cross-species comparison. Genome Res., 12, 1523–1532. [PMC free article] [PubMed]
20. Claverie J.M. and States,D. (1993) Information enhancement methods for large scale sequence analysis, Comp. Chem., 17, 191–201.
21. Hughes J.D., Estep,P.W., Tavazoie,S. and Church,G.M. (2000) Computational identification of cis-regulatory elements associated with functionally coherent groups of genes in Saccharomyces cerevisiae. J. Mol. Biol., 296, 1205–1214. [PubMed]
22. Thijs G., Marchal,K., Lescot,M., Rombauts,S., De Moor,B., Rouze,P. and Moreau,Y. (2002) A Gibbs sampling method to detect over-represented motifs in upstream regions of coexpressed genes. J. Comp. Biol., 9, 447–464. [PubMed]
23. van Helden J., André,B. and Collado-Vides,J. (2000) A web site for the computational analysis of yeast regulatory sequences. Yeast, 16, 177–187. [PubMed]
24. Workman C. and Stormo,G.D. (2000) ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Proc. Pac. Symp. Biocomp., 5, 464–475. [PubMed]
25. Sinha S. and Tompa,M. (2000) A statistical method for finding transcription factor binding sites. In Bourne,P., Gribskov,M., Altman,R., Jensen,N., Hope,D., Lengauer,T., Mitchell,J., Scheeff,E., Smith,C., Strande,S. and Weissig,H. (eds), Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, pp. 344–354.
26. Eskin E. and Pevzner,P.A. (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics, 18, 354–363. [PubMed]
27. Marsan L. and Sagot,M.F. (2001) Algorithms for extracting structured motifs using a suffix-tree with application to promoter and regulatory site consensus identification. J. Comput. Biol., 7, 345–360. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...