# Gibbs Recursive Sampler: finding transcription factor binding sites

^{1}The Wadsworth Center, New York State Department of Health, Albany, NY 12201-0509, USA

^{2}Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292, USA

^{3}Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180, USA

^{*}To whom correspondence should be addressed. Tel: +1 5184867882; Fax: +1 518 473 2900; Email: gro.htrowsdaw@nospmoht

## Abstract

The Gibbs Motif Sampler is a software package for locating common elements in collections of biopolymer sequences. In this paper we describe a new variation of the Gibbs Motif Sampler, the Gibbs Recursive Sampler, which has been developed specifically for locating multiple transcription factor binding sites for multiple transcription factors simultaneously in unaligned DNA sequences that may be heterogeneous in DNA composition. Here we describe the basic operation of the web-based version of this sampler. The sampler may be acces-sed at http://bayesweb.wadsworth.org/gibbs/gibbs.html and at http://www.bioinfo.rpi.edu/applications/bayesian/gibbs/gibbs.html. An online user guide is available at http://bayesweb.wadsworth.org/gibbs/bernoulli.html and at http://www.bioinfo.rpi.edu/applications/bayesian/gibbs/manual/bernoulli.html. Solaris, Solaris.x86 and Linux versions of the sampler are available as stand-alone programs for academic and not-for-profit users. Commercial licenses are also available. The Gibbs Recursive Sampler is distributed in accordance with the ISCB level 0 guidelines and a requirement for citation of use in scientific publications.

## INTRODUCTION

Transcription regulation is arguably the most important foundation of cellular function, since it exerts the most fundamental control over the abundance of virtually all of a cell's functional macromolecules. A predominant feature of transcription regulation is the binding of regulatory proteins, transcription factors (TFs), to cognate DNA binding sites known as transcription factor binding sites (TFBS) in the genome. The computational identification of TFBS through the analysis of DNA sequence data has emerged in the last decade as a major new technology for the elucidation of transcription regulatory networks. The Gibbs Motif Sampler is a software package used to locate common elements in collections of biopolymer sequences. It has been applied to the analysis of protein sequences (1,2). Gibbs sampling has also been used extensively in the identification of TFBS (3,4) and an earlier version of this software has been available at this web location for some time. In this paper we describe a new variation, the Gibbs Recursive Sampler, designed to search for multiple TFBS simultaneously. It includes several features that are designed specifically for locating TFBS in unaligned DNA sequences. These features are based on characteristics of TF/DNA complexes or their components.

## THE GIBBS RECURSIVE SAMPLER

Gibbs sampling is a Markov Chain Monte Carlo procedure that has seen wide application in the statistical community. It was first applied in bioinformatics as a tool for multiple sequence alignment in 1993 (1). Gibbs sampling techniques have subsequently seen numerous enhancements and applications (2,5). A key feature of sequence-based Gibbs sampling algorithms and related expectation maximization algorithms (6,7) is the use of motif models in the form of product multinomial models to capture sequence patterns common to the binding sites of each TF.

The recursive sampler, described here, was specifically developed for the identification of TFBS in unaligned DNA sequences. It includes several features that are unique to this software: a rigorous Bayesian method for inferring the number and the locations of the TFBS for multiple TF motifs simultaneously; a background model of the heterogeneity in the composition of non-coding nucleotide sequence and the ability to use prior information of binding motifs. In addition, it includes features to allow the use of palindromic, direct repeat and concentrated alphabet models, preferred binding site locations, and a rigorous test of the statistical significance of the results, the Wilcoxon signed-rank test.

In the following, we briefly describe how the algorithm incorporates these features and we provide instructions on the use of the algorithm and on the interpretation of its results.

## RECURSIVE DISCOVERY OF SITES AND SITE COUNTS

Multiple TFs often bind in a combinatorial fashion to regulate transcription. These collections of factors may contain multiple TFBS for any or all of the factors. Furthermore, some of the DNA sequences in a data set may contain no sites at all, or no sites for one or more of the factors involved in regulating the rest. While the total number of sites in a given input sequence often spans a specifiable and relatively short range, the exact number of sites and the number of sites corresponding to each motif in any input sequence are unknown and often vary among the input sequences. To address these unknowns, the sampler uses recursive sums over all possible alignments of 0≤*k*≤*K*_{max} sites in a sequence, to obtain Bayesian inferences on the number of sites for each motif and the total number of sites in each sequence. The recursion examines the placements of sites for *k* TFs, as represented by *p* different motifs, in each sequence. The algorithm infers for each sequence the total number of sites, the number of each of the *p* motifs and the alignments and orderings of these sites in the sequence. In its back sampling step it simultaneously samples sites in each sequence according to these inferences. As in previous Gibbs sampling algorithms, the widths of sites are inferred using a fragmentation algorithm (5). The sampling process iterates over the sequences one at a time, using currently sampled values in all other sequences, to guide the sampling process toward a converged result.

After all of the sequences have been examined and a set of multiple motif positions has been determined, the log of the posterior alignment probability is calculated (5). The maximum value of this probability, the MAP (maximum *a posteriori* probability), provides the optimal solution. The MAP value is measured relative to an empty or ‘null’ alignment, by taking the difference between the log of the probability of the alignment and the log of the probability of an empty alignment. A value greater than zero indicates that the alignment is more likely than unaligned background. The process continues until a maximal number of iterations has been executed or until a certain number of iterations, the plateau period, has occurred without an increase in the estimated MAP. As a default we report the MAP solution. Alternatively, we provide a Bayesian inference of site frequencies based on continued sampling after convergence. These Bayesian inferences take the form of estimated probabilities for each predicted site. Both types of solution adjust inferences for the lengths of the input sequences and of sites and motifs in a Bayesian manner analogous to Webb (8).

## HETEROGENEOUS BACKGROUND

As in previous Gibbs sampling models, the probability that a particular position in the sequence is sampled as a site is calculated as the ratio of the probability of the site under motif models to the probability under a background model that describes the sequence in the absence of TFBS. In previous implementations, background models assumed homogeneity in the composition of each sequence. However, variations in sequence composition in non-coding DNA are often complex. The most common approach to this complexity has been to model the background sequence using Markov chains (9,10).

Non-coding sequence is also often heterogeneous in composition, particularly in eukaryotes (Fig. (Fig.1).1). This variation in local base composition can adversely affect sequence alignment (11,12). In addition, since TFBS are often A-T or G-C rich, masking algorithms are often not useful for reducing the effect of background variation (13). To address this heterogeneity, the recursive sampler uses the Bayesian segmentation algorithm (14) to produce a position-specific background model. A two-step process is employed. First, the individual input sequences are analyzed for heterogeneity in composition using the Bayesian segmentation algorithm. This algorithm returns the probabilities of observing each of the four bases at each position in a sequence, based only on that specific sequence's compositional heterogeneity and on the uncertainty in this heterogeneity. These probabilities are then used as position-specific background models in the above ratio.

## PALINDROMES AND DIRECT REPEATS

Several TFs that bind as symmetric homodimers or homo-multimeric protein complexes have palindromic DNA binding motifs (6). Other regulatory TFs such as *Escherichia coli* PhoB and certain zinc finger proteins bind in directly repeating multimers and therefore have directly repeating binding patterns (15). Often, the spacing between the palindromic half-sites is unknown. The algorithm makes inferences in these cases, using a modified version of the fragmentation algorithm (5), by restricting fragmentation to the center positions. In a number of cases, AT/TA and GC/CG base pairs may be indistinguishable to TF binding (16). For example, bases in the minor groove of DNA are less accessible to the protein, which limits the ability to distinguish the bases. In addition, AT/TA pairs may allow the DNA greater flexibility to facilitate the DNA bending often associated with TF binding (6). To address all of these cases the recursive sampler includes motif models that use reduced alphabets.

## PREFERRED SITE LOCATIONS

While at least in principle a TFBS may be located over the entire range of the input DNA sequences, there are often preferred locations for binding. For example, in prokaryotes, while TFBS may occur anywhere in an intergenic sequence as well as in the coding region, they are much more common in and around the RNA polymerase binding site. In some cases, an empirical distribution of the locations of the sites within the sequences may be known. For example, Figure Figure22 shows the distribution of 182 experimentally reported intergenic TFBS relative to the start codons of the regulated genes in *E.coli.* The use of such a distribution will allow the algorithm to focus on the most probable locations for sites (4). By default, the algorithm uses a uniform distribution of site positions.

## PRIOR INFORMATION ON CANDIDATE MOTIFS

Position weight matrices are now available for many TFs. When a sufficient number of sites from prior studies are available the corresponding multiple alignments can be converted to position weight matrices that can be used to scan sequences for additional sites (2,17). However, it is often the case that limited data are available for the determination of a position weight matrix. Fortunately, Bayesian statistics provides the means to use such limited prior information through the specification of prior motif models. These informed prior models provide clues to the expected patterns in DNA binding motifs that influence but do not control posterior inference of sites and motifs. The recursive sampler permits incorporation of informed motif priors and gives the user control over the strength of the clue. By default, uninformed prior motif models are assumed.

## WILCOXON TEST OF MOTIF SIGNIFICANCE

The assessment of statistical significance of any multiple sequence alignment is difficult. A general, robust and non-parametric procedure for this assessment has been described by Liu and coworkers (5). Conceptually, this procedure formalizes comparisons with negative controls to assess statistical significance. Under the null hypothesis, sites will be no more likely to occur in study sequences than in negative controls. In this case, application of the Gibbs sampler or other site-finding algorithm to an extended set of sequences, the first half of which are study sequences and the second half of which are negative control sequences, will be equally likely to find a site in either half. The Wilcoxon signed-rank test, included in the recursive sampler, implements this concept in a manner that uses not only the differences in the number of sites predicted in study sequences versus the negative control sequences, but also the ranks of the scores of each predicted site. Two characteristics of this test contribute to its robustness: it is non-parametric and there are no restrictions on the characteristics of the negative controls that can be used with the procedure.

## PROGRAM INTERFACE

An effort was made to keep the web-based program interface straightforward. At a minimum, to use the recursive sampler the user must supply a set of sequences in FASTA format, the maximum number of sites per sequence and an initial width estimate for each motif. The sequence data may be pasted into the entry window or loaded from a file. Sets of prokaryotic or eukaryotic default values for all parameters can be chosen. Alternatively, the user can specify values for these parameters. Figure Figure33 illustrates the initial interface screen. Required entries are marked with asterisks. Each entry field has an associated hyperlink that gives details about the required data format.

The sequence data is typically the non-coding sequences of co-regulated genes from a single species (18) or the non-coding sequences of orthologous genes from several species (4,19). McCue and colleagues (19) describe a number of factors influencing the choice of species for use in the identification of TFBS in prokaryotes.

## ADVANCED OPTIONS

Clicking on the Show Advanced Options link brings up a new screen with more options (Fig. (Fig.4).4). While defaults have been set for all advanced features, specification of alternative values on this page allows the user greater control over how the sampler searches for sites, and also allows adjustment of the program runtime parameters. For example, in a particular prokaryotic application, the user may decide to replace the default choice of palindromic sites with a non-palindromic model. A number of other parameters influence the operation of the program. Several items have been mentioned above and further information on formats and acceptable values can be obtained by clicking on the associated hyperlinks.

## A FREQUENCY-BASED SOLUTION

By default, the algorithm returns the single best solution that it has found as measured by the MAP value. However, no prediction method is perfect, and not all of the predictions of this or other algorithms are equally compelling. For examination of the creditability of each prediction, a Bayesian sampling solution is also available as an option. For this solution, after the MAP solution has been determined, the algorithm continues to sample, in order to explore variations in the models. The frequencies with which sites are sampled are recorded and reported to the user. Variations in models may alter estimates of site frequencies from the values obtained with the single MAP solution. Our experience shows that some sites are strong and are sampled ~100% of the time, while others are weaker and are sampled with a lower frequency. These sampling frequencies are an estimate of the probability that the cognate TF binds at each predicted site. By default, only sites that have a frequency of at least 50% are reported.

## PROGRAM OUTPUT

The web-based version of Gibbs Recursive Sampler produces the same output as does the stand-alone version. Normally, the output is returned via email. The user also has the option of receiving the output online. Figure Figure55 illustrates typical program output. These results are for a set of 18 sequences extracted from *E.coli* regulatory regions. The sequences are known to contain binding sites for the protein CRP (6). The results were generated using the recursive sampler with the following parameters: a width of 16, a palindromic model, a maximum number of sites per sequence of two and heterogeneous background composition. Since we used a palindromic model, the search for sites in the reverse complement direction was turned off. The Wilcoxon signed-rank test was applied to test the statistical significance of the resulting alignment.

## INTERPRETATION OF RESULTS

The first portion of the results in Figure Figure55 is a list of the options used for the current run. This is followed by a list of the FASTA headings for the input sequences. Next are the results from the MAP solution for each motif. The listing for each motif begins with a table having a column for each of the bases and a row for each of the positions in the motif. The numbers in the table indicate the estimated probability of occurrence of each base in each position within the motif, throughout the alignment. The last column gives the information contribution to the model of each position, with a maximum of two bits. Thus, this table gives the estimated binding pattern of the motif and a measure of the degree of conservation of each position in bits.

The next portion of the results is the alignment of the motif sites, along with the probability that each site belongs to the current motif model. Immediately below the listing of motif sites is a row containing asterisks. The asterisks indicate the conserved columns of the site. In the example in Figure Figure5,5, even though the initial width of the site was entered as 16, the program fragmented the sites to a total width of 22, 16 conserved columns and six non-conserved columns. Only the conserved columns are used in the alignments. The fragmentation algorithm dynamically learns which columns are conserved. The p-value of 0.000396 from the Wilcoxon signed-rank test indicates that the solution is highly significant. In addition, the predicted sites match well with the reported crp binding sites for these sequences (6).

The MAP calculation includes terms for each component of the model: the motif model, the fragmentation, the background model and the alignment. The individual contributions of the two motif-specific terms, the motif model term and the fragmentation term are listed for each individual motif. The final full log MAP minus the log of the null MAP (MAP of the sequence with no sites) is returned at the bottom of the output, as are the background and null MAP terms. A full description of the output, along with other examples can be found at http://bayesweb.wadsworth.org/gibbs/content.html#results.

## ADDITIONAL FEATURES

The Gibbs Motif Sampler web page also allows for the analysis of amino acid sequences. With the exception of nucleotide-specific features described above, all of the sampling options are applicable to proteins. Low complexity regions of amino acid sequences can be removed with an XNU process (20) using a BLOSSUM62 matrix.

## OTHER RESOURCES

A number of other web sites exist for the discovery of motifs in nucleotide sequences. Several of these are based on Gibbs sampling methods (9,21–23). Others use expectation maximization (7, http://www.cse.ucsc.edu/~kent/improbizer/index.html), neural nets (24) or suffix trees or similar enumerative algorithms (25–27) to search for motifs in unaligned sequence data. Workman and Stormo (24) provide a comparison of an earlier version of this software with neural network and expectation maximization algorithms.

## FOR FURTHER INFORMATION

The Bayesian Bioinformatics web page at http://www.wadsworth.org/resnres/bioinfo/ provides a number of other web based applications for sequence analysis, a database of predicted *E.coli* TFBS, a tutorial on Bayesian bioinformatics and references.

## ACKNOWLEDGEMENTS

We thank our colleagues at the Wadsworth Center for their assistance, especially C. Steven Carmack and Clarence Chan for assistance with the web programming, and Lee Ann McCue and Michael Palumbo for critical readings of the manuscript. This work was supported by NIH grant R01HG01257 and DOE grant DE-FG02-01ER63204.

## REFERENCES

*Escherichia coli*K-12 genome. J. Mol. Biol., 248, 241–254. [PubMed]

*cis*-regulatory elements associated with functionally coherent groups of genes in

*Saccharomyces cerevisiae*. J. Mol. Biol., 296, 1205–1214. [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (461K) |
- Citation

- The Gibbs Centroid Sampler.[Nucleic Acids Res. 2007]
*Thompson WA, Newberg LA, Conlan S, McCue LA, Lawrence CE.**Nucleic Acids Res. 2007 Jul; 35(Web Server issue):W232-7. Epub 2007 May 5.* - MATCH: A tool for searching transcription factor binding sites in DNA sequences.[Nucleic Acids Res. 2003]
*Kel AE, Gössling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E.**Nucleic Acids Res. 2003 Jul 1; 31(13):3576-9.* - Using the Gibbs Motif Sampler for phylogenetic footprinting.[Methods Mol Biol. 2007]
*Thompson W, Conlan S, McCue LA, Lawrence CE.**Methods Mol Biol. 2007; 395:403-24.* - YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation.[Nucleic Acids Res. 2003]
*Sinha S, Tompa M.**Nucleic Acids Res. 2003 Jul 1; 31(13):3586-8.* - Position weight matrix, gibbs sampler, and the associated significance tests in motif characterization and prediction.[Scientifica (Cairo). 2012]
*Xia X.**Scientifica (Cairo). 2012; 2012:917540. Epub 2012 Oct 23.*

- In silico discovery of novel transcription factors regulated by mTOR-pathway activities[Frontiers in Cell and Developmental Biology...]
*Jablonska A, Polouliakh N.**Frontiers in Cell and Developmental Biology. 223* - Dietary switch reveals fast coordinated gene expression changes in Drosophila melanogaster[Aging (Albany NY). ]
*Whitaker R, Gil MP, Ding F, Tatar M, Helfand SL, Neretti N.**Aging (Albany NY). 6(5)355-368* - Butanol tolerance regulated by a two-component response regulator Slr1037 in photosynthetic Synechocystis sp. PCC 6803[Biotechnology for Biofuels. ]
*Chen L, Wu L, Wang J, Zhang W.**Biotechnology for Biofuels. 789* - Arabidopsis KANADI1 Acts as a Transcriptional Repressor by Interacting with a Specific cis-Element and Regulates Auxin Biosynthesis, Transport, and Signaling in Opposition to HD-ZIPIII Factors[The Plant Cell. 2014]
*Huang T, Harrar Y, Lin C, Reinhart B, Newell NR, Talavera-Rauh F, Hokin SA, Barton MK, Kerstetter RA.**The Plant Cell. 2014 Jan; 26(1)246-262* - On the Value of Intra-Motif Dependencies of Human Insulator Protein CTCF[PLoS ONE. ]
*Eggeling R, Gohr A, Keilwagen J, Mohr M, Posch S, Smith AD, Grosse I.**PLoS ONE. 9(1)e85629*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Gibbs Recursive Sampler: finding transcription factor binding sitesGibbs Recursive Sampler: finding transcription factor binding sitesNucleic Acids Research. 2003 Jul 1; 31(13)3580

Your browsing activity is empty.

Activity recording is turned off.

See more...