![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||
Copyright Habib et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. A Novel Bayesian DNA Motif Comparison Method for Clustering and Retrieval 1School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel 2Department of Molecular Genetics and Biotechnology, Faculty of Medicine, The Hebrew University, Jerusalem, Israel Ernest Fraenkel, Academic Editor Massachusetts Institute of Technology, United States of America #Contributed equally. * E-mail: nir/at/cs.huji.ac.il Conceived and designed the experiments: NH TK HM NF. Performed the experiments: NH. Analyzed the data: NH TK. Wrote the paper: NH TK HM NF. Received July 19, 2007; Accepted January 24, 2008. This article has been cited by other articles in PMC.Abstract Characterizing the DNA-binding specificities of transcription factors is a key problem in computational biology that has been addressed by multiple algorithms. These usually take as input sequences that are putatively bound by the same factor and output one or more DNA motifs. A common practice is to apply several such algorithms simultaneously to improve coverage at the price of redundancy. In interpreting such results, two tasks are crucial: clustering of redundant motifs, and attributing the motifs to transcription factors by retrieval of similar motifs from previously characterized motif libraries. Both tasks inherently involve motif comparison. Here we present a novel method for comparing and merging motifs, based on Bayesian probabilistic principles. This method takes into account both the similarity in positional nucleotide distributions of the two motifs and their dissimilarity to the background distribution. We demonstrate the use of the new comparison method as a basis for motif clustering and retrieval procedures, and compare it to several commonly used alternatives. Our results show that the new method outperforms other available methods in accuracy and sensitivity. We incorporated the resulting motif clustering and retrieval procedures in a large-scale automated pipeline for analyzing DNA motifs. This pipeline integrates the results of various DNA motif discovery algorithms and automatically merges redundant motifs from multiple training sets into a coherent annotated library of motifs. Application of this pipeline to recent genome-wide transcription factor location data in S. cerevisiae successfully identified DNA motifs in a manner that is as good as semi-automated analysis reported in the literature. Moreover, we show how this analysis elucidates the mechanisms of condition-specific preferences of transcription factors. Author Summary Regulation of gene expression plays a central role in the activity of living cells and in their response to internal (e.g., cell division) or external (e.g., stress) stimuli. Key players in determining gene-specific regulation are transcription factors that bind sequence-specific sites on the DNA, modulating the expression of nearby genes. To understand the regulatory program of the cell, we need to identify these transcription factors, when they act, and on which genes. Transcription regulatory maps can be assembled by computational analysis of experimental data, by discovering the DNA recognition sequences (motifs) of transcription factors and their occurrences along the genome. Such an analysis usually results in a large number of overlapping motifs. To reconstruct regulatory maps, it is crucial to combine similar motifs and to relate them to transcription factors. To this end we developed an accurate fully-automated method, termed BLiC, based upon an improved similarity measure for comparing DNA motifs. By applying it to genome-wide data in yeast, we identified the DNA motifs of transcription factors and their putative target genes. Finally, we analyze motifs of transcription factor that alter their target genes under different conditions, and show how cells adjust their regulatory program in response to environmental changes. Introduction Transcription initiation is modulated by transcription factors that recognize sequence-specific binding sites in regulatory regions. The organization of binding sites around a gene specifies which factors can bind to it and where, and consequently determines to what extent the gene is transcribed under different conditions. To understand this regulatory mechanism, one must specify the DNA binding preferences of transcription factors. These preferences are usually characterized by a motif that summarizes the commonalities among the binding sites of a transcription factor [1]. Multiple tools were developed for finding motifs (e.g., [2]–[5]), however there are several problems in interpreting their output. Typically these algorithms output multiple results which require filtering and scoring. Moreover, different motif discovery methods have complementary successes, and therefore it is beneficial to apply multiple methods simultaneously and collate their results [6]. In addition, the motif discovery algorithms frequently produce a redundant output and the transcription factor that binds each motif is usually unknown. As similar motifs may represent binding sites of the same factor, eliminating this redundancy is essential for elucidating the true transcriptional regulatory program. The general strategy is thus to cluster similar motifs and merge motifs within each cluster to create a library of non-redundant motifs [6] (Figure 1B
In the literature there is an ongoing discussion regarding the best representation of DNA motifs [1], [7]–[10]. Here we use a Position Frequency Matrix (PFM), which has the benefits of being relatively simple yet flexible. A PFM is a matrix of positions in the binding site versus nucleotide preferences, where each row represents one residue and each column represents the nucleotide count at each position in a set of aligned binding sites. This representation assumes that the choice of nucleotides at different positions is independent of all other positions. To compare two PFMs, we need to consider all possible alignments between them. Given two aligned PFMs, we utilize the position-independence assumption to decompose the similarity score into a sum of the similarities of single aligned positions. Several similarity scores can be used to compare a pair of aligned positions. One approach uses the Pearson correlation coefficient (e.g., [11],[12]). This measure, however, might inappropriately capture similarities between probabilities (Figure 2
This motif similarity score is used by us to identify similar motifs that represent binding sites of the same factor and for clustering motifs. For the latter we devised a hierarchal agglomerative clustering procedure that is based on our motif similarity score. Our results show that the new method outperforms other alternatives in accuracy and sensitivity in both the clustering and retrieval tasks. Using our new similarity score and the clustering method based upon it, we developed a large-scale analysis pipeline of DNA motif sets. This pipeline is designed for analysis following concurrent motif search by a combination of methods (using the TAMO package [19]). The goal is to process the set of DNA motifs into a set of reliable non-redundant motifs. We use our method to identify sets of DNA motifs from a large-scale ChIP-chip assay in S. cerevisiae [13]. This allows us to examine how transcription factors alter their DNA binding preferences under various environmental conditions and elucidate mechanisms for condition-specific preferences. Results A Novel DNA Motif Similarity Score Our goal is to determine whether two DNA motifs represent the same binding preferences. Since the less informative positions in a motif do not contribute to sequence-specific binding of proteins, we developed a similarity score that measures the similarity between two DNA motifs, while taking into account their dissimilarity from the background distribution. We now develop the details of the score. We can view DNA motifs as a list of binding sites from which the nucleotide distribution at each position is estimated. This view allows us to perform statistical evaluations. We assume that each binding site was sampled independently from a common distribution over nucleotides, which satisfies the position independence properties (in correspondence with the motif PFM representation described above). Then, we can evaluate the likelihood ratio of different source distributions of the sampled binding sites. In practice, we keep only the sufficient statistics allowing us to evaluate the likelihood of the binding sites. These sufficient statistics are the counts of each nucleotide in each position, represented as a PFM. Our score is composed of two components: the first measures whether the two motifs were generated from a common distribution, while the second reflects the distance of that common distribution from the background (see Methods). Our Bayesian Likelihood 2-Component (BLiC) score for comparing motifs m1 and m2 is:
are the estimators for the source distribution of n1, n2 and the common source distribution, respectively, Pbg is the background nucleotide distribution, and NT = {A,C,G,T}.Since the source distribution is unknown, we must estimate it from the nucleotide counts at each position. We used a Bayesian estimator, where a priori knowledge and the number of samples were integrated into the estimation process. We considered two alternative priors. The first is a standard Dirichlet prior [20], which is conjugate to the multinomial distribution, enabling us to compute the estimations efficiently (see Methods). However with this prior we cannot model our prior knowledge that a position in a DNA motif tends to have specific preference to one or more nucleotides. Such prior knowledge can be described with a Dirichlet mixture prior [17],[18], which represents a prior that consists of several “typical” distributions. Specifically, we used a five-component mixture prior, with four components representing an informative distribution, giving high probability for a single nucleotide: A, C, G, or T. The fifth component represents the uniform distribution (see Methods). In the above discussion we assumed that the motifs are aligned, but in practice, we compare unaligned motifs. Thus, we defined the similarity score for two motifs as the score of the best possible alignment (without gaps) between them, including the reverse complement alignment. In addition, we need statistical calibration of the similarity scores, since a high similarity score might be due to chance events [21],[22]. In particular, when comparing a single motif against motifs of different lengths, the probability of similarity by chance depends on the query motif and the length of the target. To circumvent these problems we use the p-value of the similarity score, which is computed empirically for each query against the distribution of scores of random motifs of a given length (see Methods and Figure 3
Clustering motifs An important application of motif similarity scores is clustering. There are many clustering methods [23] that can be applied to motifs. Here we consider one of the simplest and straightforward clustering procedures where we combined a similarity score, such as our BLiC score, within a hierarchical agglomerative clustering algorithm. In each iteration, the algorithm computes the similarity between all pairs of motifs and then merges the most similar pair into a new motif based on the best alignment between the two motifs (see Figure 1
Comprehensive Evaluation of Similarity Scores We set out to compare our similarity score to existing ones in the literature, in the context of both motif comparison and clustering. We use two different data sets. The first data set, which we refer to as “Yeast” is a synthetic one where we know the true labeling of motifs and use it to benchmark different procedures by relating their results with the underlying truth. To generate synthetic motifs in a realistic manner that reflects true binding properties of transcription factors, we use the genome-wide catalogue of transcription factor binding locations in S. cerevisiae [13]. This catalogue has high-confidence binding sites (based on combination of experimental assays with evolutionary conservation considerations). From these, we selected nine transcription factors to represent different binding patterns (in terms of inner arrangements of informative positions and length). From the binding sites of each factor we sampled sets of binding sites, and from each set generated a motif (see Figure 3A The second data set, which we refer to as “Structural”, was compiled by Mahony et al. [24]. Their evaluation is based on structural information. Since structurally related transcription factors often have similar DNA-binding preferences, the best match to a given motif is expected to be a motif associated with a member of the same structural class. Mahony et al. compiled a data set that contains the motifs of the families with four or more profiles in JASPAR [25]. Using these two data sets we compared different possible similarity scores for DNA motifs. Specifically, we compared the Pearson correlation coefficient; the information-theory based Jensen-Shannon divergence; the Euclidean distance; and our BLiC score. Motif comparison evaluation—Identifying similar motifs We evaluated the sensitivity and specificity of motif similarity scoring methods by comparing all possible pairs of motifs from the data sets described above, and testing whether pairs that have high similarity indeed were generated from the same source. In the “Yeast” data set we call a pair as true if the two motifs were generated from binding locations of the same transcription factor, and in the “Structural” data set we call a pair as true if the motifs are of factors from the same structural class. For each motif pair, if the similarity is statistically significant we label this as a positive pair, and otherwise call it a negative. We compared this prediction to the label of the pair, and calculated the sensitivity and specificity for each p-value threshold to create ROC curves (Figure 3B and 3C The construction of the “Yeast” data set allows examining different parameters that make the task more challenging. We do so by restricting the number of binding sites or by checking whether the motif is partial or not. Using a smaller number of sites results in higher variability among motifs of the same factor, and using partial coverage means smaller overlap between compared motifs; see Figure S2. These results show that as the task becomes harder all the methods have reduced success rate: for 5% False Positive Rate (FPR), the True Positive Rates (TPR) vary from 65% (for partial overlapping motifs from samples of size 5) to 99% (for the motif with different offsets compared to the full length motifs from sample of size 35). Nonetheless, using our score improves the retrieval rates substantially in most tasks; for example, when looking at sub-motifs with partial overlap from samples of size 35, for 5% FPR using the BLiC score leads to 80% TPR, compared to 62% with the Euclidean distance or 57% with the Pearson Correlation (see Figure 3B Comparing our two alternative priors, The Dirichlet prior versus the Dirichlet-Mixture prior, our results show that the more complex prior, which better models the nucleotide distribution in binding sites, leads to better results as the number of samples decreases (see Figure S2). When the number of samples is larger, the two priors result in similar performance. Motif clustering evaluation—Reducing the redundancy To further evaluate the accuracy of the different similarity scores we used these scores in clustering motifs from the two data sets. For this, we used the hierarchical agglomerative clustering algorithm described above. We then examined whether clusters consisted of motifs that are considered similar (either from the same factor in the “Yeast“ data set, or the same structural family in the “Structural” data set). Examining the cluster hierarchy at different levels of granularity we get a tradeoff curve between two criteria, the True Positive Rate (TPR) of all clusters, and the number of clusters; see Figure 4 As in the motif comparison evaluation, we can perform the clustering evaluation on various subsets of the “Yeast” data set (see Figure 4 Large-Scale DNA Motif Analyses Motif analysis pipeline To facilitate analysis of many motifs we developed an automatic motif analysis pipeline, based on our BLiC score. This is a three-step method for processing and integrating large-scale data of newly discovered DNA motifs into coherent and reliable sets of non-redundant motifs. The inputs for this procedure are multiple groups of co-regulated DNA sequences, and the output is a set of non-redundant motifs and a ranking of their relevance for each of the input groups (Figure 5
Step 1: Motif searching and filtering Step 2: Clustering and merging motifs The integrated sets of motifs (from all input groups) are clustered and merged to create a non-redundant set. First, the discovered motifs for each group are clustered and merged separately. Then, motifs from all groups are assembled, clustered and merged. After each stage of clustering, a subset of refined motifs is automatically chosen based on the clustering tree (see Methods). Step 3: Ranking and identifying motifs Finally, the non-redundant set of motifs is ranked and filtered once again, using the abundance of the motifs in the original groups of DNA sequences (see Methods). To know which motif is new and which was previously characterized, we compare the motifs to a library of known DNA motifs from the literature (TRASFAC [26], SCPD [27], YPD [28]). By this comparison we associate the motifs with transcription factors. Genome-Wide Yeast Motif Library As a real life application of this pipeline we examined genome wide ChIP-chip measurements in S. cerevisiae of 177 transcription factors under several environmental conditions. In total we analyzed 301 experiments for different factors and conditions [13]. We used seven motif discovery algorithms to produce a set of motifs for each ChIP-chip experiment. These motifs were clustered, filtered, ranked and compared to known motifs from the literature (as described above and in the Methods). This resulted in a concise set of DNA motifs attributed to each transcription factor under each environmental condition (all the motif sets can be found at the Supplementary Web site http://compbio.cs.huji.ac.il/BLiC). To further analyze the resulting Yeast DNA motif library, we contrast it against the wealth of genomic annotations in the yeast literature. To do so, we scanned each motif in the library against the promoters of yeast genes (see Methods) and created a target gene set for the motif. We then scored the enrichment of these motif gene sets against different types of gene annotations: the original ChIP-chip data [13], GO functional annotations [29], and groups of genes which are up or down regulated according to gene expression data (assembled by [30]–[32]). This allowed us to relate each motif to specific genomic annotations. To visualize these relationships we created a combined clustering of motifs and annotations using EdgeCluster - a clustering algorithm recently developed in our lab [33]. The novelty of EdgeCluster is in the integration of various sources of information into the clustering process. These information sources can be attributes of motifs (e.g., extent of enrichment in different gene sets) and pairwise information about motifs (i.e., the similarity of motif pairs). Figure S4 demonstrates the clustering of all the motifs. Clustering of a partial set of motifs is presented in Figure 6
Comparison to Previous Work In the works of Harbison et al. [13] and MacIsaac et al. [34], the same ChIP-chip data was used to construct a global transcriptional regulatory map in yeast. The motif analyses performed in these two works differ from ours in the similarity score used (the Euclidean distance) and in the different motif clustering and merging methods. In addition, the output of these two works was a single motif for each transcription factor. To be consistent with these previous works in the comparison, we narrowed down our set of motifs for each ChIP experiment to a single motif. We first looked only at transcription factors with previously characterized motifs. Our criterion for comparison is measuring the similarity to known motifs from the literature (TRANSFAC [26], SCPD [27], YPD [28]), using our BLiC score. To narrow down our motif set to a single motif for each factor we chose (as done in these previous works) the motif most similar to the known motif. In 65% of the cases our motifs have the highest similarity to the known motifs (Figure 7
For transcription factors with no previously known binding motif in the literature, we compared the enrichment of the motifs within the ChIP-chip groups of sequences. For the comparison, we narrowed the motif sets by choosing the most significant motif for each factor and environmental condition (similarly to what was done in these previous studies). We scanned the genomic sequences and computed the enrichment of each motif (see Methods), using the same procedure and parameters for motifs from all three methods. Our motifs were found to have the highest enrichments in 80% of the cases (see Figure 7 To ensure that the improvement we see is not due to differences in motif discovery methods, we repeated the analysis using the original output of the motif discovery of Harbison et al. (data not shown). This lead to slight changes in the output motifs, as our original analysis used a superset of these motifs. Comparing these modified results against the results of Harbison et al. and MacIssac et al. we see essentially improvement as the one we reported above (in 62% of the cases our motifs have the highest similarity to the known motifs, and in 65% of the cases our motifs were found to have the highest enrichments). Elucidating Conditional Binding of Transcription Factors Using the motif sets we have learned, we next turned to examine the change in the binding specificities of the transcription factors under different conditions. We distinguish between two types of factors. A condition-independent factor binds the same targets in multiple conditions, while a condition-dependent factor changes its set of targets between conditions. An example of a condition-independent transcription factor in yeast is Fhl1, a master regulator of ribosomal genes, which according to the ChIP data remains bound to 75% of its targets under different conditions (see Figure S5A). This is consistent with previous work [35] and with the motif analysis, where similar motifs are related to Fhl1 in all three conditions (see Figure S5B). A condition-dependent regulator can show a range of behaviors in response to a change in condition. It may expand and bind additional targets, it may alter and bind to a different set of targets, or it may even not bind any targets [13]. Various mechanisms may be involved in monitoring condition-dependent binding. A factor may expand its targets, due to dosage change of the active transcription factor in the nucleus [13]. Alternatively, a factor may alter its targets due to several probable mechanisms (see Figure S6). One mechanism is changing the factor's specificity to the DNA, which we can trace by identifying variations in the DNA motif (Figure S6A). Another possible mechanism is a change in the factor's binding partner, which may be detected through co-occurrence of motifs of different factors (Figure S6B). In addition, a change of targets may be caused by a change in the accessibility to the binding site, which we cannot identify by analyzing motifs (Figure S6C). We focus here on factors that alter their targets under different conditions and try to elucidate the mechanism. We defined a transcription factor as altering its target genes between two conditions, if the number of target genes in the intersection is less than half of the number in each condition separately. In addition, we considered only factors with at least 20 target genes in each of the two conditions (a sufficient number for motif discovery). Out of the 72 transcription factors for which ChIP-chip experiments were carried out in more than one condition, 50 factors alter their target genes between two conditions (in total, 112 pairs of differential conditions) (Table S2). We searched for differential motifs in the motif set of each factor at every condition. We say a motif is differential if there is a significant difference (p<0.05, chi-square test) in the fraction of ChIP targets containing the motif between the two conditions (excluding the genes in the intersection). This analysis can potentially elucidate the mechanism through which a factor changes its DNA targets, by finding different variants of motifs, or co-occurrence of motifs of different factors as explained above. In about half of these pairs we did not find statistically significant motifs in at least one of the compared conditions and thus could not search for differential motifs. Finding a motif only for one condition could be meaningful on its own, since this may indicate that in the other condition there is no direct binding of the factor to the DNA. On the other hand it could result from technical reasons, such as noise in the input set of sequences, and thus in this work we do not analyze these cases. Out of the remaining 52 pairs (spanned over 27 different transcription factors), we found differential motifs for 88% of the factors (47 cases spanned over 24 factors, see Table S3) with a p-value of less than 0.05. Condition-Dependent Binding of Ste12 under Conditions of Mating and Filamentous Growth An example of a transcription factor that shows condition-dependent binding is Ste12, which activates genes in two alternative pathways—mating and filamentous growth [36],[37] (Figure 8A
Condition-Dependent Binding of the Iron-Regulated Factor Aft2 Another interesting example is provided by the iron-regulated transcription factor Aft2, required for iron homeostasis and resistance to oxidative stress [43]. This factor exhibits a significant environmental-dependent binding, switching targets between low and high H2O2 conditions (Figure 9A
Discussion An accurate motif comparison method is important for clustering redundant DNA motifs into coherent groups and for connecting the discovered motifs to previously characterized motifs. In this study we present a novel similarity score, the BLiC score, based on Bayesian probabilistic principles. We use the new comparison method as a basis for motif clustering and retrieval procedures, and compare it to several commonly used alternatives. This comparison shows that our BLiC score improves the specificity and sensitivity of motif comparisons and clustering tasks. The resulting motif clustering and retrieval procedures are incorporated in a large-scale automated pipeline for analyzing DNA motifs, which integrates the output of various DNA motif discovery algorithms and automatically merges redundant motifs from multiple training sets. The output of our pipeline is a coherent annotated library of motifs. Application of this pipeline to genome-wide location data of transcription factors in S. cerevisiae, successfully identified DNA motifs in a manner that is as good as semi-automated analyses reported in the literature. Moreover, we demonstrate how motif analysis can lead to insights into regulatory mechanisms. Hierarchical Agglomerative Clustering We used our BLiC score to develop a hierarchical agglomerative clustering algorithm for merging similar motifs, in which we ensure that the motifs within every sub-tree are properly aligned. Furthermore, such an approach allows us to trim the cluster tree at various levels, thus allowing us to merge motifs at different resolutions. In our method a new agglomerative node results from aligning and merging the motifs of its descendent nodes, and then computing the similarly of this new motif to all other nodes. As a consequence, the hierarchical progression ensures that each sub-tree is coherent. This is in contrast to many clustering methods, such as k-means and typical hierarchical clustering [45] which find a set of motifs that are all similar to each other, but are not necessarily coherent in the sense that they cannot all be aligned. Motif Analysis Our motif analysis pipeline is designed to process discovered DNA motifs into a set of non-redundant motifs and compare these with known motifs. As we have shown, our approach improves the sensitivity and specificity in the analysis of the outputs of standard motif discovery methods. By automating all the steps, we enable the analysis of hundreds of input groups. In addition, we achieve a wide view on transcription regulation by running several motif discovery algorithms in parallel, and integrating their outputs. By comparing motifs from different input groups we are able to connect between transcription factors that play a role in different processes. Our analysis does not focus on finding the “best” single motif for each input group (e.g., targets of ChIP-chip assay), but rather we find a set of non-redundant motifs and their relations (enrichment) to each input group. This output better captures the complexity of the underlying regulatory program. For example, in many cases we find motifs of co-factors (e.g., Ste12 and Tec1). In other cases we see that a factor changes its binding specificity under different conditions (e.g., Aft2). For these cases, several DNA motifs better capture the DNA binding preferences of the transcription factor than a single motif. Relations to Previous Work There are several different approaches attempting to quantify similarities between DNA motifs. Two previous works [21],[22] showed that using p-values when comparing motifs is more accurate than the actual similarity scores. Specifically, Gupta et al.[21], compared seven motif-motif position similarity functions, including the Pearson Correlation coefficient (e.g., [11],[46]), average log-likelihood ratio (ALLR) [16], Kullback-Leibler divergence [47]–[49], and the Euclidean distance (ED) [13],[50]. They found that the Euclidean distance is slightly better than the alternatives they considered. The data set used by Gupta et al. has a similar design as our data set, but it is based on the TRANSFAC database [26]. Not surprisingly, our results are consistent with theirs. Here we also use p-values to calibrate similarity scores, and show that our score is more accurate than the Euclidean distance, which is the second best. Several resources are available for DNA motif analysis. There are many open access motif discovery tools available (e.g., [2],[3],[11]) and motif comparison tools [11],[21],[51]. In addition there are several available tools that integrate multiple motif discovery tools, and supply additional tools for filtering, comparison and ranking motifs [19],[49],[52]. In our motif analysis pipeline we use the TAMO package [19], for motif discovery and filtering, with a different genomic scan approach using statistical tools [53]. The main difference is that for the motif comparison and clustering we use our new BLiC score and a hierarchical agglomerative clustering (as discussed above). From DNA Motifs to Regulatory Mechanisms Sequence information is a highly accessible resource, and thus it is interesting to ask whether it can help elucidate mechanisms of transcription regulation. We examined transcription factors that alter their targets in response to an environmental change, and found a differential motif in 88% of these cases (24/27 factors). These differential motifs can suggest the potential mechanism through which the factor changes its targets. We show that motifs provide an indication for potential mechanisms when the factor changes its binding partner (Figure S6A) or its specificity to the DNA (Figure S6B), as we discussed thoroughly for the case of Ste12 and Aft2. Nevertheless, motif analysis obviously does not reveal the whole regulatory picture. For example, chromatin-modeling mediated regulation cannot be inferred from motif analysis (Figure S6C). Thus, for a complete understanding of the regulatory mechanisms additional information is needed. A significant limitation of motif analysis in general, is the discrepancy between putative binding sites and actual functional binding events. This raises the question addressed frequently before [10],[54], whether our representation of transcription factor binding preferences is sufficiently accurate. In this study we overcome a basic obstacle in DNA motif analysis, by developing an accurate motif comparison method. Our motif analysis pipeline, which includes clustering and retrieval procedures based on our novel score, is fully automated and produces accurate results. This is highly important in large-scale analysis, such as the one reported here. We showed the power of motif analyses, which is useful not only for building regulatory maps, but also for understanding more profoundly regulatory mechanisms. Methods Motif Representation We use a Position Frequency Matrix (PFM) representation for a DNA motif. This is a n×4 matrix, where each i,j cell contains the count of nucleotide j in position i of the motif. Scores We define the similarity score for two aligned PFMs. Due to the positional independence assumption in PFMs, the score decomposes into the sum of scores for corresponding positions. Our score is composed of two components: The first measures whether the two motifs were generated from a common distribution. The second reflects the distance of that common distribution from the background. Thus, for positions n1 and n2, our score is as described in Equation 2. Statistically, in the score we sum the log-likelihood-ratio of two pairs of hypotheses. The first component: H0: The two samples were drawn from a common source distribution. H1: The two samples were drawn independently from different source distributions. The second component: H0: The two samples were drawn from a common source distribution that is distinct from the background. H1: The two samples were drawn from the background distribution. Estimation We estimate the source distributions from the PFM using a Bayesian approach, with a Dirichlet prior. The Dirichlet prior is specified by a set of hyper-parameters α = (α1,α2,…αn) and has the form: ![]() Where Γ(x) is the Gamma function. We use two prior variants: The first is a standard Dirichlet prior [20], with hyper-parameters of (1,1,1,1). When using this prior, the estimated distribution for position n is:
The second prior we use is a five-component mixture of Dirichlet prior [17]. We merge five Dirichlet priors using uniform weights. Four of the components give high probability for a single DNA nucleotide: A, C, G, or T. The fifth element represents the uniform distribution. We use the hyper-parameters (5,1,1,1) for A, (1,5,1,1) for C, etc., For the fifth component we use the hyper-parameters (2,2,2,2). Using this, the estimated distribution for position n is:
where,
Clustering and Comparing Motifs Comparing motifs The similarity score of two PFMs is the score of the best possible alignment (without gaps) between them, including the reverse complement alignment. The unaligned flanks of the motif are scored according to their distance from the background distribution multiplied by a relaxing factor of 0.2. Assigning p-values to motif similarity scores We devised an empirical p-value estimation procedure for motif similarity scores. For each motif, we computed the score distribution against motifs of all possible lengths, by comparison to 1000 random motifs of a specified length. Since the BLiC score distribution depends on the specificity of each motif, the distribution is computed for each motif separately to retain the overall characteristics of the motif. The random motifs were generated by sampling positions of motifs from the TRANSFAC database [26]. The p-value of the similarity of a given DNA motif to another, is calculated empirically from the score distribution of the first motif against random motifs of the same length as the second motif (calculating the fraction of random motifs that got the same score or higher). Clustering and Trimming the Tree To cluster motifs, we implemented a hierarchical agglomerative clustering algorithm, using various motif comparison scores. In each iteration, the algorithm computes the similarity between all pairs of motifs and then merges the pair with the highest similarity score into a new motif (see Figure 4A The clustering tree is used to distill the input set into a non-redundant group, by splitting the tree into clusters representing groups of redundant motifs. To obtain this non-redundant set, which covers the initial set, we choose a frontier in the clustering tree. A frontier in a tree is a subset of nodes, non-descendent to each other, where every leaf in the tree is a descendant of one of them. This is done by a bottom-up traversal over the tree in which we choose the set of nodes in the required frontier. Specifically, we consider every two motifs that were merged into one in the tree. We want to identify situations where this merge resulted in a motif that is very different than each of the two motifs that were merged. To test that, we compare the degree of similarity between the two motifs to the maximal score we could have attained (the maximum of the similarity of each one to itself). If the observed score's ratio to this maximum is less than a preset threshold, the two motifs are added to the frontier. In the motif analysis pipeline, we use a stringent threshold of 60% of the maximum for creating non-redundant motifs (chosen according to hand-curated splits of 10 trees). Motif Analysis Motif discovery algorithms In the analysis pipeline we applied several motif discovery algorithms—MDScan [2], AligneAce [11], and MEME [3] were used through the TAMO package [19], with the default parameters (apart from the MEME algorithm, for which we changed the parameters to output six motifs). We included conserved and abundant motifs in the yeast genome [55], and the output of MEME_c [13], Converge [13] and the SeedSearcher motif discovery algorithm [56]. The discovered motifs underwent an initial filtration according to their enrichment among the initial group of sequences (p-value threshold of 10−5, calculated using the TAMO package [19]). All motifs are converted to a PFM representation. Clustering motifs In the second step of the pipeline we cluster the motifs—first we clustered the motifs discovered for each transcription factor under each environmental condition separately, then the (merged) motifs for each factor under all conditions, and finally the entire set of motifs. The motifs are clustered and merged as described above. Truncating motifs Uninformative positions at the two edges of motifs were truncated automatically. This was done by a chi-square test (threshold of 0.05), testing if the nucleotides at a motif position distribute according to the background. Identifying the motifs Ranking motifs In the third step of the pipeline, we rank and filter the merged motifs according to their enrichment (−log hyper-geometric p-value) in the input groups of DNA sequences. For filtering we use a threshold of 3 after applying a Bonferroni correction for multiple hypotheses. For this, we find the occurrences of each motif using a statistical tool for genomic scan, TestMotif program [53]. To scan the genome with our motifs, we transferred them from PFMs (count matrices) to profiles (frequencies), using estimation with Dirichlet-mixture prior described above. After scanning with the TestMotif program [53], we combine evolutionary conservation data to find the occurrences of motifs. Particularly, we decide whether a DNA sequence contains a motif if one of two following criteria holds:
Parameter tuning The threshold values listed above were chosen according to an extensive search of parameters that maximize the true positive rate, allowing up to 2% false positive calls. This optimization was based on location analysis data of Gcn4 [13], and location and expression data for Sko1 (unpublished data). Figure S1 Distinguishing between informative and non-informative positions: Two pairs of aligned motifs are presented (by a sequence-logo). This is an alignment of the known motif for the invertebrate factor Dfd versus two variants of the vertebrate factor Pax4, all taken from TRANSFAC [26] (matrix accessions I$DFD_01, V$PAX4_02, and V$PAX4_04, all from version 8.3). While it is clear the first motif (left) should get a lower similarity score than the second motif (right), scoring the two pairs of aligned motifs using the Jensen-Shannon divergence yields a higher score for the first motif. The desired similarity score should distinguish between high similarity of informative positions and non-informative positions. (0.73 MB TIF) Click here for additional data file.(710K, tif) Figure S2 Evaluation of motif comparison. Using different subsets of motifs out of the “Yeast” data set, we compare our BLiC score (green, using a Dirichlet prior, and blue, using a Dirichlet-mixture prior) with other similarity scores: Jensen-Shannon divergence (red), Euclidean distance (purple) and Pearson Correlation coefficient (cyan). Each of the nine panels represents a different comparison. The columns correspond to the number of samples used for constructing the motifs. The rows correspond to different choices of query sets and target sets for comparison (illustrated by the logos on the right): In the top row all motifs of partial offsets are queries against the same set. In the middle row, all motifs, including full-length motifs and partial offsets are compared against themselves. In the bottom row, we use partial offset motifs as queries and full motifs as targets. In each panel we plot True Positive Rate (y-axis) vs. False Positive Rate (x-axis) as in Figure 3B (1.67 MB TIF) Click here for additional data file.(1.5M, tif) Figure S3 Evaluation of motif clustering. Using different subsets of motifs out of the “Yeast” data set, we compare our BLiC score (green, using a Dirichlet prior, and blue, using a Dirichlet-mixture prior) with other similarity scores: Jensen-Shannon divergence (red), Euclidean distance (purple) and Pearson Correlation coefficient (cyan). Each of the nine panels represents the average performance of 9 repeats of clustering over different motif sets. The columns correspond to the number of samples used for constructing the motifs. The rows correspond to different choices of motif sets (illustrated by the logos on the right): In the top row we cluster all motifs of partial offsets. In the middle row, we cluster all motifs, including full-length motifs and partial offsets. In the bottom row, we cluster only full motifs. In each panel we plot True Positive Rate (y-axis) vs. number of clusters (x-axis) as in Figure 4B (1.88 MB TIF) Click here for additional data file.(1.7M, tif) Figure S4 Overview of the discovered motifs. Investigation of the properties of discovered motifs. Each motif (column) is compared to other motifs using the BLiC score (rows, top group), to average expression of its targets in different experiments [30]–[32] (second group), to enrichment of its targets in GO annotations [29] (third groups) and in ChIP-chip location assays [13] (bottom groups). The rows and columns were clustered using the EdgeCluster [33] algorithm, which integrates various sources of information into the clustering process. These information sources are attributes of motifs and pairwise information about motifs. The results are clusters of motifs that have not only similar attributes, as in regular clustering algorithm, but also similar relations to motifs in other clusters. (8.43 MB TIF) Click here for additional data file.(8.0M, tif) Figure S5 Condition independent binding of Fhl1. (A) A Venn diagram representing the results of the ChIP-chip experiment [13] for the transcription factor Fhl1 under YPD conditions, amino-acid starvation and nutrient deprived conditions. The targets of Fhl1 do not change under these three environments. (B) Under all conditions the same motif is found to be highly enriched. (2.70 MB TIF) Click here for additional data file.(2.5M, tif) Figure S6 Possible mechanisms for condition-dependent binding of TFs. Motif analysis for condition-dependent transcription factors that bind different targets under different conditions. Here, three possible mechanisms that may be involved in monitoring condition-dependent binding, which lead to altered targets, are presented schematically. For each mechanism we show the scheme of the promoter organization of the target genes (above the dashed line) and the result of motif discovery (under the dashed line). (A) The first mechanism is through a change in the cofactor. This may be detected through co-occurrence of motifs of different factors. (B) The second mechanism is through a change in the specificity to the DNA. This change can be traced by identifying variations in the DNA motif. (C) The third mechanism is a change in the chromatin state. This change cannot be traced using motif analysis. (0.98 MB TIF) Click here for additional data file.(962K, tif) Table S1 Comparison of the Yeast motif library to previous works (0.02 MB XLS) Click here for additional data file.(20K, xls) Table S2 Transcription factors with condition-dependent binding which alter their targets. (0.02 MB XLS) Click here for additional data file.(24K, xls) Acknowledgments We thank Aviv Regev, Andrew Capaldi, and the members of Nir Friedman's and Hanah Margalit's labs for useful discussions related to this work. We thank Yoseph Barash for his help with the SeedSearcher motif discovery algorithm, Kenzie MacIsaac for help with the TAMO package, and Takis Benos for providing us with the data set of Mahoney et al. [24]. Footnotes The authors have declared that no competing interests exist. This study was supported by grants from the Israeli Science Foundation administered by the Israeli Academy of Sciences and Humanities (HM), The Israeli Cancer Research Foundation (HM), the Human Frontiers Science Program (NF), and the National Institutes of Health (NF). References 1. Stormo G. DNA binding sites: representation and discovery. Bioinformatics. 2000;16:16–23. [PubMed] 2. Liu X, Brutlag D, Liu J. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol. 2002;20:835–839. [PubMed] 3. Bailey T, Elkan C. The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol. 1995;3:21–29. [PubMed] 4. Kaplan T, Friedman N, Margalit H. Ab initio prediction of transcription factor targets using structural knowledge. PLoS Comput Biol. 2005;1:e1. [PubMed] 5. Morozov A, Havranek J, Baker D, Siggia E. Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res. 2005;33:5781–5798. [PubMed] 6. MacIsaac K, Fraenkel E. Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol. 2006;2:e36. [PubMed] 7. Osada R, Zaslavsky E, Singh M. Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics. 2004;20:3516–3525. [PubMed] 8. Day W, McMorris F. Critical comparison of consensus methods for molecular sequences. Nucleic Acids Res. 1992;20:1093–1099. [PubMed] 9. Benos P, Bulyk M, Stormo G. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res. 2002;30:4442–4451 L4443. [PubMed] 10. Barash Y, Elidan G, Kaplan T, Friedman N. 2003. Modeling Dependencies in Protein-DNA Binding Sites. 11. Hughes J, Estep P, Tavazoie S, Church G. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000;296:1205–1214. [PubMed] 12. Xie X, Lu J, Kulbokas E, Golub T, VM, Lindblad-Toh K, et al. Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals. Nature. 2005;434:338–345. [PubMed] 13. Harbison C, Gordon D, Lee T, Rinaldi N, Macisaac K, et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431:99–104. [PubMed] 14. Lin J. Divergence measures based on the Shannon entropy. 1991:145–151. 15. Yona G, Levitt M. Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol. 2002;315:1257–1275. [PubMed] 16. Wang T, Stormo GD. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics. 2003;19:2369–2380. [PubMed] 17. Sjolander K, Karplus K, Brown M, Hughey R, AK, Mian I, et al. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci. 1996;12:327–345. [PubMed] 18. Xing EP, Jordan MI, Karp RM, Russell R. 2002. A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences. NIPS 15. MIT Press. 19. Gordon D, Nekludova L, McCallum S, Fraenkel E. TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs. Bioinformatics. 2005;21:3164–3165. [PubMed] 20. DeGroot M. Optimal Satistical Decisions. New York: McGraw-Hill; 1970. 21. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8:R24. [PubMed] 22. Bailey TL, Gribskov M. Methods and statistics for combining motif match scores. J Comput Biol. 1998;5:211–221. [PubMed] 23. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. J ACM Comput Surv. 1999;31:264–323. 24. Mahony S, Auron PE, Benos PV. DNA familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Comput Biol. 2007;3:e61. [PubMed] 25. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 2004;32:D91–D94. [PubMed] 26. Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003;31:374–378. [PubMed] 27. Zhu J, Zhang M. SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics. 1999;15:607–611. [PubMed] 28. Csank C, Costanzo M, Hirschman J, Hodges P, Kranz J, et al. Three yeast proteome databases: YPD, PombePD, and CalPD (MycoPathPD). Methods Enzymol. 2002;350:347–373. [PubMed] 29. Harris M, Clark J, Ireland A, Lomax J, Ashburner M, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32:D258–D261. [PubMed] 30. DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997;278:680–686. [PubMed] 31. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000;11:4241–4257. [PubMed] 32. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998;9:3273–3297. [PubMed] 33. Friedman N, Jaimovich A. 2008. EdgeCluster: Probabilistic Agglomerative Clustering of Genes with Relational Observations. School of Computer Science and Engineering, The Hebrew University TR-2008-2. 34. MacIsaac K, Wang T, Gordon D, Gifford D, Stormo G, et al. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics. 2006;7:113. [PubMed] 35. Martin D, Soulard A, Hall M. TOR regulates ribosomal protein gene expression via PKA and the Forkhead transcription factor FHL1. Cell. 2004;119:969–979. [PubMed] 36. Zeitlinger J, Simon I, Harbison C, Hannett N, Volkert T, et al. Program-specific distribution of a transcription factor dependent on partner transcription factor and MAPK signaling. Cell. 2003;113:395–404. [PubMed] 37. Chou S, Lane S, Liu H. Regulation of mating and filamentation genes by two distinct Ste12 complexes in Saccharomyces cerevisiae. Mol Cell Biol. 2006;26:4794–4805. [PubMed] 38. Madhani H, Fink G. Combinatorial control required for the specificity of yeast MAPK signaling. Science. 1997;275:1314–1317. [PubMed] 39. Chou S, Huang L, Liu H. Fus3-regulated Tec1 degradation through SCFCdc4 determines MAPK signaling specificity during mating in yeast. Cell. 2004;119:981–990. [PubMed] 40. Schaber J, Kofahl B, Kowald A, Klipp E. A modelling approach to quantify dynamic crosstalk between the pheromone and the starvation pathway in baker's yeast. FEBS J. 2006;273:3520–3533. [PubMed] 41. Wang Y, Dohlman H. Pheromone-regulated sumoylation of transcription factors that mediate the invasive to mating developmental switch in yeast. J Biol Chem. 2006;281:1964–1969. [PubMed] 42. Gelli A. Rst1 and Rst2 are required for the a/alpha diploid cell type in yeast. Mol Microbiol. 2002;46:845–854. [PubMed] 43. Courel M, Lallet S, Camadro J, Blaiseau P. Direct activation of genes involved in intracellular iron use by the yeast iron-responsive transcription factor Aft2 without its paralog Aft1. Mol Cell Biol. 2005;25:6760–6771. [PubMed] 44. Rutherford J, Jaron S, Ray E, Brown P, Winge D. A second iron-regulatory system in yeast independent of Aft1p. Proc Natl Acad Sci U S A. 2001;98:14322–14327. [PubMed] 45. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998;95:14863–14868. [PubMed] 46. Pietrokovski S. Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res. 1996;24:3836–3845. [PubMed] 47. Roepcke S, Grossmann S, Rahmann S, Vingron M. T-Reg Comparator: an analysis tool for the comparison of position weight matrices. Nucleic Acids Res. 2005;33:W438–W441. [PubMed] 48. Thijs G, Marchal K, Lescot M, Rombauts S, Moor B, et al. A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol. 2002;9:447–464. [PubMed] 49. Aerts S, Thijs G, Coessens B, Staes M, Moreau Y, et al. Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res. 2003;31:1753–1764. [PubMed] 50. Choi IG, Kwon J, Kim SH. Local feature frequency profile: a method to measure structural similarity in proteins. Proc Natl Acad Sci U S A. 2004;101:3797–3802. [PubMed] 51. Mahony S, Benos PV. STAMP: a web tool for exploring DNA-binding motif similarities. Nucleic Acids Res. 2007;35:W253–W258. [PubMed] 52. Che D, Jensen S, Cai L, Liu JS. BEST: binding-site estimation suite of tools. Bioinformatics. 2005;21:2909–2911. [PubMed] 53. Barash Y, Elidan G, Kaplan T, Friedman N. CIS: compound importance sampling method for protein-DNA binding site p-value estimation. Bioinformatics. 2005;21:596–600. [PubMed] 54. Bulyk M, Johnson P, Church G. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res. 2002;30:1255–1261. [PubMed] 55. Kellis M, Patterson N, Endrizzi M, Birren B, Lander E. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003;423:241–254. [PubMed] 56. Barash Y. Unified Models for Regulatory Mechanisms. PhD thesis. Jerusalem: Hebrew University; 2005. 57. Siepel A, Bejerano G, Pedersen J, Hinrichs A, Hou M, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–1050. [PubMed] 58. Karolchik D, Baertsch R, Diekhans M, Furey T, Hinrichs A, et al. The UCSC Genome Browser Database. Nucleic Acids Res. 2003;31:51–54. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||
Bioinformatics. 2000 Jan; 16(1):16-23.
[Bioinformatics. 2000]Nat Biotechnol. 2002 Aug; 20(8):835-9.
[Nat Biotechnol. 2002]Nucleic Acids Res. 2005; 33(18):5781-98.
[Nucleic Acids Res. 2005]PLoS Comput Biol. 2006 Apr; 2(4):e36.
[PLoS Comput Biol. 2006]Bioinformatics. 2000 Jan; 16(1):16-23.
[Bioinformatics. 2000]Bioinformatics. 2004 Dec 12; 20(18):3516-25.
[Bioinformatics. 2004]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Nature. 2005 Mar 17; 434(7031):338-45.
[Nature. 2005]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]J Mol Biol. 2002 Feb 1; 315(5):1257-75.
[J Mol Biol. 2002]Bioinformatics. 2003 Dec 12; 19(18):2369-80.
[Bioinformatics. 2003]Bioinformatics. 2005 Jul 15; 21(14):3164-5.
[Bioinformatics. 2005]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Comput Appl Biosci. 1996 Aug; 12(4):327-45.
[Comput Appl Biosci. 1996]Genome Biol. 2007; 8(2):R24.
[Genome Biol. 2007]J Comput Biol. 1998 Summer; 5(2):211-21.
[J Comput Biol. 1998]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]PLoS Comput Biol. 2007 Mar 30; 3(3):e61.
[PLoS Comput Biol. 2007]PLoS Comput Biol. 2007 Mar 30; 3(3):e61.
[PLoS Comput Biol. 2007]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]PLoS Comput Biol. 2007 Mar 30; 3(3):e61.
[PLoS Comput Biol. 2007]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Bioinformatics. 2005 Jul 15; 21(14):3164-5.
[Bioinformatics. 2005]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Bioinformatics. 1999 Jul-Aug; 15(7-8):607-11.
[Bioinformatics. 1999]Methods Enzymol. 2002; 350():347-73.
[Methods Enzymol. 2002]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D258-61.
[Nucleic Acids Res. 2004]Science. 1997 Oct 24; 278(5338):680-6.
[Science. 1997]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]BMC Bioinformatics. 2006 Mar 7; 7():113.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Bioinformatics. 1999 Jul-Aug; 15(7-8):607-11.
[Bioinformatics. 1999]Methods Enzymol. 2002; 350():347-73.
[Methods Enzymol. 2002]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]BMC Bioinformatics. 2006 Mar 7; 7():113.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Methods Enzymol. 2002; 350():347-73.
[Methods Enzymol. 2002]Cell. 2004 Dec 29; 119(7):969-79.
[Cell. 2004]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Cell. 2003 May 2; 113(3):395-404.
[Cell. 2003]Mol Cell Biol. 2006 Jul; 26(13):4794-805.
[Mol Cell Biol. 2006]Science. 1997 Feb 28; 275(5304):1314-7.
[Science. 1997]Cell. 2004 Dec 29; 119(7):981-90.
[Cell. 2004]FEBS J. 2006 Aug; 273(15):3520-33.
[FEBS J. 2006]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Mol Cell Biol. 2005 Aug; 25(15):6760-71.
[Mol Cell Biol. 2005]Proc Natl Acad Sci U S A. 2001 Dec 4; 98(25):14322-7.
[Proc Natl Acad Sci U S A. 2001]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]Genome Biol. 2007; 8(2):R24.
[Genome Biol. 2007]J Comput Biol. 1998 Summer; 5(2):211-21.
[J Comput Biol. 1998]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Nucleic Acids Res. 1996 Oct 1; 24(19):3836-45.
[Nucleic Acids Res. 1996]Bioinformatics. 2003 Dec 12; 19(18):2369-80.
[Bioinformatics. 2003]Nat Biotechnol. 2002 Aug; 20(8):835-9.
[Nat Biotechnol. 2002]Proc Int Conf Intell Syst Mol Biol. 1995; 3():21-9.
[Proc Int Conf Intell Syst Mol Biol. 1995]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Genome Biol. 2007; 8(2):R24.
[Genome Biol. 2007]Nucleic Acids Res. 2007 Jul; 35(Web Server issue):W253-8.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2002 Mar 1; 30(5):1255-61.
[Nucleic Acids Res. 2002]Comput Appl Biosci. 1996 Aug; 12(4):327-45.
[Comput Appl Biosci. 1996]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Nat Biotechnol. 2002 Aug; 20(8):835-9.
[Nat Biotechnol. 2002]J Mol Biol. 2000 Mar 10; 296(5):1205-14.
[J Mol Biol. 2000]Proc Int Conf Intell Syst Mol Biol. 1995; 3():21-9.
[Proc Int Conf Intell Syst Mol Biol. 1995]Bioinformatics. 2005 Jul 15; 21(14):3164-5.
[Bioinformatics. 2005]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Bioinformatics. 1999 Jul-Aug; 15(7-8):607-11.
[Bioinformatics. 1999]Methods Enzymol. 2002; 350():347-73.
[Methods Enzymol. 2002]Bioinformatics. 2005 Mar 1; 21(5):596-600.
[Bioinformatics. 2005]Genome Res. 2005 Aug; 15(8):1034-50.
[Genome Res. 2005]Nucleic Acids Res. 2003 Jan 1; 31(1):51-4.
[Nucleic Acids Res. 2003]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Science. 1997 Oct 24; 278(5338):680-6.
[Science. 1997]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D258-61.
[Nucleic Acids Res. 2004]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]PLoS Comput Biol. 2007 Mar 30; 3(3):e61.
[PLoS Comput Biol. 2007]