pmc logo image
Logo of nihpaNIHPA bannerabout author manuscriptssubmit a manuscript

Formats:

Methods Mol Biol. Author manuscript; available in PMC 2009 June 27.
Published in final edited form as:
doi: 10.1007/978-1-59745-251-9_13.
PMCID: PMC2702474
NIHMSID: NIHMS112925
Promoter Analysis: Gene Regulatory Motif Identification with A-GLAM
Leonardo Mariño-Ramírez, Kannan Tharakaraman, John L. Spouge, and David Landsman
Reliable detection of cis-regulatory elements in promoter regions is a difficult and unsolved problem in computational biology. The intricacy of transcriptional regulation in higher eukaryotes, primarily in metazoans, could be a major driving force of organismal complexity. Eukaryotic genome annotations have improved greatly due to large-scale characterization of full-length cDNAs, transcriptional start sites (TSSs), and comparative genomics. Regulatory elements are identified in promoter regions using a variety of enumerative or alignment-based methods. Here we present a survey of recent computational methods for eukaryotic promoter analysis and describe the use of an alignment-based method implemented in the A-GLAM program.
Keywords: Promoter regions, transcription factor binding sites, enumerative methods, promoter comparison
The establishment and maintenance of temporal and spatial patterns of gene expression are achieved primarily by transcription regulation. Additionally, the precise control of timing and location of gene expression depends on the interaction between transcription factors and cis-acting sequence elements in promoter regions. Transcription factors can induce or repress gene expression upon binding of their cognate sequence element in the DNA. The discovery and categorization of the entire collection of transcription factor-binding sites (TFBSs) of an organism are among the greatest challenges in computational biology (1). Large-scale efforts involving genome mapping and identification of TFBS in lower eukaryotes, such as the yeast Saccharomyces cerevisiae, have been successful (2). In contrast, similar efforts in vertebrates have proven difficult due to the presence of repetitive elements and an increased regulatory complexity (3-5).
The accurate prediction and identification of regulatory elements in higher eukaryotes remains a challenge for computational biology, despite recent progress in the development or improvement of different algorithms (6-19). Different strategies for motif recognition have been benchmarked to compare their performance (20). Typically, computational methods for identifying cis-regulatory elements in promoter sequences fall into two classes, enumerative and alignment techniques (21). We have developed algorithms that use enumerative approaches to identify cis-regulatory elements statistically significant over-represented in promoter regions (22). Subsequently, we developed an algorithm that combines both enumeration and alignment techniques to identify statistically significant cis-regulatory elements positionally clustered relative to a specific genomic landmark (23,24).
Promoter identification is the first step in the computational analysis that leads to the prediction of regulatory elements. In lower Eukaryotes this is a rather simple problem due to a relative high gene density with respect to the genome size. The yeast Saccharomyces cerevisiae has ~70% of its genome coding for proteins and its intergenic regions are fairly short (~440 bp in length) (25). In contrast, the human genome has a ~relative low gene density, with ~3% of the genome coding for proteins (26); this poses significant challenges for the identification of both the promoter and its regulatory elements. Despite the complexity of gene expression regulation in higher Eukaryotes (27), we now have a number of experimental and computational resources that can assist in the delineation of mammalian promoter regions. The experimental resources include full-length cDNA collections (28) and transcriptional start sites (TSS) (29). Additionally, complementary computational resources include the database of transcriptional start sites (DBTSS) (30) and promoter identification services (31-33). Many regulatory elements are located in the proximal promoter region (PPR) located a few hundred bases upstream the TSS (22) and the PPR can be generally defined by its low transposable element content (34).
The computational methods for the prediction and identification of transcription factor binding sites can be divided in two broad categories: algorithms for de novo identification and algorithms that recognize elements using prior knowledge of the elements. Enumerative and alignment methods form part of the de novo algorithms. Enumerative algorithms use exhaustive methods to examine exact DNA words of a fixed length to rank them according to a specific function that determine over-representation relative to a background distribution. An enumerative method that estimates p-values with the standard normal approximation associated with z-scores (22) has been successfully applied for the identification of regulatory elements in higher Eukaryotes (35). Other enumerative methods include Weeder (16, 17), oligonucleotide frequency analysis (36), and QuickScore (14).
Alignment methods aim to identify functional elements by a multiple local alignment of all sequences of interest. The most popular algorithms in this category use an optimization procedure based in probabilistic sequence models to find statistically significant motifs; these include Gibbs sampling (37) or expectation maximization (11). Approaches that use a combination of enumerative and alignment methods have shown a significant improvement in the identification of regulatory elements in promoter sequences (23, 24).
Algorithms that use prior knowledge of known motifs often use position frequency matrices (PFMs) that contain the number of observed nucleotides at each position (38). Methods that assess statistical over-representation of known motifs in a set of sequences have been particularly successful (9). Additionally, motif scores determined by over-representation can be used as a proxy to perform promoter comparisons (39).
2.1. The A-GLAM Algorithm
The A-GLAM software package uses a Gibbs sampling algorithm to identify functional motifs in a set of sequences. Gibbs sampling, also known as successive substitution sampling, is a well-known Markov-chain Monte Carlo procedure for discovering sequence motifs (37). In brief, A-GLAM takes a set of sequences in FASTA format as input. The Gibbs sampling step in A-GLAM uses simulated annealing to maximize an “overall score,” corresponding to a Bayesian marginal log-odds score. The overall score is given by
equation M1
(1)
In equation (1), m! = m(m — 1)…1 denotes a factorial; aj, the pseudo-counts for nucleic acid j in each position; a = a1 + a2 + a3 + a4, the total pseudo-counts in each position; cij, the count of nucleic acid j in position i; and c = ci1 + ci2 + ci3 + ci4, the total number of aligned windows, which is independent of the position i. The underlying principle behind the overall score s in A-GLAM is explained in detail elsewhere (23).
The annealing maximization is initialized when A-GLAM places a single window of arbitrary size and position at every sequence, generating a gapless multiple alignment of the windowed subsequences. The program then proceeds through a series of iterations; on each iteration step, A-GLAM proposes a set of adjustments to the alignment. The proposal step is either a repositioning step or a resizing step. In a repositioning step, a single sequence is chosen uniformly at random from the alignment; and the set of adjustments include all possible positions in the sequence where the alignment window would fit without overhanging the ends of the sequence. In a resizing step, either the right or the left end of the alignment window is selected; and the set of proposed adjustments includes expanding or contracting the corresponding end of all alignment windows by one position at a time. Each adjustment leads to a different value of the overall score s. Then, A-GLAM accepts one of the adjustments randomly, with probability proportional to exp(s/T). A-GLAM may even exclude a sequence if doing so would improve alignment quality. The temperature T is gradually lowered to T = 0, with the intent of finding the gapless multiple alignments of the windows maximizing s. The maximization implicitly determines the final window size. The randomness in the algorithm helps it avoid local maxima and find the global maximum of s. However, due to the stochastic nature of the procedure, finding the optimum alignment it is not guaranteed.
In the default mode, A-GLAM repeats the annealing maximization procedure ten times from different starting points (ten runs). The rationale behind this is that if several of the runs converge to the same best alignment, the user has increased confidence that it is indeed the optimum alignment.
The individual score and its E-value in A-GLAM: The Gibbs sampling step produces an alignment whose overall score s is given by equation (1). Consider a window of length w that is about to be added to A-GLAM’s alignment. Let δi(j) equal 1 if the window has nucleic acid j in position i, and 0 otherwise. The addition of the new window changes the overall score by
equation M2
(2)
The score change corresponds to scoring the new window according to a position specific scoring matrix (PSSM) that assigns the “individual score”
equation M3
(3)
to nucleic acid j in position i. Equation (3) represents a log-odds score for nucleic acid j in position i under an alternative hypothesis with probability (cij + aj)/(c + a) and a null hypothesis with probability pij. PSI-BLAST (40) uses equation (3) to calculate E-values. The derivation through equation (2) confirms the PSSM in equation (3) as the natural choice for evaluating individual sequences.
The assignment of an E-value to a subsequence with a particular individual score is done as follows. Consider the alignment sequence containing the subsequence. Let n be the sequence length, and recall that w is the window size. If ΔSi denotes the quantity in equation (2) if the final letter in the window falls at position i of the alignment sequence, then ΔS* = max{ΔSi : i = w, …, n} is the maximum individual score over all sequence positions i. We assigned an E-value to the actual value ΔS* = Δs*, as follows. Staden’s method (41) yields An external file that holds a picture, illustration, etc.
Object name is nihms-112925-ig0001.jpgSi ≥ Δs*} (independent of i) under the null hypothesis of bases chosen independently and randomly from the frequency distribution {pj}. The E-value E = (nw + 1)An external file that holds a picture, illustration, etc.
Object name is nihms-112925-ig0002.jpgSi ≥ Δs*} is therefore the expected number of sequence positions with an individual score exceeding Δs*. The factor nw + 1 in E is essentially a multiple test correction.
More recently, the A-GLAM package has been improved to allow the identification of multiple instances of an element within a target sequence (24). The optional “scanning step” after Gibbs sampling produces a PSSM given by equation (3). The new scanning step resembles an iterative PSI-BLAST search based on the PSSM (Fig. 13.1Fig. 13.1). First, it assigns an “individual score” to each subsequence of appropriate length within the input sequences using the initial PSSM. Second, it computes an E-value from each individual score to assess the agreement between the corresponding subsequence and the PSSM. Third, it permits subsequences with E-values falling below a threshold to contribute to the underlying PSSM, which is then updated using the Bayesian calculus. A-GLAM iterates its scanning step to convergence, at which point no new subsequences contribute to the PSSM. After convergence, A-GLAM reports predicted regulatory elements within each sequence in order of increasing E-values; users then have a statistical evaluation of the predicted elements in a convenient presentation. Thus, although the Gibbs sampling step in A-GLAM finds at most one regulatory element per input sequence, the scanning step can now rapidly locate further instances of the element in each sequence.
Fig. 13.1
Fig. 13.1
Fig. 13.1
Strategy to find multiple motif instances in A-GLAM
2.2. Hardware
The minimum hardware requirements are a personal computer with at least 512 MB of random access memory (RAM) connected to the Internet as well as access to a Linux or UNIX workstation where A-GLAM will be installed. The connectivity between the personal computer and the workstation is typically established by the Secure Shell (SSH) protocol, a widely used open source of the protocol available at http://www.openssh.org/.
2.3. Software
A modern version of the Perl programming language installed on the Linux or UNIX workstation freely available at http://www.perl.com/ will allow the user to parse A-GLAM’s output. The A-GLAM package (23) freely available at http://ftp.ncbi.nih.-gov/pub/spouge/papers/archive/AGLAM/ is currently available as source code and binary packages for the Linux operating system.
Installation of the Linux binary: get the executable from the FTP site and set execute permissions.
$chmod +x aglam
Installation from source: unpack the glam archive in a convenient location and compile A-GLAM.
$tar —zxvf aglam.tar.gz
$cd aglam
$make aglam
Then you could place the binary in your path: $HOME/bin or/usr/local/bin/.
2.4. Data Files
A-GLAM accepts input data in FASTA format containing the sequences to be analyzed. The FASTA format consists of one or more sequences identified by a line beginning with the “>” character that should include a unique identifier and a short description about the sequence. The next line(s) should contain the sequence string. A-GLAM expects the standard nucleic acid IUPAC code.
2.5. A-GLAM Options
Some important options to modify A-GLAM’s behavior are described below:
$aglam <fasta_file.fa>
This command simply uses the standard Gibbs sampling procedure to find sequence motifs in “fasta_file.fa.”
$aglam <fasta_file.fa> -n 30000 -a 8 -b 16 -j
These sets of commands instruct the program to search only the given strand of the sequences to find motifs of length between 8 and 16 bp. The flag n specifies the number of iterations performed in each of the ten runs. Low values of n are adequate when the problem size is small, i.e., when the sequences are short and more importantly there are few of them, but high values of n are needed for large problems. In addition, smaller values of n are sufficient when there is a strong alignment to be found, but larger values are necessary when there is no strong alignment, e.g., for finding the optimal alignment of random sequences. You will have to choose n on a case-by-case basis. This parameter also controls the tradeoff between speed and accuracy.
$aglam <fasta_file.fa> -i TATA
This important option sets the program to run in a “seed” oriented mode. The above command restricts the search to sequences that are TATA-like. Unlike the procedure followed in the standard Gibbs sampling algorithm, however, A-GLAM continues to align one exact copy of the “seed” in all “seed sequences.” Therefore, A-GLAM uses the seed sequences to direct its search in the remaining non-seed “target sequences.” Using this option leads to the global optimum quickly.
$aglam <fasta_file.fa> -l -k 0.05 -g 2000
Usable only with version 1.1. This set of commands instructs the program to find multiple motif instances in each input sequence via the scanning step (described above). Those instances that receive an E-value less than 0.05 are included in the PSSM. The search for multiple motifs is carried on until either (a) no new motifs are present or (b) the user-specified number of iterations (in this case, it is 2000) is attained, whichever comes first.
The A-GLAM package includes documentation and test datasets. Here, we will use a dataset obtained from a large-scale chromatin immunoprecipitation in Saccharomyces cerevisiae (2), combined with DNA microarrays (42) to detect interactions between transcription factors and a DNA sequence in vivo. The DNA sequence binding specificity of various transcription factors can then be inferred using A-GLAM on intergenic regions bound by a particular transcription factor. Here, we will use the intergenic regions bound by Snt2p (see Note 1).
3.1. Promoter Identification
The Saccharomyces Genome Database (SGD) maintains the most current annotations of the yeast genome (see http://www.yeast-genome.org/). The SGD FTP site contains the DNA sequences annotated as intergenic regions in FASTA format (available at ftp://genome-ftp.stanford.edu/pub/yeast/sequence/genomic_sequence/intergenic/), indicating the 5′ and 3′ flanking features. Additionally, a tab delimited file with the annotated features of the genome is necessary to determine the orientation of the intergenic regions relative to the genes (available at ftp://genome-ftp.stanford.edu/pub/yeast/chromosomal_feature/). The two files can be used to extract upstream intergenic regions. Additionally, there are a number of Web services that facilitate the identification of proximal promoter in mammalian genomes; these include TRED (32), EPD (33), and Promoser (31).
3.2. Identification of cis-Regulatory Elements in Promoter Regions
Construct FASTA files for each of the promoters to be included in the analysis. The Perl programming language can be used in conjunction with BioPerl libraries (freely available at http://www.bioperl.org/) to generate files in FASTA format. In this particular example all relevant files can be found on the Fraenkel Web site at http://fraenkel.mit.edu//Harbison/release_v24.
The A-GLAM package has a number of options that can be used to adjust search parameters.
$aglam
Usage summary: aglam [options] myseqs.fa
Options:
-h help: print documentation
-n end each run after this many iterations without improvement (10000)
-r number of alignment runs (10)
-a minimum alignment width (3)
-b maximum alignment width (10000)
-j examine only one strand
-i word seed query ()
-f input file containing positions of the motifs ()
-z turn off ZOOPS (force every sequence to participate in the alignment)
-v print all alignments in full
-e turn off sorting individual sequences in an alignment on p-value
-q pretend residue abundances = 1/4
-d frequency of width-adjusting moves (1)
-p pseudocount weight (1.5)
-u use uniform pseudocounts: each pseudocount = p/4
-t initial temperature (0.9)
-c cooling factor (1)
-m use modified Lam schedule (default = geometric schedule)
-s seed for random number generator (1)
-w print progress information after each iteration
-l find multiple instances of motifs in each sequence
-k add instances of motifs that satisfy the cutoff e-value (0)
-g number of iterations to be carried out in the post-processing step (1000)
Run A-GLAM to identify regulatory elements present in the promoter regions bound by Snt2p. A-GLAM uses sequences in FASTA format as input. There are 46 intergenic regions bound by Snt2p that were identified by ChIP-chip in a large-scale study (2). These regions vary in length from 71 to 1,512 bp with an average of 398 bp. A-GLAM is able to identify statistically significant motifs for Snt2p and rank them according to their individual p-values. A-GLAM has a number of useful command line options that can be adjusted to improve ab initio motif finding; in this example we have restricted the search to motifs no larger than 20 bp and instructed the program to find multiple instances of motifs in each sequence using a strategy that resembles an iterative PSI-BLAST search based on the PSSM constructed by the Gibbs sampling step (24). The output of the A-GLAM program is presented in Fig. 13.2Fig. 13.2. In the default mode, A-GLAM repeats the annealing maximization procedure ten times from different starting points (ten runs). The rationale behind this is that if several of the runs converge to the same best alignment, the user has increased confidence that it is indeed the optimum alignment. The user can adjust the number of alignment runs by setting the -r flag (see Note 2). The number of iterations can also be adjusted for large datasets. The default value is set at 10,000 without alignment improvement, using the -n flag the number of iterations can be increased to extend coverage of the sequence space.
Fig. 13.2
Fig. 13.2
Fig. 13.2
A-GLAM output for a set of sequences containing an SNT2p motif identified using ChIP-chip
A-GLAM identifies candidate sequences that could serve as Snt2p binding sites. The candidate sequences found by A-GLAM are in agreement with previous findings where other motif finding algorithms were used (2) and Fig. 13.3Fig. 13.3. Additional examples where we have successfully used A-GLAM to complement experimental efforts for the identification of regulatory elements include motifs for Spt10p in yeast and the CREB-binding protein (34, 35). In this particular example, the program constructs a PSSM using the sequences from the optimal alignment to find multiple instances (see Note 3). The multiple alignments produced by A-GLAM can be represented graphically by sequence logos (43, 44) (see Note 4).
Fig. 13.3
Fig. 13.3
Fig. 13.3
Snt2p regulatory motif identified with A-GLAM
  • The primary data can be obtained from the Fraenkel Laboratory Web site at http://fraenkel.mit.edu/Harbison/.
  • The number of alignment runs is 10 by default; however, the user can increase the number of runs to boost the confidence of the results. The user has the option -v to print all alignments generated in each run; by default A-GLAM will report only the highest scoring alignment.
  • Alternatively, the user could run A-GLAM without the -l flag and construct a position frequency matrix that in turn could be used to scan the target sequences for additional instances of the motif. The TFBS Perl modules for transcription factor binding detection and analysis provide a flexible and powerful framework (available at http://tfbs.genereg.net/).
  • Other Web servers for logo generation include enoLOGOS (available on the Web at http://biodev.hgen.pitt.edu/enologos/) and Pictogram (http://genes.mit.edu/pictogram. html).
Acknowledgments
This research was supported by the Intramural Research Program of the NIH, NLM, NCBI.
1. Elnitski L, Jin VX, Farnham PJ, Jones SJ. Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. Genome Res. 2006;16:1455–64. [PubMed]
2. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431:99–104. [PubMed]
3. Bieda M, Xu X, Singer MA, Green R, Farnham PJ. Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. Genome Res. 2006;16:595–605. [PubMed]
4. Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ, et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell. 2004;116:499–509. [PubMed]
5. Guccione E, Martinato F, Finocchiaro G, Luzi L, Tizzoni L, Olio V. Dall’, Zardo G, Nervi C, Bernard L, Amati B. Myc-binding-site recognition in the human genome is determined by chromatin context. Nat Cell Biol. 2006;8:764–70. [PubMed]
6. Hughes JD, Estep PW, Tavazoie S, Church GM. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000;296:1205–14. [PubMed]
7. Workman CT, Stormo GD. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput. 2000;5:467–78. [PubMed]
8. Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999;15:563–77. [PubMed]
9. Frith MC, Fu Y, Yu L, Chen JF, Hansen U, Weng Z. Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res. 2004;32:1372–81. [PubMed]
10. Ao W, Gaudet J, Kent WJ, Muttumu S, Mango SE. Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science. 2004;305:1743–6. [PubMed]
11. Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994;2:28–36. [PubMed]
12. Eskin E, Pevzner PA. Finding composite regulatory patterns in DNA sequences. Bioinformatics. 2002;18(Suppl 1):S354–63. [PubMed]
13. Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y. A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics. 2001;17:1113–22. [PubMed]
14. Régnier M, Denise A. Rare events and conditional events on random strings. Discrete Math Theor Comput Sci. 2004;6:191–214.
15. Favorov AV, Gelfand MS, Gerasimova AV, Ravcheev DA, Mironov AA, Makeev VJ. A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics. 2005;21:2240–5. [PubMed]
16. Pavesi G, Mereghetti P, Zambelli F, Stefani M, Mauri G, Pesole G. MoD Tools: regulatory motif discovery in nucleotide sequences from coregulated or homologous genes. Nucleic Acids Res. 2006;34:W566–70. [PubMed]
17. Pavesi G, Zambelli F, Pesole G. WeederH: an algorithm for finding conserved regulatory motifs and regions in homologous sequences. BMC Bioinformatics. 2007;8:46. [PubMed]
18. Sinha S, Tompa M. YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2003;31:3586–8. [PubMed]
19. Blanchette M, Bataille AR, Chen X, Poitras C, Laganiere J, Lefebvre C, Deblois G, Giguere V, Ferretti V, Bergeron D, et al. Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Res. 2006;16:656–68. [PubMed]
20. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–44. [PubMed]
21. Ohler U, Niemann H. Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet. 2001;17:56–60. [PubMed]
22. Marino-Ramirez L, Spouge JL, Kanga GC, Landsman D. Statistical analysis of over-represented words in human promoter sequences. Nucleic Acids Res. 2004;32:949–58. [PubMed]
23. Tharakaraman K, Marino-Ramirez L, Sheetlin S, Landsman D, Spouge JL. Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics. 2005;21(Suppl 1):i440–8. [PubMed]
24. Tharakaraman K, Marino-Ramirez L, Sheetlin S, Landsman D, Spouge JL. Scanning sequences after Gibbs sampling to find multiple occurrences of functional elements. BMC Bioinformatics. 2006;7:408. [PubMed]
25. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, et al. Life with 6000 genes. Science. 1996;274:546, 563–47. [PubMed]
26. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, Fitz-Hugh W, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [PubMed]
27. Levine M, Tjian R. Transcription regulation and animal diversity. Nature. 2003;424:147–51. [PubMed]
28. Carninci P, Waki K, Shiraki T, Konno H, Shibata K, Itoh M, Aizawa K, Arakawa T, Ishii Y, Sasaki D, et al. Targeting a complex transcriptome: the construction of the mouse full-length cDNA encyclopedia. Genome Res. 2003;13:1273–89. [PubMed]
29. Kimura K, Wakamatsu A, Suzuki Y, Ota T, Nishikawa T, Yamashita R, Yamamoto J, Sekine M, Tsuritani K, Wakaguri H, et al. Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes. Genome Res. 2006;16:55–65. [PubMed]
30. Suzuki Y, Yamashita R, Sugano S, Nakai K. DBTSS, DataBase of Transcriptional Start Sites: progress report 2004. Nucleic Acids Res. 2004;32:D78–81. [PubMed]
31. Halees AS, Weng Z. Promo-Ser: improvements to the algorithm, visualization and accessibility. Nucleic Acids Res. 2004;32:W191–4. [PubMed]
32. Jiang C, Xuan Z, Zhao F, Zhang MQ. TRED: a transcriptional regulatory element database, new entries and other development. Nucleic Acids Res. 2007;35:D137–40. [PubMed]
33. Schmid CD, Perier R, Praz V, Bucher P. EPD in its twentieth year: towards complete promoter coverage of selected model organisms. Nucleic Acids Res. 2006;34:D82–5. [PubMed]
34. Eriksson PR, Mendiratta G, McLaughlin NB, Wolfsberg TG, Marino-Ramirez L, Pompa TA, Jainerin M, Landsman D, Shen CH, Clark DJ. Global regulation by the yeast Spt10 protein is mediated through chromatin structure and the histone upstream activating sequence elements. Mol Cell Biol. 2005;25:9127–37. [PubMed]
35. Riz I, Akimov SS, Eaker SS, Baxter KK, Lee HJ, Marino-Ramirez L, Landsman D, Hawley TS, Hawley RG. TLX1/HOX11-induced hematopoietic differentiation blockade. Oncogene. 2007;26:4115–23. [PubMed]
36. van Helden J, andre B, Collado-Vides J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol. 1998;281:827–42. [PubMed]
37. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262:208–14. [PubMed]
38. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004;5:276–87. [PubMed]
39. Marino-Ramirez L, Jordan IK, Landsman D. Multiple independent evolutionary solutions to core histone gene regulation. Genome Biol. 2006;7:R122. [PubMed]
40. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. [PubMed]
41. Staden R. Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci. 1989;5:89–96. [PubMed]
42. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. Genome-wide location and function of DNA binding proteins. Science. 2000;290:2306–9. [PubMed]
43. Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18:6097–100. [PubMed]
44. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14:1188–90. [PubMed]

See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph