From: Ovcharenko, Ivan (NIH/NLM/NCBI) [E] Sent: Friday, January 23, 2009 9:10 AM To: NLM/NCBI List ncbi-seminar Subject: FW: NCBI special seminar, January 23 Follow Up Flag: Follow up Flag Status: Red Today at 11AM in the B2 Library. ----------------------------------------------------------------------------------------------------------------------------------------------- Uwe Ohler, Duke University Friday, January 23, 2009 Revisiting old friends: New approaches for the computational identification of functional non-coding elements and promoter regions The recent arrival of large-scale Cap Analysis of Gene Expression (CAGE) datasets in mammals provides a wealth of quantitative information on coding and non-coding RNA polymerase II transcription start sites (TSS). This allowed us to revisit the problem of how to computationally identify TSSs in higher eukaryotes. We propose a new model for TSSs based solely on known transcription factors (TFs) and their respective regions of positional enrichment. This probabilistic model leads to near-perfect classification results in cross-validation. Furthermore, the interpretable model structure suggests a DNA code in which canonical sequence features such as TATA box, Initiator, and GC content do play a significant role, but many additional TFs show distinct spatial biases with respect to TSS location and are important contributors. Performance in genomic scans demonstrates that TSS prediction with both high accuracy and spatial resolution is achievable for both coding and non-coding genes. In the second part, I will turn to the ever-popular problem of identifying regulatory sequence motifs in non-coding regions. The increasing availability of genome-wide data sets which directly or indirectly reflect gene regulation has allowed for the following problem definition: identify enriched sequence motifs, given quantitative experimental evidence for each regulatory region in a genome-wide set. We have developed the (conserved) Evidence-Ranked Motif Identification Tool (c)ERMIT, an efficient enumerative strategy which operates on a set of non-coding regulatory regions and their corresponding evidence, for example p-values resulting from chromatin- or RNA-immunoprecipitation experiments, or differential expression scores from knockdown assays. (c)ERMIT is validated extensively on curated yeast datasets and substantially outperforms existing state-of-the-art approaches. More importantly, it is e.g. easily applicable on recent large datasets, e.g. from ChIP-seq, which provide measurements for tens of thousands of input sequences.