Display Settings:


Send to:

Choose Destination
See comment in PubMed Commons below
Nucleic Acids Res. 2004 Feb 12;32(3):949-58. Print 2004.

Statistical analysis of over-represented words in human promoter sequences.

Author information

  • 1Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, MSC 6075 Bethesda, MD 20894-6075, USA.

Erratum in

  • Nucleic Acids Res. 2004 Nov 8;32(19):5972.
  • Nucleic Acids Res. 2004 Dec;32(22):6718.


The identification and characterization of regulatory sequence elements in the proximal promoter region of a gene can be facilitated by knowing the precise location of the transcriptional start site (TSS). Using known TSSs from over 5700 different human full-length cDNAs, this study extracted a set of 4737 distinct putative promoter regions (PPRs) from the human genome. Each PPR consisted of nucleotides from -2000 to +1000 bp, relative to the corresponding TSS. Since many regulatory regions contain short, highly conserved strings of less than 10 nucleotides, we counted eight-letter words within the PPRs, using z-scores and other related statistics to evaluate their over- and under-representation. Several over-represented eight-letter words have known biological functions described in the eukaryotic transcription factor database TRANSFAC; however, many did not. Besides calculating a P-value with the standard normal approximation associated with z-scores, we used two extra statistical controls to evaluate the significance of over-represented words. These controls have important implications for evaluating over- and under-represented words with z-scores.

[PubMed - indexed for MEDLINE]
Free PMC Article

Images from this publication.See all images (4)Free text

Figure 1
Figure 2
Figure 3
Figure 4
PubMed Commons home

PubMed Commons

How to join PubMed Commons

    Supplemental Content

    Full text links

    Icon for HighWire Icon for PubMed Central
    Loading ...
    Write to the Help Desk