• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jun 11, 1992; 20(11): 2871–2875.
PMCID: PMC336935

WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences.


We present here a fast and sensitive method designed to isolate short nucleotide sequences which have non-random statistical properties and may thus be biologically active. It is based on a first order Markov analysis and allows us to detect statistically significant sequence motifs from six to ten nucleotides long which are significantly shared (or avoided) in the sequences under investigation. This method has been tested on a set of 521 sequences extracted from the Eukaryotic Promoter Database (2). Our results demonstrate the accuracy and the efficiency of the method in that the sequence motifs which are known to act as eukaryotic promoters, such as the TATA-box and the CAAT-box, were clearly identified. In addition we have found other statistically significant motifs, the biological roles of which are yet to be clarified.

Full text

Full text is available as a scanned copy of the original print version. Get a printable copy (PDF file) of the complete article (880K), or click on a page image below to browse page by page. Links to PubMed are also available for Selected References.

Selected References

These references are in PubMed. This may not be the complete list of references from this article.
  • Brendel V, Beckmann JS, Trifonov EN. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J Biomol Struct Dyn. 1986 Aug;4(1):11–21. [PubMed]
  • Pevzner PA, Borodovsky MYu, Mironov AA. Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J Biomol Struct Dyn. 1989 Apr;6(5):1013–1026. [PubMed]
  • Beckmann JS, Brendel V, Trifonov EN. Intervening sequences exhibit distinct vocabulary. J Biomol Struct Dyn. 1986 Dec;4(3):391–400. [PubMed]
  • Borodovsky MYu, Gusein-Zade SM. A general rule for ranged series of codon frequencies in different genomes. J Biomol Struct Dyn. 1989 Apr;6(5):1001–1012. [PubMed]
  • Nussinov R. Eukaryotic dinucleotide preference rules and their implications for degenerate codon usage. J Mol Biol. 1981 Jun 15;149(1):125–131. [PubMed]
  • Beutler E, Gelbart T, Han JH, Koziol JA, Beutler B. Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage. Proc Natl Acad Sci U S A. 1989 Jan;86(1):192–196. [PMC free article] [PubMed]
  • Kozhukhin CG, Pevzner PA. Genome inhomogeneity is determined mainly by WW and SS dinucleotides. Comput Appl Biosci. 1991 Jan;7(1):39–49. [PubMed]
  • Korn LJ, Queen CL, Wegman MN. Computer analysis of nucleic acid regulatory sequences. Proc Natl Acad Sci U S A. 1977 Oct;74(10):4401–4405. [PMC free article] [PubMed]
  • Lipman DJ, Wilbur WJ, Smith TF, Waterman MS. On the statistical significance of nucleic acid similarities. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):215–226. [PMC free article] [PubMed]
  • Stormo GD. Consensus patterns in DNA. Methods Enzymol. 1990;183:211–221. [PubMed]
  • Mengeritsky G, Smith TF. Recognition of characteristic patterns in sets of functionally equivalent DNA sequences. Comput Appl Biosci. 1987 Sep;3(3):223–227. [PubMed]
  • Staden R. Searching for patterns in protein and nucleic acid sequences. Methods Enzymol. 1990;183:193–211. [PubMed]
  • Bucher P, Bryan B. Signal search analysis: a new method to localize and characterize functionally important DNA sequences. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):287–305. [PMC free article] [PubMed]
  • Gartmann CJ, Grob U. SQUIRREL: Sequence QUery, Information Retrieval and REporting Library. A program package for analyzing signals in nucleic acid sequences for the VAX. Nucleic Acids Res. 1991 Nov 11;19(21):6033–6040. [PMC free article] [PubMed]
  • Galas DJ, Eggert M, Waterman MS. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J Mol Biol. 1985 Nov 5;186(1):117–128. [PubMed]
  • Stückle EE, Emmrich C, Grob U, Nielsen PJ. Statistical analysis of nucleotide sequences. Nucleic Acids Res. 1990 Nov 25;18(22):6641–6647. [PMC free article] [PubMed]
  • Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A. 1986 Jul;83(14):5155–5159. [PMC free article] [PubMed]
  • Smith TF, Waterman MS, Sadler JR. Statistical characterization of nucleic acid sequence functional domains. Nucleic Acids Res. 1983 Apr 11;11(7):2205–2220. [PMC free article] [PubMed]
  • Gouy M, Gautier C, Attimonelli M, Lanave C, di Paola G. ACNUC--a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput Appl Biosci. 1985 Sep;1(3):167–172. [PubMed]
  • Breathnach R, Chambon P. Organization and expression of eucaryotic split genes coding for proteins. Annu Rev Biochem. 1981;50:349–383. [PubMed]
  • Everett RD, Baty D, Chambon P. The repeated GC-rich motifs upstream from the TATA box are important elements of the SV40 early promoter. Nucleic Acids Res. 1983 Apr 25;11(8):2447–2464. [PMC free article] [PubMed]
  • Efstratiadis A, Posakony JW, Maniatis T, Lawn RM, O'Connell C, Spritz RA, DeRiel JK, Forget BG, Weissman SM, Slightom JL, et al. The structure and evolution of the human beta-globin gene family. Cell. 1980 Oct;21(3):653–668. [PubMed]
  • Ghosh D. A relational database of transcription factors. Nucleic Acids Res. 1990 Apr 11;18(7):1749–1756. [PMC free article] [PubMed]
  • Nussinov R. Doublet frequencies in evolutionary distinct groups. Nucleic Acids Res. 1984 Feb 10;12(3):1749–1763. [PMC free article] [PubMed]
  • Devereux J, Haeberli P, Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):387–395. [PMC free article] [PubMed]
  • Waterman MS, Arratia R, Galas DJ. Pattern recognition in several sequences: consensus and alignment. Bull Math Biol. 1984;46(4):515–527. [PubMed]
  • Chen RP, Ingraham HA, Treacy MN, Albert VR, Wilson L, Rosenfeld MG. Autoregulation of pit-1 gene expression mediated by two cis-active promoter elements. Nature. 1990 Aug 9;346(6284):583–586. [PubMed]
  • Dudley JP. Discrete high molecular weight RNA transcribed from the long interspersed repetitive element L1Md. Nucleic Acids Res. 1987 Mar 25;15(6):2581–2592. [PMC free article] [PubMed]
  • Davidson I, Fromental C, Augereau P, Wildeman A, Zenke M, Chambon P. Cell-type specific protein binding to the enhancer of simian virus 40 in nuclear extracts. Nature. 1986 Oct 9;323(6088):544–548. [PubMed]
  • Casey JL, Di Jeso B, Rao KK, Rouault TA, Klausner RD, Harford JB. Deletional analysis of the promoter region of the human transferrin receptor gene. Nucleic Acids Res. 1988 Jan 25;16(2):629–646. [PMC free article] [PubMed]
  • Hertz GZ, Hartzell GW, 3rd, Stormo GD. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci. 1990 Apr;6(2):81–92. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles