• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Mar 1990; 87(6): 2264–2268.

Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.


An unusual pattern in a nucleic acid or protein sequence or a region of strong similarity shared by two or more sequences may have biological significance. It is therefore desirable to know whether such a pattern can have arisen simply by chance. To identify interesting sequence patterns, appropriate scoring values can be assigned to the individual residues of a single sequence or to sets of residues when several sequences are compared. For single sequences, such scores can reflect biophysical properties such as charge, volume, hydrophobicity, or secondary structure potential; for multiple sequences, they can reflect nucleotide or amino acid similarity measured in a wide variety of ways. Using an appropriate random model, we present a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score. A second class of results describes the composition of high-scoring segments. In certain contexts, these permit the choice of scoring systems which are "optimal" for distinguishing biologically relevant patterns. Examples are given of applications of the theory to a variety of protein sequences, highlighting segments with unusual biological features. These include distinctive charge regions in transcription factors and protooncogene products, pronounced hydrophobic segments in various receptor and transport proteins, and statistically significant subalignments involving the recently characterized cystic fibrosis gene.

Full text

Full text is available as a scanned copy of the original print version. Get a printable copy (PDF file) of the complete article (1.2M), or click on a page image below to browse page by page. Links to PubMed are also available for Selected References.

Selected References

These references are in PubMed. This may not be the complete list of references from this article.
  • Doolittle RF. Similar amino acid sequences: chance or common ancestry? Science. 1981 Oct 9;214(4517):149–159. [PubMed]
  • Karlin S, Ghandour G, Ost F, Tavare S, Korn LJ. New approaches for computer analysis of nucleic acid sequences. Proc Natl Acad Sci U S A. 1983 Sep;80(18):5660–5664. [PMC free article] [PubMed]
  • Fitch WM. Random sequences. J Mol Biol. 1983 Jan 15;163(2):171–176. [PubMed]
  • Altschul SF, Erickson BW. Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol. 1985 Nov;2(6):526–538. [PubMed]
  • Feng DF, Johnson MS, Doolittle RF. Aligning amino acid sequences: comparison of commonly used methods. J Mol Evol. 1984;21(2):112–125. [PubMed]
  • Risler JL, Delorme MO, Delacroix H, Henaut A. Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J Mol Biol. 1988 Dec 20;204(4):1019–1029. [PubMed]
  • McLachlan AD. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 . J Mol Biol. 1971 Oct 28;61(2):409–424. [PubMed]
  • Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. [PubMed]
  • Wilbur WJ, Lipman DJ. Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A. 1983 Feb;80(3):726–730. [PMC free article] [PubMed]
  • Karlin S, Morris M, Ghandour G, Leung MY. Efficient algorithms for molecular sequence analysis. Proc Natl Acad Sci U S A. 1988 Feb;85(3):841–845. [PMC free article] [PubMed]
  • Wilbur WJ. On the PAM matrix model of protein evolution. Mol Biol Evol. 1985 Sep;2(5):434–447. [PubMed]
  • Karlin S, Blaisdell BE, Mocarski ES, Brendel V. A method to identify distinctive charge configurations in protein sequences, with application to human herpesvirus polypeptides. J Mol Biol. 1989 Jan 5;205(1):165–177. [PubMed]
  • Karlin S, Blaisdell BE, Brendel V. Identification of significant sequence patterns in proteins. Methods Enzymol. 1990;183:388–402. [PubMed]
  • Rauscher FJ, 3rd, Cohen DR, Curran T, Bos TJ, Vogt PK, Bohmann D, Tjian R, Franza BR., Jr Fos-associated protein p39 is the product of the jun proto-oncogene. Science. 1988 May 20;240(4855):1010–1016. [PubMed]
  • Vogt PK, Bos TJ, Doolittle RF. Homology between the DNA-binding domain of the GCN4 regulatory protein of yeast and the carboxyl-terminal region of a protein coded for by the oncogene jun. Proc Natl Acad Sci U S A. 1987 May;84(10):3316–3319. [PMC free article] [PubMed]
  • Ryder K, Lau LF, Nathans D. A gene activated by growth factors is related to the oncogene v-jun. Proc Natl Acad Sci U S A. 1988 Mar;85(5):1487–1491. [PMC free article] [PubMed]
  • Karlin S, Brendel V. Charge configurations in oncogene products and transforming proteins. Oncogene. 1990 Jan;5(1):85–95. [PubMed]
  • Mitchell PJ, Tjian R. Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science. 1989 Jul 28;245(4916):371–378. [PubMed]
  • Pirrotta V, Manet E, Hardon E, Bickel SE, Benson M. Structure and sequence of the Drosophila zeste gene. EMBO J. 1987 Mar;6(3):791–799. [PMC free article] [PubMed]
  • Brendel V, Karlin S. Association of charge clusters with functional domains of cellular transcription factors. Proc Natl Acad Sci U S A. 1989 Aug;86(15):5698–5702. [PMC free article] [PubMed]
  • Theissen H, Etzerodt M, Reuter R, Schneider C, Lottspeich F, Argos P, Lührmann R, Philipson L. Cloning of the human cDNA for the U1 RNA-associated 70K protein. EMBO J. 1986 Dec 1;5(12):3209–3217. [PMC free article] [PubMed]
  • Jackson TR, Blair LA, Marshall J, Goedert M, Hanley MR. The mas oncogene encodes an angiotensin receptor. Nature. 1988 Sep 29;335(6189):437–440. [PubMed]
  • Riordan JR, Rommens JM, Kerem B, Alon N, Rozmahel R, Grzelczak Z, Zielenski J, Lok S, Plavsic N, Chou JL, et al. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science. 1989 Sep 8;245(4922):1066–1073. [PubMed]
  • Gonzalez FJ. The molecular biology of cytochrome P450s. Pharmacol Rev. 1988 Dec;40(4):243–288. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • Cited in Books
    Cited in Books
    PubMed Central articles cited in books
  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles
  • Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...