Logo of narLink to Publisher's site
Nucleic Acids Res. Feb 11, 1993; 21(3): 607–613.
PMCID: PMC309159

Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks.

Abstract

Dynamic programming (DP) is applied to the problem of precisely identifying internal exons and introns in genomic DNA sequences. The program GeneParser first scores the sequence of interest for splice sites and for these intron- and exon-specific content measures: codon usage, local compositional complexity, 6-tuple frequency, length distribution and periodic asymmetry. This information is then organized for interpretation by DP. GeneParser employs the DP algorithm to enforce the constraints that introns and exons must be adjacent and non-overlapping and finds the highest scoring combination of introns and exons subject to these constraints. Weights for the various classification procedures are determined by training a simple feed-forward neural network to maximize the number of correct predictions. In a pilot study, the system has been trained on a set of 56 human gene fragments containing 150 internal exons in a total of 158,691 bps of genomic sequence. When tested against the training data, GeneParser precisely identifies 75% of the exons and correctly predicts 86% of coding nucleotides as coding while only 13% of non-exon bps were predicted to be coding. This corresponds to a correlation coefficient for exon prediction of 0.85. Because of the simplicity of the network weighting scheme, generalization performance is nearly as good as with the training set.

Full text

Full text is available as a scanned copy of the original print version. Get a printable copy (PDF file) of the complete article (1.3M), or click on a page image below to browse page by page. Links to PubMed are also available for Selected References.

Selected References

These references are in PubMed. This may not be the complete list of references from this article.
  • Konopka AK, Owens J. Complexity charts can be used to map functional domains in DNA. Genet Anal Tech Appl. 1990 Apr;7(2):35–38. [PubMed]
  • Wieringa B, Hofer E, Weissmann C. A minimal intron length but no specific internal sequence is required for splicing the large rabbit beta-globin intron. Cell. 1984 Jul;37(3):915–925. [PubMed]
  • Ulfendahl PJ, Pettersson U, Akusjärvi G. Splicing of the adenovirus-2 E1A 13S mRNA requires a minimal intron length and specific intron signals. Nucleic Acids Res. 1985 Sep 11;13(17):6299–6315. [PMC free article] [PubMed]
  • Bougueleret L, Tekaia F, Sauvaget I, Claverie JM. Objective comparison of exon and intron sequences by means of 2-dimensional data analysis methods. Nucleic Acids Res. 1988 Mar 11;16(5):1729–1738. [PMC free article] [PubMed]
  • Claverie JM, Bougueleret L. Heuristic informational analysis of sequences. Nucleic Acids Res. 1986 Jan 10;14(1):179–196. [PMC free article] [PubMed]
  • Claverie JM, Sauvaget I, Bougueleret L. K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. Methods Enzymol. 1990;183:237–252. [PubMed]
  • Shapiro MB, Senapathy P. RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression. Nucleic Acids Res. 1987 Sep 11;15(17):7155–7174. [PMC free article] [PubMed]
  • Brunak S, Engelbrecht J, Knudsen S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol. 1991 Jul 5;220(1):49–65. [PubMed]
  • Fickett JW. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982 Sep 11;10(17):5303–5318. [PMC free article] [PubMed]
  • Konopka AK, Smythers GW, Owens J, Maizel JV., Jr Distance analysis helps to establish characteristic motifs in intron sequences. Gene Anal Tech. 1987 Jul-Aug;4(4):63–74. [PubMed]
  • Fields C. Information content of Caenorhabditis elegans splice site sequences varies with intron length. Nucleic Acids Res. 1990 Mar 25;18(6):1509–1512. [PMC free article] [PubMed]
  • Waterman MS, Eggert M. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. J Mol Biol. 1987 Oct 20;197(4):723–728. [PubMed]
  • Guigó R, Knudsen S, Drake N, Smith T. Prediction of gene structure. J Mol Biol. 1992 Jul 5;226(1):141–157. [PubMed]
  • Fields CA, Soderlund CA. gm: a practical tool for automating DNA sequence analysis. Comput Appl Biosci. 1990 Jul;6(3):263–270. [PubMed]
  • Uberbacher EC, Mural RJ. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A. 1991 Dec 15;88(24):11261–11265. [PMC free article] [PubMed]
  • Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. [PubMed]
  • Nussinov R, Jacobson AB. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc Natl Acad Sci U S A. 1980 Nov;77(11):6309–6313. [PMC free article] [PubMed]
  • Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981 Jan 10;9(1):133–148. [PMC free article] [PubMed]
  • Staden R, McLachlan AD. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 1982 Jan 11;10(1):141–156. [PMC free article] [PubMed]
  • Farber R, Lapedes A, Sirotkin K. Determination of eukaryotic protein coding regions using neural networks and information theory. J Mol Biol. 1992 Jul 20;226(2):471–479. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Compound
    Compound
    PubChem Compound links
  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...