Format

Send to

Choose Destination
See comment in PubMed Commons below
J Mol Biol. 1995 Apr 21;248(1):1-18.

Identification of protein coding regions in genomic DNA.

Author information

1
Department of Molecular, Cellular and Developmental Biology, Universityof Colorado, Boulder 80309-0347, USA.

Abstract

We have developed a computer program, GeneParser, which identifies and determines the fine structure of protein genes in genomic DNA sequences. The program scores all subintervals in a sequence for content statistics indicative of introns and exons, and for sites that identify their boundaries. This information is weighted by a neural network to approximate the log-likelihood that each subinterval exactly represents an intron or exon (first, internal or last). A dynamic programming algorithm is then applied to this data to find the combination of introns and exons that maximizes the likelihood function. Using this method, we can rapidly generate ranked suboptimal solutions, each of which is the optimum solution containing a given intron-exon junction. We have tested the system on a large collection of human genes. On sequences not used in training, we achieved a correlation coefficient for exon nucleotide prediction of 0.89. For a subset of G + C-rich genes, a correlation coefficient of 0.94 was achieved. We have also quantified the robustness of the method to substitution and frame-shift errors and show how the system can be optimized for performance on sequences with known levels of sequencing errors.

PMID:
7731036
DOI:
10.1006/jmbi.1995.0198
[Indexed for MEDLINE]
PubMed Commons home

PubMed Commons

0 comments

    Supplemental Content

    Full text links

    Icon for Elsevier Science
    Loading ...
    Support Center