Prediction of genetic structure in eukaryotic DNA using reference point logistic regression and sequence alignment

Bioinformatics. 2000 May;16(5):425-38. doi: 10.1093/bioinformatics/16.5.425.

Abstract

Motivation: Current software tools are moderately effective in predicting genetic structure (exons, introns, intergenic regions, and complete genes) from raw DNA sequence data. Improvements in accuracy and speed are needed to deal with the increasing volume of data from large scale sequencing projects.

Results: We present a two-stage computer program to predict genetic structure in eukaryotic DNA. The first stage makes use of a novel statistical technique, called reference point logistic (RPL) regression, to calculate scores for potential functional sites. These site scores are combined with interval content, length, and state scores, via a Generalized Hidden Markov Model, to determine a combined score for each possible parse of a given DNA sequence into exons, introns, and intergenic regions. An optimal parse is found using a dynamic programming algorithm. In the second stage, protein sequence alignment methods are applied to improve the accuracy of the initial parse. Computation in the first stage of the program is very fast (1 s on a 360 MHz CPU for a 16 kb sequence) and its predictive accuracy typically matches or exceeds the best results reported for other methods (Sensitivity = 0.93 and Specificity = 0.93 for the Burset/Guigótest set). Computation in the second stage is slower, but the final predictions are more accurate (Sn = 0.97, Sp = 0.97). The program (called GRPL) can handle partial, single, and multi-gene sequences. The program is also capable of predicting the genetic structure of vertebrate, invertebrate, and plant DNA with nearly equal accuracy. Statistical techniques have also been introduced to model the effects of varying C+G content in a continuous manner and to control overfitting of parameters for smaller training sets.

Availability: An academic implementation of GRPL, compiled for SUN workstations, is available by anonymous ftp from snipe.pharmacy. ualberta.ca/pub. The training and test sets used in this work, together with supplementary material, can be found at the same location. A commercial implementation is available as a component of GeneTool (BioTools Inc., http://biotools.com).

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Arabidopsis / genetics
  • Base Composition
  • DNA / chemistry
  • DNA / genetics*
  • DNA, Plant / genetics
  • Databases, Factual
  • Drosophila / genetics
  • Humans
  • Logistic Models*
  • Markov Chains
  • Models, Genetic
  • Sequence Alignment*
  • Software*

Substances

  • DNA, Plant
  • DNA