Analysing grouping of nucleotides in DNA sequences using lumped processes constructed from Markov chains

J Math Biol. 2006 Mar;52(3):343-72. doi: 10.1007/s00285-005-0358-y. Epub 2006 Feb 7.

Abstract

The most commonly used models for analysing local dependencies in DNA sequences are (high-order) Markov chains. Incorporating knowledge relative to the possible grouping of the nucleotides enables to define dedicated sub-classes of Markov chains. The problem of formulating lumpability hypotheses for a Markov chain is therefore addressed. In the classical approach to lumpability, this problem can be formulated as the determination of an appropriate state space (smaller than the original state space) such that the lumped chain defined on this state space retains the Markov property. We propose a different perspective on lumpability where the state space is fixed and the partitioning of this state space is represented by a one-to-many probabilistic function within a two-level stochastic process. Three nested classes of lumped processes can be defined in this way as sub-classes of first-order Markov chains. These lumped processes enable parsimonious reparameterizations of Markov chains that help to reveal relevant partitions of the state space. Characterizations of the lumped processes on the original transition probability matrix are derived. Different model selection methods relying either on hypothesis testing or on penalized log-likelihood criteria are presented as well as extensions to lumped processes constructed from high-order Markov chains. The relevance of the proposed approach to lumpability is illustrated by the analysis of DNA sequences. In particular, the use of lumped processes enables to highlight differences between intronic sequences and gene untranslated region sequences.

MeSH terms

  • 3' Untranslated Regions / genetics
  • Algorithms
  • Animals
  • Base Sequence
  • DNA / chemistry
  • DNA / genetics
  • Humans
  • Introns / genetics
  • Markov Chains*
  • Models, Statistical*
  • Sequence Analysis, DNA / methods*

Substances

  • 3' Untranslated Regions
  • DNA