Discovering Motifs in Biological Sequences Using the Micron Automata Processor

IEEE/ACM Trans Comput Biol Bioinform. 2016 Jan-Feb;13(1):99-111. doi: 10.1109/TCBB.2015.2430313.

Abstract

Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the (l, d) motif search problem of identifying one or more motifs of length l present in at least q of the n given sequences, with each occurrence differing from the motif in at most d substitutions. The problem is known to be NP-complete, and the largest solved instance reported to date is (26,11). We propose a novel algorithm for the (l,d) motif search problem using streaming execution over a large set of non-deterministic finite automata (NFA). This solution is designed to take advantage of the micron automata processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel. We demonstrate the capability for solving much larger instances of the (l, d) motif search problem using the resources available within a single automata processor board, by estimating run-times for problem instances (39,18) and (40,17). The paper serves as a useful guide to solving problems using this new accelerator technology.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Amino Acid Motifs / genetics*
  • Computational Biology / methods*
  • Nucleotide Motifs / genetics*
  • Sequence Analysis, DNA / methods*
  • Sequence Analysis, Protein / methods*