Determining word sequence variation patterns in clinical documents using multiple sequence alignment

AMIA Annu Symp Proc. 2011:2011:934-43. Epub 2011 Oct 22.

Abstract

Sentences and phrases that represent a certain meaning often exhibit patterns of variation where they differ from a basic structural form by one or two words. We present an algorithm that utilizes multiple sequence alignments (MSAs) to generate a representation of groups of phrases that possess the same semantic meaning but also share in common the same basic word sequence structure. The MSA enables the determination not only of the words that compose the basic word sequence, but also of the locations within the structure that exhibit variation. The algorithm can be utilized to generate patterns of text sequences that can be used as the basis for a pattern-based classifier, as a starting point to bootstrap the pattern building process for a regular expression-based classifiers, or serve to reveal the variation characteristics of sentences and phrases within a particular domain.

MeSH terms

  • Algorithms*
  • Language
  • Natural Language Processing*
  • Pattern Recognition, Automated*
  • Semantics