Format

Send to

Choose Destination
Stud Health Technol Inform. 2007;129(Pt 1):524-8.

A reappraisal of sentence and token splitting for life sciences documents.

Author information

1
Jena University Language and Information Engineering (JULIE) Lab, Friedrich-Schiller-Universit├Ąt Jena, Germany.

Abstract

Natural language processing of real-world documents requires several low-level tasks such as splitting a piece of text into its constituent sentences, and splitting each sentence into its constituent tokens to be performed by some preprocessor (prior to linguistic analysis). While this task is often considered as unsophisticated clerical work, in the life sciences domain it poses enormous problems due to complex naming conventions. In this paper, we first introduce an annotation framework for sentence and token splitting underlying a newly constructed sentence- and token-tagged biomedical text corpus. This corpus serves as a training environment and test bed for machine-learning based sentence and token splitters using Conditional Random Fields (CRFs). Our evaluation experiments reveal that CRFs with a rich feature set substantially increase sentence and token detection performance.

PMID:
17911772
[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for IOS Press
Loading ...
Support Center