Categorization of sentence types in medical abstracts

AMIA Annu Symp Proc. 2003:2003:440-4.

Abstract

This study evaluated the use of machine learning techniques in the classification of sentence type. 7253 structured abstracts and 204 unstructured abstracts of Randomized Controlled Trials from MedLINE were parsed into sentences and each sentence was labeled as one of four types (Introduction, Method, Result, or Conclusion). Support Vector Machine (SVM) and Linear Classifier models were generated and evaluated on cross-validated data. Treating sentences as a simple "bag of words", the SVM model had an average ROC area of 0.92. Adding a feature of relative sentence location improved performance markedly for some models and overall increasing the average ROC to 0.95. Linear classifier performance was significantly worse than the SVM in all datasets. Using the SVM model trained on structured abstracts to predict unstructured abstracts yielded performance similar to that of models trained with unstructured abstracts in 3 of the 4 types. We conclude that classification of sentence type seems feasible within the domain of RCT's. Identification of sentence types may be helpful for providing context to end users or other text summarization techniques.

MeSH terms

  • Abstracting and Indexing*
  • Artificial Intelligence*
  • Linear Models
  • Linguistics*
  • MEDLINE
  • ROC Curve
  • Randomized Controlled Trials as Topic