• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of procamiaLink to Publisher's site
AMIA Annu Symp Proc. 2005; 2005: 410–414.
PMCID: PMC1560806

Towards Semantic Role Labeling & IE in the Medical Literature

Abstract

Introduction

In this work, we introduce the concept of semantic role labeling to the medical domain. We report first results of porting and adapting an existing resource, Propbank, to the medical field. Propbank is an adjunct to Penn Treebank that provides semantic annotation of predicates and the roles played by their arguments. The main aim of this work is the applicability of the Propbank frame files to predicates typically encountered in the medical literature.

Methods

We analyzed a target corpus of 610,100 abstracts, which was selected by searching for publication type “case reports”. From this target corpus, we randomly selected 10,000 sample abstracts to estimate the predicate distribution, and matched the predicates from this sample to the predicates in Propbank.

Results

Of the 1998 unique verbs in our sample, 76% were represented in Propbank. This included the 40 most frequent verbs, which represented 49% of all predicate instances in our sample and which matched the Propbank usage in a study of representative sentences. We propose extensions to Propbank that handle medical predicates, which are not adequately covered by Propbank.

Conclusion

We believe that semantic role labeling using Propbank is a valid approach to capture predicate relations in the medical literature.

INTRODUCTION

Medicine is very much an observational and inductive science, where observations of patient symptoms lead to diagnoses, and the assessments of a medical intervention lead to treatment guidelines. Not surprisingly, the medical literature is rich with observational reports on patients’ symptomatology and on patients’ response to treatment. These reports are at the heart of modern evidence-based medical practice, where physicians treat their patients based on previous observations of the efficacy of a particular intervention. This comprehensive body of medical knowledge is not easily accessible to computational analysis, and unlike the molecular biology domain (and other domains such as newswire reports), there are only few attempts to build information extraction systems that specifically target the medical literature.

There are numerous research projects that deal with extracting information from (electronic) medical records [13]; the structure of such reports, as well as the use of language, is quite distinct from the medical literature. Medical reports and the medical literature share common sublanguage features (such as a distinct set of domain terms), but specific features may be unique to medical reports. This includes the omission of verbs in sentences conveying patient symptoms [4]. In fact, there are sub-types of medical reports, such as signout notes, that are quite removed from the morphologic constraints of the English language [5]. Medical reports aim for brevity and may drop sentence elements that can be implied by the wider context. In contrast, medical literature English is usually well formed. In an attempt to build an information extraction engine for the medical literature, we came across the question of the relatedness of everyday English with ‘medical literature’ English. The importance of this question is of practical nature: if there is a considerable overlap between the two language spheres, it should be possible to port existing information extraction resources (corpora or tools) that have been developed for other domains (such as newswire reports) to the medical domain. The reuse of such resources would then considerably speed up the development of a medical literature-specific information extraction engine.

In this paper, we report first results of porting, reusing and adapting a semantically oriented linguistic resource for information extraction from the medical literature. Our goal is to introduce the concept of semantic role labeling to the medical domain, taking advantage of recent advancements in semantic role labeling that were driven by the introduction of Propbank a few years ago. In essence, Propbank [6, 7] is an adjunct to the Penn Treebank [8] that provides a semantic annotation layer with predicates and the roles played by their arguments. Propbank-annotated sentences provide the desired structure for information extraction (IE): For example, in the patient reported heartburn and dysphagia, the predicate to report (Propank report.01 – see below) defines the two roles reporter (the patient) and thing reported (heartburn and dysphagia). Automatically labeling these roles in the above sentence enables the automatic extraction of structured information about a patient and his symptoms (heartburn and dysphagia in the sentence above) from free text. As we will see below, semantic role labeling addresses a known problem of existing IE engines: the mapping of various syntactic surface structures to the same target frame. By defining predicate templates that guide the annotation of training corpora, the approach circumvents the need to explicitly construct myriads of text patterns that are traditionally used to capture the syntactic variety of the textual data.

While our ultimate goal is to build a semantic role labeling-based system for extracting patient descriptions from published case reports, this paper has the less ambitious but important goal of discussing the appropriateness of using Propbank (which contains predicates found in the Wall Street Journal corpus) for labeling predicates typically encountered in medical abstracts. We are not aiming to construct another Propbank with a complete coverage of medical predicates; rather, we will limit our project mostly to high-frequency medical predicates, and aim to reuse existing Propbank material to the maximum extent possible. The paper follows the approach taken by [10], which recently discussed porting Propbank to the field of molecular biology. The work also follows other medical text mining initiatives, which aim at extracting specific relationships from the literature, such as hypernymic propositions [11] or drug-disease interactions [12].

THE STRUCTURE OF PROPBANK

Identifying predicates and their arguments (predicate argument structures [PAS]) has been an early goal of the Penn Treebank project. The Penn Treebank II employed a novel tagging schema that identified simple PAS through grammatical function tags and a limited number of semantic roles tags. It also provided a mechanism for recovering discontinuous predicate argument constructs. Propbank expanded the scope of such PAS labeling by explicitly linking all surface arguments of a given predicate to their semantic roles. Propbank consists of two core resources, the frame files and the Treebank standoff annotation. The frame files (approximately 3300 in the current Propbank release) contain one or more Framesets for each distinct meaning of a verb1. For example, frame file bank defines two different Framesets of to bank:

An external file that holds a picture, illustration, etc.
Object name is amia2005_0410f1.jpg

Framesets include syntactic frames that document the different syntactic realization of the verb roles. For example, Frameset bank.01 provides sample sentences that illustrate the appropriate mapping of the sentence constituents to the semantic roles of bank.01, as illustrated in the following sentence:

What do you say [we all]Arg0 close down the poker game, go home and [bank]rel [the $16 billion]Arg1?

The appropriateness of using Propbank for semantic role labeling in IE is exemplified by the handling of syntactic variations that map to the same Frameset. The two sentences

[The patient]Arg0 [reported]rel [heartburn and dysphagia] Arg1 and [Heartburn and dysphagia] Arg1 [were reported] rel by [the patient] Arg0 (the passive voice construct), nicely illustrate the situation where the same Frameset (report.01) is instantiated by two different syntactic structures. Semantic role labeling (possibly by automatic means) would map the exact same constituents in both sentences (patient and heartburn/dysphagia) to the same predicate arguments (reporter and thing reported), irrespective of the surface syntax structure. Although machine-learning algorithms responsible for automatic labeling consider syntactic sentence features, such features are usually automatically generated by specialized language parsers. This is unlike in traditional IE, where it would be necessary to construct distinct text patterns (such as regular expressions) for extracting the appropriate arguments from syntactically distinct sentence constructs.

The verb-centric approach of PAS begs the question whether the information in the medical literature can be adequately described by the event-type PAS constructs. We would strongly argue that this is the case; we found ample evidence for a predicate-based way of conveying medical information; to illustrate this point, this work will examine the semantic overlap of the Propbank verb frames with predicates usually encountered in abstracts of case reports from the medical literature.

METHODS

Corpus and Predicate Selection

Our target corpus consists of 610,100 abstracts from PubMed (release 2005), which was selected by searching for publication type “case reports”. From this target corpus, we randomly selected 10,000 abstracts (working corpus) for estimating the predicate distribution in our target corpus. Using Charniak’s maximum entropy-based parser [14], the working corpus was automatically augmented with Treebank-compliant syntactic and POS information. Predicates were then easily spotted based on their verb-specific tags (VB*), indicating different verb forms, such as paste tense or gerund. In order to normalize these forms to the verb base form (such as normalizing reporting to report), we used the program morpha, a robust morphological analyzer for English [15]. We measured the accuracy of this process (parsing and morphological transformation), and manually eliminated obvious tagging errors. After transformation, we counted the occurrence of each (normalized) verb in the working corpus. We also matched the list of unique verbs in the working corpus to a similar list of verbs that are contained in the current PropBank release. For matching verbs, we examined whether the medical predicate senses are adequately represented in Propbank.

Propbank-style Framesets

For medical predicate senses that were not contained in Propbank, or medical predicates that were completely absent from Propbank, we constructed sample Propbank-style Framesets. To this end, we followed the guidelines set forward by [6, 7]. In summary, core arguments are labeled Arg X (X being a cardinal number from 0 to the number of arguments of a given predicate). Core arguments can be seen as the (traditional) subject and objects of verbs, and are often paired with adjuncts (ArgMs), that are distinguished by function tags such as LOC (location), TMP (time), NEG (negation marker) or MNR (manner). In addition, the tag PRD denotes ‘secondary predication’ for arguments that are predicates of another argument of the same verb. It should be noted that there are many cases where the frame files may define roles that are not instantiated in a given sentence.

RESULTS

Our working corpus of 10,000 abstract (sampled from a target corpus of abstracts of medical case reports) contained 91,707 predicate instances (verbs in different verb forms) that mapped to 1998 unique verbs (to be exact: the base forms of these verbs). The accuracy of identifying predicates and converting them into the verb base form was 92.9% based on a testing set of 282 manually annotated random sentences from the working corpus. The distribution of these verbs revealed the following picture: the 40 most frequent verbs covered 49% of all verb instances. In other words, approximately 45,000 predicates in our working corpus consisted of different verb forms of the same set of 40 verbs. Overall, 1522 verbs (76%) of our working corpus were represented in Propbank (i.e., matched a frame file of the current Propbank release), including the 40 most frequent verbs. These matching verbs corresponded to 97% of all predicate instances. 466 verbs (24%) were not represented in Propbank, corresponding to the remaining predicate instances.2

For verbs represented in Propbank, we explored whether the medical verb usage was adequately described in Propbank. We encountered the following two situations:

  1. Propbank frame file adequately describes the use of a predicate
  2. Propank frame file describes different use of a predicate, and a new Frameset is added to Propbank.

In addition, the following situation applied for non-matching verbs

  1. No Propbank frame file adequately describes the medical predicate, and a new frame file is created.

In the following section, we describe these situations separately, and list few manually selected sample predicates and sentences.

A. Propbank frame file adequately describes the use of a predicate in the working corpus

The usage of verbs in this category seems to be the same in Propbank as well as our working corpus. These Propbank frames may be reused ‘as is’ in any attempt to port Propbank-style semantic role labeling to predicates of medical case reports. In many instances, fewer or more arguments than proposed in the Propbank frames are needed. An analysis of 5 sample sentences from our working corpus for each of the 40 most frequent verbs revealed that all verbs match the Propbank sense, and that all verbs have appropriate rolesets (sets of arguments) in Propbank. These include verbs such as report and identify.

An external file that holds a picture, illustration, etc.
Object name is amia2005_0410f2.jpg

It should be noted that the first example includes a second predicate, to diagnose, which is handled separately by Propbank.

B. Propank frame file describes a different use of a predicate in the working corpus

We created new Framesets for few sample verbs in our working corpus that matched a verb in Propbank but failed to match the Propbank verb sense (or roleset). These include discharge and enhance (we will first list the Propbank Frameset, such as discharge.01, followed by the new Frameset discharge.02).

An external file that holds a picture, illustration, etc.
Object name is amia2005_0410f3.jpg

C. No Propbank frame file adequately describes the medical predicate, and a new frame file is created

We found 466 verbs in our working corpus that were not represented in Propbank. For these, we created novel frame files, including frame files for verbs metastasize and occlude.

An external file that holds a picture, illustration, etc.
Object name is amia2005_0410f4.jpg

A more complete list of 30 predicates from our working corpus can be found at http://ycmi.med.yale.edu/krauthammer/rolelabeling.htm

DISCUSSION

In this work, we introduced the concept of semantic role labeling to the medical domain. We report first results of porting and adapting an existing resource, Propbank, to the medical field. We found that 76% of verbs in a working corpus of 10,000 abstracts match a Propbank frame file. We examined whether these frame files adequately describe the verb usage in our medical sample corpus (case A), or whether the frame files needed additional Framesets to cover the medical usage of the verb (case B). We realize that the success of our approach relies on the successful identification of the case B frame files, which may compound the labeling or tagging of medical predicates. As we are not striving to build a complete medical Propbank, we examined whether the frame files are adequate for the 40 most frequent verbs in our corpus. A small study with random sample sentences containing those predicates showed that the verb usage was matching the Propbank usage. We are therefore confident that the use of these high-frequency frame files will give satisfactory results, especially in large-scale knowledge-discovery and statistical applications. However, we need better mechanism to identify case B frame files (such as discharge), which were manually looked up for this project.

We also presented medicine-specific predicates that are not represented in Propbank. In those instances, we were able to generate Propbank-compliant verb frames. We may be able to automatically generate such frame files by determining the VerbNet classes of these verbs (see footnote above) and to copy the rolesets of existing Propbank verbs with matching VerbNet classes. This is obvious for verbs such as underdiagnose or misdiagnose (which were found in our sample corpus) and whose sense closely match the Propbank frame file diagnose. A similar approach has been discussed elsewhere [13].

We think this work is another indication that language resources from outside medicine are useful for medical natural language processing (a recent report found a strong performance of a non-medical POS tagger in the medical domain [16]). Limitations of the current study include a sub-optimal parsing accuracy of approximately 93%. Nevertheless, we are very excited about these results, as they hopefully allow us to quickly move forward in our quest to build a semantic-role labeling based information extracting system for medical case reports. It has been shown that machine learning algorithms that are trained on parts of Propbank are capable of automatic semantic role labeling with a F-score around 80 [9]. We are planning to compare different setups, including learning classifiers (for automated role identification) from Propbank, and learning classifiers on hand-annotated sentences from medical case reports. It is obvious that PAS does not cover all aspects of the IE task. While verb nominalizations can possibly be well handled by PAS (such as partial decompression of the tumor), PAS does not extend to issues such as detailed classification of arguments (such as the type of disease) or argument identification (mapping an argument term to a controlled vocabulary). Also PAS does not handle body location (such as the left side of the lung), which is an important issue in medical case reports. These aspects have to be dealt with separately.

Footnotes

1The VerbNet project [13] (which is related to Propbank) groups verbs with similar meanings and the same set of syntactic frames into verb classes. However, defining general arguments for a set of verbs is difficult. Propbank defines rolesets on a verb to verb basis, avoiding inconsistencies and achieving a high verb-coverage. Where possible, Propbank verbs are linked to their appropriate VerbNet classes.

2See supporting material for the complete verb frequency distribution: http://ycmi.med.yale.edu/krauthammer/rolelabeling.htm).

REFERENCES

1. Friedman C. A broad-coverage natural language processing system. Proc AMIA Symp. 2000:270–4. [PMC free article] [PubMed]
2. Hahn U, Romacker M, Schulz S. MEDSYNDIKATE--a natural language system for the extraction of medical information from findings reports. Int J Med Inform. 2002;67(1–3):63–74. [PubMed]
3. Haug PJ, et al. Experience with a mixed semantic/syntactic parser. Proc Annu Symp Comput Appl Med Care. 1995:284–8. [PMC free article] [PubMed]
4. Friedman C, Kra P, Rzhetsky A. Two biomedical sublanguages: a description based on the theories of Zellig Harris. J Biomed Inform. 2002;35(4):222–35. [PubMed]
5. Stetson PD, et al. The sublanguage of cross-coverage. Proc AMIA Symp. 2002:742–6. [PMC free article] [PubMed]
6. Kingsbury, P. and M. Palmer. From Treebank to Propbank. in 3rd International Conference on Language Resources and Evaluation (LREC-2002). 2002. Las Palmas.
7. Kingsbury, P., M. Palmer, and M. Marcus. Adding Semantic Annotation to the Penn TreeBank. in Human Language Technology Conference. 2002. San Diego, CA.
8. Abeille, A., ed. TREEBANKS - Building and Using Parsed Corpora. Text, Speech and Language Technology, ed. N. Ide and J. Veronis. 2003, Kluwer Academic Publishers: Dordrecht.
9. Pradhan, S., et al. Shallow semantic parsing using support vector machines. in HLT/NAACL. 2004. Boston.
10. Wattarujeekrit T, Shah PK, Collier N. PASBio: predicate-argument structures for event extraction in molecular biology. BMC Bioinformatics. 2004;5(1):155. [PMC free article] [PubMed]
11. Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform. 2003;36(6):462–77. [PubMed]
12. Srinivasan P, Rindflesch T. Exploring text mining from MEDLINE. Proc AMIA Symp. 2002:722–6. [PMC free article] [PubMed]
13. Kipper, K., M. Palmer, and O. Rambow. Extending PropBank with VerbNet Semantic Predicates. in Workshop on Applied Interlinguas AMTA 2002. 2002. Tiburon, CA.
14. Charniak, E., A Maximum-Entropy-Inspired Parser. 1999, Brown University.
15. Minning G, Carroll J, Pearce D. Applied morphological processing of English. Natural Language Engineering. 2001;7(3):207–223.
16. Wermter J, Hahn U. Really Is Medical Sublanguage That Different? Experimental Counter-evidence from Tagging Medical and Newspaper Corpora. Medinfo. 2004;2004:560–4. [PubMed]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles