Skip to main page content Skip to main page content

TaggerOne: Joint Named Entity Recognition and Normalization with Semi-Markov Models

Authors: Robert Leaman and Zhiyong Lu (PI)

Research highlights

TaggerOne is a system for locating and identifying concepts such as diseases and chemicals in biomedical text, as shown in Figure 1. Unlike other concept identification systems, TaggerOne is trainable and not limited to a specific concept type. It employs a novel machine learning model to simultaneously locate (named entity recognition) and identify (normalization) the entity type(s) of interest. This joint model directly addresses term variation, which improves performance.

Figure 1. Example text demonstrating term variation in diseases (yellow) and chemicals (green).

Use cases

TaggerOne is a general toolkit for biomedical named entity recognition and normalization. As a machine learning system it is not entity-specific but does require training data. The specific requirements and type of training data needed depend on the specific use case:

  • To simultaneously perform named entity recognition (NER) and normalization for one entity type, the training data must be annotated with a location (span) and concept identifier for each mention. This data can be created with tools like PubTator. In this use case, TaggerOne also requires a lexicon containing a list of the entities for the concept type, their names and known synonyms.
  • TaggerOne can be used to perform only named entity recognition (NER). For this use case, the training data only needs spans (not concept identifiers) and the lexicon is optional.
  • TaggerOne can handle multiple concept types simultaneously. In this case, the training data should contain annotations for all concepts of each type and the lexicon should contain entities for each type.

Methodology and implementation

Our model consists of a semi-Markov structured linear classifier, combining a rich feature approach for named entity recognition and supervised semantic indexing for normalization. It is open source, implemented in Java, and has been optimized for high throughput.

Figure 2. Overview of the TaggerOne workflow.

Results

TaggerOne was validated by measuring the named entity recognition (NER) and normalization performance on both the NCBI Disease corpus and the BioCreative V Chemical-Disease Relation corpus. TaggerOne achieved the highest reported performance for both diseases (NCBI Disease corpus) and chemicals.

System NER Normalization
Precision Recall F-measure Precision Recall F-measure
TaggerOne 0.851 0.808 0.829 0.822 0.792 0.807
DNorm 0.822 0.775 0.798 0.803 0.763 0.782
Table 1. Evaluation on the NCBI Disease Corpus (disease mentions).

System NER Normalization
Precision Recall F-measure Precision Recall F-measure
TaggerOne 0.942 0.888 0.914 0.888 0.903 0.895
tmChem model 1 0.932 0.840 0.884 0.950 0.808 0.873
Table 2. Evaluation on the BioCreative V Chemical Disease Relation corpus (chemical mentions).