Format

Send to

Choose Destination
Stud Health Technol Inform. 2018;253:165-169.

De-Identification of German Medical Admission Notes.

Author information

1
Department of Computational Linguistics, University of Heidelberg, Heidelberg, Germany.
2
Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology and Department of Internal Medicine III, University Hospital Heidelberg, German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim.

Abstract

Medical texts are a vast resource for medical and computational research. In contrast to newswire or wikipedia texts medical texts need to be de-identified before making them accessible to a wider NLP research community. We created a prototype for German medical text de-identification and named entity recognition using a three-step approach. First, we used well known rule-based models based on regular expressions and gazetteers, second we used a spelling variant detector based on Levenshtein distance, exploiting the fact that the medical texts contain semi-structured headers including sensible personal data, and third we trained a named entity recognition model on out of domain data to add statistical capabilities to our prototype. Using a baseline based on regular expressions and gazetteers we could improve F2-score from 78% to 85% for de-identification. Our prototype is a first step for further research on German medical text de-identification and could show that using spelling variant detection and out of domain trained statistical models can improve de-identification performance significantly.

KEYWORDS:

De-identification; anonymization; medical admission notes; named entity recognition; personal health information

PMID:
30147065
[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for IOS Press
Loading ...
Support Center