Send to

Choose Destination
Stud Health Technol Inform. 2019 Aug 21;264:283-287. doi: 10.3233/SHTI190228.

Impact of De-Identification on Clinical Text Classification Using Traditional and Deep Learning Classifiers.

Author information

Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, USA.
Department of Clinical Pharmacy and Outcome Sciences, Medical University of South Carolina, Charleston, SC, USA.
Department of Emergency Medicine, Medical University of South Carolina, Charleston, SC, USA.
Department of Computer Science, University of South Carolina, Columbia, SC, USA.


Clinical text de-identification enables collaborative research while protecting patient privacy and confidentiality; however, concerns persist about the reduction in the utility of the de-identified text for information extraction and machine learning tasks. In the context of a deep learning experiment to detect altered mental status in emergency department provider notes, we tested several classifiers on clinical notes in their original form and on their automatically de-identified counterpart. We tested both traditional bag-of-words based machine learning models as well as word-embedding based deep learning models. We evaluated the models on 1,113 history of present illness notes. A total of 1,795 protected health information tokens were replaced in the de-identification process across all notes. The deep learning models had the best performance with accuracies of 95% on both original and de-identified notes. However, there was no significant difference in the performance of any of the models on the original vs. the de-identified notes.


Data Anonymization; Machine Learning; Natural Language Processing

[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for IOS Press Icon for PubMed Central
Loading ...
Support Center