Send to

Choose Destination
AMIA Jt Summits Transl Sci Proc. 2019 May 6;2019:462-471. eCollection 2019.

Efficient Active Learning for Electronic Medical Record De-identification.

Author information

Vanderbilt University, Nashville, TN, USA.
Privacy Analytics, Ottawa, ON, Canada.
Google, Mountain View, CA.


Electronic medical records are often de-identified before disseminated for secondary uses. However, unstructured natural language records are challenging to de-identify while utilizing a considerable amount of expensive human annotation. In this investigation, we incorporate active learning into the de-identification workflow to reduce annotation requirements. We apply this approach to a real clinical trials dataset and a publicly available i2b2 dataset to illustrate that, when the machine learning de-identification system can actively request information to help create a better model from beyond the system (e.g., a knowledgeable human assistant), less training data will be needed to maintain or improve the performance of trained models in comparison to the typical passive learning framework. Specifically, with a batch size of 10 documents, it requires only 40 documents for an active learning approach to reach an F-measure of 0.9, while passive learning needs at least 25% more data for training a comparable model.


Supplemental Content

Full text links

Icon for PubMed Central
Loading ...
Support Center