Send to

Choose Destination
J Biomed Inform. 2015 Dec;58 Suppl:S11-9. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28.

Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.

Author information

School of Library and Information Science, Simmons College, Boston, MA, USA. Electronic address:
Department of Information Studies, State University of New York at Albany, Albany, NY, USA.


The 2014 i2b2/UTHealth Natural Language Processing (NLP) shared task featured four tracks. The first of these was the de-identification track focused on identifying protected health information (PHI) in longitudinal clinical narratives. The longitudinal nature of clinical narratives calls particular attention to details of information that, while benign on their own in separate records, can lead to identification of patients in combination in longitudinal records. Accordingly, the 2014 de-identification track addressed a broader set of entities and PHI than covered by the Health Insurance Portability and Accountability Act - the focus of the de-identification shared task that was organized in 2006. Ten teams tackled the 2014 de-identification task and submitted 22 system outputs for evaluation. Each team was evaluated on their best performing system output. Three of the 10 systems achieved F1 scores over .90, and seven of the top 10 scored over .75. The most successful systems combined conditional random fields and hand-written rules. Our findings indicate that automated systems can be very effective for this task, but that de-identification is not yet a solved problem.


Machine learning; Medical records; Natural language processing; Shared task

[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Elsevier Science Icon for PubMed Central
Loading ...
Support Center