Format

Send to

Choose Destination
See comment in PubMed Commons below
J Biomed Inform. 2017 May;69:86-92. doi: 10.1016/j.jbi.2017.04.003. Epub 2017 Apr 4.

Crowd control: Effectively utilizing unscreened crowd workers for biomedical data annotation.

Author information

1
Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, United States; Computer and Information Sciences Department, University of Pennsylvania, United States. Electronic address: acocos@seas.upenn.edu.
2
Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, United States. Electronic address: qiant@email.chop.edu.
3
Computer and Information Sciences Department, University of Pennsylvania, United States.
4
Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, United States. Electronic address: masinoa@email.chop.edu.

Abstract

Annotating unstructured texts in Electronic Health Records data is usually a necessary step for conducting machine learning research on such datasets. Manual annotation by domain experts provides data of the best quality, but has become increasingly impractical given the rapid increase in the volume of EHR data. In this article, we examine the effectiveness of crowdsourcing with unscreened online workers as an alternative for transforming unstructured texts in EHRs into annotated data that are directly usable in supervised learning models. We find the crowdsourced annotation data to be just as effective as expert data in training a sentence classification model to detect the mentioning of abnormal ear anatomy in radiology reports of audiology. Furthermore, we have discovered that enabling workers to self-report a confidence level associated with each annotation can help researchers pinpoint less-accurate annotations requiring expert scrutiny. Our findings suggest that even crowd workers without specific domain knowledge can contribute effectively to the task of annotating unstructured EHR datasets.

KEYWORDS:

Crowdsourcing; EHR data; Logistic regression; Sentence classification; Text annotations

PMID:
28389234
DOI:
10.1016/j.jbi.2017.04.003
PubMed Commons home

PubMed Commons

0 comments
How to join PubMed Commons

    Supplemental Content

    Full text links

    Icon for Elsevier Science
    Loading ...
    Support Center