NCBI logo The NCBI Disease Corpus

back to NCBI homepage
back to NCBI homepage

Rezarta Islamaj Doğan
Robert Leaman
Zhiyong Lu

spacer gif
At a glance

The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.

Corpus Characteristics

  • 793 PubMed abstracts
  • 6,892 disease mentions
  • 790 unique disease concepts
    • Medical Subject Headings (MeSH®)
    • Online Mendelian Inheritance in Man (OMIM®)
  • 91% of the mentions map to a single disease concept
  • divided into training, developing and testing sets.
Corpus Annotation
  • Fourteen annotators
  • Two-annotators per document (randomly paired)
  • Three annotation phases
  • Checked for corpus-wide consistency of annotations

An improved corpus of disease mentions in PubMed citations ACL-WEB link
NCBI Disease Corpus: A Resource for Disease Name Recognition and Normalization PubMed link
Disease Name Normalization with Pairwise Learning to Rank PubMed link

Fig. 1. The illustration of the annotation process involving 12 annotators working on pairs on 793 PubMed abstracts for disease name recognition covering all the sentences in every PubMed citation.

We welcome your feedback:
Rezarta Islamaj Doğan Robert Leaman Zhiyong Lu

Revised: August 27, 2013.

See also:

NCBI Disease Corpus (Mention Level)
NCBI Disease Corpus (Complete - Train set)
NCBI Disease Corpus (Complete - Development set)
NCBI Disease Corpus (Complete - Test set)

Public Domain Notice
Characteristics and Results
Annotation Guidelines