Skip to main page content Skip to main page content

AI Datasets

Description
NCBI disease corpus is a collection of 793 PubMed abstracts fully annotated at both mention and concept levels.
Description
BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.
Description
tmVar Corpus contains 500 PubMed articles manually annotated with mutation mentions of various kinds.
Description
The weakly-labeled corpus used in (Peng et al., 2016) consists of 18,410 abstracts and 33,224 CID relations. The raw data was extracted from curated data in the CTD-Pfizer collaboration with document-level annotations of drug-disease and drug-phenotype interactions. We applied tmChem and DNorm to recognize and normalize chemical and disease mentions, respectively. To maximize recall, we also applied a dictionary look-up method with a controlled vocabulary (MeSH). Finally, we filtered those without CID relations in the title/abstracts as some asserted relations are only in the full text.
Description
The dataset contains a collection of 705,915 PubMed Phrases (Kim et al., 2018) that are beneficial for information retrieval and human comprehension.
Description
The NLM-Chem corpus is a manually annotated full-text resource on chemicals in the biomedical literature. The corpus contains 150 full-text journal articles selected both to be rich in chemical mentions and for articles where human annotation was expected to be most valuable. The corpus was doubly annotated by ten expert NLM indexers, with high inter-annotator agreement, and contains ~5000 unique chemical name annotations mapped to ~2000 MeSH identifiers.