Named Entity Recognition and Relationship Extraction in Biomedicine
Introduction
Mining useful knowledge from the biomedical literature holds potentials for facilitating literature search, biological database
curation and many other scientific tasks. To do that, it is a key step to be able to recognize various types of biological
entities (e.g. gene and gene products) as they are the research focus in most biomedical studies.
Indeed, our previous investigation revealed that most PubMed users search for publications mentioning those biomedical
concepts. For instance, approximately 20% of the PubMed queries containing a gene/protein name.
Goals and Objectives
Our overall goal is to develop automated techniques to identify and annotate various biological entities and concepts (e.g. gene names) in the biomedical literature. Furthermore, we aim to develop state-of-the-art computational technologies for automatically extracting biologically meaningful relationships between those pre-identified entities in free text.
Team Members
Research Highlights
- Gene Normalization Task at BioCreative III:
In view of the importance and difficulty of identifying gene names in free text, we organized the Gene Normalization (GN) challenge task in BioCreative III for engaging the text mining community to advance the state of the art in this area.
By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results [Lu et al., BMC Bioinformatics, 2011].
- Disease Name Recognition:
In addition to recognizing gene names, we investigated methods on the automatic detection of disease concepts in different
genres of biomedical text [Neveol et al., BioNLP 2009; Mork et al., JAMIA 2010] that come from various
sources
[Neveol et al., IHI 2012]. We examined several automatic methods including both statistical and rule based
approaches. For more accurate evaluation, we are developing new disease corpora consisting of human annotation of several
hundred abstracts containing disease mentions.
- Relation Extraction:
We also explored means for automatically identifying relationships between various biological entities as an effort to build an
end-to-end system that includes both entity recognition and relationship extraction. In this research, we used the data from the
4th i2b2 challenge comprising a corpus of fully de-identified medical records with manually annotated information for clinical
concepts (e.g. medical problems) and relationships (e.g. treatments improve medical problems). We applied SVM with a customized
feature representation schema and achieved better performance than other state-of-the-art approaches
[Dogan et al., BMC Bioinformatics, 2011].
Selected Publications
- Neveol et al.,
Linking multiple disease-related resources through UMLS,
to appear at IHI 2012.
- Lu et al.,
The Gene Normalization Task in BioCreative III,
to appear in BMC Bioinformatics
- Dogan et al.,
A context-blocks model for identifying clinical relationships in patient records,
BMC Bioinformatics, 2011
Free Access
- Mork et al.,
Extracting Rx Information from Clinical Narrative, JAMIA, 2010
Free Access
- Neveol et al.,
Exploring two biomedical text genres for disease recognition, BioNLP 2009.
Free Access