Format

Send to

Choose Destination
J Biomed Inform. 2002 Oct-Dec;35(5-6):322-30.

Automatically identifying gene/protein terms in MEDLINE abstracts.

Author information

1
Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY 10027, USA. Hongyu@cs.columbia.edu

Abstract

MOTIVATION:

Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein-protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for "gene/protein-full name mark up"), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation.

RESULTS:

GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts.

AVAILABILITY:

A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/. Contact. hy52@columbia.edu. Voice: 212-939-7028; fax: 212-666-0140.

PMID:
12968781
[Indexed for MEDLINE]
Free full text

Supplemental Content

Full text links

Icon for Elsevier Science
Loading ...
Support Center