Display Settings:


Send to:

Choose Destination
See comment in PubMed Commons below
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W634-7.

NLProt: extracting protein names and sequences from papers.

Author information

  • 1CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA. mika@cubic.bioc.columbia.edu


Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly; in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing and bizarre, constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles and journals, as well as collections of abstracts, or entire papers.

[PubMed - indexed for MEDLINE]
Free PMC Article

Images from this publication.See all images (2)Free text

Figure 1
Figure 2
PubMed Commons home

PubMed Commons

How to join PubMed Commons

    Supplemental Content

    Icon for HighWire Icon for PubMed Central
    Write to the Help Desk