• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Bioinformatics. Author manuscript; available in PMC Aug 14, 2008.
Published in final edited form as:
PMCID: PMC2516306
NIHMSID: NIHMS58265

MutationFinder: a high-performance system for extracting point mutation mentions from text

Monitoring Editor: J. Gregory Caporaso,1,2,* William A. Baumgartner, Jr,2 David A. Randolph,3,4 K. Bretonnel Cohen,2,5 Lawrence Hunter,2 and Alfonso Valencia
1Department of Biochemistry and Molecular Genetics, University of Colorado Health Sciences Center, Aurora, CO
2Center for Computational Pharmacology, University of Colorado Health Sciences Center, Aurora, CO
3Motorola Mobile Devices, Libertyville, IL
4Department of Computer Science, University of Colorado at Boulder, Boulder, CO, USA
5Department of Linguistics, University of Colorado at Boulder, Boulder, CO, USA

Abstract

Summary

Discussion of point mutations is ubiquitous in biomedical literature, and manually compiling databases or literature on mutations in specific genes or proteins is tedious. We present an open-source, rule-based system, MutationFinder, for extracting point mutation mentions from text. On blind test data, it achieves nearly perfect precision and a markedly improved recall over a baseline.

Availability

MutationFinder, along with a high-quality gold standard data set, and a scoring script for mutation extraction systems have been made publicly available. Implementations, source code and unit tests are available in Python, Perl and Java. MutationFinder can be used as a stand-alone script, or imported by other applications.

1 INTRODUCTION

Comprehensive and accurate data on mutations that have been studied in specific genes or proteins is often required by researchers, but is time-consuming and tedious to compile manually. Traditional keyword-based information retrieval is often inadequate for retrieving mutation literature because mutation mentions are often abbreviated in non-standard ways. Automatically recognizing mentions of point mutations in text is therefore important for basic information retrieval, in addition to providing the framework for more complex processing of biomedical literature, such as automating the construction of mutation databases. We present MutationFinder, an open-source, rule-based system for recognizing descriptions of point mutations in text and extracting them into consistent and unambiguous representations.

The seminal papers on extracting mentions of point mutations from text Horn et al., 2004; Rebholz-Schuhmann et al., 2004) provide the initial bases on which to develop these systems. These techniques have proven useful in applications such as OSIRIS (Bonis et al., 2006), an IR engine for compiling literature documenting mutations in specific proteins, and Mutation Miner (Baker and Witte, 2006), which annotates protein structures with mutation data extracted from biomedical literature. However, an information extraction system for extracting descriptions of point mutations from free text has not been made publicly available, and a corpus for directly comparing different approaches is lacking.

The novel contributions of this work are (i) improved precision and recall over existing approaches for identifying mutation mentions in text, (ii) an open-source and publicly available system for extracting mutation mentions from text, (iii) a large, high-quality gold standard data set for judging and comparing the performance of mutation extraction systems and (iv) a script for judging the performance of mutation mention extraction systems.

2 METHODS

2.1 Baseline mutation extraction system

A partial reimplementation of MuteXt served as a baseline for performance comparisons. The rules described in Horn et al. (2004) for recognizing point mutation mentions were implemented, although extracted mutations were not validated against sequence data; this validation step would likely have led to higher precision but lower recall in the baseline system.

2.2 MutationFinder

MutationFinder builds on our baseline system, and similarly uses regular expressions to identify mutations. MutationFinder's regular expressions are documented with examples in the source code. The baseline system is modified in six ways to improve precision and recall:

  1. WNm1 format mentions with one-letter abbreviations must have N>49;
  2. WNm format mentions with one-letter abbreviations must appear in upper-case letters;
  3. Wild-type and mutant residue/base identities must not be the same;
  4. MutationFinder specifies patterns incorporating non-alphanumeric characters, whereas the baseline system removes non-alphanumerics;
  5. MutationFinder identifies mutations described in natural language (as opposed to completely abbreviated formats)2 with specific patterns, whereas the baseline system uses a heuristic to match these mentions;
  6. MutationFinder splits text on sentences and applies its regular expressions to each sentence, whereas the baseline system splits both on words and sentences and applies different regular expressions to each.

These modifications yield the performance data presented in this document; 1 through 5 result in improvements to precision, and 4 through 6 result in improvements to recall. Modifications to future versions of MutationFinder will be described via updates to the software documentation.

2.3 Gold standard data

We created two independent gold standard data sets: one for developing and one for evaluating the performance of mutation extraction systems. The data sets consisted of abstracts, which were randomly selected from the set of primary citations in the PDB for mutant protein structures. Annotation was performed with Knowtator (Ogren, 2006), using an ontology that was developed for describing point mutations. To build the development data set, the primary system developer (Caporaso) annotated 605 point mutation mentions in 305 abstracts.

This data set informed the manual development of MutationFinder's regular expressions. Development and tuning of these regular expressions for mutation extraction was performed by testing variations of MutationFinder, (e.g. by comparing case-sensitive and case-insensitive regular expressions when matching amino-acid residue names), on this mutation corpus. These performance comparisons additionally motivated the six modifications presented earlier.

To test MutationFinder, the system was applied to an independent (completely blind) test data set. This data set consisted of 910 mutation mentions from 508 article abstracts, annotated by two of the authors not involved in system development (Baumgartner and Randolph).3 To calculate interannotator agreement, fifty of the abstracts annotated by Caporaso were randomly selected and dispersed throughout these annotators corpora. Mean pairwise interannotator agreement, calculated on the fifty overlapping abstracts, was 94%. (These fifty abstracts were not included in the test data.) In total, 1515 point mutation mentions were annotated in 813 abstracts, and we have made this complete mutation corpus publicly available.

2.4 Performance metrics

Systems were compared in terms of precision, recall and F-measure on three separate metrics: Extracted Mentions, Normalized Mutations and Document Retrieval. When calculating Extracted Mentions, each mention of a mutation (including duplicates) must be identified. When calculating Normalized Mutations, on the other hand, a perfect score can be achieved if each of the mutations that is mentioned in text is extracted at least once, even if some mentions are missed. Last, Document Retrieval measures a system's ability to recognize if any mutations have been mentioned in a document. Extracted Mentions is the strictest metric in terms of recall, and is perhaps the most discriminative for judging the performance of the information extraction system (since each missed mutation mention is scored as a false negative). Normalized Mutations is the best measure of a system's utility in terms of document-level entity recognition, whereas Document Retrieval measures a system's utility in terms of document-level information retrieval.

3 SYSTEM DETAILS

Three functionally identical implementations of MutationFinder, in Python, Perl and Java, along with source code, extensive unit tests, the annotated development and test data and a scoring script for mutation extraction systems, are available at http://bionlp.sourceforge.net. MutationFinder can be used either as a stand-alone application, or in other Python, Perl or Java applications. We provide our annotation set and scoring script with the hope that these resources will foster continued work on this problem and provide a means for comparing approaches between groups.

The simplest way to use MutationFinder is via the provided scripts. These take as input a single file comprising a document collection, where each line in the input file defines an independent ‘document’ in the following format:

identifier <tab> text

Our tests utilize PubMed identifiers and article abstracts, but any combination of identifiers and text sources may be used. For example, if users wish to process abstracts at the sentence level, they can split their abstracts on sentences and assign a unique identifier to each. As output, the system prints one line for each text source in the following tab-delimited format:

identifier <tab> mutation1 <tab> mutation2

Mutations are listed in wNm format with one-letter residue/base abbreviations. Start and end byte-offsets of the matched text span can optionally be provided for each mention as well.

4 SYSTEM PERFORMANCE

Mutation Finder was compared with the baseline system on our independent test collection. It achieved nearly perfect precision and a marked increase in recall over the baseline system on this blind test set. (Table 1.)

Table 1
MutationFinder and a baseline system are compared in terms of precision (P), recall (R) and F-measure(F)

To estimate the error of MutationFinder and the baseline system, the independent test set was subsampled into 100 random data sets, ranging in size from 50% of the test set (204 entries) to 85% of the test set (431 entries). (This range was selected to allow for performance variability which might arise from having test sets of different sizes, but constrained to avoid variance from being over-or under-estimated.) Both systems exhibited low variance on these samples (Table 2), and mean performances were nearly identical to the values presented in Table 1. MutationFinder's low variance suggests that its high performance should generalize well to other mutation-related literature.

Table 2
Mean±SD for precision, recall and F-measure of MutationFinder and the baseline system calculated on 100 randomly selected subsets of the blind test data. (Sample variance calculations may underestimate true variance calculations.)

5 SUMMARY

We have published MutationFinder, an open-source, rulebased system for extracting point mutation mentions from literature with high precision and recall; a high-quality, human-annotated mutation corpus; and a scoring script for mutation extraction systems.

We recognize that a system for identifying the gene or protein incurring a mutation, in conjunction with the mutation data, would be even more valuable. Additionally, the ability to distinguish between mutations in genes and proteins would be useful (e.g. ’C10T’ could refer to either a cysteine to threonine mutation in a protein, or a cytosine to thymine mutation in DNA). Accurately identifying gene and protein names in text is an open area of research (Yeh et al., 2005), and classifying extracted names as referring to genes or proteins is a task which human experts only do with around 80% agreement (Hatzivassiloglou et al., 2001). We view mutation extraction, gene/protein named-entity recognition and gene/protein mutation disambiguation as separate language processing problems, which may collectively be solved by combining independent systems. MutationFinder provides reliable extraction of mutation data, and its modular nature and open-source availability in multiple languages facilitate its incorporation into more complex systems. Combining the output of MutationFinder with the output of an independent gene/protein name extraction system would provide a basis for assigning mutations to their gene/protein source, and the ability to distinguish gene and protein names would provide the requisite information to disambiguate mutation types. Alternatively, MutationFinder could be adapted to more specifically extract mutations either in genes or proteins, in which case it might be useful in disambiguating gene/protein names in mutation literature. However, we consider it a strength of our system to function generally by extracting both point mutation types.

A potential source of false positives for mutation extraction systems is mentions of other entities, such as genes, proteins or cell lines, whose names look similar to mutation mentions. For example, MutationFinder would mistakenly extract the gene name L23A and the cell line T98G, as mutation mentions. This difficulty is commonly encountered by information extraction systems, and in many cases, can be avoided by beginning with a good information retrieval system. (Note that MutationFinder requires that sequence position be greater than 9 in single-letter abbreviated ‘mutation’ mentions; this rule is aimed at avoiding common errors of this type, such as extraction of the gene names E2F or H4M as mutation mentions.)

MutationFinder achieves impressive precision and recall on blind test data. Work on techniques for further increasing system recall, and for connecting mutations with gene or protein sequences, is currently in progress.

ACKNOWLEDGEMENTS

The authors would like to thank Sonia Leach and Hyunmin Kim, of the University of Colorado Health Sciences Center Computational Bioscience Program and James Martin, of the University of Colorado at Boulder Department of Computer Science, for helpful comments and discussion during system and corpus development and testing. We would additionally like to thank the two anonymous reviewers for their useful suggestions during the preparation of this manuscript, and reviewer 2, in particular, for providing the example mutation-like cell line we mention in Section 5: T98G. This work was partially funded by NLM grant R01-NLM-009254 to LH.

Conflicts of Interest

none declared.

1In this notation, w and m refer to the wild-type and mutant residue/base identities, respectively, and N refers to the sequence position. For example, A42G and Ala42Gly are both wNm-formatted mutation mentions.

2For example, alanine 42 was mutated to glycine, as opposed to A42G.

3These authors were involved with porting MutationFinder from Python to Java and Perl, but no changes were made to the system during this process.

REFERENCES

  • Baker CJO, Witte R. Mutation-mining a prospector's tale. J. Infor. Sys. Frontiers. 2006;8:47–57.
  • Bonis J, et al. OSIRIS: a tool for retrieving literature about sequence variants. Bioinformatics. 2006;22:2567–2569. [PubMed]
  • Hatzivassiloglou V, et al. Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 2001;17(Suppl 1):S97–S106. [PubMed]
  • Horn F, et al. Automated extraction of mutation data from the literature: application of MuteXt to G proteincoupled receptors and nuclear hormone receptors. Bioinformatics. 2004;20:557–568. [PubMed]
  • Ogren PV. Knowtator: a plug-in for creating training and evaluation data sets for biomedical natural language systems; Proceedings of the 9th International Protégé Conference; 2006. pp. 73–76.
  • Rebholz-Schuhmann D, et al. Automatic extraction of mutations from Medline and cross-validation with OMIM. Nucl. Acids Res. 2004;32:135–142. [PMC free article] [PubMed]
  • Yeh A, et al. BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinformatics. 2005;6(Suppl 1) [PMC free article] [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...