Skip to main page content Skip to main page content

tmVar: A text mining approach for extracting sequence variants in biomedical literature

Authors: Chih-Hsuan Wei, Bethany R. Harris, Hung-Yu Kao and Zhiyong Lu (PI)

Research highlights (demo)

Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. Current approaches are mostly rule-based and focus on limited types of sequence variations such as protein point mutations. Here we report tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants in both protein and gene levels according to a standard sequence variants nomenclature developed by the human genome variation society (HGVS). By doing so, we cover several important types of mutations that were not considered in past studies. Using a novel CRF label model with a set of customized features, our method achieves high performance of over 90% in F-measure on both our own corpus and a publicly available benchmarking data set and compares favorably to the state of the art methods.

  tmVar is now able to normalize extracted variant mentions to unique identifiers (dbSNP RSIDs). In benchmarking results, tmVar achieves state of the art performance (~90% in F-measure). See the article below for more details.

  • Wei C-H, Phan L, Feltz J, Maiti R, Hefferon T, Lu Z. tmVar 2.0: Integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine. Bioinformatics, 2017.

Method overview

As shown in belowe Figure, our method first performs tokenization on the input text as pre-processing. Next, our method extracts mutation mentions from text using a CRF-based approach, followed by some post-processing steps. As illustrated in the figure, instead of extracting a mutation mention such as c.2708_2711delTTAG as a whole, our CRF module identifies each mutation component (e.g.'del' as the mutation type) individually. Finally, we have implemented a post-processing module to handle some rare mutation formula and nature language mentions that are not curated in our own corpus.

Figure 1. An overview of the tmVar workflow.

  Next, we developed a new module for mapping each previously detected mutation mention to a corresponding RS number as shown in Figure 2. By using two main strategies, pattern matching and dictionary lookup, tmVar can find the corresponding RSIDs for variants. We firstly developed a set of patterns (e.g., “[Gene/Protein] ([DNAMutation] with [RSID])”) to detect a pair of mutation and RSID co-occurring in the same sentence. For the remaining mentions, we generated a list of candidate RSIDs by searching our lexicon.

Figure 2. The overall workflow of our mutation normalization process.



Methods Precision Recall F-measure
All Mutations MutationFinder 91.66% 33.21% 48.76%
MutationFinder+ 89.66% 69.15% 78.08%
tmVar 91.38% 91.40% 91.39%
MutationFinder 84.21% 25.29% 38.90%
MutationFinder+ 84.09% 63.25% 72.20%
tmVar 87.74% 87.46% 87.60%
Table 1. Results on the test set of our corpus for mutation individual component.
  Methods Precision Recall F-measure
All Mutations MutationFinder 98.41% 81.92% 89.41%
tmVar 98.80% 89.62% 93.98%
MutationFinder 98.47% 80.63% 88.66%
tmVar 97.58% 83.96% 90.26%
Table 2. Results on the MutationFinder corpus for mutation individual component.

Corpus Method TP FP FN Precision Recall F-score
tmVar2 tmVar 565 16 60 97.25% 90.40% 93.70%
SETH 466 5 159 98.94% 74.56% 85.04%
OSIRIS tmVar 208 6 50 97.20% 80.62% 88.14%
SETH 179 11 79 94.21% 69.38% 79.91%
Thomas tmVar 465 52 62 89.94% 88.24% 89.08%
SETH 303 14 224 95.58% 57.50% 71.80%
Table 3. Normalization results on the tmVar normalization corpus, OSIRIS and SETH.


tmVar 2.0 Software (Includes normalization) in Java
tmVar NER Corpus (Mention forms, Annotation guidelines)
tmVar Normalization Corpus
tmVar-tagged PubMed results in PubTator
tmVar RESTful API

Please cite