Format

Send to

Choose Destination
J Biomed Inform. 2019 Jan 14:103096. doi: 10.1016/j.jbi.2019.103096. [Epub ahead of print]

Unsupervised Low-Dimensional Vector Representations for Words, Phrases and Text that are Transparent, Scalable, and produce Similarity Metrics that are not Redundant with Neural Embeddings.

Author information

1
Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine,Chicago, IL 60612 USA. Electronic address: neils@uic.edu.
2
Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR 97239 USA.
3
Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine,Chicago, IL 60612 USA.

Abstract

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title+abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

KEYWORDS:

Word2vec; dimensional reduction; implicit features; natural language processing; pvtopic; semantic similarity; text mining; vector representation

PMID:
30654030
DOI:
10.1016/j.jbi.2019.103096

Supplemental Content

Full text links

Icon for Elsevier Science
Loading ...
Support Center