Automatic Analysis and Annotation of Document Keywords in Biomedical Literature
As a document retrieval system, PubMed aims at providing efficient access to millions of scientific documents. For this purpose, it relies on matching keywords and semantic representations of PubMed documents to user queries. One type of semantic representation used in MEDLINE citations is known as Medical Subject Heading (MeSH) indexing terms, which are assigned by professional human indexers at the National Library of Medicine. Alternatively, author keywords, provided by authors when submitting an article, capture the essence of the topic of a document from the authors perspective. Last but not least, readers have their own opinions about what words are of importance to an article, which may or may not agree with either MeSH terms or author keywords of the same article.
Goals and Objectives
Our overall goal is to develop automated techniques to analyze and annotate various important document keywords in biomedical literature using different perspectives from curators, authors, and readers in the context of document indexing and retrieval. Furthermore, it is our goal to develop machine learning approaches for automated prediction of such important document keywords.
- MeSH Term Prediction: PubMed relies on human indexers to assign the appropriate MeSH indexing terms to PubMed articles a very time and labor-intensive process. As a result, these terms are not immediately available for new articles. In fact, our analysis shows that on average it takes over 90 days for a PubMed citation to be manually annotated with MeSH terms [Huang and Lu, Coling 2010]. In response, we have developed a machine learning algorithm for automatically predicting MeSH terms with a set of novel features. When compared to other state-of-the-art methods, our approach achieved significantly better performance [Huang et al., JAMIA, 2011].
- Author Keywords vs. MeSH terms: As MeSH terms require human curation, author keywords can be obtained freely from journal articles when they are available. We conducted a first study on author keywords in biomedical articles where we described the growth of author keywords in biomedical journal articles and presented a comparative study of author keywords and MeSH indexing terms. A similarity metric from our past study was used to automatically assess the relatedness between pairs of author keywords and MeSH indexing terms. Furthermore, a set of 300 pairs was manually reviewed to evaluate the metric and characterize the relationships between the term types. Results show that author keywords are increasingly available in biomedical articles and that over 60% of author keywords can be linked to a closely related indexing term. Results of this work have implications in both MEDLINE document indexing and MeSH terminology development [Neveol et al., AMIA, 2010].
- User Click Words: Finally by comparison, we found neither MeSH terms nor author keywords overlap significantly with the important words from the users point of view, which motivated us to learn what characteristics make document words important from a collective user perspective. Specifically, we applied machine learning to identify document keywords which would likely be used frequently in user queries. Each word was represented by a set of features that included different types of information, such as semantic type, part of speech tag, TF-IDF weight and location in the abstract. We examined both traditional features such as TF-IDF, as well as novel ones such as named entity, which have not been explored before in this context. We identified the most important features and evaluated our model using months of real-world PubMed log data. Our results suggest that, in addition to carrying high TF-IDF weight, important words from the users perspective tend to be biomedical entities, to exist in article titles, and to occur repeatedly in article abstracts. This study enabled us to automatically predict words likely to appear in user queries that lead to document clicks. The relative importance of predicted words can also play a role in ranking documents by relevance [Dogan and Lu, Bioinformatics, 2010].
- Huang et al.,
Recommending MeSH terms for annotating biomedical articles,
Download data sets.
- Neveol et al.,
Author keywords in biomedical journal articles, AMIA 2010.
- Huang & Lu,
Learning to annotate scientific publications, Coling 2010.
Download data sets.
- Dogan & Lu, Click-words: Learning to predict document keywords from a user perspective,