Defining and evaluating classification algorithm for high-dimensional data based on latent topics

PLoS One. 2014 Jan 9;9(1):e82119. doi: 10.1371/journal.pone.0082119. eCollection 2014.

Abstract

Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Databases as Topic*
  • Principal Component Analysis
  • Time Factors

Grants and funding

This work was supported by National Natural Science Foundation of China (No. 61170192) and Natural Science Foundation Project of CQ (No. CSTC2012JJB40012). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.