Luo Le, Li Li
Faculty of Computer and Information Science, Southwest University, Chongqing, China.
PLoS One. 2014 Jan 9;9(1):e82119. doi: 10.1371/journal.pone.0082119. eCollection 2014.
Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications.
自动文本分类是信息检索和数据挖掘领域的关键技术之一。当训练数据集规模大且维度高时,分类通常很耗时。人们已经提出了许多方法来解决这个问题,但很少有方法能达到令人满意的效率。在本文中,我们提出了一种将潜在狄利克雷分配(LDA)算法和支持向量机(SVM)相结合的方法。首先使用LDA在向量空间模型(VSM)中生成主题的降维表示作为特征。它能够大幅减少特征数量,同时保留必要的语义信息。然后使用支持向量机(SVM)基于生成的特征对数据进行分类。我们分别在20个新闻组和路透社-21578数据集上对该算法进行了评估。实验结果表明,基于我们提出的LDA+SVM模型的分类在精确率、召回率和F1值方面都取得了高性能。此外,它能在更短的时间内实现这一点。我们的方法在该领域的先前工作基础上有了很大改进,并且在广泛的应用中显示出实现简化分类过程的强大潜力。