Seara Vieira A, Borrajo L, Iglesias E L
Department of Computer Science, Higher Technical School of Computer Engineering, University of Vigo, 32004 Ourense, Spain.
Comput Methods Programs Biomed. 2016 Nov;136:119-30. doi: 10.1016/j.cmpb.2016.08.018. Epub 2016 Aug 26.
In text classification problems, the representation of a document has a strong impact on the performance of learning systems. The high dimensionality of the classical structured representations can lead to burdensome computations due to the great size of real-world data. Consequently, there is a need for reducing the quantity of handled information to improve the classification process. In this paper, we propose a method to reduce the dimensionality of a classical text representation based on a clustering technique to group documents, and a previously developed Hidden Markov Model to represent them. We have applied tests with the k-NN and SVM classifiers on the OHSUMED and TREC benchmark text corpora using the proposed dimensionality reduction technique. The experimental results obtained are very satisfactory compared to commonly used techniques like InfoGain and the statistical tests performed demonstrate the suitability of the proposed technique for the preprocessing step in a text classification task.
在文本分类问题中,文档的表示形式对学习系统的性能有很大影响。由于现实世界数据量巨大,经典结构化表示的高维性会导致计算负担繁重。因此,需要减少处理的信息量以改进分类过程。在本文中,我们提出了一种方法,该方法基于用于对文档进行分组的聚类技术以及先前开发的用于表示文档的隐马尔可夫模型来降低经典文本表示的维度。我们使用所提出的降维技术在OHSUMED和TREC基准文本语料库上对k-NN和SVM分类器进行了测试。与InfoGain等常用技术相比,所获得的实验结果非常令人满意,并且所进行的统计测试证明了所提出的技术适用于文本分类任务中的预处理步骤。