Feng Guozhong, An Baiguo, Yang Fengqin, Wang Han, Zhang Libiao
Key Laboratory of Intelligent Information Processing of Jilin Universities, School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, China.
Key Laboratory for Applied Statistics of MOE, Northeast Normal University, Changchun, 130024, China.
PLoS One. 2017 Apr 5;12(4):e0174341. doi: 10.1371/journal.pone.0174341. eCollection 2017.
Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.
特征选择是一种通过优化输入到分类器的特征子集来提高文本分类方法性能的实用方法。在诸如信息增益和卡方检验等传统特征选择方法中,经常会用到包含特定术语的文档数量(即文档频率)。然而,给定术语在每个文档中出现的频率尚未得到充分研究,尽管它是产生准确分类的一个很有前景的特征。在本文中,我们提出了一种基于术语事件多项朴素贝叶斯概率模型的新特征选择方案。根据模型假设,基于预测概率比的匹配得分函数可以进行因式分解。最后,在用估计量替换内部参数后,我们为每个术语推导了一个特征选择度量。在一个基准英文文本数据集(20个新闻组)和一个中文文本数据集(MPH - 20)上,我们使用两种广泛使用的文本分类器(朴素贝叶斯和支持向量机)获得的数值实验结果表明,我们的方法优于代表性的特征选择方法。