Suppr超能文献

相关性流行度:一种基于术语事件模型的文本分类特征选择方案。

Relevance popularity: A term event model based feature selection scheme for text classification.

作者信息

Feng Guozhong, An Baiguo, Yang Fengqin, Wang Han, Zhang Libiao

机构信息

Key Laboratory of Intelligent Information Processing of Jilin Universities, School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, China.

Key Laboratory for Applied Statistics of MOE, Northeast Normal University, Changchun, 130024, China.

出版信息

PLoS One. 2017 Apr 5;12(4):e0174341. doi: 10.1371/journal.pone.0174341. eCollection 2017.

Abstract

Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.

摘要

特征选择是一种通过优化输入到分类器的特征子集来提高文本分类方法性能的实用方法。在诸如信息增益和卡方检验等传统特征选择方法中,经常会用到包含特定术语的文档数量(即文档频率)。然而,给定术语在每个文档中出现的频率尚未得到充分研究,尽管它是产生准确分类的一个很有前景的特征。在本文中,我们提出了一种基于术语事件多项朴素贝叶斯概率模型的新特征选择方案。根据模型假设,基于预测概率比的匹配得分函数可以进行因式分解。最后,在用估计量替换内部参数后,我们为每个术语推导了一个特征选择度量。在一个基准英文文本数据集(20个新闻组)和一个中文文本数据集(MPH - 20)上,我们使用两种广泛使用的文本分类器(朴素贝叶斯和支持向量机)获得的数值实验结果表明,我们的方法优于代表性的特征选择方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/efbbd2abef9b/pone.0174341.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验