• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相关性流行度:一种基于术语事件模型的文本分类特征选择方案。

Relevance popularity: A term event model based feature selection scheme for text classification.

作者信息

Feng Guozhong, An Baiguo, Yang Fengqin, Wang Han, Zhang Libiao

机构信息

Key Laboratory of Intelligent Information Processing of Jilin Universities, School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, China.

Key Laboratory for Applied Statistics of MOE, Northeast Normal University, Changchun, 130024, China.

出版信息

PLoS One. 2017 Apr 5;12(4):e0174341. doi: 10.1371/journal.pone.0174341. eCollection 2017.

DOI:10.1371/journal.pone.0174341
PMID:28379986
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5381872/
Abstract

Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.

摘要

特征选择是一种通过优化输入到分类器的特征子集来提高文本分类方法性能的实用方法。在诸如信息增益和卡方检验等传统特征选择方法中,经常会用到包含特定术语的文档数量(即文档频率)。然而,给定术语在每个文档中出现的频率尚未得到充分研究,尽管它是产生准确分类的一个很有前景的特征。在本文中,我们提出了一种基于术语事件多项朴素贝叶斯概率模型的新特征选择方案。根据模型假设,基于预测概率比的匹配得分函数可以进行因式分解。最后,在用估计量替换内部参数后,我们为每个术语推导了一个特征选择度量。在一个基准英文文本数据集(20个新闻组)和一个中文文本数据集(MPH - 20)上,我们使用两种广泛使用的文本分类器(朴素贝叶斯和支持向量机)获得的数值实验结果表明,我们的方法优于代表性的特征选择方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/f6eafa24dcc3/pone.0174341.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/efbbd2abef9b/pone.0174341.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/ecdb877fccdc/pone.0174341.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/d2ffe6e283eb/pone.0174341.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/2dee2559d2b9/pone.0174341.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/196bb27ea597/pone.0174341.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/f6eafa24dcc3/pone.0174341.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/efbbd2abef9b/pone.0174341.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/ecdb877fccdc/pone.0174341.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/d2ffe6e283eb/pone.0174341.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/2dee2559d2b9/pone.0174341.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/196bb27ea597/pone.0174341.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c65/5381872/f6eafa24dcc3/pone.0174341.g006.jpg

相似文献

1
Relevance popularity: A term event model based feature selection scheme for text classification.相关性流行度:一种基于术语事件模型的文本分类特征选择方案。
PLoS One. 2017 Apr 5;12(4):e0174341. doi: 10.1371/journal.pone.0174341. eCollection 2017.
2
Prediction of cause of death from forensic autopsy reports using text classification techniques: A comparative study.使用文本分类技术从法医尸检报告预测死亡原因:一项比较研究。
J Forensic Leg Med. 2018 Jul;57:41-50. doi: 10.1016/j.jflm.2017.07.001. Epub 2017 Jul 4.
3
Improving the Mann-Whitney statistical test for feature selection: an approach in breast cancer diagnosis on mammography.改进用于特征选择的曼-惠特尼统计检验:一种乳腺钼靶摄影乳腺癌诊断方法
Artif Intell Med. 2015 Jan;63(1):19-31. doi: 10.1016/j.artmed.2014.12.004. Epub 2014 Dec 12.
4
Improving PLS-RFE based gene selection for microarray data classification.改进基于偏最小二乘回归特征消除法的基因选择用于微阵列数据分类
Comput Biol Med. 2015 Jul;62:14-24. doi: 10.1016/j.compbiomed.2015.04.011. Epub 2015 Apr 17.
5
Improved feature-selection method considering the imbalance problem in text categorization.考虑文本分类中不平衡问题的改进特征选择方法。
ScientificWorldJournal. 2014;2014:625342. doi: 10.1155/2014/625342. Epub 2014 May 26.
6
Stable feature selection for clinical prediction: exploiting ICD tree structure using Tree-Lasso.用于临床预测的稳定特征选择:利用树套索法挖掘国际疾病分类树结构
J Biomed Inform. 2015 Feb;53:277-90. doi: 10.1016/j.jbi.2014.11.013. Epub 2014 Dec 9.
7
Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.计算N元语法的对称强度:文本文件自动分类中的两遍过滤方法。
Springerplus. 2016 Jun 30;5(1):942. doi: 10.1186/s40064-016-2573-y. eCollection 2016.
8
A Novel Feature Selection Technique for Text Classification Using Naïve Bayes.一种使用朴素贝叶斯进行文本分类的新型特征选择技术。
Int Sch Res Notices. 2014 Oct 28;2014:717092. doi: 10.1155/2014/717092. eCollection 2014.
9
Text mining approach to predict hospital admissions using early medical records from the emergency department.利用急诊科早期医疗记录预测住院情况的文本挖掘方法。
Int J Med Inform. 2017 Apr;100:1-8. doi: 10.1016/j.ijmedinf.2017.01.001. Epub 2017 Jan 5.
10
Seminal quality prediction using data mining methods.使用数据挖掘方法进行精液质量预测。
Technol Health Care. 2014;22(4):531-45. doi: 10.3233/THC-140816.

本文引用的文献

1
Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy.Pretata:运用新特征和降维策略预测TATA结合蛋白
BMC Syst Biol. 2016 Dec 23;10(Suppl 4):114. doi: 10.1186/s12918-016-0353-5.
2
Which statistical significance test best detects oncomiRNAs in cancer tissues? An exploratory analysis.哪种统计学显著性检验最能检测癌症组织中的致癌miRNA?一项探索性分析。
Oncotarget. 2016 Dec 20;7(51):85613-85623. doi: 10.18632/oncotarget.12828.
3
Accurate Identification of Cancerlectins through Hybrid Machine Learning Technology.
通过混合机器学习技术准确识别癌凝集素。
Int J Genomics. 2016;2016:7604641. doi: 10.1155/2016/7604641. Epub 2016 Jul 13.
4
McTwo: a two-step feature selection algorithm based on maximal information coefficient.McTwo:一种基于最大信息系数的两步特征选择算法。
BMC Bioinformatics. 2016 Mar 23;17:142. doi: 10.1186/s12859-016-0990-0.
5
Supervised and traditional term weighting methods for automatic text categorization.用于自动文本分类的监督式和传统词加权方法。
IEEE Trans Pattern Anal Mach Intell. 2009 Apr;31(4):721-35. doi: 10.1109/TPAMI.2008.110.