• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于二次规划的文本分类特征选择中的广义术语相似度

Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming.

作者信息

Lim Hyunki, Kim Dae-Won

机构信息

Image and Media Research Center, Korea Institute of Science and Technology, 5 Hwarang-Ro 14-gil, Seongbuk-Gu, Seoul 02792, Korea.

School of Computer Science and Engineering, Chung-Ang University, 221 Heukseok-Dong, Dongjak-Gu, Seoul 06974, Korea.

出版信息

Entropy (Basel). 2020 Mar 30;22(4):395. doi: 10.3390/e22040395.

DOI:10.3390/e22040395
PMID:33286170
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7516869/
Abstract

The rapid growth of Internet technologies has led to an enormous increase in the number of electronic documents used worldwide. To organize and manage big data for unstructured documents effectively and efficiently, text categorization has been employed in recent decades. To conduct text categorization tasks, documents are usually represented using the bag-of-words model, owing to its simplicity. In this representation for text classification, feature selection becomes an essential method because all terms in the vocabulary induce enormous feature space corresponding to the documents. In this paper, we propose a new feature selection method that considers term similarity to avoid the selection of redundant terms. Term similarity is measured using a general method such as mutual information, and serves as a second measure in feature selection in addition to term ranking. To consider balance of term ranking and term similarity for feature selection, we use a quadratic programming-based numerical optimization approach. Experimental results demonstrate that considering term similarity is effective and has higher accuracy than conventional methods.

摘要

互联网技术的快速发展导致全球范围内使用的电子文档数量大幅增加。为了有效且高效地组织和管理非结构化文档的大数据,近几十年来人们采用了文本分类技术。为了执行文本分类任务,由于其简单性,文档通常使用词袋模型来表示。在这种文本分类表示中,特征选择成为一种必不可少的方法,因为词汇表中的所有术语都会导致对应于文档的巨大特征空间。在本文中,我们提出了一种新的特征选择方法,该方法考虑词项相似度以避免选择冗余词项。词项相似度使用诸如互信息等通用方法进行度量,并作为特征选择中的第二种度量,除了词项排名之外。为了在特征选择中考虑词项排名和词项相似度的平衡,我们使用基于二次规划的数值优化方法。实验结果表明,考虑词项相似度是有效的,并且比传统方法具有更高的准确率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9882/7516869/92a3e5a68981/entropy-22-00395-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9882/7516869/c01c9c159fba/entropy-22-00395-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9882/7516869/526b8e839c55/entropy-22-00395-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9882/7516869/0c76e5ae415a/entropy-22-00395-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9882/7516869/9713ae8171a0/entropy-22-00395-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9882/7516869/92a3e5a68981/entropy-22-00395-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9882/7516869/c01c9c159fba/entropy-22-00395-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9882/7516869/526b8e839c55/entropy-22-00395-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9882/7516869/0c76e5ae415a/entropy-22-00395-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9882/7516869/9713ae8171a0/entropy-22-00395-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9882/7516869/92a3e5a68981/entropy-22-00395-g005.jpg

相似文献

1
Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming.基于二次规划的文本分类特征选择中的广义术语相似度
Entropy (Basel). 2020 Mar 30;22(4):395. doi: 10.3390/e22040395.
2
Relevance popularity: A term event model based feature selection scheme for text classification.相关性流行度:一种基于术语事件模型的文本分类特征选择方案。
PLoS One. 2017 Apr 5;12(4):e0174341. doi: 10.1371/journal.pone.0174341. eCollection 2017.
3
Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification.通过将文档频率与遗传算法相结合进行阿姆哈拉语文本分类的特征选择
PeerJ Comput Sci. 2022 Apr 25;8:e961. doi: 10.7717/peerj-cs.961. eCollection 2022.
4
Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.计算N元语法的对称强度:文本文件自动分类中的两遍过滤方法。
Springerplus. 2016 Jun 30;5(1):942. doi: 10.1186/s40064-016-2573-y. eCollection 2016.
5
Improved feature-selection method considering the imbalance problem in text categorization.考虑文本分类中不平衡问题的改进特征选择方法。
ScientificWorldJournal. 2014;2014:625342. doi: 10.1155/2014/625342. Epub 2014 May 26.
6
Latent Topic Text Representation Learning on Statistical Manifolds.统计流形上的潜在主题文本表示学习
IEEE Trans Neural Netw Learn Syst. 2018 Nov;29(11):5643-5654. doi: 10.1109/TNNLS.2018.2808332. Epub 2018 Mar 16.
7
TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.文本网络主题:基于文本分类的词群分组作为主题及主题评分
Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.
8
Competitive Particle Swarm Optimization for Multi-Category Text Feature Selection.用于多类别文本特征选择的竞争粒子群优化算法
Entropy (Basel). 2019 Jun 18;21(6):602. doi: 10.3390/e21060602.
9
Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection.使用基于上下文相似度的特征选择改进蛋白质相互作用文章的分类
Biomed Res Int. 2015;2015:751646. doi: 10.1155/2015/751646. Epub 2015 Aug 3.
10
A Novel Feature Selection Technique for Text Classification Using Naïve Bayes.一种使用朴素贝叶斯进行文本分类的新型特征选择技术。
Int Sch Res Notices. 2014 Oct 28;2014:717092. doi: 10.1155/2014/717092. eCollection 2014.

引用本文的文献

1
Forecasting mergers and acquisitions failure based on partial-sigmoid neural network and feature selection.基于偏 S 型神经网络和特征选择的并购失败预测。
PLoS One. 2021 Nov 17;16(11):e0259575. doi: 10.1371/journal.pone.0259575. eCollection 2021.

本文引用的文献

1
Indefinite Proximity Learning: A Review.不确定邻近学习:综述
Neural Comput. 2015 Oct;27(10):2039-96. doi: 10.1162/NECO_a_00770. Epub 2015 Aug 27.