基于二次规划的文本分类特征选择中的广义术语相似度

Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming.

作者信息

Lim Hyunki, Kim Dae-Won

机构信息

Image and Media Research Center, Korea Institute of Science and Technology, 5 Hwarang-Ro 14-gil, Seongbuk-Gu, Seoul 02792, Korea.

School of Computer Science and Engineering, Chung-Ang University, 221 Heukseok-Dong, Dongjak-Gu, Seoul 06974, Korea.

出版信息

Entropy (Basel). 2020 Mar 30;22(4):395. doi: 10.3390/e22040395.

DOI:10.3390/e22040395

PMID:33286170

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7516869/

Abstract

The rapid growth of Internet technologies has led to an enormous increase in the number of electronic documents used worldwide. To organize and manage big data for unstructured documents effectively and efficiently, text categorization has been employed in recent decades. To conduct text categorization tasks, documents are usually represented using the bag-of-words model, owing to its simplicity. In this representation for text classification, feature selection becomes an essential method because all terms in the vocabulary induce enormous feature space corresponding to the documents. In this paper, we propose a new feature selection method that considers term similarity to avoid the selection of redundant terms. Term similarity is measured using a general method such as mutual information, and serves as a second measure in feature selection in addition to term ranking. To consider balance of term ranking and term similarity for feature selection, we use a quadratic programming-based numerical optimization approach. Experimental results demonstrate that considering term similarity is effective and has higher accuracy than conventional methods.

摘要

互联网技术的快速发展导致全球范围内使用的电子文档数量大幅增加。为了有效且高效地组织和管理非结构化文档的大数据，近几十年来人们采用了文本分类技术。为了执行文本分类任务，由于其简单性，文档通常使用词袋模型来表示。在这种文本分类表示中，特征选择成为一种必不可少的方法，因为词汇表中的所有术语都会导致对应于文档的巨大特征空间。在本文中，我们提出了一种新的特征选择方法，该方法考虑词项相似度以避免选择冗余词项。词项相似度使用诸如互信息等通用方法进行度量，并作为特征选择中的第二种度量，除了词项排名之外。为了在特征选择中考虑词项排名和词项相似度的平衡，我们使用基于二次规划的数值优化方法。实验结果表明，考虑词项相似度是有效的，并且比传统方法具有更高的准确率。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于二次规划的文本分类特征选择中的广义术语相似度

Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

基于二次规划的文本分类特征选择中的广义术语相似度

Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献