Suppr超能文献

文本分类的半监督学习综述。

A review of semi-supervised learning for text classification.

作者信息

Duarte José Marcio, Berton Lilian

机构信息

Science and Technology Department, Federal University of São Paulo, Cesare Mansueto Giulio Lattes Ave, 1201, São José dos Campos, SP 12247-014 Brazil.

出版信息

Artif Intell Rev. 2023 Jan 31:1-69. doi: 10.1007/s10462-023-10393-8.

Abstract

A huge amount of data is generated daily leading to big data challenges. One of them is related to text mining, especially text classification. To perform this task we usually need a large set of labeled data that can be expensive, time-consuming, or difficult to be obtained. Considering this scenario semi-supervised learning (SSL), the branch of machine learning concerned with using labeled and unlabeled data has expanded in volume and scope. Since no recent survey exists to overview how SSL has been used in text classification, we aim to fill this gap and present an up-to-date review of SSL for text classification. We retrieve 1794 works from the last 5 years from IEEE Xplore, ACM Digital Library, Science Direct, and Springer. Then, 157 articles were selected to be included in this review. We present the application domain, datasets, and languages employed in the works. The text representations and machine learning algorithms. We also summarize and organize the works following a recent taxonomy of SSL. We analyze the percentage of labeled data used, the evaluation metrics, and obtained results. Lastly, we present some limitations and future trends in the area. We aim to provide researchers and practitioners with an outline of the area as well as useful information for their current research.

摘要

每天都会产生大量数据,从而带来大数据挑战。其中之一与文本挖掘相关,尤其是文本分类。要执行此任务,我们通常需要大量有标签的数据,而这些数据可能成本高昂、耗时或难以获取。考虑到这种情况,半监督学习(SSL)——机器学习中涉及使用有标签和无标签数据的分支——在数量和范围上都有所扩展。由于目前尚无近期调查来概述SSL在文本分类中的应用情况,我们旨在填补这一空白,并对用于文本分类的SSL进行最新综述。我们从IEEE Xplore、ACM数字图书馆、Science Direct和Springer中检索了过去五年的1794篇文献。然后,挑选了157篇文章纳入本综述。我们介绍了这些文献中的应用领域、数据集和使用的语言,文本表示方法和机器学习算法。我们还按照SSL的最新分类法对这些文献进行了总结和整理。我们分析了所使用的有标签数据的百分比、评估指标以及所得结果。最后,我们介绍了该领域的一些局限性和未来趋势。我们旨在为研究人员和从业人员提供该领域的概述以及对他们当前研究有用的信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6f6/9887265/3d7b2cdb1ab1/10462_2023_10393_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验