文本分类的半监督学习综述。

A review of semi-supervised learning for text classification.

作者信息

Duarte José Marcio, Berton Lilian

机构信息

Science and Technology Department, Federal University of São Paulo, Cesare Mansueto Giulio Lattes Ave, 1201, São José dos Campos, SP 12247-014 Brazil.

出版信息

Artif Intell Rev. 2023 Jan 31:1-69. doi: 10.1007/s10462-023-10393-8.

DOI:10.1007/s10462-023-10393-8

PMID:36743267

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9887265/

Abstract

A huge amount of data is generated daily leading to big data challenges. One of them is related to text mining, especially text classification. To perform this task we usually need a large set of labeled data that can be expensive, time-consuming, or difficult to be obtained. Considering this scenario semi-supervised learning (SSL), the branch of machine learning concerned with using labeled and unlabeled data has expanded in volume and scope. Since no recent survey exists to overview how SSL has been used in text classification, we aim to fill this gap and present an up-to-date review of SSL for text classification. We retrieve 1794 works from the last 5 years from IEEE Xplore, ACM Digital Library, Science Direct, and Springer. Then, 157 articles were selected to be included in this review. We present the application domain, datasets, and languages employed in the works. The text representations and machine learning algorithms. We also summarize and organize the works following a recent taxonomy of SSL. We analyze the percentage of labeled data used, the evaluation metrics, and obtained results. Lastly, we present some limitations and future trends in the area. We aim to provide researchers and practitioners with an outline of the area as well as useful information for their current research.

摘要

每天都会产生大量数据，从而带来大数据挑战。其中之一与文本挖掘相关，尤其是文本分类。要执行此任务，我们通常需要大量有标签的数据，而这些数据可能成本高昂、耗时或难以获取。考虑到这种情况，半监督学习（SSL）——机器学习中涉及使用有标签和无标签数据的分支——在数量和范围上都有所扩展。由于目前尚无近期调查来概述SSL在文本分类中的应用情况，我们旨在填补这一空白，并对用于文本分类的SSL进行最新综述。我们从IEEE Xplore、ACM数字图书馆、Science Direct和Springer中检索了过去五年的1794篇文献。然后，挑选了157篇文章纳入本综述。我们介绍了这些文献中的应用领域、数据集和使用的语言，文本表示方法和机器学习算法。我们还按照SSL的最新分类法对这些文献进行了总结和整理。我们分析了所使用的有标签数据的百分比、评估指标以及所得结果。最后，我们介绍了该领域的一些局限性和未来趋势。我们旨在为研究人员和从业人员提供该领域的概述以及对他们当前研究有用的信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6f6/9887265/3d7b2cdb1ab1/10462_2023_10393_Fig1_HTML.jpg

相似文献

A review of semi-supervised learning for text classification.文本分类的半监督学习综述。

Artif Intell Rev. 2023 Jan 31:1-69. doi: 10.1007/s10462-023-10393-8.

Audio self-supervised learning: A survey.音频自监督学习：一项综述。

Patterns (N Y). 2022 Dec 9;3(12):100616. doi: 10.1016/j.patter.2022.100616.

A Survey on Self-Supervised Learning: Algorithms, Applications, and Future Trends.自监督学习综述：算法、应用及未来趋势

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):9052-9071. doi: 10.1109/TPAMI.2024.3415112. Epub 2024 Nov 6.

Unsupervised and semi-supervised learning: the next frontier in machine learning for plant systems biology.无监督和半监督学习：植物系统生物学机器学习的下一个前沿。

Plant J. 2022 Sep;111(6):1527-1538. doi: 10.1111/tpj.15905. Epub 2022 Jul 27.

Comprehensive study of semi-supervised learning for DNA methylation-based supervised classification of central nervous system tumors.基于 DNA 甲基化的中枢神经系统肿瘤有监督分类的半监督学习综合研究。

BMC Bioinformatics. 2022 Jun 8;23(1):223. doi: 10.1186/s12859-022-04764-1.

Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management.基于拉普拉斯支持向量机的半监督临床文本分类：在癌症病例管理中的应用。

J Biomed Inform. 2013 Oct;46(5):869-75. doi: 10.1016/j.jbi.2013.06.014. Epub 2013 Jul 8.

Deep Source Semi-Supervised Transfer Learning (DS3TL) for Cross-Subject EEG Classification.深度源半监督迁移学习 (DS3TL) 在跨被试 EEG 分类中的应用。

IEEE Trans Biomed Eng. 2024 Apr;71(4):1308-1318. doi: 10.1109/TBME.2023.3333327. Epub 2024 Mar 20.

Multi-class motor imagery EEG classification using collaborative representation-based semi-supervised extreme learning machine.基于协同表示的半监督极限学习机的多类运动想象 EEG 分类。

Med Biol Eng Comput. 2020 Sep;58(9):2119-2130. doi: 10.1007/s11517-020-02227-4. Epub 2020 Jul 16.

Social media based surveillance systems for healthcare using machine learning: A systematic review.基于社交媒体的机器学习医疗保健监测系统：一项系统综述。

J Biomed Inform. 2020 Aug;108:103500. doi: 10.1016/j.jbi.2020.103500. Epub 2020 Jul 2.

Semi-supervised classifier guided by discriminator.基于判别器的半监督分类器。

Sci Rep. 2022 Aug 29;12(1):14665. doi: 10.1038/s41598-022-18947-6.

引用本文的文献

Machine learning approaches for predicting the link of the global trade network of liquefied natural gas.用于预测液化天然气全球贸易网络关联的机器学习方法。

PLoS One. 2025 Jul 30;20(7):e0326952. doi: 10.1371/journal.pone.0326952. eCollection 2025.

Leveraging AI to Drive Timely Improvements in Patient Experience Feedback: Algorithm Validation.利用人工智能推动患者体验反馈的及时改善：算法验证

JMIR Med Inform. 2025 Jul 10;13:e60900. doi: 10.2196/60900.

The Role of Artificial Intelligence in Advancing Biosensor Technology: Past, Present, and Future Perspectives.人工智能在推动生物传感器技术发展中的作用：过去、现在和未来展望。

Adv Mater. 2025 Aug;37(34):e2504796. doi: 10.1002/adma.202504796. Epub 2025 Jun 16.

Predicting the availability of power line communication nodes using semi-supervised learning algorithms.使用半监督学习算法预测电力线通信节点的可用性。

Sci Rep. 2025 May 21;15(1):17670. doi: 10.1038/s41598-025-01064-5.

Weakly supervised text classification on free-text comments in patient-reported outcome measures.患者报告结局指标中自由文本评论的弱监督文本分类

Front Digit Health. 2025 Apr 30;7:1345360. doi: 10.3389/fdgth.2025.1345360. eCollection 2025.

Transfer learning-based English translation text classification in a multimedia network environment.多媒体网络环境下基于迁移学习的英语翻译文本分类

PeerJ Comput Sci. 2024 Jan 31;10:e1842. doi: 10.7717/peerj-cs.1842. eCollection 2024.

sscNOVA: a semi-supervised convolutional neural network for predicting functional regulatory variants in autoimmune diseases.sscNOVA：一种用于预测自身免疫性疾病中功能性调控变异的半监督卷积神经网络。

Front Immunol. 2024 Feb 6;15:1323072. doi: 10.3389/fimmu.2024.1323072. eCollection 2024.

本文引用的文献

Deep semi-supervised learning via dynamic anchor graph embedding in latent space.基于潜在空间动态锚图嵌入的深度半监督学习。

Neural Netw. 2022 Feb;146:350-360. doi: 10.1016/j.neunet.2021.11.026. Epub 2021 Dec 1.

A network-based positive and unlabeled learning approach for fake news detection.一种基于网络的用于虚假新闻检测的正例与无标签学习方法。

Mach Learn. 2022;111(10):3549-3592. doi: 10.1007/s10994-021-06111-6. Epub 2021 Nov 18.

GANBOT: a GAN-based framework for social bot detection.GANBOT：一种基于生成对抗网络的社交机器人检测框架。

Soc Netw Anal Min. 2022;12(1):4. doi: 10.1007/s13278-021-00800-9. Epub 2021 Nov 14.

"The coronavirus is a bioweapon": classifying coronavirus stories on fact-checking sites.“新冠病毒是一种生物武器”：在事实核查网站上对新冠病毒相关报道进行分类

Comput Math Organ Theory. 2021;27(2):179-194. doi: 10.1007/s10588-021-09329-w. Epub 2021 Apr 26.

Learning structured medical information from social media.从社交媒体中学习结构化的医学信息。

J Biomed Inform. 2020 Oct;110:103568. doi: 10.1016/j.jbi.2020.103568. Epub 2020 Sep 14.

China declared world's largest producer of scientific articles.中国被宣布为世界上最大的科学论文产出国。

Nature. 2018 Jan;553(7689):390. doi: 10.1038/d41586-018-00927-4.

Semi-supervised distributed representations of documents for sentiment analysis.用于情感分析的文档的半监督分布式表示。

Neural Netw. 2019 Nov;119:139-150. doi: 10.1016/j.neunet.2019.08.001. Epub 2019 Aug 6.

Fast and scalable neural embedding models for biomedical sentence classification.用于生物医学句子分类的快速可扩展神经嵌入模型。

BMC Bioinformatics. 2018 Dec 22;19(1):541. doi: 10.1186/s12859-018-2496-4.

A semi-supervised approach using label propagation to support citation screening.一种使用标签传播来支持文献筛选的半监督方法。

J Biomed Inform. 2017 Aug;72:67-76. doi: 10.1016/j.jbi.2017.06.018. Epub 2017 Jun 23.

Co-Labeling for Multi-View Weakly Labeled Learning.多视图弱标签学习的联合标记。

IEEE Trans Pattern Anal Mach Intell. 2016 Jun;38(6):1113-25. doi: 10.1109/TPAMI.2015.2476813. Epub 2015 Sep 4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

文本分类的半监督学习综述。

A review of semi-supervised learning for text classification.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献