• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

中文短文本分类中预处理方法对分类器性能变化的影响研究。

Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification.

机构信息

School of Computer and Communication Engineering, University of Science and Technology Beijing, Haidian, Beijing, China.

Beijing Key Laboratory of Knowledge Engineering for Materials Science, University of Science and Technology Beijing, Haidian, Beijing, China.

出版信息

PLoS One. 2023 Oct 12;18(10):e0292582. doi: 10.1371/journal.pone.0292582. eCollection 2023.

DOI:10.1371/journal.pone.0292582
PMID:37824464
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10569603/
Abstract

Text pre-processing is an important component of a Chinese text classification. At present, however, most of the studies on this topic focus on exploring the influence of preprocessing methods on a few text classification algorithms using English text. In this paper we experimentally compared fifteen commonly used classifiers on two Chinese datasets using three widely used Chinese preprocessing methods that include word segmentation, Chinese specific stop word removal, and Chinese specific symbol removal. We then explored the influence of the preprocessing methods on the final classifications according to various conditions such as classification evaluation, combination style, and classifier selection. Finally, we conducted a battery of various additional experiments, and found that most of the classifiers improved in performance after proper preprocessing was applied. Our general conclusion is that the systematic use of preprocessing methods can have a positive impact on the classification of Chinese short text, using classification evaluation such as macro-F1, combination of preprocessing methods such as word segmentation, Chinese specific stop word and symbol removal, and classifier selection such as machine and deep learning models. We find that the best macro-f1s for categorizing text for the two datasets are 92.13% and 91.99%, which represent improvements of 0.3% and 2%, respectively over the compared baselines.

摘要

文本预处理是中文文本分类的重要组成部分。然而,目前大多数关于这个主题的研究都集中在使用英文文本探索预处理方法对几种文本分类算法的影响。在本文中,我们使用三种广泛使用的中文预处理方法(分词、中文特有停用词去除和中文特有符号去除),在两个中文数据集上对十五种常用分类器进行了实验比较。然后,我们根据分类评估、组合方式和分类器选择等各种条件,探讨了预处理方法对最终分类的影响。最后,我们进行了一系列其他的额外实验,发现大多数分类器在适当的预处理后性能得到了提高。我们的总体结论是,系统地使用预处理方法可以对中文短文本的分类产生积极的影响,使用分类评估(如宏 F1)、预处理方法的组合(如分词、中文特有停用词和符号去除)以及分类器的选择(如机器学习和深度学习模型)。我们发现,两个数据集的最佳宏 F1 值分别为 92.13%和 91.99%,分别比比较基线提高了 0.3%和 2%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bb6/10569603/0cd4b8adf8be/pone.0292582.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bb6/10569603/436022122f35/pone.0292582.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bb6/10569603/280a5438eeca/pone.0292582.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bb6/10569603/4e99b1669133/pone.0292582.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bb6/10569603/e15853651a05/pone.0292582.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bb6/10569603/0cd4b8adf8be/pone.0292582.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bb6/10569603/436022122f35/pone.0292582.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bb6/10569603/280a5438eeca/pone.0292582.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bb6/10569603/4e99b1669133/pone.0292582.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bb6/10569603/e15853651a05/pone.0292582.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bb6/10569603/0cd4b8adf8be/pone.0292582.g005.jpg

相似文献

1
Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification.中文短文本分类中预处理方法对分类器性能变化的影响研究。
PLoS One. 2023 Oct 12;18(10):e0292582. doi: 10.1371/journal.pone.0292582. eCollection 2023.
2
The influence of preprocessing on text classification using a bag-of-words representation.基于词袋模型的文本分类中预处理的影响。
PLoS One. 2020 May 1;15(5):e0232525. doi: 10.1371/journal.pone.0232525. eCollection 2020.
3
Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches.自动国际疾病分类编码系统:基于规则方法的深度情境化语言模型
JMIR Med Inform. 2022 Jun 29;10(6):e37557. doi: 10.2196/37557.
4
Traditional Chinese medicine clinical records classification with BERT and domain specific corpora.基于 BERT 和领域专用语料库的中医临床记录分类。
J Am Med Inform Assoc. 2019 Dec 1;26(12):1632-1636. doi: 10.1093/jamia/ocz164.
5
Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.机器学习算法在(放化疗)治疗结果预测中的应用:分类器的实证比较。
Med Phys. 2018 Jul;45(7):3449-3459. doi: 10.1002/mp.12967. Epub 2018 Jun 13.
6
Connecting Text Classification with Image Classification: A New Preprocessing Method for Implicit Sentiment Text Classification.将文本分类与图像分类相连接:一种用于隐式情感文本分类的新预处理方法。
Sensors (Basel). 2022 Feb 28;22(5):1899. doi: 10.3390/s22051899.
7
Chinese text classification by combining Chinese-BERTology-wwm and GCN.结合中文BERTology-wwm和图卷积网络进行中文文本分类。
PeerJ Comput Sci. 2023 Aug 17;9:e1544. doi: 10.7717/peerj-cs.1544. eCollection 2023.
8
Systematic Comparison of the Influence of Different Data Preprocessing Methods on the Performance of Gait Classifications Using Machine Learning.不同数据预处理方法对基于机器学习的步态分类性能影响的系统比较
Front Bioeng Biotechnol. 2020 Apr 15;8:260. doi: 10.3389/fbioe.2020.00260. eCollection 2020.
9
Text preprocessing for improving hypoglycemia detection from clinical notes - A case study of patients with diabetes.文本预处理提高临床记录中低血糖检测的准确率:以糖尿病患者为例的研究。
Int J Med Inform. 2019 Sep;129:374-380. doi: 10.1016/j.ijmedinf.2019.06.020. Epub 2019 Jul 9.
10
Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media.研究预处理技术和预训练词嵌入在社交媒体上检测阿拉伯语健康信息方面的影响。
J Big Data. 2021;8(1):95. doi: 10.1186/s40537-021-00488-w. Epub 2021 Jul 2.

本文引用的文献

1
Survey on sentiment analysis: evolution of research methods and topics.情感分析综述:研究方法与主题的演变
Artif Intell Rev. 2023 Jan 6:1-42. doi: 10.1007/s10462-022-10386-z.
2
Investigating Multi-Level Semantic Extraction with Squash Capsules for Short Text Classification.使用挤压胶囊进行短文本分类的多级语义提取研究
Entropy (Basel). 2022 Apr 23;24(5):590. doi: 10.3390/e24050590.
3
Acceptability of Traditional Chinese Medicine in Chinese People Based on 10-Year's Real World Study With Mutiple Big Data Mining.基于 10 年真实世界研究和多大数据挖掘的中国人对中医药的可接受性。
Front Public Health. 2022 Jan 11;9:811730. doi: 10.3389/fpubh.2021.811730. eCollection 2021.
4
Short Text Paraphrase Identification Model Based on RDN-MESIM.基于 RDN-MESIM 的短文释义识别模型。
Comput Intell Neurosci. 2021 Sep 5;2021:6865287. doi: 10.1155/2021/6865287. eCollection 2021.
5
Dynamic Embedding Projection-Gated Convolutional Neural Networks for Text Classification.用于文本分类的动态嵌入投影门控卷积神经网络
IEEE Trans Neural Netw Learn Syst. 2022 Mar;33(3):973-982. doi: 10.1109/TNNLS.2020.3036192. Epub 2022 Feb 28.
6
A benchmark dataset and case study for Chinese medical question intent classification.用于中文医学问题意图分类的基准数据集和案例研究。
BMC Med Inform Decis Mak. 2020 Jul 9;20(Suppl 3):125. doi: 10.1186/s12911-020-1122-3.
7
The influence of preprocessing on text classification using a bag-of-words representation.基于词袋模型的文本分类中预处理的影响。
PLoS One. 2020 May 1;15(5):e0232525. doi: 10.1371/journal.pone.0232525. eCollection 2020.
8
Traditional Chinese medicine clinical records classification with BERT and domain specific corpora.基于 BERT 和领域专用语料库的中医临床记录分类。
J Am Med Inform Assoc. 2019 Dec 1;26(12):1632-1636. doi: 10.1093/jamia/ocz164.
9
Automatic ICD code assignment of Chinese clinical notes based on multilayer attention BiRNN.基于多层注意力 BiRNN 的中文临床记录自动 ICD 编码分配。
J Biomed Inform. 2019 Mar;91:103114. doi: 10.1016/j.jbi.2019.103114. Epub 2019 Feb 12.