• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

研究预处理技术和预训练词嵌入在社交媒体上检测阿拉伯语健康信息方面的影响。

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media.

作者信息

Albalawi Yahya, Buckley Jim, Nikolov Nikola S

机构信息

Department of Computer Science and Information Systems, University of Limerick, Limerick, Ireland.

Department of Computer and Information Sciences, College of Arts and Science, University of Taibah, Al-Ula, Saudi Arabia.

出版信息

J Big Data. 2021;8(1):95. doi: 10.1186/s40537-021-00488-w. Epub 2021 Jul 2.

DOI:10.1186/s40537-021-00488-w
PMID:34249602
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8253467/
Abstract

This paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F score of 75.2% and accuracy of 90.7% compared to F score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

摘要

本文对社交媒体上健康相关交流领域阿拉伯语文档分类中的数据预处理和词嵌入技术进行了全面评估。在训练一个用于识别与健康相关推文的分类器的过程中,我们评估了应用于阿拉伯语推文的26种文本预处理方法。对于此任务,我们使用(传统)机器学习分类器KNN、SVM、多项式朴素贝叶斯和逻辑回归。此外,我们报告了针对相同文本分类问题使用深度学习架构BLSTM和CNN的实验结果。由于词嵌入通常更多地用作深度网络的输入层,在深度学习实验中,我们使用相同的文本预处理方法评估了几种最先进的预训练词嵌入。为实现这些目标,我们使用了两个数据集:一个用于训练和测试,另一个仅用于测试我们模型的通用性。我们的结果得出结论,26种预处理方法中只有4种能显著提高分类准确率。对于第一个阿拉伯语推文数据集,我们发现将Mazajak CBOW预训练词嵌入作为BLSTM深度网络的输入可得到最准确的分类器,F分数为89.7%。对于第二个数据集,将Mazajak Skip-Gram预训练词嵌入作为BLSTM的输入可得到最准确的模型,F分数为75.2%,准确率为90.7%,而对于相同架构,Mazajak CBOW的F分数为90.8%,但准确率较低,为70.89%。我们的结果还表明,我们训练的最佳传统分类器在第一个数据集上的性能与深度学习方法相当,但在第二个数据集上则明显较差。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e03/8253467/4c0ba0e1a6a7/40537_2021_488_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e03/8253467/95fcd756c3df/40537_2021_488_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e03/8253467/7070b4e5f635/40537_2021_488_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e03/8253467/d3691acc2544/40537_2021_488_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e03/8253467/4c0ba0e1a6a7/40537_2021_488_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e03/8253467/95fcd756c3df/40537_2021_488_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e03/8253467/7070b4e5f635/40537_2021_488_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e03/8253467/d3691acc2544/40537_2021_488_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e03/8253467/4c0ba0e1a6a7/40537_2021_488_Fig4_HTML.jpg

相似文献

1
Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media.研究预处理技术和预训练词嵌入在社交媒体上检测阿拉伯语健康信息方面的影响。
J Big Data. 2021;8(1):95. doi: 10.1186/s40537-021-00488-w. Epub 2021 Jul 2.
2
Pretrained Transformer Language Models Versus Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media: Comparative Study.用于在阿拉伯社交媒体上检测准确健康信息的预训练Transformer语言模型与预训练词嵌入:比较研究
JMIR Form Res. 2022 Jun 29;6(6):e34834. doi: 10.2196/34834.
3
Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification.评估浅层和深度学习策略在 2018 n2c2 临床文本分类共享任务中的应用。
J Am Med Inform Assoc. 2019 Nov 1;26(11):1247-1254. doi: 10.1093/jamia/ocz149.
4
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
5
Word2vec convolutional neural networks for classification of news articles and tweets.基于词向量卷积神经网络的新闻文章和推文分类。
PLoS One. 2019 Aug 22;14(8):e0220976. doi: 10.1371/journal.pone.0220976. eCollection 2019.
6
Public Perception Analysis of Tweets During the 2015 Measles Outbreak: Comparative Study Using Convolutional Neural Network Models.2015年麻疹疫情期间推文的公众认知分析:使用卷积神经网络模型的比较研究
J Med Internet Res. 2018 Jul 9;20(7):e236. doi: 10.2196/jmir.9413.
7
An efficient method for disaster tweets classification using gradient-based optimized convolutional neural networks with BERT embeddings.一种使用基于梯度优化的卷积神经网络与BERT嵌入的高效灾难推文分类方法。
MethodsX. 2024 Jul 3;13:102843. doi: 10.1016/j.mex.2024.102843. eCollection 2024 Dec.
8
Detecting Potentially Harmful and Protective Suicide-Related Content on Twitter: Machine Learning Approach.在 Twitter 上检测潜在有害和保护自杀相关内容:机器学习方法。
J Med Internet Res. 2022 Aug 17;24(8):e34705. doi: 10.2196/34705.
9
A clinical text classification paradigm using weak supervision and deep representation.一种使用弱监督和深度表示的临床文本分类范式。
BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.
10
Deep learning based sentiment analysis of public perception of working from home through tweets.基于深度学习的通过推文对公众在家工作看法的情感分析。
J Intell Inf Syst. 2023;60(1):255-274. doi: 10.1007/s10844-022-00736-2. Epub 2022 Aug 24.

引用本文的文献

1
Arab2Vec: An Arabic word embedding model for use in Twitter NLP applications.Arab2Vec:一种用于推特自然语言处理应用的阿拉伯语词嵌入模型。
PLoS One. 2025 Aug 29;20(8):e0328369. doi: 10.1371/journal.pone.0328369. eCollection 2025.
2
TrG2P: A transfer-learning-based tool integrating multi-trait data for accurate prediction of crop yield.TrG2P:一种基于迁移学习的工具,集成多性状数据,用于准确预测作物产量。
Plant Commun. 2024 Jul 8;5(7):100975. doi: 10.1016/j.xplc.2024.100975. Epub 2024 May 15.
3
Deep Learning Model for COVID-19 Sentiment Analysis on Twitter.

本文引用的文献

1
An alternative approach to dimension reduction for pareto distributed data: a case study.帕累托分布数据降维的另一种方法:一个案例研究。
J Big Data. 2021;8(1):39. doi: 10.1186/s40537-021-00428-8. Epub 2021 Feb 25.
2
Conversations and Medical News Frames on Twitter: Infodemiological Study on COVID-19 in South Korea.推特上的对话与医学新闻框架:韩国新冠肺炎信息流行病学研究
J Med Internet Res. 2020 May 5;22(5):e18897. doi: 10.2196/18897.
3
Trustworthy Health-Related Tweets on Social Media in Saudi Arabia: Tweet Metadata Analysis.沙特阿拉伯社交媒体上与健康相关的可靠推文:推文元数据分析
用于推特上新冠疫情情感分析的深度学习模型
New Gener Comput. 2023;41(2):189-212. doi: 10.1007/s00354-023-00209-2. Epub 2023 Mar 13.
4
Pretrained Transformer Language Models Versus Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media: Comparative Study.用于在阿拉伯社交媒体上检测准确健康信息的预训练Transformer语言模型与预训练词嵌入:比较研究
JMIR Form Res. 2022 Jun 29;6(6):e34834. doi: 10.2196/34834.
5
A systematic literature review on spam content detection and classification.关于垃圾邮件内容检测与分类的系统文献综述。
PeerJ Comput Sci. 2022 Jan 20;8:e830. doi: 10.7717/peerj-cs.830. eCollection 2022.
J Med Internet Res. 2019 Oct 8;21(10):e14731. doi: 10.2196/14731.
4
Zika discourse in the Americas: A multilingual topic analysis of Twitter.美洲的寨卡话语:推特上的多语种主题分析。
PLoS One. 2019 May 23;14(5):e0216922. doi: 10.1371/journal.pone.0216922. eCollection 2019.
5
Indicators of adolescents' preference to receive oral health information using social media.青少年使用社交媒体获取口腔健康信息的偏好指标。
Acta Odontol Scand. 2019 Apr;77(3):213-218. doi: 10.1080/00016357.2018.1536803. Epub 2019 Jan 11.
6
A comparison of information sharing behaviours across 379 health conditions on Twitter.比较 379 种健康状况在 Twitter 上的信息共享行为。
Int J Public Health. 2019 Apr;64(3):431-440. doi: 10.1007/s00038-018-1192-5. Epub 2018 Dec 26.
7
The accuracy, fairness, and limits of predicting recidivism.预测累犯的准确性、公正性和局限性。
Sci Adv. 2018 Jan 17;4(1):eaao5580. doi: 10.1126/sciadv.aao5580. eCollection 2018 Jan.
8
Mining Twitter as a First Step toward Assessing the Adequacy of Gender Identification Terms on Intake Forms.挖掘推特数据作为评估入院表格中性别识别术语充分性的第一步。
AMIA Annu Symp Proc. 2015 Nov 5;2015:611-20. eCollection 2015.
9
Tweeting as Health Communication: Health Organizations' Use of Twitter for Health Promotion and Public Engagement.作为健康传播的推文:健康组织利用推特进行健康促进和公众参与。
J Health Commun. 2016;21(2):188-98. doi: 10.1080/10810730.2015.1058435. Epub 2015 Dec 30.
10
Are Health-Related Tweets Evidence Based? Review and Analysis of Health-Related Tweets on Twitter.与健康相关的推文基于证据吗?对推特上与健康相关推文的回顾与分析。
J Med Internet Res. 2015 Oct 29;17(10):e246. doi: 10.2196/jmir.4898.