• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于在阿拉伯社交媒体上检测准确健康信息的预训练Transformer语言模型与预训练词嵌入:比较研究

Pretrained Transformer Language Models Versus Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media: Comparative Study.

作者信息

Albalawi Yahya, Nikolov Nikola S, Buckley Jim

机构信息

Department of Computer Science and Information Systems, University of Limerick, Limerick, Ireland.

Department of Computer and Information Sciences, College of Arts and Science, University of Taibah, Al-Ula, Saudi Arabia.

出版信息

JMIR Form Res. 2022 Jun 29;6(6):e34834. doi: 10.2196/34834.

DOI:10.2196/34834
PMID:35767322
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9280463/
Abstract

BACKGROUND

In recent years, social media has become a major channel for health-related information in Saudi Arabia. Prior health informatics studies have suggested that a large proportion of health-related posts on social media are inaccurate. Given the subject matter and the scale of dissemination of such information, it is important to be able to automatically discriminate between accurate and inaccurate health-related posts in Arabic.

OBJECTIVE

The first aim of this study is to generate a data set of generic health-related tweets in Arabic, labeled as either accurate or inaccurate health information. The second aim is to leverage this data set to train a state-of-the-art deep learning model for detecting the accuracy of health-related tweets in Arabic. In particular, this study aims to train and compare the performance of multiple deep learning models that use pretrained word embeddings and transformer language models.

METHODS

We used 900 health-related tweets from a previously published data set extracted between July 15, 2019, and August 31, 2019. Furthermore, we applied a pretrained model to extract an additional 900 health-related tweets from a second data set collected specifically for this study between March 1, 2019, and April 15, 2019. The 1800 tweets were labeled by 2 physicians as accurate, inaccurate, or unsure. The physicians agreed on 43.3% (779/1800) of tweets, which were thus labeled as accurate or inaccurate. A total of 9 variations of the pretrained transformer language models were then trained and validated on 79.9% (623/779 tweets) of the data set and tested on 20% (156/779 tweets) of the data set. For comparison, we also trained a bidirectional long short-term memory model with 7 different pretrained word embeddings as the input layer on the same data set. The models were compared in terms of their accuracy, precision, recall, F score, and macroaverage of the F score.

RESULTS

We constructed a data set of labeled tweets, 38% (296/779) of which were labeled as inaccurate health information, and 62% (483/779) of which were labeled as accurate health information. We suggest that this was highly efficacious as we did not include any tweets in which the physician annotators were unsure or in disagreement. Among the investigated deep learning models, the Transformer-based Model for Arabic Language Understanding version 0.2 (AraBERTv0.2)-large model was the most accurate, with an F score of 87%, followed by AraBERT version 2-large and AraBERTv0.2-base.

CONCLUSIONS

Our results indicate that the pretrained language model AraBERTv0.2 is the best model for classifying tweets as carrying either inaccurate or accurate health information. Future studies should consider applying ensemble learning to combine the best models as it may produce better results.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/90bd/9280463/fcd428ba81b3/formative_v6i6e34834_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/90bd/9280463/e82c44d59aee/formative_v6i6e34834_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/90bd/9280463/fcd428ba81b3/formative_v6i6e34834_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/90bd/9280463/e82c44d59aee/formative_v6i6e34834_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/90bd/9280463/fcd428ba81b3/formative_v6i6e34834_fig2.jpg
摘要

背景

近年来,社交媒体已成为沙特阿拉伯健康相关信息的主要传播渠道。先前的健康信息学研究表明,社交媒体上很大一部分与健康相关的帖子是不准确的。鉴于此类信息的主题和传播规模,能够自动区分阿拉伯语中准确和不准确的健康相关帖子非常重要。

目的

本研究的首要目标是生成一个阿拉伯语通用健康相关推文的数据集,标记为准确或不准确的健康信息。第二个目标是利用该数据集训练一个先进的深度学习模型,用于检测阿拉伯语健康相关推文的准确性。具体而言,本研究旨在训练和比较多个使用预训练词嵌入和Transformer语言模型的深度学习模型的性能。

方法

我们使用了先前发布的数据集中在2019年7月15日至2019年8月31日期间提取的900条与健康相关的推文。此外,我们应用一个预训练模型从专门为本研究收集的第二个数据集中提取另外900条与健康相关的推文,该数据集收集于2019年3月1日至2019年4月15日。这1800条推文由2名医生标记为准确、不准确或不确定。医生们对43.3%(779/1800)的推文达成了一致,这些推文因此被标记为准确或不准确。然后,在数据集的79.9%(623/779条推文)上训练并验证了预训练Transformer语言模型的9种变体,并在数据集的20%(156/779条推文)上进行了测试。为了进行比较,我们还在同一数据集上训练了一个双向长短期记忆模型,以7种不同的预训练词嵌入作为输入层。根据模型的准确率、精确率、召回率、F分数和F分数的宏平均对这些模型进行了比较。

结果

我们构建了一个带标签推文的数据集,其中38%(296/779)被标记为不准确的健康信息,62%(483/779)被标记为准确的健康信息。我们认为这非常有效,因为我们没有纳入任何医生注释者不确定或存在分歧的推文。在研究的深度学习模型中,基于Transformer的阿拉伯语语言理解模型版本0.2(AraBERTv0.2)-大型模型最准确,F分数为87%,其次是AraBERT版本2-大型和AraBERTv0.2-基础模型。

结论

我们的结果表明,预训练语言模型AraBERTv0.2是将推文分类为携带不准确或准确健康信息的最佳模型。未来的研究应考虑应用集成学习来组合最佳模型,因为这可能会产生更好的结果。

相似文献

1
Pretrained Transformer Language Models Versus Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media: Comparative Study.用于在阿拉伯社交媒体上检测准确健康信息的预训练Transformer语言模型与预训练词嵌入:比较研究
JMIR Form Res. 2022 Jun 29;6(6):e34834. doi: 10.2196/34834.
2
Social Media Monitoring of the COVID-19 Pandemic and Influenza Epidemic With Adaptation for Informal Language in Arabic Twitter Data: Qualitative Study.对新冠疫情和流感流行进行社交媒体监测,并针对阿拉伯语推特数据中的非正式语言进行调整:定性研究。
JMIR Med Inform. 2021 Sep 17;9(9):e27670. doi: 10.2196/27670.
3
Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada.加拿大萨斯喀彻温省使用社交媒体文本数据对基于预训练变压器的流感和新冠病毒检测模型的比较
Front Digit Health. 2023 Jun 28;5:1203874. doi: 10.3389/fdgth.2023.1203874. eCollection 2023.
4
Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media.研究预处理技术和预训练词嵌入在社交媒体上检测阿拉伯语健康信息方面的影响。
J Big Data. 2021;8(1):95. doi: 10.1186/s40537-021-00488-w. Epub 2021 Jul 2.
5
Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis.利用自我报告的全球推文识别潜在莱姆病病例:通过表情符号增强带有情感词汇的深度学习模型。
J Med Internet Res. 2023 Oct 16;25:e47014. doi: 10.2196/47014.
6
A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance.深度学习模型在不同类别不平衡程度的非结构化医疗记录文本分类中的对比研究。
BMC Med Res Methodol. 2022 Jul 2;22(1):181. doi: 10.1186/s12874-022-01665-y.
7
Traditional Machine Learning Models and Bidirectional Encoder Representations From Transformer (BERT)-Based Automatic Classification of Tweets About Eating Disorders: Algorithm Development and Validation Study.传统机器学习模型与基于双向编码器表征变换器(BERT)的饮食失调推文自动分类:算法开发与验证研究
JMIR Med Inform. 2022 Feb 24;10(2):e34492. doi: 10.2196/34492.
8
Detecting Potentially Harmful and Protective Suicide-Related Content on Twitter: Machine Learning Approach.在 Twitter 上检测潜在有害和保护自杀相关内容:机器学习方法。
J Med Internet Res. 2022 Aug 17;24(8):e34705. doi: 10.2196/34705.
9
Applying Machine Learning to Identify Anti-Vaccination Tweets during the COVID-19 Pandemic.应用机器学习识别 COVID-19 大流行期间的反疫苗推文。
Int J Environ Res Public Health. 2021 Apr 12;18(8):4069. doi: 10.3390/ijerph18084069.
10
Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set.用于追踪 COVID-19 的 Twitter:自然语言处理管道和探索性数据集。
J Med Internet Res. 2021 Jan 22;23(1):e25314. doi: 10.2196/25314.

引用本文的文献

1
Task-Specific Transformer-Based Language Models in Health Care: Scoping Review.基于任务特定的转换器的语言模型在医疗保健中的应用:范围综述。
JMIR Med Inform. 2024 Nov 18;12:e49724. doi: 10.2196/49724.

本文引用的文献

1
Twitter and Facebook posts about COVID-19 are less likely to spread misinformation compared to other health topics.与其他健康话题相比,有关 COVID-19 的推文和 Facebook 帖子不太可能传播错误信息。
PLoS One. 2022 Jan 12;17(1):e0261768. doi: 10.1371/journal.pone.0261768. eCollection 2022.
2
Accuracy of health-related information regarding COVID-19 on Twitter during a global pandemic.全球大流行期间推特上关于新冠病毒病的健康相关信息的准确性。
World Med Health Policy. 2021 Sep;13(3):503-517. doi: 10.1002/wmh3.468. Epub 2021 Jul 29.
3
Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media.
研究预处理技术和预训练词嵌入在社交媒体上检测阿拉伯语健康信息方面的影响。
J Big Data. 2021;8(1):95. doi: 10.1186/s40537-021-00488-w. Epub 2021 Jul 2.
4
Drug-Induced Liver Injury in a Patient with Nonsmall Cell Lung Cancer after the Self-Administration of Fenbendazole Based on Social Media Information.基于社交媒体信息的非小细胞肺癌患者自行服用芬苯达唑后发生的药物性肝损伤
Case Rep Oncol. 2021 Jun 17;14(2):886-891. doi: 10.1159/000516276. eCollection 2021 May-Aug.
5
Lies Kill, Facts Save: Detecting COVID-19 Misinformation in Twitter.谎言杀人,事实救人:在推特上检测新冠疫情虚假信息
IEEE Access. 2020 Aug 26;8:155961-155970. doi: 10.1109/ACCESS.2020.3019600. eCollection 2020.
6
Infodemic, Misinformation and Disinformation in Pandemics: Scientific Landscape and the Road Ahead for Public Health Informatics Research.大流行中的信息疫情、错误信息和虚假信息:公共卫生信息学研究的科学格局和未来道路。
Stud Health Technol Inform. 2021 May 27;281:764-768. doi: 10.3233/SHTI210278.
7
Misinformation and the US Ebola communication crisis: analyzing the veracity and content of social media messages related to a fear-inducing infectious disease outbreak.错误信息与美国埃博拉疫情传播危机:分析与引发恐慌的传染病爆发相关的社交媒体信息的真实性和内容
BMC Public Health. 2020 May 7;20(1):550. doi: 10.1186/s12889-020-08697-3.
8
Trustworthy Health-Related Tweets on Social Media in Saudi Arabia: Tweet Metadata Analysis.沙特阿拉伯社交媒体上与健康相关的可靠推文:推文元数据分析
J Med Internet Res. 2019 Oct 8;21(10):e14731. doi: 10.2196/14731.
9
Systematic Literature Review on the Spread of Health-related Misinformation on Social Media.社交媒体上与健康相关的错误信息传播的系统文献综述。
Soc Sci Med. 2019 Nov;240:112552. doi: 10.1016/j.socscimed.2019.112552. Epub 2019 Sep 18.
10
Social Media and the Orthopaedic Surgeon: a Mixed Methods Study.社交媒体与骨科医生:一项混合方法研究。
Acta Inform Med. 2019 Mar;27(1):23-28. doi: 10.5455/aim.2019.27.23-28.