• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

识别关于新冠疫情长期影响的虚假信息:使用自然语言处理模型的模糊排序集成进行方法学调查

Identifying Disinformation on the Extended Impacts of COVID-19: Methodological Investigation Using a Fuzzy Ranking Ensemble of Natural Language Processing Models.

作者信息

Chen Jian-An, Chung Wu-Chun, Hung Che-Lun, Wu Chun-Ying

机构信息

Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan.

Department of Information and Computer Engineering, Chung Yuan Christian University, Taoyuan, Taiwan.

出版信息

J Med Internet Res. 2025 May 21;27:e73601. doi: 10.2196/73601.

DOI:10.2196/73601
PMID:40397945
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12138316/
Abstract

BACKGROUND

During the COVID-19 pandemic, the continuous spread of misinformation on the internet posed an ongoing threat to public trust and understanding of epidemic prevention policies. Although the pandemic is now under control, information regarding the risks of long-term COVID-19 effects and reinfection still needs to be integrated into COVID-19 policies.

OBJECTIVE

This study aims to develop a robust and generalizable deep learning framework for detecting misinformation related to the prolonged impacts of COVID-19 by integrating pretrained language models (PLMs) with an innovative fuzzy rank-based ensemble approach.

METHODS

A comprehensive dataset comprising 566 genuine and 2361 fake samples was curated from reliable open sources and processed using advanced techniques. The dataset was randomly split using the scikit-learn package to facilitate both training and evaluation. Deep learning models were trained for 20 epochs on a Tesla T4 for hierarchical attention networks (HANs) and an RTX A5000 (for the other models). To enhance performance, we implemented an ensemble learning strategy that incorporated a reparameterized Gompertz function, which assigned fuzzy ranks based on each model's prediction confidence for each test case. This method effectively fused outputs from state-of-the-art PLMs such as robustly optimized bidirectional encoder representations from transformers pretraining approach (RoBERTa), decoding-enhanced bidirectional encoder representations from transformers with disentangled attention (DeBERTa), and XLNet.

RESULTS

After training on the dataset, various classification methods were evaluated on the test set, including the fuzzy rank-based method and state-of-the-art large language models. Experimental results reveal that language models, particularly XLNet, outperform traditional approaches that combine term frequency-inverse document frequency features with support vector machine or utilize deep models like HAN. The evaluation metrics-including accuracy, precision, recall, F-score, and area under the curve (AUC)-indicated a clear performance advantage for models that had a larger number of parameters. However, this study also highlights that model architecture, training procedures, and optimization techniques are critical determinants of classification effectiveness. XLNet's permutation language modeling approach enhances bidirectional context understanding, allowing it to surpass even larger models in the bidirectional encoder representations from transformers (BERT) series despite having relatively fewer parameters. Notably, the fuzzy rank-based ensemble method, which combines multiple language models, achieved impressive results on the test set, with an accuracy of 93.52%, a precision of 94.65%, an F-score of 96.03%, and an AUC of 97.15%.

CONCLUSIONS

The fusion of ensemble learning with PLMs and the Gompertz function, employing fuzzy rank-based methodology, introduces a novel prediction approach with prospects for enhancing accuracy and reliability. Additionally, the experimental results imply that training solely on textual content can yield high prediction accuracy, thereby providing valuable insights into the optimization of fake news detection systems. These findings not only aid in detecting misinformation but also have broader implications for the application of advanced deep learning techniques in public health policy and communication.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d30/12138316/f266266f5e0b/jmir_v27i1e73601_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d30/12138316/cad47b97068f/jmir_v27i1e73601_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d30/12138316/f266266f5e0b/jmir_v27i1e73601_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d30/12138316/cad47b97068f/jmir_v27i1e73601_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d30/12138316/f266266f5e0b/jmir_v27i1e73601_fig2.jpg
摘要

背景

在新冠疫情期间,互联网上错误信息的持续传播对公众对防疫政策的信任和理解构成了持续威胁。尽管疫情目前已得到控制,但有关新冠长期影响风险和再次感染的信息仍需纳入新冠政策。

目的

本研究旨在通过将预训练语言模型(PLM)与创新的基于模糊排名的集成方法相结合,开发一个强大且可推广的深度学习框架,用于检测与新冠长期影响相关的错误信息。

方法

从可靠的公开来源精心策划了一个包含566个真实样本和2361个虚假样本的综合数据集,并使用先进技术进行处理。使用scikit-learn包对数据集进行随机分割,以方便训练和评估。在特斯拉T4上对深度学习模型进行20个轮次的训练,用于分层注意力网络(HAN),在RTX A5000上进行训练(用于其他模型)。为了提高性能,我们实施了一种集成学习策略,该策略纳入了一个重新参数化的冈珀茨函数,该函数根据每个模型对每个测试用例的预测置信度分配模糊排名。该方法有效地融合了来自最先进的PLM的输出,如来自变压器预训练方法(RoBERTa)的稳健优化双向编码器表示、具有解缠注意力的变压器解码增强双向编码器表示(DeBERTa)和XLNet。

结果

在数据集上进行训练后,在测试集上评估了各种分类方法,包括基于模糊排名的方法和最先进的大语言模型。实验结果表明,语言模型,特别是XLNet,优于将词频-逆文档频率特征与支持向量机相结合的传统方法,或使用如HAN等深度模型的方法。评估指标——包括准确率、精确率、召回率、F值和曲线下面积(AUC)——表明参数较多的模型具有明显的性能优势。然而,本研究还强调,模型架构、训练过程和优化技术是分类有效性的关键决定因素。XLNet的排列语言建模方法增强了双向上下文理解,使其尽管参数相对较少,但在变压器双向编码器表示(BERT)系列中甚至超过了更大的模型。值得注意的是,结合多个语言模型的基于模糊排名的集成方法在测试集上取得了令人印象深刻的结果,准确率为93.52%,精确率为94.65%,F值为96.03%,AUC为97.15%。

结论

将集成学习与PLM和冈珀茨函数相结合,采用基于模糊排名的方法,引入了一种具有提高准确性和可靠性前景的新型预测方法。此外,实验结果表明,仅对文本内容进行训练可以产生较高的预测准确率,从而为假新闻检测系统的优化提供有价值的见解。这些发现不仅有助于检测错误信息,而且对先进深度学习技术在公共卫生政策和传播中的应用具有更广泛的意义。

相似文献

1
Identifying Disinformation on the Extended Impacts of COVID-19: Methodological Investigation Using a Fuzzy Ranking Ensemble of Natural Language Processing Models.识别关于新冠疫情长期影响的虚假信息:使用自然语言处理模型的模糊排序集成进行方法学调查
J Med Internet Res. 2025 May 21;27:e73601. doi: 10.2196/73601.
2
A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study.基于荷兰全科电子健康记录的 COVID-19 检测自然语言处理模型:使用转换器的双向编码器表示进行开发和验证研究。
J Med Internet Res. 2023 Oct 4;25:e49944. doi: 10.2196/49944.
3
Machine and deep learning algorithms for sentiment analysis during COVID-19: A vision to create fake news resistant society.用于COVID-19期间情感分析的机器学习和深度学习算法:创建抵制假新闻社会的愿景。
PLoS One. 2024 Dec 19;19(12):e0315407. doi: 10.1371/journal.pone.0315407. eCollection 2024.
4
Deep Ensemble Fake News Detection Model Using Sequential Deep Learning Technique.基于序列深度学习技术的深度集成假新闻检测模型。
Sensors (Basel). 2022 Sep 15;22(18):6970. doi: 10.3390/s22186970.
5
Deep Learning Approach for Negation and Speculation Detection for Automated Important Finding Flagging and Extraction in Radiology Report: Internal Validation and Technique Comparison Study.用于放射学报告中自动重要发现标记和提取的否定与推测检测的深度学习方法:内部验证与技术比较研究
JMIR Med Inform. 2023 Apr 25;11:e46348. doi: 10.2196/46348.
6
Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text.使用预训练语言模型和先进提示学习技术的自主国际疾病分类编码:对一个使用医学文本的自动分析系统的评估
JMIR Med Inform. 2025 Jan 6;13:e63020. doi: 10.2196/63020.
7
Multi-Label Classification in Patient-Doctor Dialogues With the RoBERTa-WWM-ext + CNN (Robustly Optimized Bidirectional Encoder Representations From Transformers Pretraining Approach With Whole Word Masking Extended Combining a Convolutional Neural Network) Model: Named Entity Study.基于RoBERTa-WWM-ext + CNN(带有全词掩码扩展的基于变换器预训练方法的稳健优化双向编码器表示与卷积神经网络相结合)模型的医患对话多标签分类:命名实体研究
JMIR Med Inform. 2022 Apr 21;10(4):e35606. doi: 10.2196/35606.
8
Depression Risk Prediction for Chinese Microblogs via Deep-Learning Methods: Content Analysis.基于深度学习方法的中文微博抑郁风险预测:内容分析
JMIR Med Inform. 2020 Jul 29;8(7):e17958. doi: 10.2196/17958.
9
Trajectory-Ordered Objectives for Self-Supervised Representation Learning of Temporal Healthcare Data Using Transformers: Model Development and Evaluation Study.使用Transformer进行时间序列医疗数据自监督表示学习的轨迹有序目标:模型开发与评估研究
JMIR Med Inform. 2025 Jun 4;13:e68138. doi: 10.2196/68138.
10
Use of Retrieval-Augmented Large Language Model for COVID-19 Fact-Checking: Development and Usability Study.使用检索增强大语言模型进行COVID-19事实核查:开发与可用性研究。
J Med Internet Res. 2025 Apr 30;27:e66098. doi: 10.2196/66098.

本文引用的文献

1
Long COVID: major findings, mechanisms and recommendations.长新冠:主要发现、机制和建议。
Nat Rev Microbiol. 2023 Mar;21(3):133-146. doi: 10.1038/s41579-022-00846-2. Epub 2023 Jan 13.
2
Cross-SEAN: A cross-stitch semi-supervised neural attention model for COVID-19 fake news detection.Cross-SEAN:一种用于新冠疫情虚假新闻检测的十字绣半监督神经注意力模型。
Appl Soft Comput. 2021 Aug;107:107393. doi: 10.1016/j.asoc.2021.107393. Epub 2021 Apr 15.
3
FibVID: Comprehensive fake news diffusion dataset during the COVID-19 period.
FibVID:新冠疫情期间的综合虚假新闻传播数据集。
Telemat Inform. 2021 Nov;64:101688. doi: 10.1016/j.tele.2021.101688. Epub 2021 Jul 28.
4
Acute and postacute sequelae associated with SARS-CoV-2 reinfection.与 SARS-CoV-2 再感染相关的急性和后期后遗症。
Nat Med. 2022 Nov;28(11):2398-2405. doi: 10.1038/s41591-022-02051-3. Epub 2022 Nov 10.
5
Fuzzy rank-based fusion of CNN models using Gompertz function for screening COVID-19 CT-scans.基于模糊秩的 CNN 模型融合 Gompertz 函数在 COVID-19 CT 扫描筛查中的应用。
Sci Rep. 2021 Jul 8;11(1):14133. doi: 10.1038/s41598-021-93658-y.
6
CHECKED: Chinese COVID-19 fake news dataset.已检查:中国新冠疫情虚假新闻数据集。
Soc Netw Anal Min. 2021;11(1):58. doi: 10.1007/s13278-021-00766-8. Epub 2021 Jun 22.