• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

新冠疫情错误信息检测:针对信息疫情的机器学习解决方案

COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic.

作者信息

Kolluri Nikhil, Liu Yunong, Murthy Dhiraj

机构信息

Computational Media Lab Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX United States.

School of Engineering College of Science and Engineering University of Edinburgh Edinburgh United Kingdom.

出版信息

JMIR Infodemiology. 2022 Aug 25;2(2):e38756. doi: 10.2196/38756. eCollection 2022 Jul-Dec.

DOI:10.2196/38756
PMID:37113446
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9987189/
Abstract

BACKGROUND

The volume of COVID-19-related misinformation has long exceeded the resources available to fact checkers to effectively mitigate its ill effects. Automated and web-based approaches can provide effective deterrents to online misinformation. Machine learning-based methods have achieved robust performance on text classification tasks, including potentially low-quality-news credibility assessment. Despite the progress of initial, rapid interventions, the enormity of COVID-19-related misinformation continues to overwhelm fact checkers. Therefore, improvement in automated and machine-learned methods for an infodemic response is urgently needed.

OBJECTIVE

The aim of this study was to achieve improvement in automated and machine-learned methods for an infodemic response.

METHODS

We evaluated three strategies for training a machine-learning model to determine the highest model performance: (1) COVID-19-related fact-checked data only, (2) general fact-checked data only, and (3) combined COVID-19 and general fact-checked data. We created two COVID-19-related misinformation data sets from fact-checked "false" content combined with programmatically retrieved "true" content. The first set contained ~7000 entries from July to August 2020, and the second contained ~31,000 entries from January 2020 to June 2022. We crowdsourced 31,441 votes to human label the first data set.

RESULTS

The models achieved an accuracy of 96.55% and 94.56% on the first and second external validation data set, respectively. Our best-performing model was developed using COVID-19-specific content. We were able to successfully develop combined models that outperformed human votes of misinformation. Specifically, when we blended our model predictions with human votes, the highest accuracy we achieved on the first external validation data set was 99.1%. When we considered outputs where the machine-learning model agreed with human votes, we achieved accuracies up to 98.59% on the first validation data set. This outperformed human votes alone with an accuracy of only 73%.

CONCLUSIONS

External validation accuracies of 96.55% and 94.56% are evidence that machine learning can produce superior results for the difficult task of classifying the veracity of COVID-19 content. Pretrained language models performed best when fine-tuned on a topic-specific data set, while other models achieved their best accuracy when fine-tuned on a combination of topic-specific and general-topic data sets. Crucially, our study found that blended models, trained/fine-tuned on general-topic content with crowdsourced data, improved our models' accuracies up to 99.7%. The successful use of crowdsourced data can increase the accuracy of models in situations when expert-labeled data are scarce. The 98.59% accuracy on a "high-confidence" subsection comprised of machine-learned and human labels suggests that crowdsourced votes can optimize machine-learned labels to improve accuracy above human-only levels. These results support the utility of supervised machine learning to deter and combat future health-related disinformation.

摘要

背景

与新冠疫情相关的错误信息数量长期以来一直超过了事实核查人员可用于有效减轻其不良影响的资源。自动化和基于网络的方法可以有效遏制网络错误信息。基于机器学习的方法在文本分类任务中取得了强大的性能,包括对潜在低质量新闻的可信度评估。尽管最初的快速干预取得了进展,但与新冠疫情相关的海量错误信息仍然让事实核查人员应接不暇。因此,迫切需要改进用于应对信息疫情的自动化和机器学习方法。

目的

本研究的目的是改进用于应对信息疫情的自动化和机器学习方法。

方法

我们评估了三种训练机器学习模型以确定最高模型性能的策略:(1)仅使用与新冠疫情相关的经过事实核查的数据,(2)仅使用一般的经过事实核查的数据,以及(3)将新冠疫情和一般的经过事实核查的数据相结合。我们从经过事实核查的“虚假”内容与通过编程检索的“真实”内容相结合创建了两个与新冠疫情相关的错误信息数据集。第一组包含2020年7月至8月的约7000条记录,第二组包含2020年1月至2022年6月的约31000条记录。我们通过众包获得了31441张选票,让人工对第一个数据集进行标注。

结果

模型在第一个和第二个外部验证数据集上的准确率分别达到了96.55%和94.56%。我们表现最佳的模型是使用特定于新冠疫情的内容开发的。我们成功开发了优于人工对错误信息投票的组合模型。具体而言,当我们将模型预测与人工投票相结合时,在第一个外部验证数据集上我们达到的最高准确率为99.1%。当我们考虑机器学习模型与人工投票一致的输出时,在第一个验证数据集上我们达到的准确率高达98.59%。这超过了仅人工投票的准确率,人工投票的准确率仅为73%。

结论

96.55%和94.56%的外部验证准确率证明,机器学习可以在对新冠疫情相关内容的真实性进行分类这一艰巨任务中产生卓越的结果。预训练语言模型在特定主题数据集上进行微调时表现最佳,而其他模型在特定主题和一般主题数据集的组合上进行微调时达到了最佳准确率。至关重要的是,我们的研究发现,在一般主题内容上使用众包数据进行训练/微调的混合模型将我们模型的准确率提高到了99.7%。在专家标注数据稀缺的情况下,成功使用众包数据可以提高模型的准确率。在由机器学习和人工标注组成的“高置信度”子集中98.59%的准确率表明,众包投票可以优化机器学习标注,以将准确率提高到高于仅人工标注的水平。这些结果支持了监督机器学习在遏制和对抗未来与健康相关的虚假信息方面的效用。

相似文献

1
COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic.新冠疫情错误信息检测:针对信息疫情的机器学习解决方案
JMIR Infodemiology. 2022 Aug 25;2(2):e38756. doi: 10.2196/38756. eCollection 2022 Jul-Dec.
2
CoVerifi: A COVID-19 news verification system.CoVerifi:一个新冠疫情新闻核实系统。
Online Soc Netw Media. 2021 Mar;22:100123. doi: 10.1016/j.osnem.2021.100123. Epub 2021 Jan 23.
3
Combat COVID-19 infodemic using explainable natural language processing models.使用可解释的自然语言处理模型应对新冠疫情信息疫情。
Inf Process Manag. 2021 Jul;58(4):102569. doi: 10.1016/j.ipm.2021.102569. Epub 2021 Mar 6.
4
Multi-label multi-class COVID-19 Arabic Twitter dataset with fine-grained misinformation and situational information annotations.具有细粒度错误信息和情境信息注释的多标签多类别新冠疫情阿拉伯语推特数据集
PeerJ Comput Sci. 2022 Dec 5;8:e1151. doi: 10.7717/peerj-cs.1151. eCollection 2022.
5
Medical Misinformation in Polish on the World Wide Web During the COVID-19 Pandemic Period: Infodemiology Study.COVID-19 大流行期间波兰万维网上的医学错误信息:信息流行病学研究。
J Med Internet Res. 2024 Mar 29;26:e48130. doi: 10.2196/48130.
6
"Thought I'd Share First" and Other Conspiracy Theory Tweets from the COVID-19 Infodemic: Exploratory Study.“我想率先分享”和其他有关 COVID-19 信息疫情的阴谋论推文:探索性研究。
JMIR Public Health Surveill. 2021 Apr 14;7(4):e26527. doi: 10.2196/26527.
7
Detecting and classifying online health misinformation with 'Content Similarity Measure (CSM)' algorithm: an automated fact-checking-based approach.使用“内容相似性度量(CSM)”算法检测和分类在线健康错误信息:一种基于自动事实核查的方法。
J Supercomput. 2023;79(8):9127-9156. doi: 10.1007/s11227-022-05032-y. Epub 2023 Jan 7.
8
A quantitative content analysis of topical characteristics of the online COVID-19 infodemic in the United States and Japan.对美国和日本网络 COVID-19 信息疫情主题特征的定量内容分析。
BMC Public Health. 2024 Sep 9;24(1):2447. doi: 10.1186/s12889-024-19813-y.
9
Data Exploration and Classification of News Article Reliability: Deep Learning Study.新闻文章可靠性的数据探索与分类:深度学习研究
JMIR Infodemiology. 2022 Sep 22;2(2):e38839. doi: 10.2196/38839. eCollection 2022 Jul-Dec.
10
A Stanford Conference on Social Media, Ethics, and COVID-19 Misinformation (INFODEMIC): Qualitative Thematic Analysis.斯坦福社交媒体、伦理与 COVID-19 错误信息会议(INFODEMIC):定性主题分析。
J Med Internet Res. 2022 Feb 15;24(2):e35707. doi: 10.2196/35707.

引用本文的文献

1
Public perception and changing attitudes toward antidepressants over a decade in social media: Lessons learned from online discussion using artificial intelligence.社交媒体上公众对抗抑郁药物十年间的认知及态度变化:利用人工智能从在线讨论中汲取的经验教训
PLoS One. 2025 Sep 4;20(9):e0318464. doi: 10.1371/journal.pone.0318464. eCollection 2025.
2
Tools/instruments for assessing YouTube videos on surgical procedures for patient/consumer health education: a systematic review.用于评估YouTube上手术操作患者/消费者健康教育视频的工具/仪器:一项系统评价。
Front Public Health. 2025 Jul 10;13:1575801. doi: 10.3389/fpubh.2025.1575801. eCollection 2025.
3
Use of Retrieval-Augmented Large Language Model for COVID-19 Fact-Checking: Development and Usability Study.使用检索增强大语言模型进行COVID-19事实核查:开发与可用性研究。
J Med Internet Res. 2025 Apr 30;27:e66098. doi: 10.2196/66098.
4
Evaluating the Influence of Role-Playing Prompts on ChatGPT's Misinformation Detection Accuracy: Quantitative Study.评估角色扮演提示对 ChatGPT 错误信息检测准确率的影响:定量研究。
JMIR Infodemiology. 2024 Sep 26;4:e60678. doi: 10.2196/60678.
5
Detecting nuance in conspiracy discourse: Advancing methods in infodemiology and communication science with machine learning and qualitative content coding.检测阴谋话语中的细微差别:用机器学习和定性内容编码推进信息流行病学和传播学方法。
PLoS One. 2023 Dec 20;18(12):e0295414. doi: 10.1371/journal.pone.0295414. eCollection 2023.
6
Capturing Emerging Experiential Knowledge for Vaccination Guidelines Through Natural Language Processing: Proof-of-Concept Study.通过自然语言处理捕获疫苗接种指南中的新兴经验知识:概念验证研究。
J Med Internet Res. 2023 Sep 14;25:e44461. doi: 10.2196/44461.
7
Artificial Intelligence-Enabled Analysis of Statin-Related Topics and Sentiments on Social Media.基于人工智能的社交媒体中他汀类药物相关话题和情绪的分析。
JAMA Netw Open. 2023 Apr 3;6(4):e239747. doi: 10.1001/jamanetworkopen.2023.9747.
8
An anti-infodemic virtual center for the Americas.美洲抗信息疫情虚拟中心。
Rev Panam Salud Publica. 2023 Mar 10;47:e5. doi: 10.26633/RPSP.2023.5. eCollection 2023.
9
Fine-tuned Sentiment Analysis of COVID-19 Vaccine-Related Social Media Data: Comparative Study.新冠疫苗相关社交媒体数据的微调情感分析:比较研究。
J Med Internet Res. 2022 Oct 17;24(10):e40408. doi: 10.2196/40408.

本文引用的文献

1
A Large-Scale COVID-19 Twitter Chatter Dataset for Open Scientific Research-An International Collaboration.用于开放科学研究的大规模COVID-19推特聊天数据集——一项国际合作。
Epidemiologia (Basel). 2021 Aug 5;2(3):315-324. doi: 10.3390/epidemiologia2030024.
2
Lies Kill, Facts Save: Detecting COVID-19 Misinformation in Twitter.谎言杀人,事实救人:在推特上检测新冠疫情虚假信息
IEEE Access. 2020 Aug 26;8:155961-155970. doi: 10.1109/ACCESS.2020.3019600. eCollection 2020.
3
CHECKED: Chinese COVID-19 fake news dataset.已检查:中国新冠疫情虚假新闻数据集。
Soc Netw Anal Min. 2021;11(1):58. doi: 10.1007/s13278-021-00766-8. Epub 2021 Jun 22.
4
Prevalence of Misinformation and Factchecks on the COVID-19 Pandemic in 35 Countries: Observational Infodemiology Study.35个国家关于新冠疫情的错误信息及事实核查的流行情况:观察性信息流行病学研究
JMIR Hum Factors. 2021 Feb 13;8(1):e23279. doi: 10.2196/23279.
5
Assessing the risks of 'infodemics' in response to COVID-19 epidemics.评估应对 COVID-19 疫情“信息疫情”的风险。
Nat Hum Behav. 2020 Dec;4(12):1285-1293. doi: 10.1038/s41562-020-00994-6. Epub 2020 Oct 29.
6
The COVID-19 social media infodemic.新冠病毒肺炎疫情相关社交媒体信息疫情。
Sci Rep. 2020 Oct 6;10(1):16598. doi: 10.1038/s41598-020-73510-5.
7
A first public dataset from Brazilian twitter and news on COVID-19 in Portuguese.首个来自巴西的葡萄牙语推特和新冠疫情新闻的公开数据集。
Data Brief. 2020 Oct;32:106179. doi: 10.1016/j.dib.2020.106179. Epub 2020 Aug 18.
8
FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media.假新闻网:一个具有新闻内容、社交背景和时空信息的数据资源库,用于研究社交媒体上的假新闻。
Big Data. 2020 Jun;8(3):171-188. doi: 10.1089/big.2020.0062.
9
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set.追踪社交媒体上关于 COVID-19 大流行的讨论:公共冠状病毒 Twitter 数据集的开发。
JMIR Public Health Surveill. 2020 May 29;6(2):e19273. doi: 10.2196/19273.
10
COVID-19-Related Web Search Behaviors and Infodemic Attitudes in Italy: Infodemiological Study.意大利与 COVID-19 相关的网络搜索行为和信息疫情态度:信息疫情研究。
JMIR Public Health Surveill. 2020 May 5;6(2):e19374. doi: 10.2196/19374.