Kolluri Nikhil, Liu Yunong, Murthy Dhiraj
Computational Media Lab Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX United States.
School of Engineering College of Science and Engineering University of Edinburgh Edinburgh United Kingdom.
JMIR Infodemiology. 2022 Aug 25;2(2):e38756. doi: 10.2196/38756. eCollection 2022 Jul-Dec.
The volume of COVID-19-related misinformation has long exceeded the resources available to fact checkers to effectively mitigate its ill effects. Automated and web-based approaches can provide effective deterrents to online misinformation. Machine learning-based methods have achieved robust performance on text classification tasks, including potentially low-quality-news credibility assessment. Despite the progress of initial, rapid interventions, the enormity of COVID-19-related misinformation continues to overwhelm fact checkers. Therefore, improvement in automated and machine-learned methods for an infodemic response is urgently needed.
The aim of this study was to achieve improvement in automated and machine-learned methods for an infodemic response.
We evaluated three strategies for training a machine-learning model to determine the highest model performance: (1) COVID-19-related fact-checked data only, (2) general fact-checked data only, and (3) combined COVID-19 and general fact-checked data. We created two COVID-19-related misinformation data sets from fact-checked "false" content combined with programmatically retrieved "true" content. The first set contained ~7000 entries from July to August 2020, and the second contained ~31,000 entries from January 2020 to June 2022. We crowdsourced 31,441 votes to human label the first data set.
The models achieved an accuracy of 96.55% and 94.56% on the first and second external validation data set, respectively. Our best-performing model was developed using COVID-19-specific content. We were able to successfully develop combined models that outperformed human votes of misinformation. Specifically, when we blended our model predictions with human votes, the highest accuracy we achieved on the first external validation data set was 99.1%. When we considered outputs where the machine-learning model agreed with human votes, we achieved accuracies up to 98.59% on the first validation data set. This outperformed human votes alone with an accuracy of only 73%.
External validation accuracies of 96.55% and 94.56% are evidence that machine learning can produce superior results for the difficult task of classifying the veracity of COVID-19 content. Pretrained language models performed best when fine-tuned on a topic-specific data set, while other models achieved their best accuracy when fine-tuned on a combination of topic-specific and general-topic data sets. Crucially, our study found that blended models, trained/fine-tuned on general-topic content with crowdsourced data, improved our models' accuracies up to 99.7%. The successful use of crowdsourced data can increase the accuracy of models in situations when expert-labeled data are scarce. The 98.59% accuracy on a "high-confidence" subsection comprised of machine-learned and human labels suggests that crowdsourced votes can optimize machine-learned labels to improve accuracy above human-only levels. These results support the utility of supervised machine learning to deter and combat future health-related disinformation.
与新冠疫情相关的错误信息数量长期以来一直超过了事实核查人员可用于有效减轻其不良影响的资源。自动化和基于网络的方法可以有效遏制网络错误信息。基于机器学习的方法在文本分类任务中取得了强大的性能,包括对潜在低质量新闻的可信度评估。尽管最初的快速干预取得了进展,但与新冠疫情相关的海量错误信息仍然让事实核查人员应接不暇。因此,迫切需要改进用于应对信息疫情的自动化和机器学习方法。
本研究的目的是改进用于应对信息疫情的自动化和机器学习方法。
我们评估了三种训练机器学习模型以确定最高模型性能的策略:(1)仅使用与新冠疫情相关的经过事实核查的数据,(2)仅使用一般的经过事实核查的数据,以及(3)将新冠疫情和一般的经过事实核查的数据相结合。我们从经过事实核查的“虚假”内容与通过编程检索的“真实”内容相结合创建了两个与新冠疫情相关的错误信息数据集。第一组包含2020年7月至8月的约7000条记录,第二组包含2020年1月至2022年6月的约31000条记录。我们通过众包获得了31441张选票,让人工对第一个数据集进行标注。
模型在第一个和第二个外部验证数据集上的准确率分别达到了96.55%和94.56%。我们表现最佳的模型是使用特定于新冠疫情的内容开发的。我们成功开发了优于人工对错误信息投票的组合模型。具体而言,当我们将模型预测与人工投票相结合时,在第一个外部验证数据集上我们达到的最高准确率为99.1%。当我们考虑机器学习模型与人工投票一致的输出时,在第一个验证数据集上我们达到的准确率高达98.59%。这超过了仅人工投票的准确率,人工投票的准确率仅为73%。
96.55%和94.56%的外部验证准确率证明,机器学习可以在对新冠疫情相关内容的真实性进行分类这一艰巨任务中产生卓越的结果。预训练语言模型在特定主题数据集上进行微调时表现最佳,而其他模型在特定主题和一般主题数据集的组合上进行微调时达到了最佳准确率。至关重要的是,我们的研究发现,在一般主题内容上使用众包数据进行训练/微调的混合模型将我们模型的准确率提高到了99.7%。在专家标注数据稀缺的情况下,成功使用众包数据可以提高模型的准确率。在由机器学习和人工标注组成的“高置信度”子集中98.59%的准确率表明,众包投票可以优化机器学习标注,以将准确率提高到高于仅人工标注的水平。这些结果支持了监督机器学习在遏制和对抗未来与健康相关的虚假信息方面的效用。