Suppr超能文献

新闻文章可靠性的数据探索与分类:深度学习研究

Data Exploration and Classification of News Article Reliability: Deep Learning Study.

作者信息

Zhan Kevin, Li Yutong, Osmani Rafay, Wang Xiaoyu, Cao Bo

机构信息

Department of Psychiatry University of Alberta Edmonton, AB Canada.

Department of Cell Biology University of Alberta Edmonton, AB Canada.

出版信息

JMIR Infodemiology. 2022 Sep 22;2(2):e38839. doi: 10.2196/38839. eCollection 2022 Jul-Dec.

Abstract

BACKGROUND

During the ongoing COVID-19 pandemic, we are being exposed to large amounts of information each day. This "infodemic" is defined by the World Health Organization as the mass spread of misleading or false information during a pandemic. This spread of misinformation during the infodemic ultimately leads to misunderstandings of public health orders or direct opposition against public policies. Although there have been efforts to combat misinformation spread, current manual fact-checking methods are insufficient to combat the infodemic.

OBJECTIVE

We propose the use of natural language processing (NLP) and machine learning (ML) techniques to build a model that can be used to identify unreliable news articles online.

METHODS

First, we preprocessed the ReCOVery data set to obtain 2029 English news articles tagged with COVID-19 keywords from January to May 2020, which are labeled as reliable or unreliable. Data exploration was conducted to determine major differences between reliable and unreliable articles. We built an ensemble deep learning model using the body text, as well as features, such as sentiment, Empath-derived lexical categories, and readability, to classify the reliability.

RESULTS

We found that reliable news articles have a higher proportion of neutral sentiment, while unreliable articles have a higher proportion of negative sentiment. Additionally, our analysis demonstrated that reliable articles are easier to read than unreliable articles, in addition to having different lexical categories and keywords. Our new model was evaluated to achieve the following performance metrics: 0.906 area under the curve (AUC), 0.835 specificity, and 0.945 sensitivity. These values are above the baseline performance of the original ReCOVery model.

CONCLUSIONS

This paper identified novel differences between reliable and unreliable news articles; moreover, the model was trained using state-of-the-art deep learning techniques. We aim to be able to use our findings to help researchers and the public audience more easily identify false information and unreliable media in their everyday lives.

摘要

背景

在持续的新冠疫情期间,我们每天都接触到大量信息。世界卫生组织将这种“信息疫情”定义为在疫情期间误导性或虚假信息的大量传播。信息疫情期间错误信息的传播最终导致对公共卫生指令的误解或对公共政策的直接反对。尽管已经努力打击错误信息的传播,但目前的人工事实核查方法不足以应对信息疫情。

目的

我们建议使用自然语言处理(NLP)和机器学习(ML)技术来构建一个模型,该模型可用于识别在线不可靠新闻文章。

方法

首先,我们对ReCOVery数据集进行预处理,以获取2020年1月至5月标记有新冠关键词的2029篇英文新闻文章,这些文章被标记为可靠或不可靠。进行数据探索以确定可靠和不可靠文章之间的主要差异。我们使用正文以及情感、共情衍生词汇类别和可读性等特征构建了一个集成深度学习模型,以对可靠性进行分类。

结果

我们发现可靠新闻文章的中性情感比例更高,而不可靠文章的负面情感比例更高。此外,我们的分析表明,可靠文章除了具有不同的词汇类别和关键词外,比不可靠文章更易读。我们的新模型经评估实现了以下性能指标:曲线下面积(AUC)为0.906、特异性为0.835、灵敏度为0.945。这些值高于原始ReCOVery模型的基线性能。

结论

本文确定了可靠和不可靠新闻文章之间的新差异;此外该模型是使用最先进的深度学习技术进行训练的。我们的目标是能够利用我们的研究结果帮助研究人员和公众在日常生活中更轻松地识别虚假信息和不可靠媒体。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8dc6/10117316/54217b084027/infodemiology_v2i2e38839_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验