Zhan Kevin, Li Yutong, Osmani Rafay, Wang Xiaoyu, Cao Bo
Department of Psychiatry University of Alberta Edmonton, AB Canada.
Department of Cell Biology University of Alberta Edmonton, AB Canada.
JMIR Infodemiology. 2022 Sep 22;2(2):e38839. doi: 10.2196/38839. eCollection 2022 Jul-Dec.
During the ongoing COVID-19 pandemic, we are being exposed to large amounts of information each day. This "infodemic" is defined by the World Health Organization as the mass spread of misleading or false information during a pandemic. This spread of misinformation during the infodemic ultimately leads to misunderstandings of public health orders or direct opposition against public policies. Although there have been efforts to combat misinformation spread, current manual fact-checking methods are insufficient to combat the infodemic.
We propose the use of natural language processing (NLP) and machine learning (ML) techniques to build a model that can be used to identify unreliable news articles online.
First, we preprocessed the ReCOVery data set to obtain 2029 English news articles tagged with COVID-19 keywords from January to May 2020, which are labeled as reliable or unreliable. Data exploration was conducted to determine major differences between reliable and unreliable articles. We built an ensemble deep learning model using the body text, as well as features, such as sentiment, Empath-derived lexical categories, and readability, to classify the reliability.
We found that reliable news articles have a higher proportion of neutral sentiment, while unreliable articles have a higher proportion of negative sentiment. Additionally, our analysis demonstrated that reliable articles are easier to read than unreliable articles, in addition to having different lexical categories and keywords. Our new model was evaluated to achieve the following performance metrics: 0.906 area under the curve (AUC), 0.835 specificity, and 0.945 sensitivity. These values are above the baseline performance of the original ReCOVery model.
This paper identified novel differences between reliable and unreliable news articles; moreover, the model was trained using state-of-the-art deep learning techniques. We aim to be able to use our findings to help researchers and the public audience more easily identify false information and unreliable media in their everyday lives.
在持续的新冠疫情期间,我们每天都接触到大量信息。世界卫生组织将这种“信息疫情”定义为在疫情期间误导性或虚假信息的大量传播。信息疫情期间错误信息的传播最终导致对公共卫生指令的误解或对公共政策的直接反对。尽管已经努力打击错误信息的传播,但目前的人工事实核查方法不足以应对信息疫情。
我们建议使用自然语言处理(NLP)和机器学习(ML)技术来构建一个模型,该模型可用于识别在线不可靠新闻文章。
首先,我们对ReCOVery数据集进行预处理,以获取2020年1月至5月标记有新冠关键词的2029篇英文新闻文章,这些文章被标记为可靠或不可靠。进行数据探索以确定可靠和不可靠文章之间的主要差异。我们使用正文以及情感、共情衍生词汇类别和可读性等特征构建了一个集成深度学习模型,以对可靠性进行分类。
我们发现可靠新闻文章的中性情感比例更高,而不可靠文章的负面情感比例更高。此外,我们的分析表明,可靠文章除了具有不同的词汇类别和关键词外,比不可靠文章更易读。我们的新模型经评估实现了以下性能指标:曲线下面积(AUC)为0.906、特异性为0.835、灵敏度为0.945。这些值高于原始ReCOVery模型的基线性能。
本文确定了可靠和不可靠新闻文章之间的新差异;此外该模型是使用最先进的深度学习技术进行训练的。我们的目标是能够利用我们的研究结果帮助研究人员和公众在日常生活中更轻松地识别虚假信息和不可靠媒体。