Analytics, Intelligence, and Technology Division, Los Alamos National Laboratory, Los Alamos, NM, United States.
Department of Computer Science, University of New Mexico, Albuquerque, NM, United States.
JMIR Public Health Surveill. 2021 Apr 14;7(4):e26527. doi: 10.2196/26527.
The COVID-19 outbreak has left many people isolated within their homes; these people are turning to social media for news and social connection, which leaves them vulnerable to believing and sharing misinformation. Health-related misinformation threatens adherence to public health messaging, and monitoring its spread on social media is critical to understanding the evolution of ideas that have potentially negative public health impacts.
The aim of this study is to use Twitter data to explore methods to characterize and classify four COVID-19 conspiracy theories and to provide context for each of these conspiracy theories through the first 5 months of the pandemic.
We began with a corpus of COVID-19 tweets (approximately 120 million) spanning late January to early May 2020. We first filtered tweets using regular expressions (n=1.8 million) and used random forest classification models to identify tweets related to four conspiracy theories. Our classified data sets were then used in downstream sentiment analysis and dynamic topic modeling to characterize the linguistic features of COVID-19 conspiracy theories as they evolve over time.
Analysis using model-labeled data was beneficial for increasing the proportion of data matching misinformation indicators. Random forest classifier metrics varied across the four conspiracy theories considered (F1 scores between 0.347 and 0.857); this performance increased as the given conspiracy theory was more narrowly defined. We showed that misinformation tweets demonstrate more negative sentiment when compared to nonmisinformation tweets and that theories evolve over time, incorporating details from unrelated conspiracy theories as well as real-world events.
Although we focus here on health-related misinformation, this combination of approaches is not specific to public health and is valuable for characterizing misinformation in general, which is an important first step in creating targeted messaging to counteract its spread. Initial messaging should aim to preempt generalized misinformation before it becomes widespread, while later messaging will need to target evolving conspiracy theories and the new facets of each as they become incorporated.
COVID-19 疫情爆发期间,许多人被隔离在家中,他们转而通过社交媒体获取新闻和社交联系,这使得他们容易轻信和传播错误信息。与健康相关的错误信息会威胁到人们对公共卫生信息的遵从,因此监测社交媒体上此类信息的传播对于了解具有潜在负面影响的观念演变至关重要。
本研究旨在利用 Twitter 数据,通过探索方法来对四种 COVID-19 阴谋论进行特征描述和分类,并提供每种阴谋论在大流行前 5 个月的相关背景。
我们从 2020 年 1 月底至 5 月初期间约 1.2 亿条与 COVID-19 相关的推文语料库开始研究。首先,我们使用正则表达式对推文本进行过滤(n=180 万),然后使用随机森林分类模型来识别与四种阴谋论相关的推文。我们的分类数据集随后被用于下游情感分析和动态主题建模,以随着时间的推移对 COVID-19 阴谋论的语言特征进行描述。
使用标记数据的分析方法有利于提高与错误信息指标匹配的数据比例。在考虑的四种阴谋论中,随机森林分类器的指标值各不相同(F1 分数介于 0.347 和 0.857 之间);随着给定的阴谋论被更狭义地定义,性能会有所提高。我们发现,与非错误信息推文相比,错误信息推文的情感更为消极,而且这些理论会随着时间的推移而演变,会纳入来自无关阴谋论和现实世界事件的细节。
虽然我们这里关注的是与健康相关的错误信息,但这些方法的组合不仅适用于公共卫生领域,对于描述一般错误信息也具有价值,这是创建有针对性信息来阻止其传播的重要第一步。初始信息应旨在在错误信息广泛传播之前对其进行预防,而后期信息则需要针对不断演变的阴谋论及其每个新方面进行有针对性的传播。