Suppr超能文献

跨学科方法识别和表征推特上关于新冠疫情的错误信息:混合方法研究

Interdisciplinary Approach to Identify and Characterize COVID-19 Misinformation on Twitter: Mixed Methods Study.

作者信息

Isip Tan Iris Thiele, Cleofas Jerome, Solano Geoffrey, Pillejera Jeanne Genevive, Catapang Jasper Kyle

机构信息

Medical Informatics Unit, College of Medicine, University of the Philippines Manila, Manila, Philippines.

Behavioral Sciences Department, De La Salle University, Manila, Philippines.

出版信息

JMIR Form Res. 2023 Jun 28;7:e41134. doi: 10.2196/41134.

Abstract

BACKGROUND

Studying COVID-19 misinformation on Twitter presents methodological challenges. A computational approach can analyze large data sets, but it is limited when interpreting context. A qualitative approach allows for a deeper analysis of content, but it is labor-intensive and feasible only for smaller data sets.

OBJECTIVE

We aimed to identify and characterize tweets containing COVID-19 misinformation.

METHODS

Tweets geolocated to the Philippines (January 1 to March 21, 2020) containing the words coronavirus, covid, and ncov were mined using the GetOldTweets3 Python library. This primary corpus (N=12,631) was subjected to biterm topic modeling. Key informant interviews were conducted to elicit examples of COVID-19 misinformation and determine keywords. Using NVivo (QSR International) and a combination of word frequency and text search using key informant interview keywords, subcorpus A (n=5881) was constituted and manually coded to identify misinformation. Constant comparative, iterative, and consensual analyses were used to further characterize these tweets. Tweets containing key informant interview keywords were extracted from the primary corpus and processed to constitute subcorpus B (n=4634), of which 506 tweets were manually labeled as misinformation. This training set was subjected to natural language processing to identify tweets with misinformation in the primary corpus. These tweets were further manually coded to confirm labeling.

RESULTS

Biterm topic modeling of the primary corpus revealed the following topics: uncertainty, lawmaker's response, safety measures, testing, loved ones, health standards, panic buying, tragedies other than COVID-19, economy, COVID-19 statistics, precautions, health measures, international issues, adherence to guidelines, and frontliners. These were categorized into 4 major topics: nature of COVID-19, contexts and consequences, people and agents of COVID-19, and COVID-19 prevention and management. Manual coding of subcorpus A identified 398 tweets with misinformation in the following formats: misleading content (n=179), satire and/or parody (n=77), false connection (n=53), conspiracy (n=47), and false context (n=42). The discursive strategies identified were humor (n=109), fear mongering (n=67), anger and disgust (n=59), political commentary (n=59), performing credibility (n=45), overpositivity (n=32), and marketing (n=27). Natural language processing identified 165 tweets with misinformation. However, a manual review showed that 69.7% (115/165) of tweets did not contain misinformation.

CONCLUSIONS

An interdisciplinary approach was used to identify tweets with COVID-19 misinformation. Natural language processing mislabeled tweets, likely due to tweets written in Filipino or a combination of the Filipino and English languages. Identifying the formats and discursive strategies of tweets with misinformation required iterative, manual, and emergent coding by human coders with experiential and cultural knowledge of Twitter. An interdisciplinary team composed of experts in health, health informatics, social science, and computer science combined computational and qualitative methods to gain a better understanding of COVID-19 misinformation on Twitter.

摘要

背景

研究推特上有关新冠疫情的错误信息存在方法上的挑战。计算方法可以分析大量数据集,但在解释语境时存在局限性。定性方法能够对内容进行更深入的分析,但劳动强度大,仅适用于较小的数据集。

目的

我们旨在识别并描述包含新冠疫情错误信息的推文。

方法

使用GetOldTweets3 Python库挖掘2020年1月1日至3月21日定位到菲律宾且包含“coronavirus”(冠状病毒)、“covid”和“ncov”的推文。这个原始语料库(N = 12631)进行了双词主题建模。进行关键信息人访谈以获取新冠疫情错误信息的示例并确定关键词。使用NVivo(QSR国际公司)以及结合词频和使用关键信息人访谈关键词的文本搜索,构建了子语料库A(n = 5881)并进行人工编码以识别错误信息。采用持续比较、迭代和协商分析来进一步描述这些推文。从原始语料库中提取包含关键信息人访谈关键词的推文并进行处理以构建子语料库B(n = 4634),其中506条推文被人工标记为错误信息。这个训练集进行自然语言处理以识别原始语料库中包含错误信息的推文。这些推文进一步进行人工编码以确认标记。

结果

原始语料库的双词主题建模揭示了以下主题:不确定性、立法者的回应、安全措施、检测、亲人、健康标准、恐慌性购买、新冠疫情以外的悲剧、经济、新冠疫情统计数据、预防措施、健康措施、国际问题、遵守指南以及一线人员。这些被归类为4个主要主题:新冠疫情的性质、背景和后果、新冠疫情的相关人员和主体以及新冠疫情预防与管理。对子语料库A的人工编码识别出398条包含错误信息的推文,其格式如下:误导性内容(n = 179)、讽刺和/或恶搞(n = 77)、错误关联(n = 53)、阴谋论(n = 47)以及错误背景(n = 42)。识别出的话语策略有幽默(n = 109)、制造恐慌(n = 67)、愤怒和厌恶(n = 59)、政治评论(n = 59)、树立可信度(n = 45)、过度积极(n = 32)以及营销(n = 27)。自然语言处理识别出165条包含错误信息的推文。然而,人工审查显示69.7%(115/165)的推文不包含错误信息。

结论

采用跨学科方法来识别包含新冠疫情错误信息的推文。自然语言处理错误标记了推文,可能是因为推文是用菲律宾语或菲律宾语和英语混合编写的。识别包含错误信息的推文的格式和话语策略需要有推特经验和文化知识的人工编码员进行迭代、人工和应急编码。一个由健康、健康信息学、社会科学和计算机科学专家组成的跨学科团队结合计算和定性方法,以更好地理解推特上的新冠疫情错误信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b267/10337476/bc8b9a34b306/formative_v7i1e41134_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验