• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

输入垃圾,输出垃圾:健康研究、信息流行病学和数字疾病检测中社交媒体数据使用的数据收集、质量评估及报告标准

Garbage in, Garbage Out: Data Collection, Quality Assessment and Reporting Standards for Social Media Data Use in Health Research, Infodemiology and Digital Disease Detection.

作者信息

Kim Yoonsang, Huang Jidong, Emery Sherry

机构信息

Health Media Collaboratory, Institute for Health Research and Policy, University of Illinois at Chicago, Chicago, IL, United States.

出版信息

J Med Internet Res. 2016 Feb 26;18(2):e41. doi: 10.2196/jmir.4738.

DOI:10.2196/jmir.4738
PMID:26920122
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4788740/
Abstract

BACKGROUND

Social media have transformed the communications landscape. People increasingly obtain news and health information online and via social media. Social media platforms also serve as novel sources of rich observational data for health research (including infodemiology, infoveillance, and digital disease detection detection). While the number of studies using social data is growing rapidly, very few of these studies transparently outline their methods for collecting, filtering, and reporting those data. Keywords and search filters applied to social data form the lens through which researchers may observe what and how people communicate about a given topic. Without a properly focused lens, research conclusions may be biased or misleading. Standards of reporting data sources and quality are needed so that data scientists and consumers of social media research can evaluate and compare methods and findings across studies.

OBJECTIVE

We aimed to develop and apply a framework of social media data collection and quality assessment and to propose a reporting standard, which researchers and reviewers may use to evaluate and compare the quality of social data across studies.

METHODS

We propose a conceptual framework consisting of three major steps in collecting social media data: develop, apply, and validate search filters. This framework is based on two criteria: retrieval precision (how much of retrieved data is relevant) and retrieval recall (how much of the relevant data is retrieved). We then discuss two conditions that estimation of retrieval precision and recall rely on--accurate human coding and full data collection--and how to calculate these statistics in cases that deviate from the two ideal conditions. We then apply the framework on a real-world example using approximately 4 million tobacco-related tweets collected from the Twitter firehose.

RESULTS

We developed and applied a search filter to retrieve e-cigarette-related tweets from the archive based on three keyword categories: devices, brands, and behavior. The search filter retrieved 82,205 e-cigarette-related tweets from the archive and was validated. Retrieval precision was calculated above 95% in all cases. Retrieval recall was 86% assuming ideal conditions (no human coding errors and full data collection), 75% when unretrieved messages could not be archived, 86% assuming no false negative errors by coders, and 93% allowing both false negative and false positive errors by human coders.

CONCLUSIONS

This paper sets forth a conceptual framework for the filtering and quality evaluation of social data that addresses several common challenges and moves toward establishing a standard of reporting social data. Researchers should clearly delineate data sources, how data were accessed and collected, and the search filter building process and how retrieval precision and recall were calculated. The proposed framework can be adapted to other public social media platforms.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e73/4788740/c18ec12ed5cf/jmir_v18i2e41_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e73/4788740/4dde352cab7f/jmir_v18i2e41_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e73/4788740/c18ec12ed5cf/jmir_v18i2e41_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e73/4788740/4dde352cab7f/jmir_v18i2e41_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e73/4788740/c18ec12ed5cf/jmir_v18i2e41_fig2.jpg
摘要

背景

社交媒体改变了通信格局。人们越来越多地通过网络和社交媒体获取新闻和健康信息。社交媒体平台也成为健康研究(包括信息流行病学、信息监测和数字疾病检测)丰富观测数据的新来源。虽然使用社交数据的研究数量正在迅速增长,但这些研究中很少有透明地概述其收集、筛选和报告这些数据的方法。应用于社交数据的关键词和搜索过滤器构成了研究人员观察人们就给定主题进行何种交流以及如何交流的视角。如果没有一个聚焦得当的视角,研究结论可能会有偏差或产生误导。因此需要数据来源和质量的报告标准,以便数据科学家和社交媒体研究的使用者能够评估和比较各项研究的方法及结果。

目的

我们旨在开发并应用一个社交媒体数据收集和质量评估框架,并提出一项报告标准,研究人员和评审人员可据此评估和比较各项研究中社交数据的质量。

方法

我们提出一个概念框架,其中包括收集社交媒体数据的三个主要步骤:开发、应用和验证搜索过滤器。该框架基于两个标准:检索精度(检索到的数据中有多少是相关的)和检索召回率(检索到的相关数据有多少)。然后我们讨论了检索精度和召回率估计所依赖的两个条件——准确的人工编码和完整的数据收集,以及在偏离这两个理想条件的情况下如何计算这些统计数据。然后我们将该框架应用于一个实际例子,使用从Twitter实时数据流中收集的约400万条与烟草相关的推文。

结果

我们开发并应用了一个搜索过滤器,根据设备、品牌和行为这三个关键词类别,从存档中检索与电子烟相关的推文。该搜索过滤器从存档中检索到82,205条与电子烟相关的推文,并经过了验证。在所有情况下,检索精度计算结果均高于95%。在理想条件下(无人工编码错误且数据收集完整),检索召回率为86%;当未检索到的消息无法存档时,召回率为75%;假设编码人员无假阴性错误时,召回率为86%;允许人工编码人员同时存在假阴性和假阳性错误时,召回率为93%。

结论

本文提出了一个用于社交数据筛选和质量评估的概念框架,该框架解决了几个常见挑战,并朝着建立社交数据报告标准迈进。研究人员应清晰地描述数据来源、数据的获取和收集方式、搜索过滤器的构建过程以及检索精度和召回率的计算方法。所提出的框架可适用于其他公共社交媒体平台。

相似文献

1
Garbage in, Garbage Out: Data Collection, Quality Assessment and Reporting Standards for Social Media Data Use in Health Research, Infodemiology and Digital Disease Detection.输入垃圾,输出垃圾:健康研究、信息流行病学和数字疾病检测中社交媒体数据使用的数据收集、质量评估及报告标准
J Med Internet Res. 2016 Feb 26;18(2):e41. doi: 10.2196/jmir.4738.
2
Toward a Mixed-Methods Research Approach to Content Analysis in The Digital Age: The Combined Content-Analysis Model and its Applications to Health Care Twitter Feeds.迈向数字时代内容分析的混合方法研究路径:组合内容分析模型及其在医疗保健推特推送中的应用
J Med Internet Res. 2016 Mar 8;18(3):e60. doi: 10.2196/jmir.5391.
3
Applying Multiple Data Collection Tools to Quantify Human Papillomavirus Vaccine Communication on Twitter.应用多种数据收集工具量化推特上的人乳头瘤病毒疫苗传播情况
J Med Internet Res. 2016 Dec 5;18(12):e318. doi: 10.2196/jmir.6670.
4
Establishing a Link Between Prescription Drug Abuse and Illicit Online Pharmacies: Analysis of Twitter Data.建立处方药滥用与非法在线药房之间的联系:推特数据的分析
J Med Internet Res. 2015 Dec 16;17(12):e280. doi: 10.2196/jmir.5144.
5
A Scalable Framework to Detect Personal Health Mentions on Twitter.一种用于在推特上检测个人健康提及的可扩展框架。
J Med Internet Res. 2015 Jun 5;17(6):e138. doi: 10.2196/jmir.4305.
6
How to exploit twitter for public health monitoring?如何利用推特进行公共卫生监测?
Methods Inf Med. 2013;52(4):326-39. doi: 10.3414/ME12-02-0010. Epub 2013 Jul 23.
7
Identifying Topics for E-Cigarette User-Generated Contents: A Case Study From Multiple Social Media Platforms.识别电子烟用户生成内容的主题:来自多个社交媒体平台的案例研究
J Med Internet Res. 2017 Jan 20;19(1):e24. doi: 10.2196/jmir.5780.
8
Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet.信息流行病学与信息监测:一套新兴的公共卫生信息学方法的框架,用于分析互联网上的搜索、交流和出版行为。
J Med Internet Res. 2009 Mar 27;11(1):e11. doi: 10.2196/jmir.1157.
9
Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.社交媒体挖掘在出生缺陷研究中的应用:一种基于规则和自举的方法,用于在 Twitter 上收集罕见健康相关事件的数据。
J Biomed Inform. 2018 Nov;87:68-78. doi: 10.1016/j.jbi.2018.10.001. Epub 2018 Oct 4.
10
Public Discussion of Anthrax on Twitter: Using Machine Learning to Identify Relevant Topics and Events.Twitter 上炭疽杆菌的公共讨论:使用机器学习识别相关主题和事件。
JMIR Public Health Surveill. 2021 Jun 18;7(6):e27976. doi: 10.2196/27976.

引用本文的文献

1
A Custom Keyword Tool for Improving the Quality of Social Media Monitoring on Vaccine Safety: A Proof of Concept.一种用于提高社交媒体疫苗安全监测质量的定制关键词工具:概念验证
Int J Public Health. 2025 Aug 21;70:1608480. doi: 10.3389/ijph.2025.1608480. eCollection 2025.
2
Promotion of Health-Harming Products on Instagram: Characterizing Strategies Boosting Audience Engagement with Cigar Marketing Messages.Instagram上对有害健康产品的推广:剖析提高受众对雪茄营销信息参与度的策略
Int J Environ Res Public Health. 2025 Aug 17;22(8):1285. doi: 10.3390/ijerph22081285.
3
Food Access in New York City During the COVID-19 Pandemic: Social Media Monitoring Study.

本文引用的文献

1
Identifying Adverse Effects of HIV Drug Treatment and Associated Sentiments Using Twitter.利用 Twitter 识别 HIV 药物治疗的不良反应及相关情绪
JMIR Public Health Surveill. 2015 Jul 27;1(2):e7. doi: 10.2196/publichealth.4488. eCollection 2015 Jul-Dec.
2
Electronic Cigarette Marketing Online: a Multi-Site, Multi-Product Comparison.电子烟网络营销:多地点、多产品比较。
JMIR Public Health Surveill. 2015 Sep 11;1(2):e11. doi: 10.2196/publichealth.4777. eCollection 2015 Jul-Dec.
3
Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use: An Infoveillance Study.
新冠疫情期间纽约市的食品获取情况:社交媒体监测研究
JMIR Form Res. 2025 May 9;9:e49520. doi: 10.2196/49520.
4
Leveraging social media data to study disease and treatment characteristics of Hodgkin's lymphoma Using Natural Language Processing methods.利用社交媒体数据,采用自然语言处理方法研究霍奇金淋巴瘤的疾病及治疗特征。
PLOS Digit Health. 2025 Mar 19;4(3):e0000765. doi: 10.1371/journal.pdig.0000765. eCollection 2025 Mar.
5
Content analysis of substance use disorder recovery discourse on Twitter: From personal recovery narratives to marketing of addiction treatment.推特上物质使用障碍康复话语的内容分析:从个人康复叙事到成瘾治疗营销
Alcohol Clin Exp Res (Hoboken). 2025 Mar;49(3):629-640. doi: 10.1111/acer.15531. Epub 2025 Feb 22.
6
AI for Tobacco Control: Identifying Tobacco-Promoting Social Media Content Using Large Language Models.用于烟草控制的人工智能:使用大语言模型识别促进烟草消费的社交媒体内容。
Nicotine Tob Res. 2025 May 22;27(6):988-996. doi: 10.1093/ntr/ntae276.
7
Deciphering Influence on Social Media: A Comparative Analysis of Influential Account Detection Metrics in the Context of Tobacco Promotion.解读社交媒体上的影响力:烟草促销背景下有影响力账户检测指标的比较分析
Soc Media Soc. 2024 Jan;10(1). doi: 10.1177/20563051231224268. Epub 2024 Jan 27.
8
The Normalization of Vaping on TikTok Using Computer Vision, Natural Language Processing, and Qualitative Thematic Analysis: Mixed Methods Study.利用计算机视觉、自然语言处理和定性主题分析使 TikTok 上的 vaping 正常化:混合方法研究。
J Med Internet Res. 2024 Sep 11;26:e55591. doi: 10.2196/55591.
9
ChatGPT for Automated Qualitative Research: Content Analysis.ChatGPT 在定性研究中的自动化应用:内容分析。
J Med Internet Res. 2024 Jul 25;26:e59050. doi: 10.2196/59050.
10
Understanding public perceptions and discussions on diseases involving chronic pain through social media: cross-sectional infodemiology study.通过社交媒体了解公众对涉及慢性疼痛疾病的看法和讨论:横断面信息流行病学研究。
BMC Musculoskelet Disord. 2024 Jul 22;25(1):569. doi: 10.1186/s12891-024-07687-5.
利用推特数据洞察电子烟营销及使用地点:一项信息监测研究。
J Med Internet Res. 2015 Nov 6;17(11):e251. doi: 10.2196/jmir.4466.
4
Social Listening: A Content Analysis of E-Cigarette Discussions on Twitter.社交倾听:对推特上电子烟讨论的内容分析
J Med Internet Res. 2015 Oct 27;17(10):e243. doi: 10.2196/jmir.4969.
5
Electronic Cigarettes Among Priority Populations: Role of Smoking Cessation and Tobacco Control Policies.重点人群中的电子烟:戒烟及烟草控制政策的作用
Am J Prev Med. 2016 Feb;50(2):199-209. doi: 10.1016/j.amepre.2015.06.032. Epub 2015 Sep 26.
6
The Canary in the Coal Mine Tweets: Social Media Reveals Public Perceptions of Non-Medical Use of Opioids.煤矿里的金丝雀推特:社交媒体揭示公众对阿片类药物非医疗用途的看法。
PLoS One. 2015 Aug 7;10(8):e0135072. doi: 10.1371/journal.pone.0135072. eCollection 2015.
7
Disease detection or public opinion reflection? Content analysis of tweets, other social media, and online newspapers during the measles outbreak in The Netherlands in 2013.疾病检测还是民意反映?2013年荷兰麻疹疫情期间推文、其他社交媒体及在线报纸的内容分析
J Med Internet Res. 2015 May 26;17(5):e128. doi: 10.2196/jmir.3863.
8
A new source of data for public health surveillance: Facebook likes.公共卫生监测的新数据来源:脸书点赞数。
J Med Internet Res. 2015 Apr 20;17(4):e98. doi: 10.2196/jmir.3970.
9
Ebola and the social media.埃博拉与社交媒体。
Lancet. 2014 Dec 20;384(9961):2207. doi: 10.1016/S0140-6736(14)62418-1. Epub 2014 Dec 19.
10
Psychological language on Twitter predicts county-level heart disease mortality.推特上的心理语言可预测县级心脏病死亡率。
Psychol Sci. 2015 Feb;26(2):159-69. doi: 10.1177/0956797614557867. Epub 2015 Jan 20.