Department of Biomedical Informatics, Harvard Medical School, Harvard University, Boston, MA, United States.
Data Intelligence for Health Lab, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.
JMIR Public Health Surveill. 2022 Feb 14;8(2):e32355. doi: 10.2196/32355.
Advances in automated data processing and machine learning (ML) models, together with the unprecedented growth in the number of social media users who publicly share and discuss health-related information, have made public health surveillance (PHS) one of the long-lasting social media applications. However, the existing PHS systems feeding on social media data have not been widely deployed in national surveillance systems, which appears to stem from the lack of practitioners and the public's trust in social media data. More robust and reliable data sets over which supervised ML models can be trained and tested reliably is a significant step toward overcoming this hurdle. The health implications of daily behaviors (physical activity, sedentary behavior, and sleep [PASS]), as an evergreen topic in PHS, are widely studied through traditional data sources such as surveillance surveys and administrative databases, which are often several months out-of-date by the time they are used, costly to collect, and thus limited in quantity and coverage.
The main objective of this study is to present a large-scale, multicountry, longitudinal, and fully labeled data set to enable and support digital PASS surveillance research in PHS. To support high-quality surveillance research using our data set, we have conducted further analysis on the data set to supplement it with additional PHS-related metadata.
We collected the data of this study from Twitter using the Twitter livestream application programming interface between November 28, 2018, and June 19, 2020. To obtain PASS-related tweets for manual annotation, we iteratively used regular expressions, unsupervised natural language processing, domain-specific ontologies, and linguistic analysis. We used Amazon Mechanical Turk to label the collected data to self-reported PASS categories and implemented a quality control pipeline to monitor and manage the validity of crowd-generated labels. Moreover, we used ML, latent semantic analysis, linguistic analysis, and label inference analysis to validate the different components of the data set.
LPHEADA (Labelled Digital Public Health Dataset) contains 366,405 crowd-generated labels (3 labels per tweet) for 122,135 PASS-related tweets that originated in Australia, Canada, the United Kingdom, or the United States, labeled by 708 unique annotators on Amazon Mechanical Turk. In addition to crowd-generated labels, LPHEADA provides details about the three critical components of any PHS system: place, time, and demographics (ie, gender and age range) associated with each tweet.
Publicly available data sets for digital PASS surveillance are usually isolated and only provide labels for small subsets of the data. We believe that the novelty and comprehensiveness of the data set provided in this study will help develop, evaluate, and deploy digital PASS surveillance systems. LPHEADA will be an invaluable resource for both public health researchers and practitioners.
自动化数据处理和机器学习 (ML) 模型的进步,以及越来越多的社交媒体用户公开分享和讨论与健康相关信息,使公共卫生监测 (PHS) 成为社交媒体的长期应用之一。然而,现有的从社交媒体数据中获取的公共卫生监测系统尚未在国家监测系统中广泛部署,这似乎源于缺乏从业者以及公众对社交媒体数据的信任。建立更为强大和可靠的数据集,以便在这些数据集上训练和可靠测试有监督的 ML 模型,是克服这一障碍的重要一步。日常行为(体力活动、久坐行为和睡眠[PASS])的健康影响作为 PHS 中的常青话题,通过传统数据来源(监测调查和行政数据库)进行了广泛研究,这些数据来源往往在使用时滞后数月,收集成本高昂,因此数量和覆盖范围有限。
本研究的主要目的是提供一个大规模、多国家、纵向且完全标记的数据集,以支持和支持公共卫生监测中的数字 PASS 监测研究。为了支持使用我们的数据集进行高质量的监测研究,我们对数据集进行了进一步分析,用额外的公共卫生监测相关元数据对其进行了补充。
我们于 2018 年 11 月 28 日至 2020 年 6 月 19 日使用 Twitter 的 Twitter 实时应用程序编程接口从 Twitter 收集了本研究的数据。为了获取用于手动标注的 PASS 相关推文,我们迭代使用正则表达式、无监督自然语言处理、特定领域本体和语言分析。我们使用亚马逊机械土耳其人 (Amazon Mechanical Turk) 对收集的数据进行标注,将其标记为自我报告的 PASS 类别,并实施了质量控制管道来监测和管理众包标签的有效性。此外,我们使用 ML、潜在语义分析、语言分析和标签推断分析来验证数据集的不同组件。
LPHEADA(标记数字公共卫生数据集)包含 366405 个众包标签(每个推文 3 个标签),来自澳大利亚、加拿大、英国或美国的 122135 条与 PASS 相关的推文,由亚马逊机械土耳其上的 708 位独特标注员标注。除了众包标签外,LPHEADA 还提供了任何公共卫生监测系统的三个关键组件的详细信息:与每条推文相关的地点、时间和人口统计学信息(即性别和年龄范围)。
数字 PASS 监测的公开可用数据集通常是孤立的,并且仅为数据的一小部分提供标签。我们相信,本研究中提供的数据的新颖性和全面性将有助于开发、评估和部署数字 PASS 监测系统。LPHEADA 将成为公共卫生研究人员和从业者的宝贵资源。