一种使用社交媒体数据发现潜在传染病的无监督机器学习模型。

An unsupervised machine learning model for discovering latent infectious diseases using social media data.

作者信息

Lim Sunghoon, Tucker Conrad S, Kumara Soundar

机构信息

Department of Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, USA.

School of Engineering Design, Technology, and Professional Programs, The Pennsylvania State University, University Park, PA 16802, USA; Department of Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, USA.

出版信息

J Biomed Inform. 2017 Feb;66:82-94. doi: 10.1016/j.jbi.2016.12.007. Epub 2016 Dec 26.

DOI:10.1016/j.jbi.2016.12.007

PMID:28034788

Abstract

INTRODUCTION

The authors of this work propose an unsupervised machine learning model that has the ability to identify real-world latent infectious diseases by mining social media data. In this study, a latent infectious disease is defined as a communicable disease that has not yet been formalized by national public health institutes and explicitly communicated to the general public. Most existing approaches to modeling infectious-disease-related knowledge discovery through social media networks are top-down approaches that are based on already known information, such as the names of diseases and their symptoms. In existing top-down approaches, necessary but unknown information, such as disease names and symptoms, is mostly unidentified in social media data until national public health institutes have formalized that disease. Most of the formalizing processes for latent infectious diseases are time consuming. Therefore, this study presents a bottom-up approach for latent infectious disease discovery in a given location without prior information, such as disease names and related symptoms.

METHODS

Social media messages with user and temporal information are extracted during the data preprocessing stage. An unsupervised sentiment analysis model is then presented. Users' expressions about symptoms, body parts, and pain locations are also identified from social media data. Then, symptom weighting vectors for each individual and time period are created, based on their sentiment and social media expressions. Finally, latent-infectious-disease-related information is retrieved from individuals' symptom weighting vectors.

DATASETS AND RESULTS

Twitter data from August 2012 to May 2013 are used to validate this study. Real electronic medical records for 104 individuals, who were diagnosed with influenza in the same period, are used to serve as ground truth validation. The results are promising, with the highest precision, recall, and F score values of 0.773, 0.680, and 0.724, respectively.

CONCLUSION

This work uses individuals' social media messages to identify latent infectious diseases, without prior information, quicker than when the disease(s) is formalized by national public health institutes. In particular, the unsupervised machine learning model using user, textual, and temporal information in social media data, along with sentiment analysis, identifies latent infectious diseases in a given location.

摘要

引言

本研究的作者提出了一种无监督机器学习模型，该模型能够通过挖掘社交媒体数据来识别现实世界中潜在的传染病。在本研究中，潜在传染病被定义为尚未被国家公共卫生机构正式确定并明确告知公众的传染病。大多数现有的通过社交媒体网络对传染病相关知识发现进行建模的方法都是自上而下的方法，这些方法基于已知信息，如疾病名称及其症状。在现有的自上而下的方法中，必要但未知的信息，如疾病名称和症状，在国家公共卫生机构正式确定该疾病之前，在社交媒体数据中大多无法识别。潜在传染病的大多数正式确定过程都很耗时。因此，本研究提出了一种自下而上的方法，用于在没有疾病名称和相关症状等先验信息的给定位置发现潜在传染病。

方法

在数据预处理阶段提取带有用户和时间信息的社交媒体消息。然后提出一个无监督情感分析模型。还从社交媒体数据中识别用户对症状、身体部位和疼痛位置的表述。然后，根据个人的情感和社交媒体表述，为每个个体和时间段创建症状加权向量。最后，从个体的症状加权向量中检索与潜在传染病相关的信息。

数据集与结果

使用2012年8月至2013年5月的推特数据来验证本研究。将同期被诊断为流感的104个人的真实电子病历用作基本事实验证。结果很有前景，最高精度、召回率和F分数值分别为0.773、0.680和0.724。

结论

这项工作利用个人的社交媒体消息来识别潜在传染病，无需先验信息，比国家公共卫生机构正式确定疾病的速度更快。特别是，使用社交媒体数据中的用户、文本和时间信息以及情感分析的无监督机器学习模型能够在给定位置识别潜在传染病。

相似文献

An unsupervised machine learning model for discovering latent infectious diseases using social media data.一种使用社交媒体数据发现潜在传染病的无监督机器学习模型。

J Biomed Inform. 2017 Feb;66:82-94. doi: 10.1016/j.jbi.2016.12.007. Epub 2016 Dec 26.

Results and Methodological Implications of the Digital Epidemiology of Prescription Drug References Among Twitter Users: Latent Dirichlet Allocation (LDA) Analyses.社交媒体中文药物信息的数字流行病学研究结果与方法学启示：潜在狄利克雷分配模型（LDA）分析。

J Med Internet Res. 2023 Jul 28;25:e48405. doi: 10.2196/48405.

Prediction of infectious diseases using sentiment analysis on social media data.利用社交媒体数据的情感分析预测传染病。

PLoS One. 2024 Sep 4;19(9):e0309842. doi: 10.1371/journal.pone.0309842. eCollection 2024.

Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records.使用电子健康记录进行无监督机器学习以发现潜在疾病集群和患者亚组。

J Biomed Inform. 2020 Feb;102:103364. doi: 10.1016/j.jbi.2019.103364. Epub 2019 Dec 28.

Exploring trends of nonmedical use of prescription drugs and polydrug abuse in the Twittersphere using unsupervised machine learning.使用无监督机器学习探索推特圈中处方药非医疗用途和多药滥用的趋势。

Addict Behav. 2017 Feb;65:289-295. doi: 10.1016/j.addbeh.2016.08.019. Epub 2016 Aug 17.

An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages.一种用于在社交媒体消息中发现健康相关知识的集成异构分类方法。

J Biomed Inform. 2014 Jun;49:255-68. doi: 10.1016/j.jbi.2014.03.005. Epub 2014 Mar 16.

Understanding Health Care Social Media Use From Different Stakeholder Perspectives: A Content Analysis of an Online Health Community.从不同利益相关者视角理解医疗保健领域社交媒体的使用：对一个在线健康社区的内容分析

J Med Internet Res. 2017 Apr 7;19(4):e109. doi: 10.2196/jmir.7087.

Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis.从社交媒体中发现孕妇群体以进行安全监测与分析。

J Med Internet Res. 2017 Oct 30;19(10):e361. doi: 10.2196/jmir.8164.

Mining Health Social Media with Sentiment Analysis.运用情感分析挖掘健康社交媒体信息。

J Med Syst. 2016 Nov;40(11):236. doi: 10.1007/s10916-016-0604-4. Epub 2016 Sep 23.

Demographic-Based Content Analysis of Web-Based Health-Related Social Media.基于人口统计学的网络健康相关社交媒体内容分析

J Med Internet Res. 2016 Jun 13;18(6):e148. doi: 10.2196/jmir.5327.

引用本文的文献

Global infectious disease early warning models: An updated review and lessons from the COVID-19 pandemic.全球传染病早期预警模型：最新综述及新冠疫情的教训

Infect Dis Model. 2024 Dec 3;10(2):410-422. doi: 10.1016/j.idm.2024.12.001. eCollection 2025 Jun.

Finding polarized communities and tracking information diffusion on Twitter: a network approach on the Irish Abortion Referendum.在推特上寻找极化社区并追踪信息传播：关于爱尔兰堕胎公投的网络方法

R Soc Open Sci. 2025 Jan 15;12(1):240454. doi: 10.1098/rsos.240454. eCollection 2025 Jan.

Primary care research on hypertension: A bibliometric analysis using machine-learning.高血压的初级保健研究：一项使用机器学习的文献计量分析。

Medicine (Baltimore). 2024 Nov 22;103(47):e40482. doi: 10.1097/MD.0000000000040482.

Tracking mosquito-borne diseases via social media: a machine learning approach to topic modelling and sentiment analysis.通过社交媒体追踪蚊媒疾病：一种用于主题建模和情感分析的机器学习方法。

PeerJ. 2024 Mar 1;12:e17045. doi: 10.7717/peerj.17045. eCollection 2024.

Identifying diseases symptoms and general rules using supervised and unsupervised machine learning.使用监督式和非监督式机器学习识别疾病症状和一般规则。

Sci Rep. 2024 Aug 2;14(1):17956. doi: 10.1038/s41598-024-69029-8.

Recent advances and applications of artificial intelligence in 3D bioprinting.人工智能在3D生物打印中的最新进展与应用

Biophys Rev (Melville). 2024 Jul 19;5(3):031301. doi: 10.1063/5.0190208. eCollection 2024 Sep.

A Comparison of ChatGPT and Fine-Tuned Open Pre-Trained Transformers (OPT) Against Widely Used Sentiment Analysis Tools: Sentiment Analysis of COVID-19 Survey Data.ChatGPT与微调后的开放预训练变换器（OPT）与广泛使用的情感分析工具的比较：COVID-19调查数据的情感分析

JMIR Ment Health. 2024 Jan 25;11:e50150. doi: 10.2196/50150.

Early-stage pregnancy recognition on microblogs: Machine learning and lexicon-based approaches.微博上的早期妊娠识别：基于机器学习和词汇的方法。

Heliyon. 2023 Sep 14;9(9):e20132. doi: 10.1016/j.heliyon.2023.e20132. eCollection 2023 Sep.

Editorial: Infectious Disease Surveillance Using Artificial Intelligence (AI) and its Role in Epidemic and Pandemic Preparedness.社论：利用人工智能（AI）进行传染病监测及其在疫情和大流行防范中的作用。

Med Sci Monit. 2023 Jun 1;29:e941209. doi: 10.12659/MSM.941209.

Predicting the Number of Reported Pulmonary Tuberculosis in Guiyang, China, Based on Time Series Analysis Techniques.基于时间序列分析技术预测中国贵阳的肺结核报告数量。

Comput Math Methods Med. 2022 Oct 30;2022:7828131. doi: 10.1155/2022/7828131. eCollection 2022.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种使用社交媒体数据发现潜在传染病的无监督机器学习模型。

An unsupervised machine learning model for discovering latent infectious diseases using social media data.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

DATASETS AND RESULTS

CONCLUSION

引言

方法

数据集与结果

结论

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献