Suppr超能文献

一种使用社交媒体数据发现潜在传染病的无监督机器学习模型。

An unsupervised machine learning model for discovering latent infectious diseases using social media data.

作者信息

Lim Sunghoon, Tucker Conrad S, Kumara Soundar

机构信息

Department of Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, USA.

School of Engineering Design, Technology, and Professional Programs, The Pennsylvania State University, University Park, PA 16802, USA; Department of Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, USA.

出版信息

J Biomed Inform. 2017 Feb;66:82-94. doi: 10.1016/j.jbi.2016.12.007. Epub 2016 Dec 26.

Abstract

INTRODUCTION

The authors of this work propose an unsupervised machine learning model that has the ability to identify real-world latent infectious diseases by mining social media data. In this study, a latent infectious disease is defined as a communicable disease that has not yet been formalized by national public health institutes and explicitly communicated to the general public. Most existing approaches to modeling infectious-disease-related knowledge discovery through social media networks are top-down approaches that are based on already known information, such as the names of diseases and their symptoms. In existing top-down approaches, necessary but unknown information, such as disease names and symptoms, is mostly unidentified in social media data until national public health institutes have formalized that disease. Most of the formalizing processes for latent infectious diseases are time consuming. Therefore, this study presents a bottom-up approach for latent infectious disease discovery in a given location without prior information, such as disease names and related symptoms.

METHODS

Social media messages with user and temporal information are extracted during the data preprocessing stage. An unsupervised sentiment analysis model is then presented. Users' expressions about symptoms, body parts, and pain locations are also identified from social media data. Then, symptom weighting vectors for each individual and time period are created, based on their sentiment and social media expressions. Finally, latent-infectious-disease-related information is retrieved from individuals' symptom weighting vectors.

DATASETS AND RESULTS

Twitter data from August 2012 to May 2013 are used to validate this study. Real electronic medical records for 104 individuals, who were diagnosed with influenza in the same period, are used to serve as ground truth validation. The results are promising, with the highest precision, recall, and F score values of 0.773, 0.680, and 0.724, respectively.

CONCLUSION

This work uses individuals' social media messages to identify latent infectious diseases, without prior information, quicker than when the disease(s) is formalized by national public health institutes. In particular, the unsupervised machine learning model using user, textual, and temporal information in social media data, along with sentiment analysis, identifies latent infectious diseases in a given location.

摘要

引言

本研究的作者提出了一种无监督机器学习模型,该模型能够通过挖掘社交媒体数据来识别现实世界中潜在的传染病。在本研究中,潜在传染病被定义为尚未被国家公共卫生机构正式确定并明确告知公众的传染病。大多数现有的通过社交媒体网络对传染病相关知识发现进行建模的方法都是自上而下的方法,这些方法基于已知信息,如疾病名称及其症状。在现有的自上而下的方法中,必要但未知的信息,如疾病名称和症状,在国家公共卫生机构正式确定该疾病之前,在社交媒体数据中大多无法识别。潜在传染病的大多数正式确定过程都很耗时。因此,本研究提出了一种自下而上的方法,用于在没有疾病名称和相关症状等先验信息的给定位置发现潜在传染病。

方法

在数据预处理阶段提取带有用户和时间信息的社交媒体消息。然后提出一个无监督情感分析模型。还从社交媒体数据中识别用户对症状、身体部位和疼痛位置的表述。然后,根据个人的情感和社交媒体表述,为每个个体和时间段创建症状加权向量。最后,从个体的症状加权向量中检索与潜在传染病相关的信息。

数据集与结果

使用2012年8月至2013年5月的推特数据来验证本研究。将同期被诊断为流感的104个人的真实电子病历用作基本事实验证。结果很有前景,最高精度、召回率和F分数值分别为0.773、0.680和0.724。

结论

这项工作利用个人的社交媒体消息来识别潜在传染病,无需先验信息,比国家公共卫生机构正式确定疾病的速度更快。特别是,使用社交媒体数据中的用户、文本和时间信息以及情感分析的无监督机器学习模型能够在给定位置识别潜在传染病。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验