Graduate School of Information, Yonsei University, Seoul 03722, South Korea.
Department of Computer Science, Sangmyung University, Seoul 03016, South Korea.
J Biomed Inform. 2022 Sep;133:104148. doi: 10.1016/j.jbi.2022.104148. Epub 2022 Jul 22.
Perhaps no other generation in the span of recorded human history has endured the risks of infectious diseases as has the current generation. The prevalence of infectious diseases is caused mainly by unlimited contact between people in a highly globalized world. Disease control and prevention (CDC) promptly collect and produce disease outbreak statistics, but CDCs rely on a curated, centralized collection system, and requires up to two weeks of lead time. Consequently, the quick release of disease outbreak information has become a great challenge. Infectious disease outbreak information is recorded and spread somewhere on the Internet much faster than CDC announcements, and Internet-sourced data have shown non-substitutable potential to watch and predict infectious disease outbreaks in advance. In this study, we performed a thorough analysis to show the similarity between the Korean Center of Disease Control (KCDC) infectious disease datasets and three Internet-sourced data for nine major infectious diseases in terms of time-series volume. The results show that many of infectious disease outbreak have strongly related to Internet-sourced data. We analyzed several factors that affect the similarity. Our analysis shows that the increase in the number of Internet-sourced data correlates with the increase in the number of infected people and thus, show the positive similarity. We also found that the greater the number of infectious disease outbreaks corresponds to having a wider spread of outbreak regions, in which it also proves to have higher similarity. We presented the prediction result of infectious disease outbreak using various Internet-sourced data and an effective deep learning algorithm. It showed that there are positive correlations between the number of infected people or the number of related web data and the prediction accuracy. We developed and currently operate a web-based system to show the similarity between KCDC and related Internet-sourced data for infectious diseases. This paper helps people to identify what kind of Internet-sourced data they need to use to predict and track a specific infectious disease outbreak. We considered as much as nine major diseases and three kinds of Internet-sourced data together, and we can say that our finding did not depend on specific infectious disease nor specific Internet-sourced data.
或许,在有记录的人类历史中,没有任何一代人像当代人那样承受着传染病风险。传染病的流行主要是由于全球化世界中人们之间无限的接触。疾病控制和预防中心(CDC)会及时收集和制作疾病爆发统计数据,但 CDC 依赖于经过精心策划的集中式采集系统,并且需要长达两周的时间。因此,快速发布疾病爆发信息成为一个巨大的挑战。传染病爆发信息在互联网上的记录和传播速度比 CDC 的公告要快得多,而且互联网来源的数据已经显示出具有不可替代的潜力,可以提前观察和预测传染病的爆发。在本研究中,我们进行了全面的分析,以展示韩国疾病控制与预防中心(KCDC)传染病数据集与三个互联网来源数据在九种主要传染病的时间序列量方面的相似性。结果表明,许多传染病爆发与互联网来源数据密切相关。我们分析了影响相似性的几个因素。我们的分析表明,互联网来源数据数量的增加与感染人数的增加相关,因此显示出正相似性。我们还发现,传染病爆发的数量越多,爆发地区的传播范围越广,这也证明了它们具有更高的相似性。我们使用各种互联网来源数据和有效的深度学习算法展示了传染病爆发的预测结果。结果表明,感染人数或相关网络数据数量与预测准确性之间存在正相关关系。我们开发并目前运营一个基于网络的系统,以展示 KCDC 与相关互联网来源数据之间的相似性,用于传染病。本文帮助人们确定需要使用哪种互联网来源数据来预测和跟踪特定的传染病爆发。我们考虑了九种主要疾病和三种互联网来源数据,我们可以说我们的发现不依赖于特定的传染病或特定的互联网来源数据。