利用搜索引擎查询发现韩国新冠肺炎病例预测中的时变公众兴趣：信息流行病学研究

Discovering Time-Varying Public Interest for COVID-19 Case Prediction in South Korea Using Search Engine Queries: Infodemiology Study.

作者信息

Ahn Seong-Ho, Yim Kwangil, Won Hyun-Sik, Kim Kang-Min, Jeong Dong-Hwa

机构信息

Department of Artificial Intelligence, The Catholic University of Korea, Bucheon-Si, Republic of Korea.

Department of Hospital Pathology, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea.

出版信息

J Med Internet Res. 2024 Dec 16;26:e63476. doi: 10.2196/63476.

DOI:10.2196/63476

PMID:39680913

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11686031/

Abstract

BACKGROUND

The number of confirmed COVID-19 cases is a crucial indicator of policies and lifestyles. Previous studies have attempted to forecast cases using machine learning techniques that use a previous number of case counts and search engine queries predetermined by experts. However, they have limitations in reflecting temporal variations in queries associated with pandemic dynamics.

OBJECTIVE

This study aims to propose a novel framework to extract keywords highly associated with COVID-19, considering their temporal occurrence. We aim to extract relevant keywords based on pandemic variations using query expansion. Additionally, we examine time-delayed web-based search behavior related to public interest in COVID-19 and adjust for better prediction performance.

METHODS

To capture temporal semantics regarding COVID-19, word embedding models were trained on a news corpus, and the top 100 words related to "Corona" were extracted over 4-month windows. Time-lagged cross-correlation was applied to select optimal time lags correlated to confirmed cases from the expanded queries. Subsequently, ElasticNet regression models were trained after reducing the feature dimensions using principal component analysis of the time-lagged features to predict future daily case counts.

RESULTS

Our approach successfully extracted relevant keywords depending on the pandemic phase, encompassing keywords directly related to COVID-19, such as its symptoms, and its societal impact. Specifically, during the first outbreak, keywords directly linked to COVID-19 and past infectious disease outbreaks similar to those of COVID-19 exhibited a high positive correlation. In the second phase of the pandemic, as community infections emerged, keywords related to the government's pandemic control policies were frequently observed with a high positive correlation. In the third phase of the pandemic, during the delta variant outbreak, keywords such as "economic crisis" and "anxiety" appeared, reflecting public fatigue. Consequently, prediction models trained by the extracted queries over 4-month windows outperformed previous methods for most predictions 1-14 days ahead. Notably, our approach showed significantly higher Pearson correlation coefficients than models based solely on the number of past cases for predictions 9-11 days ahead (P=.02, P<.01, and P<.01), in contrast to heuristic- and symptom-based query sets.

CONCLUSIONS

This study proposes a novel COVID-19 case-prediction model that automatically extracts relevant queries over time using word embedding. The model outperformed previous methods that relied on static symptom-based or heuristic queries, even without prior expert knowledge. The results demonstrate the capability of our approach to track temporal shifts in public interest regarding changes in the pandemic.

摘要

背景

新冠病毒病确诊病例数是政策和生活方式的关键指标。以往的研究曾尝试使用机器学习技术来预测病例数，这些技术利用先前的病例数计数以及专家预先确定的搜索引擎查询。然而，它们在反映与疫情动态相关的查询中的时间变化方面存在局限性。

目的

本研究旨在提出一个新颖的框架，考虑与新冠病毒病高度相关的关键词的时间出现情况来提取这些关键词。我们旨在利用查询扩展，根据疫情变化提取相关关键词。此外，我们研究与公众对新冠病毒病的关注相关的基于网络的延迟搜索行为，并进行调整以获得更好的预测性能。

方法

为了捕捉与新冠病毒病相关的时间语义，在新闻语料库上训练词嵌入模型，并在4个月的窗口内提取与“冠状病毒”相关的前100个单词。应用时间滞后互相关来从扩展查询中选择与确诊病例相关的最佳时间滞后。随后，在使用时间滞后特征的主成分分析降低特征维度后，训练弹性网络回归模型来预测未来每日病例数。

结果

我们的方法根据疫情阶段成功提取了相关关键词，包括与新冠病毒病直接相关的关键词，如症状及其社会影响。具体而言，在首次爆发期间，与新冠病毒病直接相关以及与过去类似于新冠病毒病的传染病爆发相关的关键词呈现出高度正相关。在疫情的第二阶段，随着社区感染的出现，与政府疫情防控政策相关的关键词经常被观察到具有高度正相关。在疫情的第三阶段，在德尔塔变异株爆发期间，出现了“经济危机”和“焦虑”等关键词，反映了公众的疲惫。因此，在4个月窗口内由提取的查询训练的预测模型在提前1 - 14天的大多数预测中优于先前的方法。值得注意的是，与基于启发式和症状的查询集相比，对于提前9 - 11天的预测，我们的方法显示出比仅基于过去病例数的模型显著更高的皮尔逊相关系数（P = 0.02，P < 0.01，P < 0.01）。

结论

本研究提出了一种新颖的新冠病毒病病例预测模型，该模型使用词嵌入随时间自动提取相关查询。该模型优于以往依赖基于静态症状或启发式查询的方法，甚至无需先验专家知识。结果证明了我们的方法能够追踪公众对疫情变化的关注随时间的转移。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

利用搜索引擎查询发现韩国新冠肺炎病例预测中的时变公众兴趣：信息流行病学研究

Discovering Time-Varying Public Interest for COVID-19 Case Prediction in South Korea Using Search Engine Queries: Infodemiology Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献

利用搜索引擎查询发现韩国新冠肺炎病例预测中的时变公众兴趣：信息流行病学研究

Discovering Time-Varying Public Interest for COVID-19 Case Prediction in South Korea Using Search Engine Queries: Infodemiology Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献