一种集成方法改进了韩国新冠疫情的预测。

An ensemble approach improves the prediction of the COVID-19 pandemic in South Korea.

作者信息

Han Kyulhee, Apio Catherine, Song Hanbyul, Lee Bogyeom, Hu Xuwen, Park Jiwon, Zhe Liu, Goo Taewan, Park Taesung

机构信息

Interdisciplinary Program of Bioinformatics, Seoul National University, Seoul, Korea.

Department of Industrial Engineering, Seoul National University, Seoul, Korea.

出版信息

J Glob Health. 2025 Mar 28;15:04079. doi: 10.7189/jogh.15.04079.

DOI:10.7189/jogh.15.04079

PMID:40146993

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11949510/

Abstract

BACKGROUND

Modelling can contribute to disease prevention and control strategies. Accurate predictions of future cases and mortality rates were essential for establishing appropriate policies during the COVID-19 pandemic. However, no single model yielded definite conclusions, with each having specific strengths and weaknesses. Here we propose an ensemble learning approach which can offset the limitations of each model and improve prediction performances.

METHODS

We generated predictions for the transmission and impact of COVID-19 in South Korea using seven individual models, including mathematical, statistical, and machine learning approaches. We integrated these predictions using three ensemble methods: stacking, average, and weighted average ensemble (WAE). We used train and test errors to measure a model's performance and selected the best covariate combinations based on the lowest train error. We then evaluated model performance using five error measures (r, weighted mean absolute percentage error (WMAPE), autoregressive integrated moving average (ARIMA), mean squared error (MSE), root mean squared error (RMSE), and mean absolute percentage error (MAPE)) and selected the optimal covariate combination accordingly. To validate the generalisability of our approach, we applied the same modelling framework to USA data.

RESULTS

Booster shot rate + Omicron variant BA.5 rate was the most commonly selected combination of covariates. For raw data evaluated using the WMAPE, individual models achieved the following: Generalised additive modelling (GAM) reached a value of 0.244 for the daily number of confirmed cases, a value of 0.172 for the time series Poisson for the daily number of confirmed deaths, and a value of 0.022 for both ARIMA and time series Poisson for the daily number of ICU patients. For smoothed data, the Holt-Winters model achieved a value of 0.058 for daily confirmed cases, while ARIMA attained a value of 0.058 for the daily number of confirmed deaths and 0.013 for the daily number of ICU patients. Among ensemble models, the SVM-based stacking ensemble achieved error values of 0.235 for the daily number of confirmed cases, 0.118 for the daily number of deaths, and 0.019 for the daily number of ICU patients on raw data. For smoothed data, the average ensemble and weighted average ensemble achieved 0.060 for the daily number of confirmed cases and 0.013 for daily ICU patients. The ensemble models also generalised well when applied to data from the USA.Booster shot rate + Omicron variant BA.5 rate was the most commonly selected combination of covariates. For raw data, GAM (0.244) predicted daily confirmed cases best, time series Poisson (0.172) predicted daily confirmed deaths, and both ARIMA and time series Poisson (0.022) predicted daily ICU patients, based on WMAPE. For smoothed data, time series Poisson predicted daily confirmed cases (0.065) best, while ARIMA best predicted daily confirmed deaths (0.058) and ICU patients (0.013). For ensemble models, stacking ensemble using SVM was the best model for predicting daily confirmed cases (0.228), deaths (0.11), and ICU patients (0.02). With smoothed data, average ensemble and WAE were the best models for predicting daily confirmed cases (0.058) and ICU patients (0.011). The performance of ensemble models was generalised to other countries using the USA data for predictive performance.

CONCLUSIONS

No single model performed consistently. While the ensemble models did not always provide the best predictions, a comparison of first-best and second-best models showed that they performed considerably better than the single models. If an ensemble model was not the best performing model, its performance was always not far from the best single model: a look at the mean and variance of the error measures shows that ensemble models provided stable predictions without much variation in their performances compared to single models. These results can be used to inform policymaking during future pandemics.

摘要

背景

建模有助于疾病预防和控制策略。在新冠疫情期间，准确预测未来病例数和死亡率对于制定适当政策至关重要。然而，没有一个单一模型能得出明确结论，每个模型都有其特定的优缺点。在此，我们提出一种集成学习方法，该方法可以弥补每个模型的局限性并提高预测性能。

方法

我们使用七种个体模型对韩国新冠疫情的传播和影响进行预测，这些模型包括数学、统计和机器学习方法。我们使用三种集成方法对这些预测进行整合：堆叠、平均和加权平均集成（WAE）。我们使用训练误差和测试误差来衡量模型性能，并根据最低训练误差选择最佳协变量组合。然后，我们使用五种误差度量（r、加权平均绝对百分比误差（WMAPE）、自回归积分移动平均（ARIMA）、均方误差（MSE）、均方根误差（RMSE）和平均绝对百分比误差（MAPE））评估模型性能，并据此选择最优协变量组合。为验证我们方法的通用性，我们将相同的建模框架应用于美国数据。

结果

加强针接种率+奥密克戎变种BA.5感染率是最常被选择的协变量组合。对于使用WMAPE评估的原始数据，个体模型的表现如下：广义相加模型（GAM）对每日确诊病例数的预测值为0.244，时间序列泊松模型对每日确诊死亡数的预测值为0.172，ARIMA模型和时间序列泊松模型对每日ICU患者数的预测值均为0.022。对于平滑后的数据，霍尔特-温特斯模型对每日确诊病例数的预测值为0.058，而ARIMA模型对每日确诊死亡数的预测值为0.058，对每日ICU患者数的预测值为0.013。在集成模型中，基于支持向量机的堆叠集成模型对原始数据的每日确诊病例数的误差值为0.235，对每日死亡数的误差值为0.118，对每日ICU患者数的误差值为0.019。对于平滑后的数据，平均集成模型和加权平均集成模型对每日确诊病例数的预测值为0.060，对每日ICU患者数的预测值为0.013。当应用于美国数据时，集成模型也具有良好的通用性。加强针接种率+奥密克戎变种BA.5感染率是最常被选择的协变量组合。对于原始数据，基于WMAPE，GAM（0.244）对每日确诊病例的预测最佳，时间序列泊松模型（0.172）对每日确诊死亡的预测最佳，ARIMA模型和时间序列泊松模型（0.022）对每日ICU患者的预测最佳。对于平滑后的数据，时间序列泊松模型对每日确诊病例（0.065）的预测最佳，而ARIMA模型对每日确诊死亡（0.058）和ICU患者（0.013）的预测最佳。对于集成模型，使用支持向量机的堆叠集成模型是预测每日确诊病例（0.228）、死亡（0.11）和ICU患者（0.02）的最佳模型。对于平滑后的数据，平均集成模型和WAE是预测每日确诊病例（0.058）和ICU患者（0.011）的最佳模型。使用美国数据进行预测性能评估时，集成模型的性能在其他国家也具有通用性。

结论

没有一个单一模型始终表现最佳。虽然集成模型并不总是能提供最佳预测，但对最佳模型和次佳模型的比较表明，它们的表现明显优于单一模型。如果一个集成模型不是表现最佳的模型，其性能也总是与最佳单一模型相差不远：查看误差度量的均值和方差表明，与单一模型相比，集成模型提供了稳定的预测，其性能变化不大。这些结果可用于为未来疫情期间的决策提供参考。