使用机器学习预测新冠病毒疾病病例的可行性。

Chen Shan, Ding Yuanzhao

Science of Learning in Education Centre, National Institute of Education, Nanyang Technological University, 637616, Singapore.

School of Geography and the Environment, University of Oxford, South Parks Road, Oxford OX1 3QY, United Kingdom.

Int J Med Inform. 2025 Apr;196:105786. doi: 10.1016/j.ijmedinf.2025.105786. Epub 2025 Jan 23.

BACKGROUND

Coronavirus Disease 2019 (COVID-19), caused by the SARS-CoV-2 virus, emerged as a global health crisis in 2019, resulting in widespread morbidity and mortality. A persistent challenge during the pandemic has been the accuracy of reported epidemic data, particularly in underdeveloped regions with limited access to COVID-19 test kits and healthcare infrastructure. In the post-COVID era, this issue remains crucial. This study introduces a novel approach by leveraging machine learning to predict cases and uncover critical discrepancies, focusing on African regions where reported daily cases per million often deviate significantly from machine learning-predicted cases. These findings strongly suggest widespread underreporting of cases. By identifying these gaps, our research provides valuable insights for future pandemic preparedness, improving epidemic forecasting accuracy, data reliability, and response strategies to mitigate the impact of emerging global health crises.

OBJECTIVE

This study aims to assess the reliability of reported COVID-19 incidence data globally, particularly in underdeveloped regions, and to identify discrepancies between reported and predicted cases using machine learning methodologies.

METHODS

Data collected from March 2020 to September 2022 included demographic, healthcare, economic, and testing-related parameters. Several machine learning models-neural networks, decision trees, random forests, cross-validation, support vector machines, and logistic regression-were employed to predict COVID-19 incidence rates. Model performance was evaluated using testing accuracy metrics.

RESULTS

Testing accuracy rates for the models were as follows: neural networks (65.50 %), decision trees (63.76 %), random forests (63.33 %), cross-validation (55.92 %), support vector machines (63.62 %), and logistic regression (64.70 %). Comparative analysis using neural networks revealed significant discrepancies between reported and predicted COVID-19 cases, particularly in numerous African countries. These results suggest a considerable volume of underreported cases in regions with limited testing capabilities.

CONCLUSION

This study highlights the critical need for improved data accuracy and reporting mechanisms, especially in resource-constrained regions. International organizations and policymakers must implement strategies to enhance testing capacity and data reliability to better understand and manage the global impact of the pandemic. Our work emphasizes the potential of machine learning to identify gaps in epidemic reporting, facilitating evidence-based interventions.

背景

2019年冠状病毒病（COVID-19）由严重急性呼吸综合征冠状病毒2（SARS-CoV-2）引起，于2019年成为全球健康危机，导致广泛的发病和死亡。疫情期间持续存在的一个挑战是报告的疫情数据的准确性，特别是在难以获得COVID-19检测试剂盒和医疗基础设施有限的欠发达地区。在COVID后时代，这个问题仍然至关重要。本研究引入了一种新方法，即利用机器学习来预测病例并发现关键差异，重点关注非洲地区，那里每百万人口的每日报告病例数往往与机器学习预测的病例数有显著偏差。这些发现强烈表明病例报告存在广泛漏报。通过识别这些差距，我们的研究为未来的疫情防范、提高疫情预测准确性、数据可靠性以及减轻新兴全球健康危机影响的应对策略提供了有价值的见解。

目的

本研究旨在评估全球报告的COVID-19发病率数据的可靠性，特别是在欠发达地区，并使用机器学习方法识别报告病例与预测病例之间的差异。

方法

收集的2020年3月至2022年9月的数据包括人口、医疗、经济和检测相关参数。使用了几种机器学习模型——神经网络、决策树、随机森林、交叉验证、支持向量机和逻辑回归——来预测COVID-19发病率。使用测试准确率指标评估模型性能。

结果

模型的测试准确率如下：神经网络（65.50%）、决策树（63.76%）、随机森林（63.33%）、交叉验证（55.92%）、支持向量机（63.62%）和逻辑回归（64.70%）。使用神经网络进行的比较分析显示，报告的COVID-19病例与预测病例之间存在显著差异，特别是在许多非洲国家。这些结果表明，检测能力有限的地区存在大量漏报病例。