Suppr超能文献

使用大型人群健康数据库比较人工智能/机器学习方法和经典回归进行预测建模:在新冠病例预测中的应用

Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction.

作者信息

Bjerre Lise M, Peixoto Cayden, Alkurd Rawan, Talarico Robert, Abielmona Rami

机构信息

Institut du Savoir Montfort, 713, chemin Montréal, Ottawa, Ontario K1K 0T2, Canada.

University of Ottawa, Faculty of Medicine, Department of Family Medicine, 201-600 Peter-Morand Crescent, Ottawa ON, K1G 5Z3, Canada.

出版信息

Glob Epidemiol. 2024 Oct 4;8:100168. doi: 10.1016/j.gloepi.2024.100168. eCollection 2024 Dec.

Abstract

BACKGROUND

Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited.

OBJECTIVES

This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against conventional multivariate logistic regression models using linked health administrative data.

METHODS

Using Ontario's population health databases, we created a cohort of residents of the city of Ottawa, Ontario, who underwent a PCR test for COVID-19 between March 10, 2020, and May 13, 2021. Using demographic, socio-economic and health data (including COVID-19 PCR test results and available, symptom data), we developed predictive models for the purpose of COVID-19 case identification using the following approaches: classical multivariate logistic regression (LR); deep neural network (DNN); random forest (RF); and gradient boosting trees (GBT). Model performance comparisons were made using the area under the curve (AUC) swarm plot for 10-fold cross-validation.

RESULTS

The cohort consisted of  = 351,248 Ottawa residents tested for COVID-19 during the study period. Among whom, a total of  = 883,879 unique COVID-19 tests were performed (2.6 % positive test results). Inclusion of COVID-19 symptoms data in the analysis improved model performance and variable predictive value across all tested models ( < 0.0001), with the 10-fold cross-validation AUC increasing to near or over 0.7 in all models when symptoms data were included. In various pairwise comparisons, the GBT method had the highest predictive ability (AUC = 0.796 ± 0.017), significantly outperforming multivariate logistic regression and the other AI/ML approaches.

CONCLUSIONS

Conventional multivariate regression-based models are better than some and worse than other machine learning algorithms to provide good predictive accuracy in a moderate dataset with a reasonable number of features. However, whenever possible, the AI/ML GBT approach should be considered.

摘要

背景

将人工智能和机器学习(AI/ML)方法与应用于大型人群健康数据库的经典统计方法进行比较的研究有限。

目的

这项回顾性队列研究旨在使用关联的健康管理数据,比较AI/ML算法与传统多变量逻辑回归模型的预测性能。

方法

利用安大略省的人群健康数据库,我们创建了一组安大略省渥太华市居民的队列,他们在2020年3月10日至2021年5月13日期间接受了新冠病毒病(COVID-19)的聚合酶链反应(PCR)检测。利用人口统计学、社会经济和健康数据(包括COVID-19 PCR检测结果和可用的症状数据),我们采用以下方法开发了用于COVID-19病例识别的预测模型:经典多变量逻辑回归(LR);深度神经网络(DNN);随机森林(RF);以及梯度提升树(GBT)。使用曲线下面积(AUC)群图进行10折交叉验证,对模型性能进行比较。

结果

该队列包括在研究期间接受COVID-19检测的351,248名渥太华居民。其中,共进行了883,879次独特的COVID-19检测(检测结果阳性率为2.6%)。在所有测试模型中,将COVID-19症状数据纳入分析可提高模型性能和变量预测价值(P<0.0001),当纳入症状数据时,所有模型的10折交叉验证AUC增加到接近或超过0.7。在各种成对比较中,GBT方法具有最高的预测能力(AUC=0.796±0.017),显著优于多变量逻辑回归和其他AI/ML方法。

结论

在具有合理数量特征的数据适中的数据集里,基于传统多变量回归的模型在提供良好预测准确性方面优于某些机器学习算法,但不如其他算法。然而,只要有可能,就应考虑使用AI/ML的GBT方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b821/11492135/a8cc8275a43c/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验