Suppr超能文献

预测美国的麻疹疫情:机器学习方法评估

Predicting Measles Outbreaks in the United States: Evaluation of Machine Learning Approaches.

作者信息

Ru Boshu, Kujawski Stephanie, Lee Afanador Nelson, Baumgartner Richard, Pawaskar Manjiri, Das Amar

机构信息

Merck & Co, Inc, West Point, PA, United States.

Merck & Co, Inc, Rahway, NJ, United States.

出版信息

JMIR Form Res. 2023 Apr 4;7:e42832. doi: 10.2196/42832.

Abstract

BACKGROUND

Measles, a highly contagious viral infection, is resurging in the United States, driven by international importation and declining domestic vaccination coverage. Despite this resurgence, measles outbreaks are still rare events that are difficult to predict. Improved methods to predict outbreaks at the county level would facilitate the optimal allocation of public health resources.

OBJECTIVE

We aimed to validate and compare extreme gradient boosting (XGBoost) and logistic regression, 2 supervised learning approaches, to predict the US counties most likely to experience measles cases. We also aimed to assess the performance of hybrid versions of these models that incorporated additional predictors generated by 2 clustering algorithms, hierarchical density-based spatial clustering of applications with noise (HDBSCAN) and unsupervised random forest (uRF).

METHODS

We constructed a supervised machine learning model based on XGBoost and unsupervised models based on HDBSCAN and uRF. The unsupervised models were used to investigate clustering patterns among counties with measles outbreaks; these clustering data were also incorporated into hybrid XGBoost models as additional input variables. The machine learning models were then compared to logistic regression models with and without input from the unsupervised models.

RESULTS

Both HDBSCAN and uRF identified clusters that included a high percentage of counties with measles outbreaks. XGBoost and XGBoost hybrid models outperformed logistic regression and logistic regression hybrid models, with the area under the receiver operating curve values of 0.920-0.926 versus 0.900-0.908, the area under the precision-recall curve values of 0.522-0.532 versus 0.485-0.513, and F scores of 0.595-0.601 versus 0.385-0.426. Logistic regression or logistic regression hybrid models had higher sensitivity than XGBoost or XGBoost hybrid models (0.837-0.857 vs 0.704-0.735) but a lower positive predictive value (0.122-0.141 vs 0.340-0.367) and specificity (0.793-0.821 vs 0.952-0.958). The hybrid versions of the logistic regression and XGBoost models had slightly higher areas under the precision-recall curve, specificity, and positive predictive values than the respective models that did not include any unsupervised features.

CONCLUSIONS

XGBoost provided more accurate predictions of measles cases at the county level compared with logistic regression. The threshold of prediction in this model can be adjusted to align with each county's resources, priorities, and risk for measles. While clustering pattern data from unsupervised machine learning approaches improved some aspects of model performance in this imbalanced data set, the optimal approach for the integration of such approaches with supervised machine learning models requires further investigation.

摘要

背景

麻疹是一种具有高度传染性的病毒感染疾病,在美国因国际输入和国内疫苗接种覆盖率下降而再度流行。尽管出现了这种复苏情况,但麻疹疫情仍然是难以预测的罕见事件。改进县级疫情预测方法将有助于优化公共卫生资源的分配。

目的

我们旨在验证和比较极端梯度提升(XGBoost)和逻辑回归这两种监督学习方法,以预测美国最有可能出现麻疹病例的县。我们还旨在评估这些模型的混合版本的性能,这些混合版本纳入了由两种聚类算法(基于密度的具有噪声的分层空间聚类(HDBSCAN)和无监督随机森林(uRF))生成的额外预测变量。

方法

我们构建了基于XGBoost的监督机器学习模型以及基于HDBSCAN和uRF的无监督模型。无监督模型用于研究麻疹疫情县之间的聚类模式;这些聚类数据也作为额外的输入变量纳入到混合XGBoost模型中。然后将机器学习模型与有无无监督模型输入的逻辑回归模型进行比较。

结果

HDBSCAN和uRF都识别出了包含高比例麻疹疫情县的聚类。XGBoost和XGBoost混合模型的表现优于逻辑回归和逻辑回归混合模型,其受试者工作特征曲线下面积值为0.920 - 0.926,而逻辑回归和逻辑回归混合模型为0.900 - 0.908;精确召回率曲线下面积值为0.522 - 0.532,而逻辑回归和逻辑回归混合模型为0.485 - 0.513;F分数为0.595 - 0.601,而逻辑回归和逻辑回归混合模型为0.385 - 0.426。逻辑回归或逻辑回归混合模型的敏感性高于XGBoost或XGBoost混合模型(0.837 - 0.857对0.704 - 0.735),但阳性预测值较低(0.122 - 0.141对0.340 - 0.367),特异性也较低(0.793 - 0.821对0.952 - 0.958)。逻辑回归和XGBoost模型的混合版本在精确召回率曲线下面积、特异性和阳性预测值方面比不包括任何无监督特征的相应模型略高。

结论

与逻辑回归相比,XGBoost在县级层面提供了更准确的麻疹病例预测。该模型中的预测阈值可根据每个县的资源、优先事项和麻疹风险进行调整。虽然来自无监督机器学习方法的聚类模式数据在这个不平衡数据集中改善了模型性能的某些方面,但将这些方法与监督机器学习模型进行整合的最佳方法仍需进一步研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4220/10131820/ea2733756e5a/formative_v7i1e42832_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验