Ru Boshu, Kujawski Stephanie, Lee Afanador Nelson, Baumgartner Richard, Pawaskar Manjiri, Das Amar
Merck & Co, Inc, West Point, PA, United States.
Merck & Co, Inc, Rahway, NJ, United States.
JMIR Form Res. 2023 Apr 4;7:e42832. doi: 10.2196/42832.
Measles, a highly contagious viral infection, is resurging in the United States, driven by international importation and declining domestic vaccination coverage. Despite this resurgence, measles outbreaks are still rare events that are difficult to predict. Improved methods to predict outbreaks at the county level would facilitate the optimal allocation of public health resources.
We aimed to validate and compare extreme gradient boosting (XGBoost) and logistic regression, 2 supervised learning approaches, to predict the US counties most likely to experience measles cases. We also aimed to assess the performance of hybrid versions of these models that incorporated additional predictors generated by 2 clustering algorithms, hierarchical density-based spatial clustering of applications with noise (HDBSCAN) and unsupervised random forest (uRF).
We constructed a supervised machine learning model based on XGBoost and unsupervised models based on HDBSCAN and uRF. The unsupervised models were used to investigate clustering patterns among counties with measles outbreaks; these clustering data were also incorporated into hybrid XGBoost models as additional input variables. The machine learning models were then compared to logistic regression models with and without input from the unsupervised models.
Both HDBSCAN and uRF identified clusters that included a high percentage of counties with measles outbreaks. XGBoost and XGBoost hybrid models outperformed logistic regression and logistic regression hybrid models, with the area under the receiver operating curve values of 0.920-0.926 versus 0.900-0.908, the area under the precision-recall curve values of 0.522-0.532 versus 0.485-0.513, and F scores of 0.595-0.601 versus 0.385-0.426. Logistic regression or logistic regression hybrid models had higher sensitivity than XGBoost or XGBoost hybrid models (0.837-0.857 vs 0.704-0.735) but a lower positive predictive value (0.122-0.141 vs 0.340-0.367) and specificity (0.793-0.821 vs 0.952-0.958). The hybrid versions of the logistic regression and XGBoost models had slightly higher areas under the precision-recall curve, specificity, and positive predictive values than the respective models that did not include any unsupervised features.
XGBoost provided more accurate predictions of measles cases at the county level compared with logistic regression. The threshold of prediction in this model can be adjusted to align with each county's resources, priorities, and risk for measles. While clustering pattern data from unsupervised machine learning approaches improved some aspects of model performance in this imbalanced data set, the optimal approach for the integration of such approaches with supervised machine learning models requires further investigation.
麻疹是一种具有高度传染性的病毒感染疾病,在美国因国际输入和国内疫苗接种覆盖率下降而再度流行。尽管出现了这种复苏情况,但麻疹疫情仍然是难以预测的罕见事件。改进县级疫情预测方法将有助于优化公共卫生资源的分配。
我们旨在验证和比较极端梯度提升(XGBoost)和逻辑回归这两种监督学习方法,以预测美国最有可能出现麻疹病例的县。我们还旨在评估这些模型的混合版本的性能,这些混合版本纳入了由两种聚类算法(基于密度的具有噪声的分层空间聚类(HDBSCAN)和无监督随机森林(uRF))生成的额外预测变量。
我们构建了基于XGBoost的监督机器学习模型以及基于HDBSCAN和uRF的无监督模型。无监督模型用于研究麻疹疫情县之间的聚类模式;这些聚类数据也作为额外的输入变量纳入到混合XGBoost模型中。然后将机器学习模型与有无无监督模型输入的逻辑回归模型进行比较。
HDBSCAN和uRF都识别出了包含高比例麻疹疫情县的聚类。XGBoost和XGBoost混合模型的表现优于逻辑回归和逻辑回归混合模型,其受试者工作特征曲线下面积值为0.920 - 0.926,而逻辑回归和逻辑回归混合模型为0.900 - 0.908;精确召回率曲线下面积值为0.522 - 0.532,而逻辑回归和逻辑回归混合模型为0.485 - 0.513;F分数为0.595 - 0.601,而逻辑回归和逻辑回归混合模型为0.385 - 0.426。逻辑回归或逻辑回归混合模型的敏感性高于XGBoost或XGBoost混合模型(0.837 - 0.857对0.704 - 0.735),但阳性预测值较低(0.122 - 0.141对0.340 - 0.367),特异性也较低(0.793 - 0.821对0.952 - 0.958)。逻辑回归和XGBoost模型的混合版本在精确召回率曲线下面积、特异性和阳性预测值方面比不包括任何无监督特征的相应模型略高。
与逻辑回归相比,XGBoost在县级层面提供了更准确的麻疹病例预测。该模型中的预测阈值可根据每个县的资源、优先事项和麻疹风险进行调整。虽然来自无监督机器学习方法的聚类模式数据在这个不平衡数据集中改善了模型性能的某些方面,但将这些方法与监督机器学习模型进行整合的最佳方法仍需进一步研究。