Division of Infectious Diseases, Department of Medicine, Emory University School of Medicine, Atlanta, Georgia, USA.
Epidemiology Division, Georgia Department of Public Health, Atlanta, Georgia, USA.
Clin Infect Dis. 2024 Sep 26;79(3):717-726. doi: 10.1093/cid/ciae100.
Advancements in machine learning (ML) have improved the accuracy of models that predict human immunodeficiency virus (HIV) incidence. These models have used electronic medical records and registries. We aim to broaden the application of these tools by using deidentified public health datasets for notifiable sexually transmitted infections (STIs) from a southern US county known for high HIV incidence. The goal is to assess the feasibility and accuracy of ML in predicting HIV incidence, which could inform and enhance public health interventions.
We analyzed 2 deidentified public health datasets from January 2010 to December 2021, focusing on notifiable STIs. Our process involved data processing and feature extraction, including sociodemographic factors, STI cases, and social vulnerability index (SVI) metrics. Various ML models were trained and evaluated for predicting HIV incidence using metrics such as accuracy, precision, recall, and F1 score.
We included 85 224 individuals; 2027 (2.37%) were newly diagnosed with HIV during the study period. The ML models demonstrated high performance in predicting HIV incidence among males and females. Influential features for males included age at STI diagnosis, previous STI information, provider type, and SVI. For females, predictive features included age, ethnicity, previous STI information, overall SVI, and race.
The high accuracy of our ML models in predicting HIV incidence highlights the potential of using public health datasets for public health interventions such as tailored HIV testing and prevention. While these findings are promising, further research is needed to translate these models into practical public health applications.
机器学习(ML)的进步提高了预测人类免疫缺陷病毒(HIV)发病率的模型的准确性。这些模型使用了电子病历和注册信息。我们旨在通过使用来自美国南部一个 HIV 发病率高的县的可识别公共卫生数据集中的性传播感染(STI)报告数据,拓宽这些工具的应用范围。目标是评估 ML 在预测 HIV 发病率方面的可行性和准确性,这可以为公共卫生干预措施提供信息并加以强化。
我们分析了 2010 年 1 月至 2021 年 12 月期间的 2 个去识别公共卫生数据集,重点关注可识别的 STI。我们的过程包括数据处理和特征提取,包括社会人口统计学因素、STI 病例和社会脆弱性指数(SVI)指标。我们使用准确性、精度、召回率和 F1 分数等指标,训练和评估了各种 ML 模型,以预测 HIV 发病率。
我们纳入了 85224 人;在研究期间,有 2027 人(2.37%)新诊断出 HIV。ML 模型在预测男性和女性 HIV 发病率方面表现出了较高的性能。对男性而言,有影响力的特征包括 STI 诊断时的年龄、以前的 STI 信息、提供者类型和 SVI。对于女性,预测特征包括年龄、族裔、以前的 STI 信息、总体 SVI 和种族。
我们的 ML 模型在预测 HIV 发病率方面的高准确性突出了使用公共卫生数据集进行公共卫生干预的潜力,例如针对 HIV 检测和预防的量身定制。虽然这些发现很有希望,但需要进一步的研究将这些模型转化为实际的公共卫生应用。