School of Public Health, Zhejiang University School of Medicine, Hangzhou, China.
School of Software Technology, Zhejiang University, Ningbo, China.
Front Public Health. 2022 Aug 25;10:967681. doi: 10.3389/fpubh.2022.967681. eCollection 2022.
Continuously growing of HIV incidence among men who have sex with men (MSM), as well as the low rate of HIV testing of MSM in China, demonstrates a need for innovative strategies to improve the implementation of HIV prevention. The use of machine learning algorithms is an increasing tendency in disease diagnosis prediction. We aimed to develop and validate machine learning models in predicting HIV infection among MSM that can identify individuals at increased risk of HIV acquisition for transmission-reduction interventions.
We extracted data from MSM sentinel surveillance in Zhejiang province from 2018 to 2020. Univariate logistic regression was used to select significant variables in 2018-2019 data ( < 0.05). After data processing and feature selection, we divided the model development data into two groups by stratified random sampling: training data (70%) and testing data (30%). The Synthetic Minority Oversampling Technique (SMOTE) was applied to solve the problem of unbalanced data. The evaluation metrics of model performance were comprised of accuracy, precision, recall, F-measure, and the area under the receiver operating characteristic curve (AUC). Then, we explored three commonly-used machine learning algorithms to compare with logistic regression (LR), including decision tree (DT), support vector machines (SVM), and random forest (RF). Finally, the four models were validated prospectively with 2020 data from Zhejiang province.
A total of 6,346 MSM were included in model development data, 372 of whom were diagnosed with HIV. In feature selection, 12 variables were selected as model predicting indicators. Compared with LR, the algorithms of DT, SVM, and RF improved the classification prediction performance in SMOTE-processed data, with the AUC of 0.778, 0.856, 0.887, and 0.942, respectively. RF was the best-performing algorithm (accuracy = 0.871, precision = 0.960, recall = 0.775, F-measure = 0.858, and AUC = 0.942). And the RF model still performed well on prospective validation (AUC = 0.846).
Machine learning models are substantially better than conventional LR model and RF should be considered in prediction tools of HIV infection in Chinese MSM. Further studies are needed to optimize and promote these algorithms and evaluate their impact on HIV prevention of MSM.
中国男男性行为者(MSM)中的艾滋病毒发病率持续上升,以及 MSM 中艾滋病毒检测率较低,这表明需要创新策略来改善艾滋病毒预防措施的实施。机器学习算法的使用是疾病诊断预测的一种趋势。我们旨在开发和验证能够识别 HIV 获得风险增加的 MSM 个体的机器学习模型,以进行减少传播的干预措施。
我们从 2018 年至 2020 年从浙江省的男男性行为者哨点监测中提取数据。使用单变量逻辑回归从 2018-2019 年的数据中选择有统计学意义的变量(<0.05)。在数据处理和特征选择后,我们通过分层随机抽样将模型开发数据分为两组:训练数据(70%)和测试数据(30%)。我们应用了合成少数过采样技术(SMOTE)来解决数据不平衡的问题。模型性能的评估指标包括准确性、精确性、召回率、F 值和接收器操作特征曲线下的面积(AUC)。然后,我们探索了三种常用的机器学习算法,与逻辑回归(LR)进行比较,包括决策树(DT)、支持向量机(SVM)和随机森林(RF)。最后,我们使用浙江省 2020 年的数据对这四个模型进行了前瞻性验证。
共有 6346 名 MSM 纳入模型开发数据,其中 372 人被诊断为 HIV 阳性。在特征选择中,选择了 12 个变量作为模型预测指标。与 LR 相比,DT、SVM 和 RF 算法在 SMOTE 处理后的数据中的分类预测性能有所提高,AUC 分别为 0.778、0.856、0.887 和 0.942。RF 是表现最好的算法(准确性=0.871、精确性=0.960、召回率=0.775、F 值=0.858、AUC=0.942)。并且 RF 模型在前瞻性验证中仍然表现良好(AUC=0.846)。
机器学习模型明显优于传统的 LR 模型,在预测中国 MSM 的 HIV 感染方面,应考虑使用 RF。需要进一步研究以优化和推广这些算法,并评估它们对 MSM 艾滋病毒预防的影响。