Soha Krittaka, Phuthomdee Sadiporn, Srichai Thanapat, Kittiratanawasin Lanchakorn, Han Win Min, Teeraananchai Sirinya
Master of Biomedical Data Science program, Kasetsart University, Bangkok, Thailand.
Department of Statistics, Kasetsart University, Bangkok, Thailand.
BMJ Health Care Inform. 2025 May 15;32(1):e101189. doi: 10.1136/bmjhci-2024-101189.
This study aimed to develop machine learning (ML) models to predict HIV status and assessed the factors associated with HIV infection among young men who have sex with men (MSM) under the Universal Health Coverage (UHC) programme in Thailand.
Young MSM aged 15-24 years who underwent HIV testing through the UHC programme from 2015 to 2022 were included. Data were divided into training (70%) and testing (30%) sets, with the Synthetic Minority Oversampling Technique (SMOTE) applied to address data set imbalance. ML models, including logistic regression, k-nearest neighbour (KNN), random forest, extreme gradient boosting (XGB) and AdaBoost, were used to predict HIV infection.
Among 146 813 young MSM, 11% were diagnosed with HIV. While KNN initially outperformed other ML models, the sensitivity of all models using the original data set was low due to imbalanced data. After applying SMOTE, the XGB model showed the best performance with an accuracy of 0.72, sensitivity of 0.73, specificity of 0.72 and the area under the curve of 0.72. The top predictors of HIV infection were the year of HIV testing (68%), age (55%) and targeted HIV testing (54%).
This study demonstrates the potential of ML models, particularly XGB, in predicting HIV infection among young MSM in Thailand under the UHC programme. The application of SMOTE improved model sensitivity, addressing data imbalance and enhancing predictive accuracy.
ML models have the potential to enhance HIV risk assessment and inform targeted prevention strategies for high-risk populations.
本研究旨在开发机器学习(ML)模型以预测艾滋病毒感染状况,并评估泰国全民健康覆盖(UHC)计划下男男性行为者(MSM)中与艾滋病毒感染相关的因素。
纳入2015年至2022年期间通过UHC计划接受艾滋病毒检测的15至24岁年轻MSM。数据分为训练集(70%)和测试集(30%),采用合成少数过采样技术(SMOTE)来解决数据集不平衡问题。使用逻辑回归、k近邻(KNN)、随机森林、极端梯度提升(XGB)和AdaBoost等ML模型来预测艾滋病毒感染情况。
在146813名年轻MSM中,11%被诊断出感染艾滋病毒。虽然KNN最初的表现优于其他ML模型,但由于数据不平衡,使用原始数据集的所有模型的敏感性都较低。应用SMOTE后,XGB模型表现最佳,准确率为0.72,敏感性为0.73,特异性为0.72,曲线下面积为0.72。艾滋病毒感染的主要预测因素是艾滋病毒检测年份(68%)、年龄(55%)和针对性艾滋病毒检测(54%)。
本研究证明了ML模型,特别是XGB,在预测泰国UHC计划下年轻MSM艾滋病毒感染方面的潜力。SMOTE的应用提高了模型敏感性,解决了数据不平衡问题并提高了预测准确性。
ML模型有潜力加强艾滋病毒风险评估,并为高危人群的针对性预防策略提供信息。