Ji Xiaopeng, Tang Zhaohui, Osborne Sonya R, Van Nguyen Thi Phuoc, Mullens Amy B, Dean Judith A, Li Yan
School of Mathematics, Physics and Computing, Centre for Health Research, University of Southern Queensland, Toowoomba, QLD, Australia.
School of Nursing and Midwifery, Centre for Health Research, Institute for Resilient Regions, University of Southern Queensland, Ipswich, QLD, Australia.
Front Public Health. 2025 Jan 3;12:1511689. doi: 10.3389/fpubh.2024.1511689. eCollection 2024.
A novel automatic framework is proposed for global sexually transmissible infections (STIs) and HIV risk prediction. Four machine learning methods, namely, Gradient Boosting Machine (GBM), Random Forest (RF), XG Boost, and Ensemble learning GBM-RF-XG Boost are applied and evaluated on the Demographic and Health Surveys Program (DHSP), with thirteen features ultimately selected as the most predictive features. Classification and generalization experiments are conducted to test the accuracy, F1-score, precision, and area under the curve (AUC) performance of these four algorithms. Two imbalanced data solutions are also applied to reduce bias for classification performance improvement. The experimental results of these models demonstrate that the Random Forest algorithm yields the best results on HIV prediction, whereby the highest accuracy, and AUC are 0.99 and 0.99, respectively. The performance of the STI prediction achieves the best when the Synthetic Minority Oversampling Technique (SMOTE) is applied (Accuracy = 0.99, AUC = 0.99), which outperforms the state-of-the-art baselines. Two possible factors that may affect the classification and generalization performance are further analyzed. This automatic classification model helps to improve convenience and reduce the cost of HIV testing.
提出了一种用于全球性传播感染(STIs)和艾滋病毒风险预测的新型自动框架。在人口与健康调查计划(DHSP)上应用并评估了四种机器学习方法,即梯度提升机(GBM)、随机森林(RF)、XGBoost和集成学习GBM-RF-XGBoost,最终选择了13个特征作为最具预测性的特征。进行了分类和泛化实验,以测试这四种算法的准确性、F1分数、精确率和曲线下面积(AUC)性能。还应用了两种不平衡数据解决方案来减少偏差,以提高分类性能。这些模型的实验结果表明,随机森林算法在艾滋病毒预测方面产生了最佳结果,其最高准确率和AUC分别为0.99和0.99。当应用合成少数过采样技术(SMOTE)时,性传播感染预测的性能达到最佳(准确率 = 0.99,AUC = 0.99),优于现有技术的基线。进一步分析了可能影响分类和泛化性能的两个因素。这种自动分类模型有助于提高便利性并降低艾滋病毒检测成本。