Jaiteh Musa, Phalane Edith, Shiferaw Yegnanew A, Jallow Haruna, Phaswana-Mafuya Refilwe Nancy
South African Medical Research Council/University of Johannesburg Pan African Centre for Epidemics Research Extramural Unit, Faculty of Health Sciences, University of Johannesburg, Johannesburg 2006, South Africa.
Department of Statistics, Faculty of Science, University of Johannesburg, Johannesburg 2006, South Africa.
Trop Med Infect Dis. 2025 Jun 14;10(6):167. doi: 10.3390/tropicalmed10060167.
There is a significant portion of the South African population with unknown HIV status, which slows down epidemic control despite the progress made in HIV testing. Machine learning (ML) has been effective in identifying individuals at higher risk of HIV infection, for whom testing is strongly recommended. However, there are insufficient predictive models to inform targeted HIV testing interventions in South Africa. By harnessing the power of supervised ML (SML) algorithms, this study aimed to identify the most consistent predictors of HIV testing in repeated adult population-based surveys in South Africa. The study employed four SML algorithms, namely, decision trees, random forest, support vector machines (SVM), and logistic regression, across the five cross-sectional cycles of the South African National HIV Prevalence, Incidence, and Behavior and Communication Survey (SABSSM) datasets. The Human Science Research Council (HSRC) conducted the SABSSM surveys and made the datasets available for this study. Each dataset was split into 80% training and 20% testing sets with a 5-fold cross-validation technique. The random forest outperformed the other models across all five datasets with the highest accuracy (80.98%), precision (81.51%), F-score (80.30%), area under the curve (AUC) (88.31%), and cross-validation average (79.10%) in the 2002 data. Random forest achieved the highest classification performance across all the dates, especially in the 2017 survey. SVM had a high recall (89.12% in 2005, 86.28% in 2008) but lower precision, leading to a suboptimal F-score in the initial analysis. We applied a soft margin to the SVM to improve its classification robustness and generalization, but the accuracy and precision were still low in most surveys, increasing the chances of misclassifying individuals who tested for HIV. Logistic regression performed well in terms of accuracy = 72.75, precision = 73.64, and AUC = 81.41 in 2002, and the F-score = 73.83 in 2017, but its performance was somewhat lower than that of the random forest. Decision trees demonstrated moderate accuracy (73.80% in 2002) but were prone to overfitting. The topmost consistent predictors of HIV testing are knowledge of HIV testing sites, being a female, being a younger adult, having high socioeconomic status, and being well-informed about HIV through digital platforms. Random forest's ability to analyze complex datasets makes it a valuable tool for informing data-driven policy initiatives, such as raising awareness, engaging the media, improving employment outcomes, enhancing accessibility, and targeting high-risk individuals. By addressing the identified gaps in the existing healthcare framework, South Africa can enhance the efficacy of HIV testing and progress towards achieving the UNAIDS 2030 goal of eradicating AIDS.
南非有很大一部分人口的艾滋病毒感染状况不明,这减缓了疫情控制的进程,尽管在艾滋病毒检测方面已取得进展。机器学习(ML)已有效地识别出感染艾滋病毒风险较高的个体,强烈建议对这些人进行检测。然而,在南非,用于指导有针对性的艾滋病毒检测干预措施的预测模型不足。通过利用监督式机器学习(SML)算法的力量,本研究旨在确定在南非基于成年人群的重复调查中,艾滋病毒检测最一致的预测因素。该研究在南非国家艾滋病毒流行率、发病率、行为与传播调查(SABSSM)数据集的五个横断面周期中,采用了四种SML算法,即决策树、随机森林、支持向量机(SVM)和逻辑回归。人类科学研究理事会(HSRC)开展了SABSSM调查,并提供了数据集供本研究使用。每个数据集通过5折交叉验证技术被分为80%的训练集和20%的测试集。在2002年的数据中,随机森林在所有五个数据集中的表现优于其他模型,其准确率最高(80.98%)、精确率(81.51%)、F值(80.30%)、曲线下面积(AUC)(88.31%)和交叉验证平均值(79.10%)。随机森林在所有日期的分类性能最高,尤其是在2017年的调查中。支持向量机有较高的召回率(2005年为89.12%,2008年为86.28%),但精确率较低,导致在初步分析中F值次优。我们对支持向量机应用了软间隔以提高其分类稳健性和泛化能力,但在大多数调查中准确率和精确率仍然较低,增加了对艾滋病毒检测呈阳性个体误分类的可能性。逻辑回归在2002年的准确率为72.75、精确率为73.64、AUC为81.41,在2017年F值为73.83,表现良好,但其性能略低于随机森林。决策树显示出中等准确率(2002年为73.80%),但容易过度拟合。艾滋病毒检测最一致的预测因素包括对艾滋病毒检测地点的了解、女性、年轻成年人、社会经济地位高以及通过数字平台对艾滋病毒有充分了解。随机森林分析复杂数据集的能力使其成为一个有价值的工具,可用于为数据驱动的政策举措提供信息,如提高认识、吸引媒体、改善就业成果、增强可及性以及针对高危个体。通过解决现有医疗框架中已确定的差距,南非可以提高艾滋病毒检测的效果,并朝着实现联合国艾滋病规划署2030年消除艾滋病的目标迈进。
Cochrane Database Syst Rev. 2022-5-20
Cochrane Database Syst Rev. 2008-7-16
Health Technol Assess. 2024-10
Health Technol Assess. 2006-9
Cochrane Database Syst Rev. 2022-7-22
Cochrane Database Syst Rev. 2020-10-19
Cochrane Database Syst Rev. 2018-4-25
Front Public Health. 2024
J Med Internet Res. 2023-9-22