Jaiteh Musa, Phalane Edith, Shiferaw Yegnanew A, Phaswana-Mafuya Refilwe Nancy
South African Medical Research Council/University of Johannesburg Pan African Centre for Epidemics Research Extramural Unit, Faculty of Health Sciences, University of Johannesburg, Johannesburg, South Africa.
Department of Statistics, Faculty of Science, University of Johannesburg, Johannesburg, South Africa.
JMIR Res Protoc. 2025 Jan 27;14:e59916. doi: 10.2196/59916.
HIV testing is the cornerstone of HIV prevention and a pivotal step in realizing the Joint United Nations Program on HIV/AIDS (UNAIDS) goal of ending AIDS by 2030. Despite the availability of relevant survey data, there exists a research gap in using machine learning (ML) to analyze and predict HIV testing among adults in South Africa. Further investigation is needed to bridge this knowledge gap and inform evidence-based interventions to improve HIV testing.
This study aims to determine consistent predictors of HIV testing by applying supervised ML algorithms in repeated adult population-based surveys in South Africa.
A retrospective analysis of multiwave cross-sectional survey data will be conducted to determine the predictors of HIV testing among South African adults aged 18 years and older. A supervised ML technique will be applied across the five cycles of the South African National HIV Prevalence, Incidence, Behavior, and Communication Survey (SABSSM) surveys. The Human Science Research Council (HSRC) conducted the SABSSM surveys in 2002, 2005, 2008, 2012, and 2017. The available SABSSM datasets will be imported to RStudio (version 4.3.2; Posit Software, PBC) to clean and remove outliers. A chi-square test will be conducted to select important predictors of HIV testing. Each dataset will be split into 80% training and 20% test samples. Logistic regression, support vector machines, random forests, and decision trees will be used. A cross-validation technique will be used to divide the training sample into k-folds, including a validation set, and models will be trained on each fold. The models' performance will be evaluated on the validation set using evaluation metrics such as accuracy, precision, recall, F-score, area under curve-receiver operating characteristics, and confusion matrix.
The SABSSM datasets are open access datasets available on the HSRC database. Ethics approval for this study was obtained from the University of Johannesburg Research and Ethics Committee on April 23, 2024 (REC-2725-2024). The authors were given access to all five SABSSM datasets by the HSRC on August 20, 2024. The datasets were explored to identify the independent variables likely influencing HIV testing uptake. The findings of this study will determine consistent variables predicting HIV testing uptake among the South African adult population over the course of 20 years. Furthermore, this study will evaluate and compare the performance metrics of the 4 different ML algorithms, and the best model will be used to develop an HIV testing predictive model.
This study will contribute to existing knowledge and deepen understanding of factors linked to HIV testing beyond traditional methods. Consequently, the findings would inform evidence-based policy recommendations that can guide policy makers to formulate more effective and targeted public health approaches toward strengthening HIV testing.
INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): DERR1-10.2196/59916.
艾滋病毒检测是艾滋病毒预防的基石,也是实现联合国艾滋病规划署(UNAIDS)到2030年终结艾滋病目标的关键一步。尽管有相关调查数据,但在利用机器学习(ML)分析和预测南非成年人艾滋病毒检测情况方面存在研究空白。需要进一步调查以弥合这一知识差距,并为改善艾滋病毒检测的循证干预措施提供依据。
本研究旨在通过在南非基于成年人群体的重复调查中应用监督式ML算法,确定艾滋病毒检测的一致预测因素。
将对多波横断面调查数据进行回顾性分析,以确定18岁及以上南非成年人艾滋病毒检测的预测因素。将在南非国家艾滋病毒流行率、发病率、行为和传播调查(SABSSM)的五个周期中应用监督式ML技术。人类科学研究委员会(HSRC)在2002年、2005年、2008年、2012年和2017年开展了SABSSM调查。将可用的SABSSM数据集导入RStudio(版本4.3.2;Posit软件公司)进行清理和去除异常值。将进行卡方检验以选择艾滋病毒检测的重要预测因素。每个数据集将分为80%的训练样本和20%的测试样本。将使用逻辑回归、支持向量机、随机森林和决策树。将使用交叉验证技术将训练样本划分为k折,包括一个验证集,并在每一折上训练模型。将使用准确性、精确性、召回率、F分数、曲线下面积-接收者操作特征和混淆矩阵等评估指标在验证集上评估模型的性能。
SABSSM数据集是HSRC数据库上的开放获取数据集。本研究于2024年4月23日获得约翰内斯堡大学研究与伦理委员会的伦理批准(REC-2725-2024)。2024年8月20日,HSRC向作者提供了所有五个SABSSM数据集。对数据集进行了探索,以确定可能影响艾滋病毒检测接受情况的自变量。本研究的结果将确定在20年期间预测南非成年人群体艾滋病毒检测接受情况的一致变量。此外,本研究将评估和比较4种不同ML算法的性能指标,并将最佳模型用于开发艾滋病毒检测预测模型。
本研究将为现有知识做出贡献,并加深对与艾滋病毒检测相关因素的理解,超越传统方法。因此,研究结果将为循证政策建议提供依据,可指导政策制定者制定更有效和有针对性的公共卫生方法,以加强艾滋病毒检测。
国际注册报告识别码(IRRID):DERR1-10.2196/59916。