Ogutu Sarah, Mohammed Mohanad, Mwambi Henry
School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, 3201, South Africa.
School of Nursing and Public Health, University of KwaZulu-Natal, Pietermaritzburg, 3201, South Africa.
Sci Rep. 2024 Dec 2;14(1):29895. doi: 10.1038/s41598-024-81510-y.
HIV remains a critical global health issue, with an estimated 39.9 million people living with the virus worldwide by the end of 2023 (according to WHO). Although the epidemic's impact varies significantly across regions, Africa remains the most affected. In the past decade, considerable efforts have focused on developing preventive measures, such as vaccines and pre-exposure prophylaxis, to combat sexually transmitted HIV. Recently, cytokine profiles have gained attention as potential predictors of HIV incidence due to their involvement in immune regulation and inflammation, presenting new opportunities to enhance preventative strategies. However, the high-dimensional, time-varying nature of cytokine data collected in clinical research, presents challenges for traditional statistical methods like the Cox proportional hazards (PH) model to effectively analyze survival data related to HIV. Machine learning (ML) survival models offer a robust alternative, especially for addressing the limitations of the PH model's assumptions. In this study, we applied survival support vector machine (SSVM) and random survival forest (RSF) models using changes or means in cytokine levels as predictors to assess their association with HIV incidence, evaluate variable importance, measure predictive accuracy using the concordance index (C-index) and integrated Brier score (IBS) and interpret the model's predictions using Shapley additive explanations (SHAP) values. Our results indicated that RSFs models outperformed SSVMs models, with the difference covariate model performing better than the mean covariate model. The highest C-index for SSVM was 0.7180 under the difference covariate model, while for RSF, it reached 0.8801 under the difference covariate model using the log-rank split rule. Key cytokines identified as positive predictors of HIV incidence included TNF-A, BASIC-FGF, IL-5, MCP-3, and EOTAXIN, while 29 cytokines were negative predictors. Baseline factors such as condom use frequency, treatment status, number of partners, and sexual activity also emerged as significant predictors. This study underscored the potential of cytokine profiles for predicting HIV incidence and highlighted the advantages of RSFs models in analyzing high-dimensional, time-varying data over SSVMs. It further through ablation studies emphasized the importance of selecting key features within mean and difference based covariate models to achieve an optimal balance between model complexity and predictive accuracy.
艾滋病毒仍然是一个关键的全球卫生问题,据世界卫生组织统计,到2023年底,全球估计有3990万人感染该病毒。尽管疫情在不同地区的影响差异很大,但非洲仍然是受影响最严重的地区。在过去十年中,大量努力集中在开发预防措施,如疫苗和暴露前预防,以对抗性传播的艾滋病毒。最近,细胞因子谱因其参与免疫调节和炎症而作为艾滋病毒发病率的潜在预测指标受到关注,为加强预防策略带来了新机会。然而,临床研究中收集的细胞因子数据具有高维、随时间变化的性质,这给传统统计方法如Cox比例风险(PH)模型有效分析与艾滋病毒相关的生存数据带来了挑战。机器学习(ML)生存模型提供了一种强大的替代方法,特别是用于解决PH模型假设的局限性。在本研究中,我们应用生存支持向量机(SSVM)和随机生存森林(RSF)模型,使用细胞因子水平的变化或均值作为预测指标,评估它们与艾滋病毒发病率的关联,评估变量重要性,使用一致性指数(C-index)和综合Brier评分(IBS)测量预测准确性,并使用Shapley加法解释(SHAP)值解释模型的预测。我们的结果表明,RSF模型优于SSVM模型,差异协变量模型比均值协变量模型表现更好。在差异协变量模型下,SSVM的最高C-index为0.7180,而对于RSF,使用对数秩分割规则时,在差异协变量模型下达到0.8801。被确定为艾滋病毒发病率阳性预测指标的关键细胞因子包括TNF-A、碱性成纤维细胞生长因子(BASIC-FGF)、IL-5、单核细胞趋化蛋白-3(MCP-3)和嗜酸性粒细胞趋化因子(EOTAXIN),而29种细胞因子为阴性预测指标。避孕套使用频率、治疗状态、性伴侣数量和性活动等基线因素也成为显著的预测指标。这项研究强调了细胞因子谱在预测艾滋病毒发病率方面的潜力,并突出了RSF模型在分析高维、随时间变化的数据方面优于SSVM模型的优势。通过消融研究,它进一步强调了在基于均值和差异的协变量模型中选择关键特征以在模型复杂性和预测准确性之间实现最佳平衡的重要性。