Kagendi Nancy, Mwau Matilu
Kenya Medical Research Institute, Nairobi, Kenya.
Health Data Sci. 2023 Oct 2;3:0019. doi: 10.34133/hds.0019. eCollection 2023.
Machine learning models are not in routine use for predicting HIV status. Our objective is to describe the development of a machine learning model to predict HIV viral load (VL) hotspots as an early warning system in Kenya, based on routinely collected data by affiliate entities of the Ministry of Health. Based on World Health Organization's recommendations, hotspots are health facilities with ≥20% people living with HIV whose VL is not suppressed. Prediction of VL hotspots provides an early warning system to health administrators to optimize treatment and resources distribution.
A random forest model was built to predict the hotspot status of a health facility in the upcoming month, starting from 2016. Prior to model building, the datasets were cleaned and checked for outliers and multicollinearity at the patient level. The patient-level data were aggregated up to the facility level before model building. We analyzed data from 4 million tests and 4,265 facilities. The dataset at the health facility level was divided into train (75%) and test (25%) datasets.
The model discriminates hotspots from non-hotspots with an accuracy of 78%. The F1 score of the model is 69% and the Brier score is 0.139. In December 2019, our model correctly predicted 434 VL hotspots in addition to the observed 446 VL hotspots.
The hotspot mapping model can be essential to antiretroviral therapy programs. This model can provide support to decision-makers to identify VL hotspots ahead in time using cost-efficient routinely collected data.
机器学习模型尚未常规用于预测艾滋病毒感染状况。我们的目标是基于肯尼亚卫生部附属实体定期收集的数据,描述一个机器学习模型的开发过程,该模型用于预测艾滋病毒病毒载量(VL)热点地区,作为一种早期预警系统。根据世界卫生组织的建议,热点地区是指艾滋病毒感染者中病毒载量未得到抑制的人数占比≥20%的医疗机构。预测病毒载量热点地区可为卫生管理人员提供早期预警系统,以优化治疗和资源分配。
构建了一个随机森林模型,用于预测自2016年起未来一个月内医疗机构的热点地区状态。在模型构建之前,对数据集进行了清理,并在患者层面检查了异常值和多重共线性。在模型构建之前,将患者层面的数据汇总到医疗机构层面。我们分析了来自400万次检测和4265家医疗机构的数据。医疗机构层面的数据集被分为训练集(75%)和测试集(25%)。
该模型区分热点地区和非热点地区的准确率为78%。该模型的F1分数为69%,布里尔分数为0.139。在2019年12月,我们的模型除了正确预测了观察到的446个病毒载量热点地区外,还正确预测了434个病毒载量热点地区。
热点地区映射模型对于抗逆转录病毒治疗项目可能至关重要。该模型可为决策者提供支持,以便利用具有成本效益的常规收集数据提前识别病毒载量热点地区。