Institute of Global Health, University of Geneva, Geneva, Switzerland.
Institute of Mathematical Statistics and Actuarial Science, University of Bern, Bern, Switzerland.
PLoS One. 2022 Mar 3;17(3):e0264429. doi: 10.1371/journal.pone.0264429. eCollection 2022.
High yield HIV testing strategies are critical to reach epidemic control in high prevalence and low-resource settings such as East and Southern Africa. In this study, we aimed to predict the HIV status of individuals living in Angola, Burundi, Ethiopia, Lesotho, Malawi, Mozambique, Namibia, Rwanda, Zambia and Zimbabwe with the highest precision and sensitivity for different policy targets and constraints based on a minimal set of socio-behavioural characteristics.
We analysed the most recent Demographic and Health Survey from these 10 countries to predict individual's HIV status using four different algorithms (a penalized logistic regression, a generalized additive model, a support vector machine, and a gradient boosting trees). The algorithms were trained and validated on 80% of the data, and tested on the remaining 20%. We compared the predictions based on the F1 score, the harmonic mean of sensitivity and positive predictive value (PPV), and we assessed the generalization of our models by testing them against an independent left-out country. The best performing algorithm was trained on a minimal subset of variables which were identified as the most predictive, and used to 1) identify 95% of people living with HIV (PLHIV) while maximising precision and 2) identify groups of individuals by adjusting the probability threshold of being HIV positive (90% in our scenario) for achieving specific testing strategies.
Overall 55,151 males and 69,626 females were included in the analysis. The gradient boosting trees algorithm performed best in predicting HIV status with a mean F1 score of 76.8% [95% confidence interval (CI) 76.0%-77.6%] for males (vs [CI 67.8%-70.6%] for SVM) and 78.8% [CI 78.2%-79.4%] for females (vs [CI 73.4%-75.8%] for SVM). Among the ten most predictive variables for each sex, nine were identical: longitude, latitude and, altitude of place of residence, current age, age of most recent partner, total lifetime number of sexual partners, years lived in current place of residence, condom use during last intercourse and, wealth index. Only age at first sex for male (ranked 10th) and Rohrer's index for female (ranked 6th) were not similar for both sexes. Our large-scale scenario, which consisted in identifying 95% of all PLHIV, would have required testing 49.4% of males and 48.1% of females while achieving a precision of 15.4% for males and 22.7% for females. For the second scenario, only 4.6% of males and 6.0% of females would have had to be tested to find 55.7% of all males and 50.5% of all females living with HIV.
We trained a gradient boosting trees algorithm to find 95% of PLHIV with a precision twice higher than with general population testing by using only a limited number of socio-behavioural characteristics. We also successfully identified people at high risk of infection who may be offered pre-exposure prophylaxis or voluntary medical male circumcision. These findings can inform the implementation of new high-yield HIV tests and help develop very precise strategies based on low-resource settings constraints.
在高流行和资源匮乏的环境(如东非和南非)中,高收益的 HIV 检测策略对于达到流行控制至关重要。在这项研究中,我们旨在根据一组最小的社会行为特征,针对不同的政策目标和限制,预测生活在安哥拉、布隆迪、埃塞俄比亚、莱索托、马拉维、莫桑比克、纳米比亚、卢旺达、赞比亚和津巴布韦的个体的 HIV 状况,以达到最高的精度和敏感性。
我们分析了这 10 个国家的最新人口与健康调查数据,使用四种不同的算法(惩罚逻辑回归、广义加性模型、支持向量机和梯度提升树)来预测个体的 HIV 状况。算法在 80%的数据上进行训练和验证,并在剩余的 20%的数据上进行测试。我们比较了基于 F1 评分、敏感性和阳性预测值(PPV)的调和平均值的预测,并通过在独立的预留国家进行测试来评估我们模型的泛化能力。表现最佳的算法是在一组被确定为最具预测性的变量的基础上进行训练的,这些变量用于:1)识别 95%的 HIV 感染者(PLHIV),同时最大程度地提高精确度;2)通过调整感染 HIV 的概率阈值(在我们的方案中为 90%)来识别具有特定检测策略的个体组。
共有 55151 名男性和 69626 名女性被纳入分析。梯度提升树算法在预测 HIV 状况方面表现最佳,男性的平均 F1 评分为 76.8%(95%置信区间[CI]76.0%-77.6%)(支持向量机为[CI 67.8%-70.6%]),女性为 78.8%(95%CI 78.2%-79.4%)(支持向量机为[CI 73.4%-75.8%])。对于每一种性别,十个最具预测性的变量中有九个是相同的:居住地点的经度、纬度和海拔,当前年龄,最近伴侣的年龄,一生中性伴侣的总数,在当前居住地点居住的年数,最近一次性交时使用避孕套的情况,以及财富指数。只有男性的首次性行为年龄(排名第 10)和女性的罗雷尔指数(排名第 6)在两性中并不相似。我们的大规模方案包括识别所有 PLHIV 的 95%,这将需要对 49.4%的男性和 48.1%的女性进行检测,同时对男性的精确度达到 15.4%,对女性的精确度达到 22.7%。对于第二个方案,只需对 4.6%的男性和 6.0%的女性进行检测,就可以找到所有男性中的 55.7%和所有女性中的 50.5%携带 HIV。
我们使用了一种梯度提升树算法,仅使用有限数量的社会行为特征,就找到了 95%的 PLHIV,其精度比普通人群检测高两倍。我们还成功地识别了处于高感染风险的人群,他们可能会被提供暴露前预防或自愿男性割礼。这些发现可以为新的高收益 HIV 检测的实施提供信息,并有助于根据资源匮乏的环境限制制定非常精确的策略。