Department of Health Management and Policy, School of Public Health, Capital Medical University, Beijing, China.
Department of Systems, Populations, and Leadership, University of Michigan School of Nursing, Ann Arbor, MI, United States.
Front Public Health. 2022 Oct 21;10:998549. doi: 10.3389/fpubh.2022.998549. eCollection 2022.
Chronic kidney disease (CKD) has become a major public health problem worldwide and has caused a huge social and economic burden, especially in developing countries. No previous study has used machine learning (ML) methods combined with longitudinal data to predict the risk of CKD development in 2 years amongst the elderly in China.
This study was based on the panel data of 925 elderly individuals in the 2012 baseline survey and 2014 follow-up survey of the Healthy Aging and Biomarkers Cohort Study (HABCS) database. Six ML models, logistic regression (LR), lasso regression, random forests (RF), gradient-boosted decision tree (GBDT), support vector machine (SVM), and deep neural network (DNN), were developed to predict the probability of CKD amongst the elderly in 2 years (the year of 2014). The decision curve analysis (DCA) provided a range of threshold probability of the outcome and the net benefit of each ML model.
Amongst the 925 elderly in the HABCS 2014 survey, 289 (18.8%) had CKD. Compared with the other models, LR, lasso regression, RF, GBDT, and DNN had no statistical significance of the area under the receiver operating curve (AUC) value (>0.7), and SVM exhibited the lowest predictive performance (AUC = 0.633, -value = 0.057). DNN had the highest positive predictive value (PPV) (0.328), whereas LR had the lowest (0.287). DCA results indicated that within the threshold ranges of ~0-0.03 and 0.37-0.40, the net benefit of GBDT was the largest. Within the threshold ranges of ~0.03-0.10 and 0.26-0.30, the net benefit of RF was the largest. Age was the most important predictor variable in the RF and GBDT models. Blood urea nitrogen, serum albumin, uric acid, body mass index (BMI), marital status, activities of daily living (ADL)/instrumental activities of daily living (IADL) and gender were crucial in predicting CKD in the elderly.
The ML model could successfully capture the linear and nonlinear relationships of risk factors for CKD in the elderly. The decision support system based on the predictive model in this research can help medical staff detect and intervene in the health of the elderly early.
慢性肾脏病(CKD)已成为全球主要的公共卫生问题,并造成了巨大的社会和经济负担,尤其是在发展中国家。以前没有研究使用机器学习(ML)方法结合纵向数据来预测中国老年人在 2 年内发生 CKD 的风险。
本研究基于 2012 年基线调查和 2014 年健康老龄化和生物标志物队列研究(HABCS)数据库随访调查中 925 名老年人的面板数据。开发了 6 种 ML 模型,包括逻辑回归(LR)、套索回归、随机森林(RF)、梯度提升决策树(GBDT)、支持向量机(SVM)和深度神经网络(DNN),以预测 2 年内(2014 年)老年人发生 CKD 的概率。决策曲线分析(DCA)提供了一系列结局的阈值概率和每个 ML 模型的净收益。
在 HABCS 2014 调查的 925 名老年人中,有 289 人(18.8%)患有 CKD。与其他模型相比,LR、套索回归、RF、GBDT 和 DNN 在接收者操作特征曲线(AUC)值(>0.7)方面没有统计学意义,而 SVM 的预测性能最低(AUC=0.633,-值=0.057)。DNN 的阳性预测值(PPV)最高(0.328),而 LR 的最低(0.287)。DCA 结果表明,在0-0.03 和 0.37-0.40 的阈值范围内,GBDT 的净收益最大。在0.03-0.10 和 0.26-0.30 的阈值范围内,RF 的净收益最大。年龄是 RF 和 GBDT 模型中最重要的预测变量。血尿素氮、血清白蛋白、尿酸、体重指数(BMI)、婚姻状况、日常生活活动(ADL)/工具性日常生活活动(IADL)和性别是预测老年人 CKD 的关键因素。
ML 模型可以成功捕捉老年人 CKD 风险因素的线性和非线性关系。基于本研究预测模型的决策支持系统可以帮助医务人员早期发现和干预老年人的健康问题。