Cao Xia, Lin Yanhui, Yang Binfang, Li Ying, Zhou Jiansong
Health Management Center, The Third Xiangya Hospital, Central South University, Changsha, Hunan, People's Republic of China.
Health Management Research Center, Central South University, Changsha, Hunan, People's Republic of China.
Risk Manag Healthc Policy. 2022 Apr 26;15:817-826. doi: 10.2147/RMHP.S346856. eCollection 2022.
Using machine learning method to predict and judge unknown data offers opportunity to improve accuracy by exploring complex interactions between risk factors. Therefore, we evaluate the performance of machine learning (ML) algorithms and to compare them with logistic regression for predicting the risk of renal function decline (RFD) using routine clinical data.
This retrospective cohort study includes datasets from 2166 subjects, aged 35-74 years old, provided by an adult health screening follow-up program between 2010 and 2020. Seven different ML models were considered - random forest, gradient boosting, multilayer perceptron, support vector machine, K-nearest neighbors, adaptive boosting, and decision tree - and were compared with standard logistic regression. There were 24 independent variables, and the baseline estimate glomerular filtration rate (eGFR) was used as the predictive variable.
A total of 2166 participants (mean age 49.2±11.2 years old, 63.3% males) were enrolled and randomly divided into a training set (n=1732) and a test set (n=434). The area under receiver operating characteristic curve (AUROC) for detecting RFD corresponding to the different models were above 0.85 during the training phase. The gradient boosting algorithms exhibited the best average prediction accuracy (AUROC: 0.914) among all algorithms validated in this study. Based on AUROC, the ML algorithms improved the RFD prediction performance, compared to logistic regression model (AUROC:0.882), except the K-nearest neighbors and decision tree algorithms (AUROC:0.854 and 0.824, respectively). However, the improvement differences with logistic regression were small (less than 4%) and nonsignificant.
Our results indicate that the proposed health screening dataset-based RFD prediction model using ML algorithms is readily applicable, produces validated results. But logistic regression yields as good performance as ML models to predict the risk of RFD with simple clinical predictors.
使用机器学习方法预测和判断未知数据,为通过探索风险因素之间的复杂相互作用来提高准确性提供了机会。因此,我们评估机器学习(ML)算法的性能,并将其与逻辑回归进行比较,以使用常规临床数据预测肾功能下降(RFD)的风险。
这项回顾性队列研究包括2010年至2020年间成人健康筛查随访项目提供的2166名年龄在35 - 74岁之间受试者的数据集。考虑了七种不同的ML模型——随机森林、梯度提升、多层感知器、支持向量机、K近邻、自适应提升和决策树——并与标准逻辑回归进行比较。有24个自变量,基线估计肾小球滤过率(eGFR)用作预测变量。
共纳入2166名参与者(平均年龄49.2±11.2岁,男性占63.3%),并随机分为训练集(n = 1732)和测试集(n = 434)。在训练阶段,不同模型检测RFD的受试者操作特征曲线下面积(AUROC)均高于0.85。在本研究验证的所有算法中,梯度提升算法表现出最佳的平均预测准确性(AUROC:0.914)。基于AUROC,与逻辑回归模型(AUROC:0.882)相比,除K近邻和决策树算法(分别为AUROC:0.854和0.824)外,ML算法提高了RFD预测性能。然而,与逻辑回归的改善差异较小(小于4%)且无统计学意义。
我们的结果表明,所提出的基于健康筛查数据集的使用ML算法的RFD预测模型易于应用,产生了经过验证的结果。但逻辑回归在使用简单临床预测指标预测RFD风险方面与ML模型表现相当。