Yan Xiaoqian, Li Ximin, Lu Ying, Ma Dongfang, Mou Shenghong, Cheng Zhiyuan, Ding Yuan, Yan Bin, Zhang Xianzhen, Hu Gang
Department of Nephropathy, Tongde Hospital of Zhejiang Province, Hangzhou, Zhejiang 310012, China.
School of Micro-Nanoelectronics, Zhejiang University, Hangzhou, Zhejiang 310058, China.
Evid Based Complement Alternat Med. 2022 Jul 8;2022:6561721. doi: 10.1155/2022/6561721. eCollection 2022.
To establish a prediction model for the risk evaluation of chronic kidney disease (CKD) to guide the management and prevention of CKD.
A total of 1263 patients with CKD and 1948 patients without CKD admitted to the Tongde Hospital of the Zhejiang Province from January 1, 2008, to December 31, 2018, were retrospectively analyzed. Spearman's correlation was used to analyze the relationship between CKD and laboratory parameters. XGBoost, random forest, Naive Bayes, support vector machine, and multivariate logistic regression algorithms were employed to establish prediction models for the risk evaluation of CKD. The accuracy, precision, recall, F1 score, and area under the receiver operating curve (AUC) of each model were compared. The new bidirectional encoder representations from transformers with light gradient boosting machine (MD-BERT-LGBM) model was used to process the unstructured data and transform it into researchable unstructured vectors, and the AUC was compared before and after processing.
Differences in laboratory parameters between CKD and non-CKD patients were observed. The neutrophil ratio and white blood cell count were significantly associated with the occurrence of CKD. The XGBoost model demonstrated the best prediction effect (accuracy = 0.9088, precision = 0.9175, recall = 0.8244, F1 score = 0.8868, AUC = 0.8244), followed by the random forest model (accuracy = 0.9020, precision = 0.9318, recall = 0.7905, F1 score = 0.581, AUC = 0.9519). Comparatively, the predictions of the Naive Bayes and support vector machine models were inferior to those of the logistic regression model. The AUC of all models was improved to some extent after processing using the new MD-BERT-LGBM model.
The new MD-BERT-LGBM model with the inclusion of unstructured data has contributed to the higher accuracy, sensitivity, and specificity of the prediction models. Clinical features such as age, gender, urinary white blood cells, urinary red blood cells, thrombin time, serum creatinine, and total cholesterol were associated with CKD incidence.
建立慢性肾脏病(CKD)风险评估预测模型,以指导CKD的管理和预防。
回顾性分析2008年1月1日至2018年12月31日浙江省同德医院收治的1263例CKD患者和1948例非CKD患者。采用Spearman相关性分析CKD与实验室参数之间的关系。运用XGBoost、随机森林、朴素贝叶斯、支持向量机和多因素逻辑回归算法建立CKD风险评估预测模型。比较各模型的准确性、精确性、召回率、F1分数和受试者工作特征曲线下面积(AUC)。使用新型带轻梯度提升机的变换器双向编码器表征(MD-BERT-LGBM)模型处理非结构化数据并将其转化为可研究的非结构化向量,比较处理前后的AUC。
观察到CKD患者与非CKD患者实验室参数存在差异。中性粒细胞比例和白细胞计数与CKD的发生显著相关。XGBoost模型显示出最佳预测效果(准确性=0.9088,精确性=0.9175,召回率=0.8244,F1分数=0.8868,AUC=0.8244),其次是随机森林模型(准确性=0.9020,精确性=0.9318,召回率=0.7905,F1分数=0.581,AUC=0.9519)。相比之下,朴素贝叶斯和支持向量机模型的预测效果不如逻辑回归模型。使用新型MD-BERT-LGBM模型处理后,所有模型的AUC均有一定程度提高。
纳入非结构化数据的新型MD-BERT-LGBM模型提高了预测模型的准确性、敏感性和特异性。年龄、性别、尿白细胞、尿红细胞、凝血酶时间、血清肌酐和总胆固醇等临床特征与CKD发病率相关。