Shi Xi, Qu Tingyu, Van Pottelbergh Gijs, van den Akker Marjan, De Moor Bart
Department of Electrical Engineering (ESAT), Stadius Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Leuven, Belgium.
Vlerick Business School, Leuven, Belgium.
Front Med (Lausanne). 2022 Mar 7;9:730748. doi: 10.3389/fmed.2022.730748. eCollection 2022.
Prognostic models can help to identify patients at risk for end-stage kidney disease (ESKD) at an earlier stage to provide preventive medical interventions. Previous studies mostly applied the Cox proportional hazards model. The aim of this study is to present a resampling method, which can deal with imbalanced data structure for the prognostic model and help to improve predictive performance.
The electronic health records of patients with chronic kidney disease (CKD) older than 50 years during 2005-2015 collected from primary care in Belgium were used ( = 11,645). Both the Cox proportional hazards model and the logistic regression analysis were applied as reference model. Then, the resampling method, the Synthetic Minority Over-Sampling Technique-Edited Nearest Neighbor (SMOTE-ENN), was applied as a preprocessing procedure followed by the logistic regression analysis. The performance was evaluated by accuracy, the area under the curve (AUC), confusion matrix, and score.
The C statistics for the Cox proportional hazards model was 0.807, while the AUC for the logistic regression analysis was 0.700, both on a comparable level to previous studies. With the model trained on the resampled set, 86.3% of patients with ESKD were correctly identified, although it was at the cost of the high misclassification rate of negative cases. The score was 0.245, much higher than 0.043 for the logistic regression analysis and 0.022 for the Cox proportional hazards model.
This study pointed out the imbalanced data structure and its effects on prediction accuracy, which were not thoroughly discussed in previous studies. We were able to identify patients with high risk for ESKD better from a clinical perspective by using the resampling method. But, it has the limitation of the high misclassification of negative cases. The technique can be widely used in other clinical topics when imbalanced data structure should be considered.
预后模型有助于在更早阶段识别终末期肾病(ESKD)风险患者,以便提供预防性医疗干预措施。以往研究大多应用Cox比例风险模型。本研究旨在提出一种重采样方法,该方法可处理预后模型的不平衡数据结构并有助于提高预测性能。
使用从比利时初级保健机构收集的2005年至2015年期间年龄大于50岁的慢性肾脏病(CKD)患者的电子健康记录(n = 11,645)。Cox比例风险模型和逻辑回归分析均作为参考模型应用。然后,应用重采样方法,即合成少数过采样技术编辑最近邻法(SMOTE - ENN)作为预处理程序,随后进行逻辑回归分析。通过准确性、曲线下面积(AUC)、混淆矩阵和F1分数评估性能。
Cox比例风险模型的C统计量为0.807,而逻辑回归分析的AUC为0.700,两者均与以往研究处于可比水平。使用在重采样集上训练的模型,86.3%的ESKD患者被正确识别,尽管这是以阴性病例的高误分类率为代价的。F1分数为0.245,远高于逻辑回归分析的0.043和Cox比例风险模型的0.022。
本研究指出了不平衡数据结构及其对预测准确性的影响,而以往研究并未对此进行充分讨论。通过使用重采样方法,我们能够从临床角度更好地识别ESKD高风险患者。但是,它存在阴性病例误分类率高的局限性。当应考虑不平衡数据结构时,该技术可广泛应用于其他临床主题。