Department of Internal Medicine, Division of Medical Informatics, University of Kansas Medical Center, Kansas City, Kansas, USA.
Big Data Decision Institute, Jinan University, Guangzhou, PRC.
J Am Med Inform Assoc. 2019 Mar 1;26(3):242-253. doi: 10.1093/jamia/ocy165.
Diabetic kidney disease (DKD) is one of the most frequent complications in diabetes associated with substantial morbidity and mortality. To accelerate DKD risk factor discovery, we present an ensemble feature selection approach to identify a robust set of discriminant factors using electronic medical records (EMRs).
We identified a retrospective cohort of 15 645 adult patients with type 2 diabetes, excluding those with pre-existing kidney disease, and utilized all available clinical data types in modeling. We compared 3 machine-learning-based embedded feature selection methods in conjunction with 6 feature ensemble techniques for selecting top-ranked features in terms of robustness to data perturbations and predictability for DKD onset.
The gradient boosting machine (GBM) with weighted mean rank feature ensemble technique achieved the best performance with an AUC of 0.82 [95%-CI, 0.81-0.83] on internal validation and 0.71 [95%-CI, 0.68-0.73] on external temporal validation. The ensemble model identified a set of 440 features from 84 872 unique clinical features that are both predicative of DKD onset and robust against data perturbations, including 191 labs, 51 visit details (mainly vital signs), 39 medications, 34 orders, 30 diagnoses, and 95 other clinical features.
Many of the top-ranked features have not been included in the state-of-art DKD prediction models, but their relationships with kidney function have been suggested in existing literature.
Our ensemble feature selection framework provides an option for identifying a robust and parsimonious feature set unbiasedly from EMR data, which effectively aids in knowledge discovery for DKD risk factors.
糖尿病肾病(DKD)是糖尿病最常见的并发症之一,与大量发病率和死亡率相关。为了加速 DKD 危险因素的发现,我们提出了一种集成特征选择方法,使用电子病历(EMR)来识别一组稳健的判别因素。
我们确定了一个包含 15645 名成年 2 型糖尿病患者的回顾性队列,排除了那些有预先存在的肾脏疾病的患者,并在建模中利用了所有可用的临床数据类型。我们比较了 3 种基于机器学习的嵌入式特征选择方法与 6 种特征集成技术,以选择在数据扰动和 DKD 发病预测方面表现稳健的顶级特征。
梯度提升机(GBM)与加权平均秩特征集成技术在内部验证中的 AUC 为 0.82[95%CI,0.81-0.83],在外部时间验证中的 AUC 为 0.71[95%CI,0.68-0.73],表现最佳。该集成模型从 84872 个独特的临床特征中识别出了一组 440 个特征,这些特征既可以预测 DKD 的发病,又可以对数据扰动具有稳健性,包括 191 个实验室、51 个就诊细节(主要是生命体征)、39 种药物、34 个医嘱、30 个诊断和 95 个其他临床特征。
许多排名最高的特征都没有被纳入最先进的 DKD 预测模型中,但它们与肾功能的关系在现有文献中已经有所提及。
我们的集成特征选择框架为从 EMR 数据中识别稳健和简约的特征集提供了一种选择,这有效地帮助了 DKD 危险因素的知识发现。