Shameer Khader, Johnson Kipp W, Yahi Alexandre, Miotto Riccardo, Li L I, Ricks Doran, Jebakaran Jebakumar, Kovatch Patricia, Sengupta Partho P, Gelijns Sengupta, Moskovitz Alan, Darrow Bruce, David David L, Kasarskis Andrew, Tatonetti Nicholas P, Pinney Sean, Dudley Joel T
Department of Genetics and Genomics, Icahn Institute of Genomics and Multiscale Biology, New York, NY, USA2Institute of Next Generation Healthcare, Mount Sinai Health System, New York, NY, USA.
Pac Symp Biocomput. 2017;22:276-287. doi: 10.1142/9789813207813_0027.
Reduction of preventable hospital readmissions that result from chronic or acute conditions like stroke, heart failure, myocardial infarction and pneumonia remains a significant challenge for improving the outcomes and decreasing the cost of healthcare delivery in the United States. Patient readmission rates are relatively high for conditions like heart failure (HF) despite the implementation of high-quality healthcare delivery operation guidelines created by regulatory authorities. Multiple predictive models are currently available to evaluate potential 30-day readmission rates of patients. Most of these models are hypothesis driven and repetitively assess the predictive abilities of the same set of biomarkers as predictive features. In this manuscript, we discuss our attempt to develop a data-driven, electronic-medical record-wide (EMR-wide) feature selection approach and subsequent machine learning to predict readmission probabilities. We have assessed a large repertoire of variables from electronic medical records of heart failure patients in a single center. The cohort included 1,068 patients with 178 patients were readmitted within a 30-day interval (16.66% readmission rate). A total of 4,205 variables were extracted from EMR including diagnosis codes (n=1,763), medications (n=1,028), laboratory measurements (n=846), surgical procedures (n=564) and vital signs (n=4). We designed a multistep modeling strategy using the Naïve Bayes algorithm. In the first step, we created individual models to classify the cases (readmitted) and controls (non-readmitted). In the second step, features contributing to predictive risk from independent models were combined into a composite model using a correlation-based feature selection (CFS) method. All models were trained and tested using a 5-fold cross-validation method, with 70% of the cohort used for training and the remaining 30% for testing. Compared to existing predictive models for HF readmission rates (AUCs in the range of 0.6-0.7), results from our EMR-wide predictive model (AUC=0.78; Accuracy=83.19%) and phenome-wide feature selection strategies are encouraging and reveal the utility of such datadriven machine learning. Fine tuning of the model, replication using multi-center cohorts and prospective clinical trial to evaluate the clinical utility would help the adoption of the model as a clinical decision system for evaluating readmission status.
减少由中风、心力衰竭、心肌梗死和肺炎等慢性或急性疾病导致的可预防的医院再入院率,仍然是美国改善医疗结果和降低医疗服务成本的一项重大挑战。尽管监管机构制定了高质量的医疗服务操作指南,但心力衰竭(HF)等疾病的患者再入院率相对较高。目前有多种预测模型可用于评估患者30天的潜在再入院率。这些模型大多是假设驱动的,并反复评估同一组生物标志物作为预测特征的预测能力。在本论文中,我们讨论了我们尝试开发一种数据驱动的、全电子病历(EMR-wide)的特征选择方法以及后续机器学习来预测再入院概率的过程。我们评估了来自单一中心心力衰竭患者电子病历的大量变量。该队列包括1068名患者,其中178名患者在30天内再次入院(再入院率为16.66%)。从电子病历中总共提取了4205个变量,包括诊断代码(n = 1763)、药物(n = 1028)、实验室测量值(n = 846)、手术程序(n = 564)和生命体征(n = 4)。我们使用朴素贝叶斯算法设计了一种多步骤建模策略。第一步,我们创建个体模型来对病例(再入院)和对照(未再入院)进行分类。第二步,使用基于相关性的特征选择(CFS)方法将独立模型中对预测风险有贡献的特征组合成一个复合模型。所有模型均使用5折交叉验证方法进行训练和测试,队列的70%用于训练,其余30%用于测试。与现有的HF再入院率预测模型(AUC范围为0.6 - 0.7)相比,我们的全电子病历预测模型(AUC = 0.78;准确率 = 83.19%)和全表型特征选择策略的结果令人鼓舞,并揭示了这种数据驱动的机器学习的效用。对模型进行微调、使用多中心队列进行复制以及进行前瞻性临床试验以评估临床效用,将有助于该模型作为评估再入院状态的临床决策系统被采用。