Department of Medicine, Columbia University Irving Medical Center, New York, NY, United States.
Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, United States.
J Biomed Inform. 2022 Jul;131:104095. doi: 10.1016/j.jbi.2022.104095. Epub 2022 May 20.
The multi-modal and unstructured nature of observational data in Electronic Health Records (EHR) is currently a significant obstacle for the application of machine learning towards risk stratification. In this study, we develop a deep learning framework for incorporating longitudinal clinical data from EHR to infer risk for pancreatic cancer (PC). This framework includes a novel training protocol, which enforces an emphasis on early detection by applying an independent Poisson-random mask on proximal-time measurements for each variable. Data fusion for irregular multivariate time-series features is enabled by a "grouped" neural network (GrpNN) architecture, which uses representation learning to generate a dimensionally reduced vector for each measurement set before making a final prediction. These models were evaluated using EHR data from Columbia University Irving Medical Center-New York Presbyterian Hospital. Our framework demonstrated better performance on early detection (AUROC 0.671, CI 95% 0.667 - 0.675, p < 0.001) at 12 months prior to diagnosis compared to a logistic regression, xgboost, and a feedforward neural network baseline. We demonstrate that our masking strategy results greater improvements at distal times prior to diagnosis, and that our GrpNN model improves generalizability by reducing overfitting relative to the feedforward baseline. The results were consistent across reported race. Our proposed algorithm is potentially generalizable to other diseases including but not limited to cancer where early detection can improve survival.
电子健康记录 (EHR) 中的观察数据具有多模态和非结构化的特点,这目前是机器学习在风险分层应用中的一个重大障碍。在本研究中,我们开发了一个深度学习框架,用于将 EHR 中的纵向临床数据纳入其中,以推断胰腺癌 (PC) 的风险。该框架包括一个新颖的训练方案,通过对每个变量的近端时间测量值应用独立的泊松随机掩码,强制强调早期检测。通过“分组”神经网络 (GrpNN) 架构实现了不规则多变量时间序列特征的数据融合,该架构使用表示学习在进行最终预测之前为每个测量集生成一个降维向量。这些模型使用来自哥伦比亚大学欧文医学中心-纽约长老会医院的 EHR 数据进行了评估。与逻辑回归、xgboost 和前馈神经网络基线相比,我们的框架在诊断前 12 个月的早期检测(AUROC 0.671,95%CI 0.667-0.675,p<0.001)方面表现出更好的性能。我们证明,我们的掩蔽策略在诊断前的远端时间上产生了更大的改进,并且我们的 GrpNN 模型通过减少相对于前馈基线的过拟合来提高了通用性。结果在报告的种族之间是一致的。我们提出的算法可能具有普遍性,可以应用于其他疾病,包括但不限于癌症,早期检测可以提高生存率。