Li Zhao, Lan Lan, Zhou Yujia, Li Ruoxing, Chavin Kenneth D, Xu Hua, Li Liang, Shih David J H, Zheng W Jim
McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin Street, Suite 600, Houston, Texas, 77030.
Department of Surgery, Case Western Reserve University School of Medicine, 11100 Euclid Ave, Cleveland OH 44106.
medRxiv. 2023 Nov 17:2023.11.17.23298691. doi: 10.1101/2023.11.17.23298691.
Deep learning models showed great success and potential when applied to many biomedical problems. However, the accuracy of deep learning models for many disease prediction problems is affected by time-varying covariates, rare incidence, and covariate imbalance when using structured electronic health records data. The situation is further exasperated when predicting the risk of one disease on condition of another disease, such as the hepatocellular carcinoma risk among patients with nonalcoholic fatty liver disease due to slow, chronic progression, the scarce of data with both disease conditions and the sex bias of the diseases.
The goal of this study is to investigate the extent to which time-varying covariates, rare incidence, and covariate imbalance influence deep learning performance, and then devised strategies to tackle these challenges. These strategies were applied to improve hepatocellular carcinoma risk prediction among patients with nonalcoholic fatty liver disease.
We evaluated two representative deep learning models in the task of predicting the occurrence of hepatocellular carcinoma in a cohort of patients with nonalcoholic fatty liver disease (n = 220,838) from a national EHR database. The disease prediction task was carefully formulated as a classification problem while taking censorship and the length of follow-up into consideration.
We developed a novel backward masking scheme to evaluate how the length of longitudinal information after the index date affects disease prediction. We observed that modeling time-varying covariates improved the performance of the algorithms and transfer learning mitigated reduced performance caused by the lack of data. In addition, covariate imbalance, such as sex bias in data impaired performance. Deep learning models trained on one sex and evaluated in the other sex showed reduced performance, indicating the importance of assessing covariate imbalance while preparing data for model training.
Devising proper strategies to address challenges from time-varying covariates, lack of data, and covariate imbalance can be key to counteracting data bias and accurately predicting disease occurrence using deep learning models. The novel strategies developed in this work can significantly improve the performance of hepatocellular carcinoma risk prediction among patients with nonalcoholic fatty liver disease. Furthermore, our novel strategies can be generalized to apply to other disease risk predictions using structured electronic health records, especially for disease risks on condition of another disease.
深度学习模型在应用于许多生物医学问题时显示出巨大的成功和潜力。然而,在使用结构化电子健康记录数据时,深度学习模型在许多疾病预测问题上的准确性会受到随时间变化的协变量、罕见发病率和协变量不平衡的影响。当预测一种疾病在另一种疾病条件下的风险时,情况会更加恶化,例如非酒精性脂肪性肝病患者的肝细胞癌风险,这是由于疾病进展缓慢、同时患有两种疾病的数据稀缺以及疾病的性别偏差。
本研究的目的是调查随时间变化的协变量、罕见发病率和协变量不平衡对深度学习性能的影响程度,然后设计应对这些挑战的策略。这些策略被应用于改善非酒精性脂肪性肝病患者的肝细胞癌风险预测。
我们在一项来自国家电子健康记录数据库的非酒精性脂肪性肝病患者队列(n = 220,838)中,评估了两种代表性的深度学习模型在预测肝细胞癌发生情况的任务中的表现。疾病预测任务被精心制定为一个分类问题,同时考虑到删失和随访时间长度。
我们开发了一种新颖的反向掩码方案,以评估索引日期后纵向信息的长度如何影响疾病预测。我们观察到,对随时间变化的协变量进行建模提高了算法的性能,而迁移学习减轻了因数据缺乏导致的性能下降。此外,协变量不平衡,如数据中的性别偏差会损害性能。在一种性别上训练并在另一种性别上评估的深度学习模型表现出性能下降,这表明在为模型训练准备数据时评估协变量不平衡的重要性。
设计适当的策略来应对随时间变化的协变量、数据缺乏和协变量不平衡带来的挑战,可能是抵消数据偏差并使用深度学习模型准确预测疾病发生的关键。本研究中开发的新策略可以显著提高非酒精性脂肪性肝病患者肝细胞癌风险预测的性能。此外,我们的新策略可以推广应用于使用结构化电子健康记录的其他疾病风险预测,特别是对于一种疾病条件下的另一种疾病风险预测。