Department of Biostatistics, Harvard T.H. Chan School of Public Health, United States of America.
Department of Biomedical Informatics, Harvard Medical School, United States of America.
J Biomed Inform. 2024 Sep;157:104685. doi: 10.1016/j.jbi.2024.104685. Epub 2024 Jul 14.
Risk prediction plays a crucial role in planning for prevention, monitoring, and treatment. Electronic Health Records (EHRs) offer an expansive repository of temporal medical data encompassing both risk factors and outcome indicators essential for effective risk prediction. However, challenges emerge due to the lack of readily available gold-standard outcomes and the complex effects of various risk factors. Compounding these challenges are the false positives in diagnosis codes, and formidable task of pinpointing the onset timing in annotations.
We develop a Semi-supervised Double Deep Learning Temporal Risk Prediction (SeDDLeR) algorithm based on extensive unlabeled longitudinal Electronic Health Records (EHR) data augmented by a limited set of gold standard labels on the binary status information indicating whether the clinical event of interest occurred during the follow-up period.
The SeDDLeR algorithm calculates an individualized risk of developing future clinical events over time using each patient's baseline EHR features via the following steps: (1) construction of an initial EHR-derived surrogate as a proxy for the onset status; (2) deep learning calibration of the surrogate along gold-standard onset status; and (3) semi-supervised deep learning for risk prediction combining calibrated surrogates and gold-standard onset status. To account for missing onset time and heterogeneous follow-up, we introduce temporal kernel weighting. We devise a Gated Recurrent Units (GRUs) module to capture temporal characteristics. We subsequently assess our proposed SeDDLeR method in simulation studies and apply the method to the Massachusetts General Brigham (MGB) Biobank to predict type 2 diabetes (T2D) risk.
SeDDLeR outperforms benchmark risk prediction methods, including Semi-parametric Transformation Model (STM) and DeepHit, with consistently best accuracy across experiments. SeDDLeR achieved the best C-statistics ( 0.815, SE 0.023; vs STM +.084, SE 0.030, P-value .004; vs DeepHit +.055, SE 0.027, P-value .024) and best average time-specific AUC (0.778, SE 0.022; vs STM + 0.059, SE 0.039, P-value .067; vs DeepHit + 0.168, SE 0.032, P-value <0.001) in the MGB T2D study.
SeDDLeR can train robust risk prediction models in both real-world EHR and synthetic datasets with minimal requirements of labeling event times. It holds the potential to be incorporated for future clinical trial recruitment or clinical decision-making.
风险预测在预防、监测和治疗规划中起着至关重要的作用。电子健康记录 (EHR) 提供了一个广泛的时间医学数据存储库,其中包含对有效风险预测至关重要的风险因素和结果指标。然而,由于缺乏现成的金标准结果以及各种风险因素的复杂影响,出现了挑战。此外,诊断代码中还存在假阳性,并且在注释中确定发病时间也是一项艰巨的任务。
我们基于广泛的无标签纵向电子健康记录 (EHR) 数据开发了一种基于半监督双深度学习时间风险预测 (SeDDLeR) 的算法,并使用有限数量的金标准标签对二进制状态信息进行补充,该信息表示在随访期间是否发生了感兴趣的临床事件。
SeDDLeR 算法通过以下步骤使用每个患者的基线 EHR 特征计算未来临床事件的个体化风险:(1)构建一个初始 EHR 衍生的替代物,作为发病状态的代理;(2)沿着金标准发病状态对替代物进行深度学习校准;(3)使用校准的替代物和金标准发病状态进行半监督深度学习进行风险预测。为了考虑到发病时间缺失和异质随访,我们引入了时间核加权。我们设计了一个门控循环单元 (GRU) 模块来捕获时间特征。随后,我们在模拟研究中评估了我们提出的 SeDDLeR 方法,并将该方法应用于马萨诸塞州综合医院 (MGB) 生物库来预测 2 型糖尿病 (T2D) 风险。
SeDDLeR 优于基准风险预测方法,包括半参数变换模型 (STM) 和 DeepHit,在实验中始终具有最佳的准确性。SeDDLeR 实现了最佳的 C 统计量 (0.815,SE 0.023;与 STM +.084,SE 0.030,P 值.004;与 DeepHit +.055,SE 0.027,P 值.024) 和最佳平均时间特异性 AUC (0.778,SE 0.022;与 STM + 0.059,SE 0.039,P 值.067;与 DeepHit + 0.168,SE 0.032,P 值 <0.001) 在 MGB T2D 研究中。
SeDDLeR 可以在真实世界的 EHR 和合成数据集以及最小的标签时间要求下训练稳健的风险预测模型。它有可能被纳入未来的临床试验招募或临床决策。