Duke University, Durham, NC, USA; Harvard Medical School, Boston, MA, USA.
Harvard T.H. Chan School of Public Health, Boston, MA, USA.
J Biomed Inform. 2023 Aug;144:104425. doi: 10.1016/j.jbi.2023.104425. Epub 2023 Jun 16.
Electronic health records (EHR), containing detailed longitudinal clinical information on a large number of patients and covering broad patient populations, open opportunities for comprehensive predictive modeling of disease progression and treatment response. However, since EHRs were originally constructed for administrative purposes not for research, in the EHR-linked studies, it is often not feasible to capture reliable information for analytical variables, especially in the survival setting, when both accurate event status and event times are needed for model building. For example, progression-free survival (PFS), a commonly used survival outcome for cancer patients, often involves complex information embedded in free-text clinical notes and cannot be extracted reliably. Proxies of PFS time such as time to the first mention of progression in the notes are at best good approximations to the true event time. This leads to difficulty in efficiently estimating event rates for an EHR patient cohort. Estimating survival rates based on error-prone outcome definitions can lead to biased results and hamper the power in the downstream analysis. On the other hand, extracting accurate event time information via manual annotation is time and resource intensive. The objective of this study is to develop a calibrated survival rate estimator using noisy outcomes from EHR data.
In this paper, we propose a two-stage semi-supervised calibration of noisy event rate (SCANER) estimator that can effectively overcome censoring induced dependency and attains more robust performance (i.e., not sensitive to misspecification of the imputation model) by fully utilizing both a small-labeled set of gold-standard survival outcomes annotated via manual chart review and a set of proxy features automatically captured via EHR in the unlabeled set. We validate the SCANER estimator by estimating the PFS rates for a virtual cohort of lung cancer patients from one large tertiary care center and the ICU-free survival rates for COVID patients from two large tertiary care centers.
In terms of survival rate estimates, the SCANER had very similar point estimates compared to the complete-case Kaplan Meier estimator. On the other hand, other benchmark methods for comparison, which fail to account for the induced dependency between event time and the censoring time conditioning on surrogate outcomes, produced biased results across all three case studies. In terms of standard errors, the SCANER estimator was more efficient than the KM estimator, with up to 50% efficiency gain.
The SCANER estimator achieves more efficient, robust, and accurate survival rate estimates compared to existing approaches. This promising new approach can also improve the resolution (i.e., granularity of event time) by using labels conditioning on multiple surrogates, particularly among less common or poorly coded conditions.
电子健康记录(EHR)包含大量患者的详细纵向临床信息,涵盖广泛的患者群体,为疾病进展和治疗反应的综合预测建模提供了机会。然而,由于 EHR 最初是为管理目的而不是为研究而构建的,因此在 EHR 相关研究中,通常无法为分析变量捕获可靠的信息,尤其是在生存设置中,此时需要准确的事件状态和事件时间来进行模型构建。例如,无进展生存期(PFS)是癌症患者常用的生存结局,它通常涉及到自由文本临床记录中嵌入的复杂信息,并且无法可靠地提取。PFS 时间的代理,例如记录中首次提到进展的时间,最多只是对真实事件时间的良好近似。这导致难以有效地估计 EHR 患者队列的事件发生率。基于错误百出的结局定义来估计生存率可能会导致有偏结果并阻碍下游分析的功效。另一方面,通过手动注释提取准确的事件时间信息既费时又费资源。本研究的目的是开发一种使用 EHR 数据中嘈杂结局的校准生存率估计器。
在本文中,我们提出了一种两阶段半监督校准噪声事件率(SCANER)估计器,该估计器可以通过充分利用通过手动图表审查注释的小标签集的黄金标准生存结局和通过 EHR 自动捕获的代理特征集,有效地克服由有偏的协变量引起的依赖性,并获得更稳健的性能(即,不受插补模型的错误指定的影响)。我们通过从一家大型三级护理中心估计虚拟的肺癌患者队列的 PFS 率和从两家大型三级护理中心估计 COVID 患者的 ICU 无生存时间率,来验证 SCANER 估计器。
在生存率估计方面,SCANER 的点估计与完整病例 Kaplan-Meier 估计器非常相似。另一方面,其他用于比较的基准方法,由于未能考虑到事件时间与基于替代结果的截尾时间之间的诱导依赖性,因此在所有三个案例研究中都产生了有偏的结果。在标准误差方面,SCANER 估计器比 KM 估计器更有效,最高可达 50%的效率增益。
与现有方法相比,SCANER 估计器可实现更高效、稳健和准确的生存率估计。这种有前景的新方法还可以通过使用多个替代物的标签来提高分辨率(即事件时间的粒度),尤其是在不太常见或编码较差的情况下。