基于电子健康记录的带噪事件风险的半监督校准（SCANER）。

Semi-supervised calibration of noisy event risk (SCANER) with electronic health records.

机构信息

Duke University, Durham, NC, USA; Harvard Medical School, Boston, MA, USA.

Harvard T.H. Chan School of Public Health, Boston, MA, USA.

出版信息

J Biomed Inform. 2023 Aug;144:104425. doi: 10.1016/j.jbi.2023.104425. Epub 2023 Jun 16.

DOI:10.1016/j.jbi.2023.104425

PMID:37331495

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10478159/

Abstract

OBJECTIVE

Electronic health records (EHR), containing detailed longitudinal clinical information on a large number of patients and covering broad patient populations, open opportunities for comprehensive predictive modeling of disease progression and treatment response. However, since EHRs were originally constructed for administrative purposes not for research, in the EHR-linked studies, it is often not feasible to capture reliable information for analytical variables, especially in the survival setting, when both accurate event status and event times are needed for model building. For example, progression-free survival (PFS), a commonly used survival outcome for cancer patients, often involves complex information embedded in free-text clinical notes and cannot be extracted reliably. Proxies of PFS time such as time to the first mention of progression in the notes are at best good approximations to the true event time. This leads to difficulty in efficiently estimating event rates for an EHR patient cohort. Estimating survival rates based on error-prone outcome definitions can lead to biased results and hamper the power in the downstream analysis. On the other hand, extracting accurate event time information via manual annotation is time and resource intensive. The objective of this study is to develop a calibrated survival rate estimator using noisy outcomes from EHR data.

MATERIALS AND METHODS

In this paper, we propose a two-stage semi-supervised calibration of noisy event rate (SCANER) estimator that can effectively overcome censoring induced dependency and attains more robust performance (i.e., not sensitive to misspecification of the imputation model) by fully utilizing both a small-labeled set of gold-standard survival outcomes annotated via manual chart review and a set of proxy features automatically captured via EHR in the unlabeled set. We validate the SCANER estimator by estimating the PFS rates for a virtual cohort of lung cancer patients from one large tertiary care center and the ICU-free survival rates for COVID patients from two large tertiary care centers.

RESULTS

In terms of survival rate estimates, the SCANER had very similar point estimates compared to the complete-case Kaplan Meier estimator. On the other hand, other benchmark methods for comparison, which fail to account for the induced dependency between event time and the censoring time conditioning on surrogate outcomes, produced biased results across all three case studies. In terms of standard errors, the SCANER estimator was more efficient than the KM estimator, with up to 50% efficiency gain.

CONCLUSION

The SCANER estimator achieves more efficient, robust, and accurate survival rate estimates compared to existing approaches. This promising new approach can also improve the resolution (i.e., granularity of event time) by using labels conditioning on multiple surrogates, particularly among less common or poorly coded conditions.

摘要

目的

电子健康记录（EHR）包含大量患者的详细纵向临床信息，涵盖广泛的患者群体，为疾病进展和治疗反应的综合预测建模提供了机会。然而，由于 EHR 最初是为管理目的而不是为研究而构建的，因此在 EHR 相关研究中，通常无法为分析变量捕获可靠的信息，尤其是在生存设置中，此时需要准确的事件状态和事件时间来进行模型构建。例如，无进展生存期（PFS）是癌症患者常用的生存结局，它通常涉及到自由文本临床记录中嵌入的复杂信息，并且无法可靠地提取。PFS 时间的代理，例如记录中首次提到进展的时间，最多只是对真实事件时间的良好近似。这导致难以有效地估计 EHR 患者队列的事件发生率。基于错误百出的结局定义来估计生存率可能会导致有偏结果并阻碍下游分析的功效。另一方面，通过手动注释提取准确的事件时间信息既费时又费资源。本研究的目的是开发一种使用 EHR 数据中嘈杂结局的校准生存率估计器。

材料和方法

在本文中，我们提出了一种两阶段半监督校准噪声事件率（SCANER）估计器，该估计器可以通过充分利用通过手动图表审查注释的小标签集的黄金标准生存结局和通过 EHR 自动捕获的代理特征集，有效地克服由有偏的协变量引起的依赖性，并获得更稳健的性能（即，不受插补模型的错误指定的影响）。我们通过从一家大型三级护理中心估计虚拟的肺癌患者队列的 PFS 率和从两家大型三级护理中心估计 COVID 患者的 ICU 无生存时间率，来验证 SCANER 估计器。

结果

在生存率估计方面，SCANER 的点估计与完整病例 Kaplan-Meier 估计器非常相似。另一方面，其他用于比较的基准方法，由于未能考虑到事件时间与基于替代结果的截尾时间之间的诱导依赖性，因此在所有三个案例研究中都产生了有偏的结果。在标准误差方面，SCANER 估计器比 KM 估计器更有效，最高可达 50%的效率增益。

结论

与现有方法相比，SCANER 估计器可实现更高效、稳健和准确的生存率估计。这种有前景的新方法还可以通过使用多个替代物的标签来提高分辨率（即事件时间的粒度），尤其是在不太常见或编码较差的情况下。

相似文献

Semi-supervised calibration of noisy event risk (SCANER) with electronic health records.基于电子健康记录的带噪事件风险的半监督校准（SCANER）。

J Biomed Inform. 2023 Aug;144:104425. doi: 10.1016/j.jbi.2023.104425. Epub 2023 Jun 16.

Semisupervised Calibration of Risk with Noisy Event Times (SCORNET) using electronic health record data.基于电子健康记录数据的带噪声事件时间的半监督风险校准（SCORNET）。

Biostatistics. 2023 Jul 14;24(3):760-775. doi: 10.1093/biostatistics/kxac003.

Semi-supervised Double Deep Learning Temporal Risk Prediction (SeDDLeR) with Electronic Health Records.基于电子健康记录的半监督双深度学习时间风险预测（SeDDLeR）

J Biomed Inform. 2024 Sep;157:104685. doi: 10.1016/j.jbi.2024.104685. Epub 2024 Jul 14.

Robust and efficient semi-supervised estimation of average treatment effects with application to electronic health records data.具有应用于电子健康记录数据的稳健且高效的平均处理效应的半监督估计。

Biometrics. 2021 Jun;77(2):413-423. doi: 10.1111/biom.13298. Epub 2020 May 25.

Semi-supervised approach to event time annotation using longitudinal electronic health records.基于纵向电子健康记录的事件时间标注的半监督方法。

Lifetime Data Anal. 2022 Jul;28(3):428-491. doi: 10.1007/s10985-022-09557-5. Epub 2022 Jun 26.

Weakly Semi-supervised phenotyping using Electronic Health records.基于电子健康记录的弱监督表型研究

J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Automated feature selection of predictors in electronic medical records data.电子病历数据中预测指标的自动特征选择

Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.

[Standard technical specifications for methacholine chloride (Methacholine) bronchial challenge test (2023)].[氯化乙酰甲胆碱支气管激发试验标准技术规范（2023年）]

Zhonghua Jie He He Hu Xi Za Zhi. 2024 Feb 12;47(2):101-119. doi: 10.3760/cma.j.cn112147-20231019-00247.

A semi-supervised adaptive Markov Gaussian embedding process (SAMGEP) for prediction of phenotype event times using the electronic health record.基于电子健康记录的表型事件时间预测的半监督自适应马尔可夫高斯嵌入过程 (SAMGEP)。

Sci Rep. 2022 Oct 22;12(1):17737. doi: 10.1038/s41598-022-22585-3.

引用本文的文献

Artificial Intelligence-Based Methods: The Path Forward in Achieving Equity in Lung Cancer Screening and Evaluation.基于人工智能的方法：实现肺癌筛查与评估公平性的前进道路。

Cancer Innov. 2025 Jun 20;4(4):e70019. doi: 10.1002/cai2.70019. eCollection 2025 Aug.

Deep learning with noisy labels in medical prediction problems: a scoping review.深度学习中带噪标签在医学预测问题中的应用：范围综述。

J Am Med Inform Assoc. 2024 Jun 20;31(7):1596-1607. doi: 10.1093/jamia/ocae108.

本文引用的文献

Efficient Evaluation of Prediction Rules in Semi-Supervised Settings under Stratified Sampling.分层抽样下半监督设置中预测规则的有效评估

J R Stat Soc Series B Stat Methodol. 2022 Sep;84(4):1353-1391. doi: 10.1111/rssb.12502. Epub 2022 Apr 26.

Sci Rep. 2022 Oct 22;12(1):17737. doi: 10.1038/s41598-022-22585-3.

Introducing the FAIR Principles for research software.提出研究软件的 FAIR 原则。

Sci Data. 2022 Oct 14;9(1):622. doi: 10.1038/s41597-022-01710-x.

Changes in laboratory value improvement and mortality rates over the course of the pandemic: an international retrospective cohort study of hospitalised patients infected with SARS-CoV-2.大流行期间实验室指标改善和死亡率的变化：一项国际回顾性队列研究，纳入了感染 SARS-CoV-2 的住院患者。

BMJ Open. 2022 Jun 23;12(6):e057725. doi: 10.1136/bmjopen-2021-057725.

Optimal Designs of Two-Phase Studies.两阶段研究的最优设计

J Am Stat Assoc. 2020;115(532):1946-1959. doi: 10.1080/01621459.2019.1671200. Epub 2019 Oct 29.

ACCOUNTING FOR DEPENDENT ERRORS IN PREDICTORS AND TIME-TO-EVENT OUTCOMES USING ELECTRONIC HEALTH RECORDS, VALIDATION SAMPLES, AND MULTIPLE IMPUTATION.利用电子健康记录、验证样本和多重填补法对预测变量和事件发生时间结局中的相关误差进行统计分析

Ann Appl Stat. 2020 Jun;14(2):1045-1061. doi: 10.1214/20-aoas1343. Epub 2020 Jun 29.

Biometrics. 2021 Jun;77(2):413-423. doi: 10.1111/biom.13298. Epub 2020 May 25.

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP).使用一种常见的半监督方法（PheCAP）对电子病历数据进行高通量表型分析。

Nat Protoc. 2019 Dec;14(12):3426-3444. doi: 10.1038/s41596-019-0227-6. Epub 2019 Nov 20.

Determining the Time of Cancer Recurrence Using Claims or Electronic Medical Record Data.利用理赔数据或电子病历数据确定癌症复发时间

JCO Clin Cancer Inform. 2018 Dec;2:1-10. doi: 10.1200/CCI.17.00163.

DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network.DeepSurv：使用 Cox 比例风险深度神经网络的个性化治疗推荐系统。

BMC Med Res Methodol. 2018 Feb 26;18(1):24. doi: 10.1186/s12874-018-0482-1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于电子健康记录的带噪事件风险的半监督校准（SCANER）。

Semi-supervised calibration of noisy event risk (SCANER) with electronic health records.

机构信息

Duke University, Durham, NC, USA; Harvard Medical School, Boston, MA, USA.

Harvard T.H. Chan School of Public Health, Boston, MA, USA.

出版信息

J Biomed Inform. 2023 Aug;144:104425. doi: 10.1016/j.jbi.2023.104425. Epub 2023 Jun 16.

DOI:10.1016/j.jbi.2023.104425

PMID:37331495

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10478159/

Abstract

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSION

摘要

基于电子健康记录的带噪事件风险的半监督校准（SCANER）。

Semi-supervised calibration of noisy event risk (SCANER) with electronic health records.

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料和方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

基于电子健康记录的带噪事件风险的半监督校准（SCANER）。

Semi-supervised calibration of noisy event risk (SCANER) with electronic health records.

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料和方法

结果

结论