利用图表审查表型中的不确定病例来加强基于电子健康记录的关联研究。

Leveraging undecided cases in chart-reviewed phenotypes to enhance EHR-based association studies.

作者信息

Jian Xinyao, Zhang Dazheng, Yu Zehao, Xu Hua, Bian Jiang, Wu Yonghui, Tong Jiayi, Chen Yong

机构信息

The Center for Health Analytics and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA.

Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA.

出版信息

J Biomed Inform. 2025 Jun;166:104839. doi: 10.1016/j.jbi.2025.104839. Epub 2025 Apr 30.

DOI:10.1016/j.jbi.2025.104839

PMID:40316004

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12243065/

Abstract

OBJECTIVES

In electronic health record (EHR)-based association studies, phenotyping algorithms efficiently classify patient clinical outcomes into binary categories but are susceptible to misclassification errors. The gold standard, manual chart review, involves clinicians determining the true disease status based on their assessment of health records. These clinicians-labeled phenotypes are labor-intensive and typically limited to a small subset of patients, potentially introducing a third "undecided" category when phenotypes are indeterminate. We aim to effectively integrate the algorithm-derived and chart-reviewed outcomes when both are available in EHR-based association studies.

MATERIAL AND METHODS

We propose an augmented estimation method that combines the binary algorithm-derived phenotypes for the entire cohort with the trinary chart-reviewed phenotypes for a small, selected subset. Additionally, a cost-effective outcome-dependent sampling strategy is used to address the rare disease scenarios. The proposed trinary chart-reviewed phenotype integrated cost-effective augmented estimation (TriCA) was evaluated across a wide range of simulation settings and real-world applications, including using EHR data on Alzheimer's disease and related dementias (ADRD) from the OneFlorida + Clinical Research Network, and using cohort data on second breast cancer events (SBCE) from the Kaiser Permanente Washington.

RESULTS

Compared to estimation based on random sampling, our augmented method improved mean square error by up to 28.3% in simulation studies; compared to estimation using only trinary chart-reviewed phenotypes, our method improved efficiency by up to 33.3% in ADRD data and 50.8% in SBCE data.

DISCUSSION

Our simulation studies and real-world applications demonstrate that, compared to existing methods, the proposed method provides unbiased estimates with higher statistical efficiency.

CONCLUSION

The proposed method effectively combined binary algorithm-derived phenotypes for the whole cohort with trinary chart-reviewed outcomes for a limited validation set, making it applicable to a broader range of applications and enhancing risk factor identification in EHR-based association studies.

摘要

目的

在基于电子健康记录（EHR）的关联研究中，表型分析算法可有效地将患者临床结局分类为二元类别，但容易出现错误分类。金标准是人工病历审查，即临床医生根据对健康记录的评估来确定真实疾病状态。这些临床医生标记的表型需要耗费大量人力，并且通常仅限于一小部分患者，当表型不确定时可能会引入第三个“未决”类别。我们旨在当基于EHR的关联研究中同时有算法得出的结果和病历审查结果时，有效地整合这两种结果。

材料与方法

我们提出一种增强估计方法，该方法将整个队列中基于算法得出的二元表型与一小部分选定子集中经过病历审查的三元表型相结合。此外，还使用了一种具有成本效益的依赖于结局的抽样策略来处理罕见病情况。所提出的经过病历审查的三元表型整合成本效益增强估计（TriCA）方法在广泛的模拟设置和实际应用中进行了评估，包括使用来自OneFlorida + 临床研究网络的阿尔茨海默病及相关痴呆症（ADRD）的EHR数据，以及使用来自凯撒永久医疗集团华盛顿分部的第二次乳腺癌事件（SBCE）队列数据。

结果

与基于随机抽样的估计相比，我们的增强方法在模拟研究中将均方误差提高了28.3%；与仅使用经过病历审查的三元表型进行估计相比，我们的方法在ADRD数据中效率提高了33.3%，在SBCE数据中效率提高了50.8%。

讨论

我们的模拟研究和实际应用表明，与现有方法相比，所提出的方法能提供具有更高统计效率的无偏估计。

结论

所提出的方法有效地将整个队列中基于算法得出的二元表型与有限验证集中经过病历审查的三元结果相结合，使其适用于更广泛的应用，并增强了基于EHR的关联研究中的危险因素识别。

相似文献

Leveraging undecided cases in chart-reviewed phenotypes to enhance EHR-based association studies.利用图表审查表型中的不确定病例来加强基于电子健康记录的关联研究。

J Biomed Inform. 2025 Jun;166:104839. doi: 10.1016/j.jbi.2025.104839. Epub 2025 Apr 30.

Leveraging error-prone algorithm-derived phenotypes: Enhancing association studies for risk factors in EHR data.利用易错算法衍生的表型：增强电子健康记录数据中风险因素的关联研究。

J Biomed Inform. 2024 Sep;157:104690. doi: 10.1016/j.jbi.2024.104690. Epub 2024 Jul 14.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Balancing the efforts of chart review and gains in PRS prediction accuracy: An empirical study.平衡图表审查工作与 PRS 预测准确性的提高：一项实证研究。

J Biomed Inform. 2024 Sep;157:104705. doi: 10.1016/j.jbi.2024.104705. Epub 2024 Aug 10.

Evaluating the Bias, type I error and statistical power of the prior Knowledge-Guided integrated likelihood estimation (PIE) for bias reduction in EHR based association studies.评估用于减少基于电子健康记录（EHR）的关联研究中偏差的先验知识引导综合似然估计（PIE）的偏差、I型错误和统计功效。

J Biomed Inform. 2025 Mar;163:104787. doi: 10.1016/j.jbi.2025.104787. Epub 2025 Feb 2.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

Automated devices for identifying peripheral arterial disease in people with leg ulceration: an evidence synthesis and cost-effectiveness analysis.用于识别下肢溃疡患者外周动脉疾病的自动化设备：证据综合和成本效益分析。

Health Technol Assess. 2024 Aug;28(37):1-158. doi: 10.3310/TWCG3912.

[Volume and health outcomes: evidence from systematic reviews and from evaluation of Italian hospital data].[容量与健康结果：来自系统评价和意大利医院数据评估的证据]

Epidemiol Prev. 2013 Mar-Jun;37(2-3 Suppl 2):1-100.

Optimal Surrogate-Assisted Sampling for Cost-Efficient Validation of Electronic Health Record Outcomes.用于电子健康记录结果成本效益验证的最优代理辅助抽样

Stat Med. 2025 May;44(10-12):e70095. doi: 10.1002/sim.70095.

A rapid and systematic review of the clinical effectiveness and cost-effectiveness of topotecan for ovarian cancer.拓扑替康治疗卵巢癌的临床有效性和成本效益的快速系统评价。

Health Technol Assess. 2001;5(28):1-110. doi: 10.3310/hta5280.

本文引用的文献

Real-World Effectiveness of BNT162b2 Against Infection and Severe Diseases in Children and Adolescents.真实世界中 BNT162b2 对儿童和青少年感染和重症疾病的有效性。

Ann Intern Med. 2024 Feb;177(2):165-176. doi: 10.7326/M23-1754. Epub 2024 Jan 9.

Early prediction of Alzheimer's disease and related dementias using real-world electronic health records.利用真实世界的电子健康记录对阿尔茨海默病及相关痴呆症进行早期预测。

Alzheimers Dement. 2023 Aug;19(8):3506-3518. doi: 10.1002/alz.12967. Epub 2023 Feb 23.

Case studies in bias reduction and inference for electronic health record data with selection bias and phenotype misclassification.具有选择偏倚和表型错分的电子健康记录数据的偏差减少和推断的案例研究。

Stat Med. 2022 Dec 10;41(28):5501-5516. doi: 10.1002/sim.9579. Epub 2022 Sep 21.

Measurement error and misclassification in electronic medical records: methods to mitigate bias.电子病历中的测量误差和错误分类：减轻偏差的方法。

Curr Epidemiol Rep. 2018 Dec;5(4):343-356. doi: 10.1007/s40471-018-0164-x. Epub 2018 Sep 10.

Incidence rates of systemic lupus erythematosus in the USA: estimates from a meta-analysis of the Centers for Disease Control and Prevention national lupus registries.美国系统性红斑狼疮的发病率：来自疾病控制和预防中心全国狼疮登记处的荟萃分析的估计。

Lupus Sci Med. 2021 Dec;8(1). doi: 10.1136/lupus-2021-000614.

A cost-effective chart review sampling design to account for phenotyping error in electronic health records (EHR) data.一种具有成本效益的图表审查抽样设计，用于解决电子健康记录 (EHR) 数据中的表型错误。

J Am Med Inform Assoc. 2021 Dec 28;29(1):52-61. doi: 10.1093/jamia/ocab222.

DrugWAS: Drug-wide Association Studies for COVID-19 Drug Repurposing.药物全关联研究：用于 COVID-19 药物再利用。

Clin Pharmacol Ther. 2021 Dec;110(6):1537-1546. doi: 10.1002/cpt.2376. Epub 2021 Aug 10.

Population estimate of people with clinical Alzheimer's disease and mild cognitive impairment in the United States (2020-2060).美国临床阿尔茨海默病和轻度认知障碍患者人数的预估（2020-2060 年）。

Alzheimers Dement. 2021 Dec;17(12):1966-1975. doi: 10.1002/alz.12362. Epub 2021 May 27.

Accelerated failure time model for data from outcome-dependent sampling.基于结果依赖抽样数据的加速失效时间模型。

Lifetime Data Anal. 2021 Jan;27(1):15-37. doi: 10.1007/s10985-020-09508-y. Epub 2020 Oct 12.

Feasibility and Reliability Testing of Manual Electronic Health Record Reviews as a Tool for Timely Identification of Diagnostic Error in Patients at Risk.手动电子病历审查作为及时识别高危患者诊断错误工具的可行性和可靠性测试。

Appl Clin Inform. 2020 May;11(3):474-482. doi: 10.1055/s-0040-1713750. Epub 2020 Jul 15.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验