为真实世界证据生成可分析数据：利用先进信息学技术驾驭电子健康记录的教程。

Generate Analysis-Ready Data for Real-world Evidence: Tutorial for Harnessing Electronic Health Records With Advanced Informatic Technologies.

机构信息

Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, United States.

Department of Medicine, University of British Columbia, Vancouver, BC, Canada.

出版信息

J Med Internet Res. 2023 May 25;25:e45662. doi: 10.2196/45662.

DOI:10.2196/45662

PMID:37227772

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10251230/

Abstract

Although randomized controlled trials (RCTs) are the gold standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data has been vital in postapproval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of real-world data is electronic health records (EHRs), which contain detailed information on patient care in both structured (eg, diagnosis codes) and unstructured (eg, clinical notes and images) forms. Despite the granularity of the data available in EHRs, the critical variables required to reliably assess the relationship between a treatment and clinical outcome are challenging to extract. To address this fundamental challenge and accelerate the reliable use of EHRs for RWE, we introduce an integrated data curation and modeling pipeline consisting of 4 modules that leverage recent advances in natural language processing, computational phenotyping, and causal modeling techniques with noisy data. Module 1 consists of techniques for data harmonization. We use natural language processing to recognize clinical variables from RCT design documents and map the extracted variables to EHR features with description matching and knowledge networks. Module 2 then develops techniques for cohort construction using advanced phenotyping algorithms to both identify patients with diseases of interest and define the treatment arms. Module 3 introduces methods for variable curation, including a list of existing tools to extract baseline variables from different sources (eg, codified, free text, and medical imaging) and end points of various types (eg, death, binary, temporal, and numerical). Finally, module 4 presents validation and robust modeling methods, and we propose a strategy to create gold-standard labels for EHR variables of interest to validate data curation quality and perform subsequent causal modeling for RWE. In addition to the workflow proposed in our pipeline, we also develop a reporting guideline for RWE that covers the necessary information to facilitate transparent reporting and reproducibility of results. Moreover, our pipeline is highly data driven, enhancing study data with a rich variety of publicly available information and knowledge sources. We also showcase our pipeline and provide guidance on the deployment of relevant tools by revisiting the emulation of the Clinical Outcomes of Surgical Therapy Study Group Trial on laparoscopy-assisted colectomy versus open colectomy in patients with early-stage colon cancer. We also draw on existing literature on EHR emulation of RCTs together with our own studies with the Mass General Brigham EHR.

摘要

虽然随机对照试验（RCT）是确定医疗效果和安全性的金标准，但来自真实世界数据的真实世界证据（RWE）在批准后监测中至关重要，并且正在为实验性治疗的监管过程而推广。真实世界数据的一个新兴来源是电子健康记录（EHR），它以结构化（例如诊断代码）和非结构化（例如临床笔记和图像）形式包含有关患者护理的详细信息。尽管 EHR 中可用数据的粒度很细，但要可靠地评估治疗与临床结果之间的关系，提取关键变量具有挑战性。为了解决这一根本挑战并加速可靠地使用 EHR 进行 RWE，我们引入了一个集成的数据管理和建模管道，该管道由 4 个模块组成，这些模块利用自然语言处理、计算表型和因果建模技术的最新进展以及嘈杂数据。模块 1 由数据协调技术组成。我们使用自然语言处理从 RCT 设计文件中识别临床变量，并使用描述匹配和知识网络将提取的变量映射到 EHR 特征。然后，模块 2 使用高级表型算法开发用于队列构建的技术，以识别感兴趣的疾病患者并定义治疗臂。模块 3 引入了变量管理方法，包括从不同来源（例如，编码、自由文本和医学成像）提取基线变量和各种类型的终点（例如，死亡、二进制、时间和数值）的现有工具列表。最后，模块 4 提出了验证和稳健建模方法，并提出了为感兴趣的 EHR 变量创建黄金标准标签的策略，以验证数据管理质量并为 RWE 进行后续因果建模。除了我们管道中提出的工作流程外，我们还为 RWE 制定了一份报告指南，其中涵盖了促进透明报告和结果可重复性所需的必要信息。此外，我们的管道高度依赖数据，通过使用各种公开可用的信息和知识库来增强研究数据。我们还通过重新审视早期结肠癌患者腹腔镜辅助结直肠切除术与开腹结直肠切除术的外科治疗结果研究小组试验的仿真，展示了我们的管道，并提供了相关工具部署的指导。我们还借鉴了关于 EHR 仿真 RCT 的现有文献以及我们自己在麻省总医院布里格姆 EHR 上的研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a1d9/10251230/8b74a1656b9a/jmir_v25i1e45662_fig1.jpg

相似文献

Generate Analysis-Ready Data for Real-world Evidence: Tutorial for Harnessing Electronic Health Records With Advanced Informatic Technologies.为真实世界证据生成可分析数据：利用先进信息学技术驾驭电子健康记录的教程。

J Med Internet Res. 2023 May 25;25:e45662. doi: 10.2196/45662.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Using artificial intelligence to identify patients with migraine and associated symptoms and conditions within electronic health records.利用人工智能在电子健康记录中识别偏头痛患者以及相关症状和情况。

BMC Med Inform Decis Mak. 2023 Jul 14;23(1):121. doi: 10.1186/s12911-023-02190-8.

A method for cohort selection of cardiovascular disease records from an electronic health record system.一种从电子健康记录系统中选择心血管疾病记录队列的方法。

Int J Med Inform. 2017 Jun;102:138-149. doi: 10.1016/j.ijmedinf.2017.03.015. Epub 2017 Mar 30.

Temporal Trends in Clinical Evidence of 5-Year Survival Within Electronic Health Records Among Patients With Early-Stage Colon Cancer Managed With Laparoscopy-Assisted Colectomy vs Open Colectomy.腹腔镜辅助结直肠切除术与开腹结直肠切除术治疗早期结肠癌患者的电子健康记录中 5 年生存率的临床证据的时间趋势。

JAMA Netw Open. 2022 Jun 1;5(6):e2218371. doi: 10.1001/jamanetworkopen.2022.18371.

Transparent Reporting on Research Using Unstructured Electronic Health Record Data to Generate 'Real World' Evidence of Comparative Effectiveness and Safety.基于非结构化电子健康记录数据开展研究以生成比较有效性和安全性的“真实世界”证据的透明报告。

Drug Saf. 2019 Nov;42(11):1297-1309. doi: 10.1007/s40264-019-00851-0.

Developing a FHIR-based EHR phenotyping framework: A case study for identification of patients with obesity and multiple comorbidities from discharge summaries.基于 FHIR 的电子健康记录表型框架的开发：以从出院小结中识别肥胖且伴有多种合并症的患者为例。

J Biomed Inform. 2019 Nov;99:103310. doi: 10.1016/j.jbi.2019.103310. Epub 2019 Oct 14.

Assessment of a Clinical Trial-Derived Survival Model in Patients With Metastatic Castration-Resistant Prostate Cancer.转移性去势抵抗性前列腺癌患者的临床试验衍生生存模型评估。

JAMA Netw Open. 2021 Jan 4;4(1):e2031730. doi: 10.1001/jamanetworkopen.2020.31730.

Automated feature selection of predictors in electronic medical records data.电子病历数据中预测指标的自动特征选择

Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.

Deployment of Real-time Natural Language Processing and Deep Learning Clinical Decision Support in the Electronic Health Record: Pipeline Implementation for an Opioid Misuse Screener in Hospitalized Adults.电子健康记录中实时自然语言处理和深度学习临床决策支持的应用：成年住院患者阿片类药物滥用筛查器的流程实施

JMIR Med Inform. 2023 Apr 20;11:e44977. doi: 10.2196/44977.

引用本文的文献

Antibiotic Resistance Microbiology Dataset (ARMD): A Resource for Antimicrobial Resistance from EHRs.抗生素耐药性微生物数据集（ARMD）：电子健康记录中抗菌药物耐药性的资源。

ArXiv. 2025 Jul 21:arXiv:2503.07664v2.

Sci Data. 2025 Jul 26;12(1):1299. doi: 10.1038/s41597-025-05649-7.

Advancing the Use of Longitudinal Electronic Health Records: Tutorial for Uncovering Real-World Evidence in Chronic Disease Outcomes.推进纵向电子健康记录的应用：慢性病结局中发现真实世界证据的教程。

J Med Internet Res. 2025 May 12;27:e71873. doi: 10.2196/71873.

UKB-MDRMF: a multi-disease risk and multimorbidity framework based on UK biobank data.英国生物银行多疾病风险与共病框架（UKB-MDRMF）：基于英国生物银行数据的多疾病风险与共病框架

Nat Commun. 2025 Apr 22;16(1):3767. doi: 10.1038/s41467-025-58724-3.

DOME: Directional medical embedding vectors from Electronic Health Records.DOME：来自电子健康记录的定向医学嵌入向量。

J Biomed Inform. 2025 Feb;162:104768. doi: 10.1016/j.jbi.2024.104768. Epub 2025 Jan 2.

Readiness of big health data analytics by technology-organization-environment (TOE) framework in Ethiopian health sectors.基于技术-组织-环境（TOE）框架的埃塞俄比亚卫生部门大健康数据分析准备情况

Heliyon. 2024 Sep 27;10(19):e38570. doi: 10.1016/j.heliyon.2024.e38570. eCollection 2024 Oct 15.

Prediction models for identifying medication overuse or medication overuse headache in migraine patients: a systematic review.用于识别偏头痛患者药物过度使用或药物过度使用性头痛的预测模型：系统评价。

J Headache Pain. 2024 Oct 4;25(1):165. doi: 10.1186/s10194-024-01874-4.

Integrating Digital Health Solutions with Immunization Strategies: Improving Immunization Coverage and Monitoring in the Post-COVID-19 Era.将数字健康解决方案与免疫策略相结合：改善新冠疫情后时代的免疫接种覆盖率及监测

Vaccines (Basel). 2024 Jul 28;12(8):847. doi: 10.3390/vaccines12080847.

Multi-modality risk prediction of cardiovascular diseases for breast cancer cohort in the All of Us Research Program.“我们所有人”研究项目中乳腺癌队列心血管疾病的多模态风险预测

J Am Med Inform Assoc. 2024 Dec 1;31(12):2800-2810. doi: 10.1093/jamia/ocae199.

CSAMDT: Conditional Self Attention Memory-Driven Transformers for Radiology Report Generation from Chest X-Ray.CSAMDT：用于从胸部X光生成放射学报告的条件自注意力记忆驱动变压器

J Imaging Inform Med. 2024 Dec;37(6):2825-2837. doi: 10.1007/s10278-024-01126-6. Epub 2024 Jun 3.

本文引用的文献

Machine learning approaches for electronic health records phenotyping: a methodical review.基于机器学习的电子健康记录表型分析方法：系统评价

J Am Med Inform Assoc. 2023 Jan 18;30(2):367-381. doi: 10.1093/jamia/ocac216.

A semi-supervised adaptive Markov Gaussian embedding process (SAMGEP) for prediction of phenotype event times using the electronic health record.基于电子健康记录的表型事件时间预测的半监督自适应马尔可夫高斯嵌入过程 (SAMGEP)。

Sci Rep. 2022 Oct 22;12(1):17737. doi: 10.1038/s41598-022-22585-3.

Semi-supervised approach to event time annotation using longitudinal electronic health records.基于纵向电子健康记录的事件时间标注的半监督方法。

Lifetime Data Anal. 2022 Jul;28(3):428-491. doi: 10.1007/s10985-022-09557-5. Epub 2022 Jun 26.

JAMA Netw Open. 2022 Jun 1;5(6):e2218371. doi: 10.1001/jamanetworkopen.2022.18371.

Development of artificial intelligence technology in diagnosis, treatment, and prognosis of colorectal cancer.人工智能技术在结直肠癌诊断、治疗及预后方面的发展

World J Gastrointest Oncol. 2022 Jan 15;14(1):124-152. doi: 10.4251/wjgo.v14.i1.124.

Comparison of Dimethyl Fumarate vs Fingolimod and Rituximab vs Natalizumab for Treatment of Multiple Sclerosis.二甲基富马酸与芬戈莫德和利妥昔单抗与那他珠单抗治疗多发性硬化症的比较。

JAMA Netw Open. 2021 Nov 1;4(11):e2134627. doi: 10.1001/jamanetworkopen.2021.34627.

Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data.通过稀疏嵌入回归（KESER）利用多中心大规模电子健康记录数据进行临床知识提取。

NPJ Digit Med. 2021 Oct 27;4(1):151. doi: 10.1038/s41746-021-00519-z.

Data-Driven Subgroup Identification in Confirmatory Clinical Trials.确证性临床试验中基于数据的亚组识别

Ther Innov Regul Sci. 2022 Jan;56(1):65-75. doi: 10.1007/s43441-021-00329-1. Epub 2021 Jul 29.

Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer.基于电子健康记录数据的机器学习算法在肺癌纵向队列患者中识别和估计生存的性能。

JAMA Netw Open. 2021 Jul 1;4(7):e2114723. doi: 10.1001/jamanetworkopen.2021.14723.

Making Compassionate Use More Useful: Using real-world data, real-world evidence and digital twins to supplement or supplant randomized controlled trials.让有同情心的使用更有用：利用真实世界数据、真实世界证据和数字孪生来补充或替代随机对照试验。

Pac Symp Biocomput. 2021;26:38-49.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

为真实世界证据生成可分析数据：利用先进信息学技术驾驭电子健康记录的教程。

Generate Analysis-Ready Data for Real-world Evidence: Tutorial for Harnessing Electronic Health Records With Advanced Informatic Technologies.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献