Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, United States.
Department of Medicine, University of British Columbia, Vancouver, BC, Canada.
J Med Internet Res. 2023 May 25;25:e45662. doi: 10.2196/45662.
Although randomized controlled trials (RCTs) are the gold standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data has been vital in postapproval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of real-world data is electronic health records (EHRs), which contain detailed information on patient care in both structured (eg, diagnosis codes) and unstructured (eg, clinical notes and images) forms. Despite the granularity of the data available in EHRs, the critical variables required to reliably assess the relationship between a treatment and clinical outcome are challenging to extract. To address this fundamental challenge and accelerate the reliable use of EHRs for RWE, we introduce an integrated data curation and modeling pipeline consisting of 4 modules that leverage recent advances in natural language processing, computational phenotyping, and causal modeling techniques with noisy data. Module 1 consists of techniques for data harmonization. We use natural language processing to recognize clinical variables from RCT design documents and map the extracted variables to EHR features with description matching and knowledge networks. Module 2 then develops techniques for cohort construction using advanced phenotyping algorithms to both identify patients with diseases of interest and define the treatment arms. Module 3 introduces methods for variable curation, including a list of existing tools to extract baseline variables from different sources (eg, codified, free text, and medical imaging) and end points of various types (eg, death, binary, temporal, and numerical). Finally, module 4 presents validation and robust modeling methods, and we propose a strategy to create gold-standard labels for EHR variables of interest to validate data curation quality and perform subsequent causal modeling for RWE. In addition to the workflow proposed in our pipeline, we also develop a reporting guideline for RWE that covers the necessary information to facilitate transparent reporting and reproducibility of results. Moreover, our pipeline is highly data driven, enhancing study data with a rich variety of publicly available information and knowledge sources. We also showcase our pipeline and provide guidance on the deployment of relevant tools by revisiting the emulation of the Clinical Outcomes of Surgical Therapy Study Group Trial on laparoscopy-assisted colectomy versus open colectomy in patients with early-stage colon cancer. We also draw on existing literature on EHR emulation of RCTs together with our own studies with the Mass General Brigham EHR.
虽然随机对照试验(RCT)是确定医疗效果和安全性的金标准,但来自真实世界数据的真实世界证据(RWE)在批准后监测中至关重要,并且正在为实验性治疗的监管过程而推广。真实世界数据的一个新兴来源是电子健康记录(EHR),它以结构化(例如诊断代码)和非结构化(例如临床笔记和图像)形式包含有关患者护理的详细信息。尽管 EHR 中可用数据的粒度很细,但要可靠地评估治疗与临床结果之间的关系,提取关键变量具有挑战性。为了解决这一根本挑战并加速可靠地使用 EHR 进行 RWE,我们引入了一个集成的数据管理和建模管道,该管道由 4 个模块组成,这些模块利用自然语言处理、计算表型和因果建模技术的最新进展以及嘈杂数据。模块 1 由数据协调技术组成。我们使用自然语言处理从 RCT 设计文件中识别临床变量,并使用描述匹配和知识网络将提取的变量映射到 EHR 特征。然后,模块 2 使用高级表型算法开发用于队列构建的技术,以识别感兴趣的疾病患者并定义治疗臂。模块 3 引入了变量管理方法,包括从不同来源(例如,编码、自由文本和医学成像)提取基线变量和各种类型的终点(例如,死亡、二进制、时间和数值)的现有工具列表。最后,模块 4 提出了验证和稳健建模方法,并提出了为感兴趣的 EHR 变量创建黄金标准标签的策略,以验证数据管理质量并为 RWE 进行后续因果建模。除了我们管道中提出的工作流程外,我们还为 RWE 制定了一份报告指南,其中涵盖了促进透明报告和结果可重复性所需的必要信息。此外,我们的管道高度依赖数据,通过使用各种公开可用的信息和知识库来增强研究数据。我们还通过重新审视早期结肠癌患者腹腔镜辅助结直肠切除术与开腹结直肠切除术的外科治疗结果研究小组试验的仿真,展示了我们的管道,并提供了相关工具部署的指导。我们还借鉴了关于 EHR 仿真 RCT 的现有文献以及我们自己在麻省总医院布里格姆 EHR 上的研究。
Cochrane Database Syst Rev. 2022-2-1
BMC Med Inform Decis Mak. 2023-7-14
J Biomed Inform. 2025-2
J Am Med Inform Assoc. 2024-12-1
J Imaging Inform Med. 2024-12
J Am Med Inform Assoc. 2023-1-18
Lifetime Data Anal. 2022-7
World J Gastrointest Oncol. 2022-1-15
Ther Innov Regul Sci. 2022-1