通过国家临床队列协作中的电子健康记录进行因果推断：长期新冠研究中的挑战与解决方案

Causal Inference via Electronic Health Records in the National Clinical Cohort Collaborative: Challenges and Solutions in Long COVID Research.

作者信息

Butzin-Dozier Zachary, Ji Yunwen, Wang Lin-Chiun, Anzalone A Jerrod, Hurwitz Eric, Patel Rena C, van der Laan Mark J, Colford John M, Hubbard Alan E

机构信息

School of Public Health, University of California, Berkeley, Berkeley, CA USA.

University of Nebraska Medical Center, Omaha, NE USA.

出版信息

medRxiv. 2025 Jun 11:2025.06.06.25329168. doi: 10.1101/2025.06.06.25329168.

DOI:10.1101/2025.06.06.25329168

PMID:40502605

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12155030/

Abstract

Observational analyses of electronic health record (EHR) data using databases such as the National Clinical Cohort Collaborative include unique challenges for researchers seeking causal inferences, particularly when evaluating subjectively-defined outcomes like Long COVID. We explore several challenges and describe potential solutions. 1. Lack of true negatives: Many diagnoses and conditions either have a positive indicator or a missing status, requiring investigators to carefully consider which patients are likely negative for this condition. 2. Differential monitoring: EHR data include nonrandom missingness driven by patients engaging with the healthcare system at different rates, which is often related to both the exposure and outcome of interest. 3. Bias: EHR data sources face many biases, but are particularly vulnerable to informative missingness, differential monitoring, and model misspecification. 4. Large sample size: High precision (i.e., narrow confidence intervals) paired with potential bias leads to a high risk of incorrectly rejecting the null hypothesis. 5. Defining index time: It is important that investigators deliberately define index time (i.e., , baseline) to ensure that they only adjust for baseline confounders and do not adjust for (or condition on) factors that are affected by the exposure of interest (i.e., colliders or mediators). 6. Parameter selection: Investigators should only select parameters that are supported by the data distribution. This manuscript provides an overview of these challenges and solutions, using both simulated data and real-world data, with the outcome of Long COVID as the running example.

摘要

使用国家临床队列协作等数据库对电子健康记录（EHR）数据进行观察性分析，对于寻求因果推断的研究人员来说存在独特的挑战，尤其是在评估像长期新冠这样主观定义的结果时。我们探讨了几个挑战并描述了潜在的解决方案。1. 缺乏真正的阴性病例：许多诊断和病症要么有阳性指标，要么状态缺失，这就要求研究人员仔细考虑哪些患者可能对此病症呈阴性。2. 差异监测：EHR数据包括由患者以不同速率参与医疗保健系统所驱动的非随机缺失，这通常与感兴趣的暴露因素和结果都有关。3. 偏差：EHR数据源面临许多偏差，但特别容易受到信息性缺失、差异监测和模型错误设定的影响。4. 大样本量：高精度（即窄置信区间）与潜在偏差相结合，导致错误拒绝原假设的风险很高。5. 定义索引时间：研究人员刻意定义索引时间（即，基线）很重要，以确保他们只对基线混杂因素进行调整，而不对受感兴趣的暴露因素影响的因素（即对撞机或中介变量）进行调整（或基于这些因素进行条件设定）。6. 参数选择：研究人员应只选择由数据分布支持的参数。本手稿使用模拟数据和真实世界数据，以长期新冠的结果作为实例，概述了这些挑战和解决方案。