Blacketer Clair, DeFalco Frank J, Conover Mitchell M, Ryan Patrick B, Schuemie Martijn J, Rijnbeek Peter R
Coordinating Center, Observational Health Data Sciences and Informatics (OHDSI), New York, NY, 10032, United States.
Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, NL, 3015 GD, United States.
J Am Med Inform Assoc. 2025 Sep 1;32(9):1434-1444. doi: 10.1093/jamia/ocaf119.
In real-world data (RWD), defining the observation period-the time during which a patient is considered observable-is critical for estimating incidence rates (IRs) and other outcomes. Yet, in the absence of explicit enrollment information, this period must often be inferred, introducing potential bias.
This study evaluates methods for defining observation periods and their impact on IR estimates across multiple database types. We applied 3 methods for defining observation periods: (1) a persistence + surveillance window approach, (2) an age- and gender-adjusted method based on time between healthcare events, and (3) the min/max method. These were tested across 11 RWD databases, including both enrollment-based and encounter-based sources. Enrollment time was used as the reference standard in eligible databases. To assess the impact on epidemiologic results, we replicated a prior study of adverse event incidence, comparing IRs and calculating mean squared error between methods.
Incidence rates decreased as observation periods lengthened, driven by increases in the person-time denominator. The persistence + surveillance method produced estimates closest to enrollment-based rates when appropriately balanced. The min/max approach yielded inconsistent results, particularly in encounter-based databases, with greater error observed in databases with longer time spans.
These findings suggest that assumptions about data completeness and population observability significantly affect incidence estimates. Observation period definitions substantially influence outcome measurement in RWD studies.
Standardized, transparent approaches are necessary to ensure valid, reproducible results-especially in databases lacking defined enrollment.
在真实世界数据(RWD)中,定义观察期(即患者被视为可观察的时间段)对于估计发病率(IR)和其他结局至关重要。然而,在缺乏明确的入组信息时,这个时间段通常必须进行推断,这就引入了潜在的偏差。
本研究评估了定义观察期的方法及其对多种数据库类型中IR估计值的影响。我们应用了3种定义观察期的方法:(1)持续存在+监测窗口方法;(2)基于医疗事件之间时间的年龄和性别调整方法;(3)最小/最大方法。这些方法在11个RWD数据库中进行了测试,包括基于入组和基于就诊的数据源。在符合条件的数据库中,将入组时间用作参考标准。为了评估对流行病学结果的影响,我们重复了一项先前关于不良事件发生率的研究,比较了发病率,并计算了不同方法之间的均方误差。
随着观察期延长,发病率下降,这是由人时分母增加所驱动的。当适当平衡时,持续存在+监测方法得出的估计值最接近基于入组的发病率。最小/最大方法产生的结果不一致,特别是在基于就诊的数据库中,在时间跨度较长的数据库中观察到更大的误差。
这些发现表明,关于数据完整性和人群可观察性的假设会显著影响发病率估计。观察期定义在RWD研究中对结局测量有重大影响。
需要采用标准化、透明的方法来确保结果有效、可重复,尤其是在缺乏明确入组定义的数据库中。