Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina, USA.
Center for Predictive Medicine, Duke Clinical Research Institute, Durham, North Carolina, USA.
J Am Med Inform Assoc. 2019 Dec 1;26(12):1609-1617. doi: 10.1093/jamia/ocz148.
Electronic health records (EHR) data have become a central data source for clinical research. One concern for using EHR data is that the process through which individuals engage with the health system, and find themselves within EHR data, can be informative. We have termed this process informed presence. In this study we use simulation and real data to assess how the informed presence can impact inference.
We first simulated a visit process where a series of biomarkers were observed informatively and uninformatively over time. We further compared inference derived from a randomized control trial (ie, uninformative visits) and EHR data (ie, potentially informative visits).
We find that only when there is both a strong association between the biomarker and the outcome as well as the biomarker and the visit process is there bias. Moreover, once there are some uninformative visits this bias is mitigated. In the data example we find, that when the "true" associations are null, there is no observed bias.
These results suggest that an informative visit process can exaggerate an association but cannot induce one. Furthermore, careful study design can, mitigate the potential bias when some noninformative visits are included.
While there are legitimate concerns regarding biases that "messy" EHR data may induce, the conditions for such biases are extreme and can be accounted for.
电子健康记录(EHR)数据已成为临床研究的主要数据源。使用 EHR 数据的一个问题是,个体与医疗系统互动并在 EHR 数据中出现的过程可能具有信息性。我们将这一过程称为“知情存在”。在这项研究中,我们使用模拟和真实数据来评估知情存在如何影响推断。
我们首先模拟了一个就诊过程,其中一系列生物标志物随着时间的推移被有信息和无信息地观察。我们进一步比较了来自随机对照试验(即无信息就诊)和 EHR 数据(即可能有信息的就诊)得出的推断。
我们发现,只有当生物标志物与结果之间以及生物标志物与就诊过程之间存在很强的关联时,才会存在偏差。此外,一旦存在一些无信息就诊,这种偏差就会减轻。在我们的数据分析中,当“真实”关联为零时,就不会观察到偏差。
这些结果表明,知情就诊过程可能夸大关联,但不能诱导关联。此外,当包含一些无信息就诊时,仔细的研究设计可以减轻潜在的偏差。
虽然人们对“杂乱”的 EHR 数据可能引起的偏差存在合理的担忧,但这些偏差的条件是极端的,可以加以考虑。