Weber Griffin M, Adams William G, Bernstam Elmer V, Bickel Jonathan P, Fox Kathe P, Marsolo Keith, Raghavan Vijay A, Turchin Alexander, Zhou Xiaobo, Murphy Shawn N, Mandl Kenneth D
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.
J Am Med Inform Assoc. 2017 Nov 1;24(6):1134-1141. doi: 10.1093/jamia/ocx071.
One promise of nationwide adoption of electronic health records (EHRs) is the availability of data for large-scale clinical research studies. However, because the same patient could be treated at multiple health care institutions, data from only a single site might not contain the complete medical history for that patient, meaning that critical events could be missing. In this study, we evaluate how simple heuristic checks for data "completeness" affect the number of patients in the resulting cohort and introduce potential biases.
We began with a set of 16 filters that check for the presence of demographics, laboratory tests, and other types of data, and then systematically applied all 216 possible combinations of these filters to the EHR data for 12 million patients at 7 health care systems and a separate payor claims database of 7 million members.
EHR data showed considerable variability in data completeness across sites and high correlation between data types. For example, the fraction of patients with diagnoses increased from 35.0% in all patients to 90.9% in those with at least 1 medication. An unrelated claims dataset independently showed that most filters select members who are older and more likely female and can eliminate large portions of the population whose data are actually complete.
As investigators design studies, they need to balance their confidence in the completeness of the data with the effects of placing requirements on the data on the resulting patient cohort.
在全国范围内采用电子健康记录(EHRs)的一个前景是可为大规模临床研究提供数据。然而,由于同一患者可能在多个医疗机构接受治疗,仅来自单一机构的数据可能不包含该患者的完整病史,这意味着关键事件可能缺失。在本研究中,我们评估了对数据“完整性”进行简单启发式检查如何影响最终队列中的患者数量,并引入潜在偏差。
我们从一组16个过滤器开始,这些过滤器用于检查人口统计学数据、实验室检查及其他类型数据的存在情况,然后系统地将这些过滤器的所有216种可能组合应用于7个医疗系统中1200万患者的电子健康记录数据以及一个包含700万成员的独立医保理赔数据库。
电子健康记录数据显示,各机构之间的数据完整性存在显著差异,且数据类型之间具有高度相关性。例如,有诊断记录的患者比例从所有患者中的35.0%增加到至少使用过1种药物的患者中的90.9%。一个不相关的理赔数据集独立显示,大多数过滤器选择的成员年龄较大且更可能为女性,并且会排除很大一部分数据实际上完整的人群。
在研究人员设计研究时,他们需要在对数据完整性的信心与对数据设置要求对最终患者队列的影响之间取得平衡。