Crull Elizabeth, O'Brien Emily C, Antiperovitch Pavel, Asfaw Kirubel, Beatty Alexis L, Djibo Djeneba Audrey, Kaul Alan F, Kornak John, Marcus Gregory M, Modrow Madelaine Faulkner, Olgin Jeffrey E, Orozco Jaime, Park Soo, Peyser Noah, Pletcher Mark J, Carton Thomas W
Department of Health Services Research, Louisiana Public Health Institute, 400 Poydras Street, Suite 1250, New Orleans, LA, 70130, United States, 1 5044954903.
Duke Clinical Research Institute, School of Medicine, Duke University, Durham, NC, United States.
JMIR Form Res. 2025 Jul 28;9:e58097. doi: 10.2196/58097.
Real-world data reported by patients and extracted from electronic health records (EHRs) are increasingly leveraged for research, policy, and clinical decision-making. However, it is not always obvious the extent to which these 2 data sources agree with each other.
This study aimed to evaluate the concordance of variables reported by participants enrolled in an electronic cohort study and data available in their EHRs.
Survey data from COVID-19 Citizen Science, an electronic cohort study, were linked to EHR data from 7 health systems, comprising 34,908 participants. Concordance was evaluated for demographics, chronic conditions, and COVID-19 characteristics. Overall agreement, sensitivity, specificity, positive predictive value, negative predictive value, and κ statistics with 95% CIs were calculated.
Of 34,017 participants with complete information, 62.3% (21,176/34,017) reported being female, and 62.4% (21,217/34,017) were female according to EHR data. The median age was 57 (IQR 42-68) years. Out of 34,017 participants, 81.6% (27,744/34,017) of participants reported being White, and 79.5% (27,054/34,017) were White according to EHR data. In addition, 9.2% (3,124/34,017) of participants reported being Hispanic, and 6.6% (2,249/34,017) were Hispanic according to EHR data. Statistically significant discordance between data sources was detected for all demographic characteristics (P<.05) except the female category (P=.57) and the American Indian and Alaska Native (P=.21) and "other" race categories (P=.33). Statistically significant discordance was detected for the 2 COVID-19 traits and all baseline medical conditions except diabetes (P=.17). The starkest absolute difference between data sources was for COVID-19 vaccination, which was 48.4% according to the EHR and 97.4% according to participant report. Overall agreement was high for all demographic characteristics, although chance-corrected agreement (κ) and sensitivity were lower for the "other" race category (κ=0.31, sensitivity =26.6%), Hispanic ethnicity (κ=0.82, sensitivity=74%), and current smoker status (κ=0.54, sensitivity=49.4%). Specificity and negative predictive value (NPV) were higher than corresponding specificity and positive predictive value (PPV) for all baseline medical conditions. Sleep apnea had the highest sensitivity of all medical conditions (83.5%), and anemia had the lowest (32.8%). Chance-corrected agreement (κ) was highly variable for baseline medical conditions, ranging from 0.26 for anemia to 0.71 for diabetes. Overall and chance-corrected agreement between data sources for COVID-19 traits such as infection (84.6%, κ=0.34) and vaccination (51.0%, κ=0.05) was relatively lower than all other evaluated traits. The sensitivity for COVID-19 infection was 32.2%, and the sensitivity for COVID-19 vaccination was 49.7%. Although PPV for COVID-19 vaccination was 99.9%, the NPV was 5%.
Results suggest the need for improvements to point-of-care capture of patient demographic traits and COVID-19 infection and vaccination history, patient education about their medical conditions, and linkage to external data sources in EHR-only pragmatic research. Further, these results indicate that additional work is required to integrate and prioritize participant-reported data in pragmatic research.
患者报告的以及从电子健康记录(EHR)中提取的真实世界数据越来越多地用于研究、政策制定和临床决策。然而,这两个数据源相互一致的程度并不总是显而易见的。
本研究旨在评估参与电子队列研究的参与者报告的变量与他们EHR中可用数据的一致性。
将电子队列研究“COVID-19公民科学”的调查数据与7个医疗系统的EHR数据相链接,共有34908名参与者。对人口统计学、慢性病和COVID-19特征的一致性进行了评估。计算了总体一致性、敏感性、特异性、阳性预测值、阴性预测值以及95%置信区间的κ统计量。
在34017名拥有完整信息的参与者中,62.3%(21176/34017)报告为女性,根据EHR数据,女性比例为62.4%(21217/34017)。年龄中位数为57岁(四分位距42 - 68岁)。在34017名参与者中,81.6%(27744/34017)报告为白人,根据EHR数据,白人比例为79.5%(27054/34017)。此外,9.2%(3124/34017)的参与者报告为西班牙裔,根据EHR数据,西班牙裔比例为6.6%(2249/34017)。除女性类别(P = 0.57)、美洲印第安人和阿拉斯加原住民(P = 0.21)以及“其他”种族类别(P = 0.33)外所有人口统计学特征的数据来源之间均检测到具有统计学意义的不一致(P < 0.05)。在2个COVID-19特征以及除糖尿病(P = 0.17)外的所有基线医疗状况中均检测到具有统计学意义的不一致。数据源之间最明显的绝对差异在于COVID-19疫苗接种情况,根据EHR为48.4%,根据参与者报告为97.4%。所有人口统计学特征的总体一致性较高,尽管“其他”种族类别(κ = 0.31,敏感性 = 26.6%)、西班牙裔种族(κ = 0.82,敏感性 = 74%)和当前吸烟者状态(κ = 0.54,敏感性 = 49.4%)的机会校正一致性(κ)和敏感性较低。所有基线医疗状况的特异性和阴性预测值(NPV)高于相应的特异性和阳性预测值(PPV)。睡眠呼吸暂停在所有医疗状况中敏感性最高(83.5%),贫血最低(32.8%)。基线医疗状况的机会校正一致性(κ)变化很大,从贫血的0.26到糖尿病的0.71。COVID-19特征如感染(84.6%,κ = 0.34)和疫苗接种(51.0%,κ = 0.05)的数据源之间的总体和机会校正一致性相对低于所有其他评估特征。COVID-19感染的敏感性为32.2%,COVID-19疫苗接种的敏感性为49.7%。尽管COVID-19疫苗接种的PPV为99.9%,但NPV为5%。
结果表明,在仅使用EHR的实用研究中,需要改进即时护理时对患者人口统计学特征、COVID-19感染和疫苗接种史的记录,对患者进行关于其医疗状况的教育,并与外部数据源建立链接。此外,这些结果表明在实用研究中整合参与者报告的数据并确定其优先级还需要更多工作。