Epidemiology, Merck & Co., Inc., Rahway, New Jersey, USA.
Biostatistics and Research Decision Sciences (BARDS), Merck & Co., Inc., Rahway, New Jersey, USA.
Pharmacoepidemiol Drug Saf. 2024 Oct;33(10):e70019. doi: 10.1002/pds.70019.
To assess the validity of privacy-preserving synthetic data by comparing results from synthetic versus original EHR data analysis.
A published retrospective cohort study on real-world effectiveness of COVID-19 vaccines by Maccabi Healthcare Services in Israel was replicated using synthetic data generated from the same source, and the results were compared between synthetic versus original datasets. The endpoints included COVID-19 infection, symptomatic COVID-19 infection and hospitalization due to infection and were also assessed in several demographic and clinical subgroups. In comparing synthetic versus original data estimates, several metrices were utilized: standardized mean differences (SMD), decision agreement, estimate agreement, confidence interval overlap, and Wald test. Synthetic data were generated five times to assess the stability of results.
The distribution of demographic and clinical characteristics demonstrated very small difference (< 0.01 SMD). In the comparison of vaccine effectiveness assessed in relative risk reduction between synthetic versus original data, there was a 100% decision agreement, 100% estimate agreement, and a high level of confidence interval overlap (88.7%-99.7%) in all five replicates across all subgroups. Similar findings were achieved in the assessment of vaccine effectiveness against symptomatic COVID-19 Infection. In the comparison of hazard ratios for COVID 19-related hospitalization and odds ratio for symptomatic COVID-19 Infection, the Wald tests suggested no significant difference between respective effect estimates in all five replicates for all patient subgroups but there were disagreements in estimate and decision metrices in some subgroups and replicates.
Overall, comparison of synthetic versus original real-world data demonstrated good validity and reliability. Transparency on the process to generate high fidelity synthetic data and assurances of patient privacy are warranted.
通过比较合成数据与原始电子健康记录(EHR)数据分析结果,评估隐私保护合成数据的有效性。
使用来自同一来源的合成数据复制了以色列马卡比医疗保健服务公司发表的关于 COVID-19 疫苗真实世界有效性的回顾性队列研究,并比较了合成数据集与原始数据集之间的结果。终点包括 COVID-19 感染、有症状的 COVID-19 感染以及因感染导致的住院治疗,同时还在几个人口统计学和临床亚组中进行了评估。在比较合成数据与原始数据估计值时,使用了几种指标:标准化均数差值(SMD)、决策一致性、估计一致性、置信区间重叠和 Wald 检验。生成了五次合成数据来评估结果的稳定性。
人口统计学和临床特征的分布显示出非常小的差异(<0.01 SMD)。在评估疫苗有效性的相对风险降低方面,在所有五个重复中,合成数据与原始数据的比较在所有亚组中都具有 100%的决策一致性、100%的估计一致性和高置信区间重叠率(88.7%-99.7%)。在评估疫苗对有症状 COVID-19 感染的有效性方面也得出了类似的发现。在 COVID 相关住院的危害比和有症状 COVID-19 感染的优势比的比较中,Wald 检验表明,在所有五个重复中,所有患者亚组的相应效应估计值之间没有显著差异,但在一些亚组和重复中,在估计和决策指标上存在分歧。
总体而言,合成数据与原始真实世界数据的比较显示出良好的有效性和可靠性。有必要对生成高保真合成数据的过程进行透明化,并确保患者隐私。