Replica Analytics Ltd, Ottawa, ON, Canada.
Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada.
BMC Med Res Methodol. 2023 Mar 23;23(1):67. doi: 10.1186/s12874-023-01869-w.
Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health's administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.
获取行政健康数据进行研究是一个困难且耗时的过程,这是由于日益严格的隐私规定。一种替代的共享行政健康数据的方法是共享合成数据集,其中记录不对应真实个体,而是复制数据中看到的模式和关系。本文评估了使用递归深度学习模型生成合成行政健康数据的可行性。我们的数据来自艾伯塔省健康管理健康数据库中的 12 万名个体。我们使用效用评估来评估我们的合成数据与真实数据的相似程度,这些评估评估了数据中的结构和总体模式,以及在真实数据中重新创建通常应用于这种类型的行政健康数据的特定分析。我们还评估了使用此合成数据集相关的隐私风险。通用效用评估使用 Hellinger 距离来量化真实和合成数据集之间事件类型(0.027)、属性(均值 0.0417)、马尔可夫转移矩阵(顺序 1 均值绝对差:0.0896,标准差:0.159;顺序 2:均值 Hellinger 距离 0.2195,标准差:0.2724)分布之间的差异,联合分布之间的 Hellinger 距离为 0.352,从真实和合成数据生成的随机队列的相似性具有 0.3 的均值 Hellinger 距离和 0.064 的均值欧几里得距离,这表明真实数据和合成数据分布之间存在较小差异。通过对真实和合成数据集应用实际分析,Cox 回归风险比在 5 个感兴趣的关键结果中实现了调整风险比的平均置信区间重叠 68%,表明合成数据产生的分析结果与真实数据相似。隐私评估的结论是,与该合成数据集相关的归因披露风险大大低于典型的 0.09 可接受风险阈值。基于这些指标,我们的结果表明,我们的合成数据与真实数据非常相似,可以为研究目的共享,从而缓解在某些情况下共享真实数据的相关问题。