El Emam Khaled, Mosquera Lucy, Bass Jason
School of Epidemiology and Public Health, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada.
Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.
J Med Internet Res. 2020 Nov 16;22(11):e23139. doi: 10.2196/23139.
There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them.
The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data.
A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this "meaningful identity disclosure risk." The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data.
The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively.
We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.
为便于共享数据进行二次分析,人们对数据合成的兴趣日益浓厚;然而,对于完全合成数据,需要一个全面的隐私风险模型:如果生成模型过度拟合,那么就有可能从合成数据中识别出个体并了解有关他们的新信息。
本研究的目的是开发并应用一种方法来评估完全合成数据的身份泄露风险。
提出了一个完整的风险模型,该模型评估身份泄露以及如果合成记录与真实个体匹配,对手了解新信息的能力。我们将此称为“有意义的身份泄露风险”。该模型应用于华盛顿州医院出院数据库(2007年)和加拿大COVID-19病例数据库的样本。这两个数据集均使用常用于合成健康和社会科学数据的顺序决策树过程进行合成。
这两个合成样本的有意义身份泄露风险均低于常用的0.09风险阈值(分别为0.0198和0.0086),分别比原始数据集的风险值低4倍和5倍。
我们提出了一个针对完全合成数据的全面身份泄露风险模型。该合成方法在两个数据集上的结果表明,合成可以显著降低有意义的身份泄露风险。该风险模型未来可用于评估完全合成数据的隐私性。