Knowledge Management, ZB MED - Information Centre for Life Sciences, Cologne, Germany.
Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
Stud Health Technol Inform. 2024 Aug 30;317:270-279. doi: 10.3233/SHTI240867.
A modern approach to ensuring privacy when sharing datasets is the use of synthetic data generation methods, which often claim to outperform classic anonymization techniques in the trade-off between data utility and privacy. Recently, it was demonstrated that various deep learning-based approaches are able to generate useful synthesized datasets, often based on domain-specific analyses. However, evaluating the privacy implications of releasing synthetic data remains a challenging problem, especially when the goal is to conform with data protection guidelines.
Therefore, the recent privacy risk quantification framework Anonymeter has been built for evaluating multiple possible vulnerabilities, which are specifically based on privacy risks that are considered by the European Data Protection Board, i.e. singling out, linkability, and attribute inference. This framework was applied to a synthetic data generation study from the epidemiological domain, where the synthesization replicates time and age trends previously found in data collected during the DONALD cohort study (1312 participants, 16 time points). The conducted privacy analyses are presented, which place a focus on the vulnerability of outliers.
The resulting privacy scores are discussed, which vary greatly between the different types of attacks.
Challenges encountered during their implementation and during the interpretation of their results are highlighted, and it is concluded that privacy risk assessment for synthetic data remains an open problem.
在共享数据集时,确保隐私的一种现代方法是使用合成数据生成方法,这些方法通常声称在数据效用和隐私之间的权衡中优于经典的匿名化技术。最近,已经证明各种基于深度学习的方法能够生成有用的合成数据集,这些方法通常基于特定于域的分析。然而,评估发布合成数据的隐私影响仍然是一个具有挑战性的问题,特别是当目标是符合数据保护准则时。
因此,最近构建了隐私风险量化框架 Anonymeter,用于评估多种可能的漏洞,这些漏洞特别基于欧洲数据保护委员会考虑的隐私风险,即挑出、可链接性和属性推断。该框架应用于来自流行病学领域的合成数据生成研究,该研究合成了在 DONALD 队列研究(1312 名参与者,16 个时间点)中收集的数据中先前发现的时间和年龄趋势。提出了进行的隐私分析,重点关注异常值的脆弱性。
讨论了产生的隐私分数,这些分数在不同类型的攻击之间差异很大。
强调了在实施过程中以及在解释结果时遇到的挑战,并得出结论,合成数据的隐私风险评估仍然是一个未解决的问题。