Adams Tim, Birkenbihl Colin, Otte Karen, Ng Hwei Geok, Rieling Jonas Adrian, Näher Anatol-Fiete, Sax Ulrich, Prasser Fabian, Fröhlich Holger
Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany.
Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02115, USA.
iScience. 2025 Apr 14;28(5):112382. doi: 10.1016/j.isci.2025.112382. eCollection 2025 May 16.
The use of synthetic data is a widely discussed and promising solution for privacy-preserving medical research. Synthetic data may, however, not always rule out the risk of re-identifying characteristics of real patients and can vary greatly in terms of data fidelity and utility. We systematically evaluate the trade-offs between privacy, fidelity, and utility across five synthetic data models and three patient-level datasets. We evaluate fidelity based on statistical similarity to the real data, utility on three machine learning use cases, and privacy via membership inference, singling out, and attribute inference risks. Synthetic data without differential privacy (DP) maintained fidelity and utility without evident privacy breaches, whereas DP-enforced models significantly disrupted correlation structures. K-anonymity-based data sanitization of demographic features, while preserving fidelity, introduced notable privacy risks. Our findings emphasize the need to advance methods that effectively balance privacy, fidelity, and utility in synthetic patient data generation.
合成数据的使用是隐私-pres隐私保护医学研究中一个广泛讨论且颇具前景的解决方案。然而,合成数据可能并不总能排除重新识别真实患者特征的风险,并且在数据保真度和实用性方面可能有很大差异。我们系统地评估了五个合成数据模型和三个患者级数据集在隐私、保真度和实用性之间的权衡。我们基于与真实数据的统计相似性评估保真度,在三个机器学习用例上评估实用性,并通过成员推理、挑出和属性推理风险来评估隐私。没有差分隐私(DP)的合成数据保持了保真度和实用性,且没有明显的隐私泄露,而强制实施DP的模型显著破坏了相关结构。基于k匿名的人口统计学特征数据清理在保持保真度的同时,引入了显著的隐私风险。我们的研究结果强调了推进在合成患者数据生成中有效平衡隐私、保真度和实用性的方法的必要性。