Sella Nadir, Guinot Florent, Lagrange Nikita, Albou Laurent-Philippe, Desponds Jonathan, Isambert Hervé
Institut Roche, Boulogne-Billancourt, France.
Institut Curie, CNRS UMR168, PSL University, Sorbonne University, Paris, 75005, France.
NPJ Digit Med. 2025 Jan 23;8(1):49. doi: 10.1038/s41746-025-01431-6.
Generating synthetic data from medical records is a complex task intensified by patient privacy concerns. In recent years, multiple approaches have been reported for the generation of synthetic data, however, limited attention was given to jointly evaluate the quality and the privacy of the generated data. The quality and privacy of synthetic data stem from multivariate associations across variables, which cannot be assessed by comparing univariate distributions with the original data. Here, we introduce a novel algorithm (MIIC-SDG) for generating synthetic data from electronic records based on a multivariate information framework and Bayesian network theory. We also propose a new metric to quantitatively assess the trade-off between the Quality and Privacy Scores (QPS) of synthetic data generation methods. The performance of MIIC-SDG is demonstrated on different clinical datasets and favorably compares with state-of-the-art synthetic data generation methods, based on the QPS trade-off between several quality and privacy metrics.
从医疗记录中生成合成数据是一项因患者隐私问题而变得复杂的任务。近年来,已有多种生成合成数据的方法被报道,然而,对于联合评估生成数据的质量和隐私问题却关注有限。合成数据的质量和隐私源于变量间的多变量关联,而这无法通过将单变量分布与原始数据进行比较来评估。在此,我们基于多变量信息框架和贝叶斯网络理论,引入了一种从电子记录中生成合成数据的新算法(MIIC-SDG)。我们还提出了一种新指标,用于定量评估合成数据生成方法的质量与隐私分数(QPS)之间的权衡。基于几个质量和隐私指标之间的QPS权衡,MIIC-SDG的性能在不同临床数据集上得到了验证,并且与最先进的合成数据生成方法相比表现良好。