Guillaudeux Morgan, Rousseau Olivia, Petot Julien, Bennis Zineb, Dein Charles-Axel, Goronflot Thomas, Vince Nicolas, Limou Sophie, Karakachoff Matilde, Wargny Matthieu, Gourraud Pierre-Antoine
Octopize, Mimethik Data, Nantes, France.
Nantes Université, INSERM, CHU de Nantes, Ecole Centrale de Nantes,Centre de Recherche Translationnelle en Transplantation et Immunologie, CR2TI, Nantes, France.
NPJ Digit Med. 2023 Mar 10;6(1):37. doi: 10.1038/s41746-023-00771-5.
While nearly all computational methods operate on pseudonymized personal data, re-identification remains a risk. With personal health data, this re-identification risk may be considered a double-crossing of patients' trust. Herein, we present a new method to generate synthetic data of individual granularity while holding on to patients' privacy. Developed for sensitive biomedical data, the method is patient-centric as it uses a local model to generate random new synthetic data, called an "avatar data", for each initial sensitive individual. This method, compared with 2 other synthetic data generation techniques (Synthpop, CT-GAN), is applied to real health data with a clinical trial and a cancer observational study to evaluate the protection it provides while retaining the original statistical information. Compared to Synthpop and CT-GAN, the Avatar method shows a similar level of signal maintenance while allowing to compute additional privacy metrics. In the light of distance-based privacy metrics, each individual produces an avatar simulation that is on average indistinguishable from 12 other generated avatar simulations for the clinical trial and 24 for the observational study. Data transformation using the Avatar method both preserves, the evaluation of the treatment's effectiveness with similar hazard ratios for the clinical trial (original HR = 0.49 [95% CI, 0.39-0.63] vs. avatar HR = 0.40 [95% CI, 0.31-0.52]) and the classification properties for the observational study (original AUC = 99.46 (s.e. 0.25) vs. avatar AUC = 99.84 (s.e. 0.12)). Once validated by privacy metrics, anonymous synthetic data enable the creation of value from sensitive pseudonymized data analyses by tackling the risk of a privacy breach.
虽然几乎所有的计算方法都对假名化的个人数据进行操作,但重新识别仍然是一种风险。对于个人健康数据而言,这种重新识别风险可能被视为对患者信任的双重背叛。在此,我们提出一种新方法,可生成具有个体粒度的合成数据,同时保护患者隐私。该方法是针对敏感生物医学数据开发的,以患者为中心,因为它使用局部模型为每个初始敏感个体生成随机的新合成数据,即“虚拟数据”。将此方法与其他两种合成数据生成技术(Synthpop、CT - GAN)进行比较,通过一项临床试验和一项癌症观察性研究将其应用于真实健康数据,以评估它在保留原始统计信息的同时所提供的保护。与Synthpop和CT - GAN相比,虚拟方法在允许计算额外隐私指标的同时,显示出相似水平的信号维持。根据基于距离的隐私指标,对于临床试验,每个个体生成的虚拟模拟平均与其他12个生成的虚拟模拟无法区分;对于观察性研究,则与24个无法区分。使用虚拟方法进行数据转换,既保留了临床试验中治疗效果的评估(原始风险比HR = 0.49 [95%置信区间,0.39 - 0.63],虚拟HR = 0.40 [95%置信区间,0.31 - 0.52]),也保留了观察性研究的分类属性(原始曲线下面积AUC = 99.46(标准误0.25),虚拟AUC = 99.84(标准误0.12))。一旦通过隐私指标验证,匿名合成数据通过解决隐私泄露风险,能够从敏感的假名化数据分析中创造价值。