Zhang Weili, Liu Ran, Zhu Xinyi, Yu Xiaojin, Jiang Depeng
Department of Epidemiology and Health Statistics, School of Public Health, Southeast University, Nanjing, 210009, China.
Department of Occupational and Environmental Health, School of Public Health, Southeast University, Nanjing, 210009, China.
BMC Med Inform Decis Mak. 2025 Sep 1;25(1):324. doi: 10.1186/s12911-025-03109-1.
Health physical examinations play a crucial role in early detection of cancer and chronic disease. However, privacy concerns limit the utilization of this kind of data for health interventions and research. Synthetic data methods based on differential privacy are increasingly used to create complete datasets that protect privacy while enabling data analysis and result interpretation. Hence, the use of synthetic algorithms based on differential privacy for privacy protection of physical examination data is a promising research direction.
Three synthetic algorithms, PrivBayes, PeGS, and DP-Gibbs were used to generate complete synthetic datasets that adhere to differential privacy standards using physical examination data composed of categorical data, which compared with the existing algorithm Private-PGM.
Compared with the existing algorithm, DP-Gibbs can provide privacy preserving capacity of 4.686 (ε = 0.5), while the existing algorithm only with 2.012. In addition, DP-Gibbs provides 0.620 of precision, 0.539 of F1-score, 0.342 of Kappa Coefficient, and 0.765 of AUC-score. The corresponding statistical results of existing algorithm are 0.520, 0.321, 0.188 and 0.695.
The main contributions of this study are the exploration of combination models incorporating different noise forms and Bayesian synthetic algorithms, alongside a comparative analysis against existing algorithms. This study explored the balance between privacy protection and data utility under different levels of privacy protection, and DP-Gibbs offers more stable technical support for de-identifying physical examination data prior to sharing and analysis, which realized the mining and application of a wider range of medical data under the requirements of privacy protection. By leveraging this effective privacy protection technique, clinical researchers can extract valuable insights on diseases and population health from the physical examination data without the risk of leaking private information.
健康体检在癌症和慢性病的早期检测中起着至关重要的作用。然而,隐私问题限制了这类数据在健康干预和研究中的利用。基于差分隐私的合成数据方法越来越多地被用于创建完整的数据集,既能保护隐私又能进行数据分析和结果解读。因此,使用基于差分隐私的合成算法对体检数据进行隐私保护是一个有前途的研究方向。
使用三种合成算法PrivBayes、PeGS和DP-Gibbs,利用由分类数据组成的体检数据生成符合差分隐私标准的完整合成数据集,并与现有算法Private-PGM进行比较。
与现有算法相比,DP-Gibbs在隐私保护能力(ε = 0.5时为4.686)方面表现更优,而现有算法仅为2.012。此外,DP-Gibbs的精确率为0.620,F1值为0.539,卡帕系数为0.342,AUC值为0.765。现有算法的相应统计结果分别为0.520、0.321、0.188和0.695。
本研究的主要贡献在于探索了结合不同噪声形式的组合模型和贝叶斯合成算法,并与现有算法进行了对比分析。本研究探索了不同隐私保护水平下隐私保护与数据效用之间的平衡,DP-Gibbs为在共享和分析前对体检数据进行去识别提供了更稳定的技术支持,实现了在隐私保护要求下更广泛医疗数据的挖掘和应用。通过利用这种有效的隐私保护技术,临床研究人员可以从体检数据中提取有关疾病和人群健康的有价值见解,而不会有泄露私人信息的风险。