• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

差分隐私合成数据是否会导致合成发现?

Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?

机构信息

Department of Computing, University of Turku, Turku, Finland.

出版信息

Methods Inf Med. 2024 May;63(1-02):35-51. doi: 10.1055/a-2385-1355. Epub 2024 Aug 13.

DOI:10.1055/a-2385-1355
PMID:39137913
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11495942/
Abstract

BACKGROUND

Synthetic data have been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential Privacy (DP) is currently considered the gold standard approach for balancing this trade-off.

OBJECTIVES

The aim of this study is to investigate how trustworthy are group differences discovered by independent sample tests from DP-synthetic data. The evaluation is carried out in terms of the tests' Type I and Type II errors. With the former, we can quantify the tests' validity, i.e., whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries.

METHODS

We evaluate the Mann-Whitney U test, Student's -test, chi-squared test, and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset ( = 500) and a cardiovascular dataset ( = 70,000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms.

CONCLUSION

A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of  ≤ 1. This result calls for caution when releasing and analyzing DP-synthetic data: low -values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP Smoothed Histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget ( ≥ 5) in order to have reasonable Type II error levels.

摘要

背景

合成数据被提议作为共享敏感生物医学数据集匿名版本的解决方案。理想情况下,合成数据应保留原始数据的结构和统计特性,同时保护个体主体的隐私。差分隐私 (DP) 目前被认为是平衡这种权衡的黄金标准方法。

目的

本研究旨在调查从 DP 合成数据中进行独立样本检验发现的组间差异的可信度。评估是根据检验的 I 型和 II 型错误进行的。前者可以量化检验的有效性,即错误发现的概率是否确实低于显著水平,后者表示检验在做出真实发现方面的能力。

方法

我们评估了 DP 合成数据上的曼-惠特尼 U 检验、学生 t 检验、卡方检验和中位数检验。私人合成数据集是从真实世界的数据中生成的,包括前列腺癌数据集(n=500)和心血管数据集(n=70000),以及二元和多元模拟数据。评估了五种不同的 DP 合成数据生成方法,包括两种基本的 DP 直方图发布方法以及 MWEM、Private-PGM 和 DP GAN 算法。

结论

评估结果的很大一部分表示 I 型错误显著膨胀,尤其是在水平为 ≤1 时。这一结果呼吁在发布和分析 DP 合成数据时保持谨慎:在统计检验中可能会获得低 - 值,这仅仅是为了保护隐私而添加的噪声的副产品。基于 DP 平滑直方图的合成数据生成方法显示,对于所有测试的隐私级别,都能产生有效的 I 型错误,但需要较大的原始数据集大小和适度的隐私预算(≥5),才能达到合理的 II 型错误水平。

相似文献

1
Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?差分隐私合成数据是否会导致合成发现?
Methods Inf Med. 2024 May;63(1-02):35-51. doi: 10.1055/a-2385-1355. Epub 2024 Aug 13.
2
Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data.用于表格数据的端到端机器学习管道中效用和公平性的差分隐私合成数据评估。
PLoS One. 2024 Feb 5;19(2):e0297271. doi: 10.1371/journal.pone.0297271. eCollection 2024.
3
Differential privacy under dependent tuples-the case of genomic privacy.相依元组下的差分隐私-基因组隐私案例。
Bioinformatics. 2020 Mar 1;36(6):1696-1703. doi: 10.1093/bioinformatics/btz837.
4
A data-driven approach to choosing privacy parameters for clinical trial data sharing under differential privacy.一种数据驱动的方法,用于在差分隐私下为临床试验数据共享选择隐私参数。
J Am Med Inform Assoc. 2024 Apr 19;31(5):1135-1143. doi: 10.1093/jamia/ocae038.
5
Inference attacks against differentially private query results from genomic datasets including dependent tuples.针对包含依赖元组的基因组数据集的差分隐私查询结果的推理攻击。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i136-i145. doi: 10.1093/bioinformatics/btaa475.
6
Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy.使用条件生成对抗网络结合差分隐私生成合成个人健康数据。
J Biomed Inform. 2023 Jul;143:104404. doi: 10.1016/j.jbi.2023.104404. Epub 2023 Jun 1.
7
8
Distributed non-disclosive validation of predictive models by a modified ROC-GLM.通过改进的 ROC-GLM 对预测模型进行分布式非披露验证。
BMC Med Res Methodol. 2024 Aug 29;24(1):190. doi: 10.1186/s12874-024-02312-4.
9
Differentially Private Histogram Publication For Dynamic Datasets: An Adaptive Sampling Approach.动态数据集的差分隐私直方图发布:一种自适应采样方法。
Proc ACM Int Conf Inf Knowl Manag. 2015 Oct;2015:1001-1010. doi: 10.1145/2806416.2806441.
10
Task-Specific Adaptive Differential Privacy Method for Structured Data.面向结构化数据的任务特定自适应差分隐私方法。
Sensors (Basel). 2023 Feb 10;23(4):1980. doi: 10.3390/s23041980.

引用本文的文献

1
Is synthetic data generation effective in maintaining clinical biomarkers? Investigating diffusion models across diverse imaging modalities.合成数据生成在维持临床生物标志物方面是否有效?跨多种成像模态研究扩散模型。
Front Artif Intell. 2025 Jan 31;7:1454441. doi: 10.3389/frai.2024.1454441. eCollection 2024.

本文引用的文献

1
Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions.健康领域中涵盖相似性、实用性和隐私性维度的合成表格数据评估。
Methods Inf Med. 2023 Jun;62(S 01):e19-e38. doi: 10.1055/s-0042-1760247. Epub 2023 Jan 9.
2
Synthetic data in machine learning for medicine and healthcare.机器学习在医学和医疗保健领域中的合成数据。
Nat Biomed Eng. 2021 Jun;5(6):493-497. doi: 10.1038/s41551-021-00751-8.
3
SciPy 1.0: fundamental algorithms for scientific computing in Python.SciPy 1.0:Python 中的科学计算基础算法。
Nat Methods. 2020 Mar;17(3):261-272. doi: 10.1038/s41592-019-0686-2. Epub 2020 Feb 3.
4
Body mass index and body composition in relation to 14 cardiovascular conditions in UK Biobank: a Mendelian randomization study.英国生物库中 14 种心血管疾病与体重指数和身体成分的关系:一项孟德尔随机化研究。
Eur Heart J. 2020 Jan 7;41(2):221-226. doi: 10.1093/eurheartj/ehz388.
5
Validation of IMPROD biparametric MRI in men with clinically suspected prostate cancer: A prospective multi-institutional trial.IMPROD 双参数 MRI 对临床可疑前列腺癌男性的验证:一项前瞻性多中心试验。
PLoS Med. 2019 Jun 3;16(6):e1002813. doi: 10.1371/journal.pmed.1002813. eCollection 2019 Jun.
6
Novel biparametric MRI and targeted biopsy improves risk stratification in men with a clinical suspicion of prostate cancer (IMPROD Trial).新型双参数 MRI 和靶向活检可改善临床怀疑前列腺癌男性的风险分层 (IMPROD 试验)。
J Magn Reson Imaging. 2017 Oct;46(4):1089-1095. doi: 10.1002/jmri.25641. Epub 2017 Feb 6.
7
T test as a parametric statistic.T检验作为一种参数统计方法。
Korean J Anesthesiol. 2015 Dec;68(6):540-6. doi: 10.4097/kjae.2015.68.6.540. Epub 2015 Nov 25.
8
Anonymising and sharing individual patient data.匿名化和共享个体患者数据。
BMJ. 2015 Mar 20;350:h1139. doi: 10.1136/bmj.h1139.
9
The chi-square test of independence.卡方独立性检验。
Biochem Med (Zagreb). 2013;23(2):143-9. doi: 10.11613/bm.2013.018.
10
Differentially Private Empirical Risk Minimization.差分隐私经验风险最小化
J Mach Learn Res. 2011 Mar;12:1069-1109.