• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于差分隐私合成数据的分布式数据协同学习。

Collaborative learning from distributed data with differentially private synthetic data.

机构信息

Aalto University, Espoo, 00076, Finland.

University of Helsinki, Helsinki, 00014, Finland.

出版信息

BMC Med Inform Decis Mak. 2024 Jun 14;24(1):167. doi: 10.1186/s12911-024-02563-7.

DOI:10.1186/s12911-024-02563-7
PMID:38877563
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11179391/
Abstract

BACKGROUND

Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank.

METHODS

We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study's Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores.

RESULTS

We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups.

CONCLUSIONS

Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.

摘要

背景

考虑这样一种场景,多个持有敏感数据的方旨在合作学习人群级别的统计数据,但由于隐私问题,他们无法合并敏感数据集,并且各方也无法进行集中协调的联合计算。我们研究了在英国生物库的真实健康数据上进行协作学习时,用隐私保护的合成数据集代替原始数据的可行性。

方法

我们基于文献中的一项现有前瞻性队列研究进行实证评估。通过沿着评估中心拆分英国生物库队列来模拟多方,我们使用差分隐私生成建模技术生成合成数据。然后,我们将原始研究的泊松回归分析应用于联合合成数据集,并评估以下因素的影响:1)本地数据集的大小;2)参与方的数量;3)局部分布的偏移,对获得的似然评分的影响。

结果

我们发现,与仅使用本地数据相比,通过共享合成数据进行协作学习的各方可以获得更准确的回归参数估计。这一发现适用于小且异构数据集的困难情况。此外,参与方越多,改进就越大,一致性也越高,直到达到一定的极限。最后,我们发现数据共享特别有助于那些数据中包含代表性不足的群体的方,使他们能够为这些群体进行更好的调整分析。

结论

根据我们的结果,我们得出结论,即使单个数据集较小或不能很好地代表总体人群,共享合成数据也是一种可行的方法,可以在不违反隐私约束的情况下从敏感数据中进行学习。分布式敏感数据的缺乏往往是生物医学研究的一个瓶颈,我们的研究表明,隐私保护的协作学习方法可以缓解这一问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/42d1063d6e30/12911_2024_2563_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/9b8bde0bc989/12911_2024_2563_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/d52f23861f29/12911_2024_2563_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/091227df3100/12911_2024_2563_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/bcf034641a66/12911_2024_2563_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/b3ae8cf6756d/12911_2024_2563_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/42d1063d6e30/12911_2024_2563_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/9b8bde0bc989/12911_2024_2563_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/d52f23861f29/12911_2024_2563_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/091227df3100/12911_2024_2563_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/bcf034641a66/12911_2024_2563_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/b3ae8cf6756d/12911_2024_2563_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/236c/11179391/42d1063d6e30/12911_2024_2563_Fig6_HTML.jpg

相似文献

1
Collaborative learning from distributed data with differentially private synthetic data.基于差分隐私合成数据的分布式数据协同学习。
BMC Med Inform Decis Mak. 2024 Jun 14;24(1):167. doi: 10.1186/s12911-024-02563-7.
2
A multicenter random forest model for effective prognosis prediction in collaborative clinical research network.多中心随机森林模型在协作临床研究网络中的有效预后预测。
Artif Intell Med. 2020 Mar;103:101814. doi: 10.1016/j.artmed.2020.101814. Epub 2020 Feb 5.
3
Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing.隐私保护生成式深度神经网络支持临床数据共享。
Circ Cardiovasc Qual Outcomes. 2019 Jul;12(7):e005122. doi: 10.1161/CIRCOUTCOMES.118.005122. Epub 2019 Jul 9.
4
Decentralised, collaborative, and privacy-preserving machine learning for multi-hospital data.去中心化、协作和保护隐私的机器学习,适用于多医院数据。
EBioMedicine. 2024 Mar;101:105006. doi: 10.1016/j.ebiom.2024.105006. Epub 2024 Feb 19.
5
Privacy-Preserving Tensor Factorization for Collaborative Health Data Analysis.用于协作式健康数据分析的隐私保护张量分解
Proc ACM Int Conf Inf Knowl Manag. 2019 Nov;2019:1291-1300. doi: 10.1145/3357384.3357878.
6
Privacy-Preserving Deep Learning for the Detection of Protected Health Information in Real-World Data: Comparative Evaluation.用于在真实世界数据中检测受保护健康信息的隐私保护深度学习:比较评估
JMIR Form Res. 2020 May 5;4(5):e14064. doi: 10.2196/14064.
7
The project data sphere initiative: accelerating cancer research by sharing data.项目数据领域计划:通过数据共享加速癌症研究
Oncologist. 2015 May;20(5):464-e20. doi: 10.1634/theoncologist.2014-0431. Epub 2015 Apr 15.
8
A collaborative framework for Distributed Privacy-Preserving Support Vector Machine learning.一种用于分布式隐私保护支持向量机学习的协作框架。
AMIA Annu Symp Proc. 2012;2012:1350-9. Epub 2012 Nov 3.
9
Privacy, Trust, and Data Sharing in Web-Based and Mobile Research: Participant Perspectives in a Large Nationwide Sample of Men Who Have Sex With Men in the United States.基于网络和移动设备的研究中的隐私、信任与数据共享:美国全国范围内大量男男性行为者样本中的参与者观点
J Med Internet Res. 2018 Jul 4;20(7):e233. doi: 10.2196/jmir.9019.
10
Security controls in an integrated Biobank to protect privacy in data sharing: rationale and study design.综合生物样本库中保护数据共享隐私的安全控制措施:基本原理与研究设计。
BMC Med Inform Decis Mak. 2017 Jul 6;17(1):100. doi: 10.1186/s12911-017-0494-5.

引用本文的文献

1
Synthetic data generation: a privacy-preserving approach to accelerate rare disease research.合成数据生成:一种加速罕见病研究的隐私保护方法。
Front Digit Health. 2025 Mar 18;7:1563991. doi: 10.3389/fdgth.2025.1563991. eCollection 2025.
2
Classification of AO/OTA 31A/B femur fractures in X-ray images using YOLOv8 and advanced data augmentation techniques.使用YOLOv8和先进的数据增强技术对X射线图像中的AO/OTA 31A/B型股骨骨折进行分类。
Bone Rep. 2024 Sep 16;22:101801. doi: 10.1016/j.bonr.2024.101801. eCollection 2024 Sep.

本文引用的文献

1
Privacy-preserving data sharing via probabilistic modeling.通过概率建模实现隐私保护数据共享。
Patterns (N Y). 2021 Jun 7;2(7):100271. doi: 10.1016/j.patter.2021.100271. eCollection 2021 Jul 9.
2
Ethnic and socioeconomic differences in SARS-CoV-2 infection: prospective cohort study using UK Biobank.SARS-CoV-2 感染的种族和社会经济差异:使用英国生物库的前瞻性队列研究。
BMC Med. 2020 May 29;18(1):160. doi: 10.1186/s12916-020-01640-8.
3
Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing.隐私保护生成式深度神经网络支持临床数据共享。
Circ Cardiovasc Qual Outcomes. 2019 Jul;12(7):e005122. doi: 10.1161/CIRCOUTCOMES.118.005122. Epub 2019 Jul 9.
4
UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.英国生物银行:一个用于识别多种中老年复杂疾病病因的开放获取资源。
PLoS Med. 2015 Mar 31;12(3):e1001779. doi: 10.1371/journal.pmed.1001779. eCollection 2015 Mar.
5
The generalisation of student's problems when several different population variances are involved.当涉及几个不同总体方差时学生问题的推广。
Biometrika. 1947;34(1-2):28-35. doi: 10.1093/biomet/34.1-2.28.