• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

无偏隐私保护联邦学习的折叠分层交叉验证。

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning.

机构信息

Centre of Research in Epidemiology and Statistics (CRESS), Université de Paris, French Institute of Health and Medical Research (INSERM), National Institute of Agricultural Research (INRA), Paris, France.

CIC 1413, Center for Research in Cancerology and Immunology Nantes-Angers (CRCINA), Dermatology Department, Centre Hospitalier Universitaire Nantes, Nantes University, Nantes, France.

出版信息

J Am Med Inform Assoc. 2020 Aug 1;27(8):1244-1251. doi: 10.1093/jamia/ocaa096.

DOI:10.1093/jamia/ocaa096
PMID:32620945
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7647321/
Abstract

OBJECTIVE

We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs).

MATERIALS AND METHODS

Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records.

RESULTS

In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome.

DISCUSSION

Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates.

CONCLUSION

Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.

摘要

目的

我们引入了折叠分层交叉验证,这是一种与隐私保护联邦学习兼容的验证方法,可以防止电子健康记录 (EHR) 副本造成的数据泄露。

材料与方法

折叠分层交叉验证通过在包含具有相似特征的患者的折叠中对 EHR 进行初始分层来补充交叉验证,从而确保记录的副本要么同时出现在训练折叠中,要么同时出现在验证折叠中。通过蒙特卡罗模拟,我们使用合成数据和 MIMIC-III(重症监护医疗信息集市-III)病历研究了模型数据分析中折叠分层交叉验证的性质。

结果

在 EHR 副本可能导致准确性估计过高的情况下,应用折叠分层交叉验证可以防止这种偏差,而无需进行完全去重。然而,如果用于分层的协变量与结果强烈相关,则可能会出现悲观偏差。

讨论

虽然折叠分层交叉验证的计算开销较低,但为了提高效率,它需要初步确定一个既与副本共享又与结果弱相关的协变量。当可用时,个人标识符的哈希值或患者的出生日期提供了这样的协变量。相反,假名化会干扰折叠分层交叉验证,因为它可能会破坏副本之间分层协变量的平等性。

结论

折叠分层交叉验证是一种易于实现的方法,当在包含副本的分布式 EHR 上训练模型时,可以防止数据泄露,同时保护隐私。

相似文献

1
Fold-stratified cross-validation for unbiased and privacy-preserving federated learning.无偏隐私保护联邦学习的折叠分层交叉验证。
J Am Med Inform Assoc. 2020 Aug 1;27(8):1244-1251. doi: 10.1093/jamia/ocaa096.
2
Secure Extraction of Personal Information from EHR by Federated Machine Learning.联邦机器学习从电子健康记录中安全提取个人信息。
Stud Health Technol Inform. 2024 Aug 22;316:611-615. doi: 10.3233/SHTI240488.
3
Reliable generation of privacy-preserving synthetic electronic health record time series via diffusion models.通过扩散模型可靠地生成隐私保护的合成电子健康记录时间序列。
J Am Med Inform Assoc. 2024 Nov 1;31(11):2529-2539. doi: 10.1093/jamia/ocae229.
4
Effectiveness of Anonymization Methods in Preserving Patients' Privacy: A Systematic Literature Review.匿名化方法在保护患者隐私方面的有效性:一项系统文献综述。
Stud Health Technol Inform. 2018;248:80-87.
5
FedSPL: federated self-paced learning for privacy-preserving disease diagnosis.FedSPL:用于保护隐私的疾病诊断的联邦自步学习。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab498.
6
Combining Federated Machine Learning and Qualitative Methods to Investigate Novel Pediatric Asthma Subtypes: Protocol for a Mixed Methods Study.联合联邦机器学习和定性方法研究新型儿科哮喘亚型:混合方法研究方案。
JMIR Res Protoc. 2024 Jul 8;13:e57981. doi: 10.2196/57981.
7
Personalized Federated Graph Learning on Non-IID Electronic Health Records.基于非独立同分布电子健康记录的个性化联邦图学习。
IEEE Trans Neural Netw Learn Syst. 2024 Sep;35(9):11843-11856. doi: 10.1109/TNNLS.2024.3370297. Epub 2024 Sep 3.
8
PAX: Using Pseudonymization and Anonymization to Protect Patients' Identities and Data in the Healthcare System.PAX:在医疗保健系统中使用化名和匿名化来保护患者的身份和数据。
Int J Environ Res Public Health. 2019 Apr 27;16(9):1490. doi: 10.3390/ijerph16091490.
9
Utility-preserving anonymization for health data publishing.用于健康数据发布的效用保持匿名化
BMC Med Inform Decis Mak. 2017 Jul 11;17(1):104. doi: 10.1186/s12911-017-0499-0.
10
FeARH: Federated machine learning with anonymous random hybridization on electronic medical records.FeARH:基于电子病历的匿名随机混合联邦机器学习
J Biomed Inform. 2021 May;117:103735. doi: 10.1016/j.jbi.2021.103735. Epub 2021 Mar 9.

引用本文的文献

1
A multimodal deep learning radiomics model for predicting degenerative meniscus tear after arthroscopy.一种用于预测关节镜检查后半月板退变撕裂的多模态深度学习放射组学模型。
PLoS One. 2025 Aug 13;20(8):e0328299. doi: 10.1371/journal.pone.0328299. eCollection 2025.
2
From Preliminary Urinalysis to Decision Support: Machine Learning for UTI Prediction in Real-World Laboratory Data.从初步尿液分析到决策支持:基于现实世界实验室数据的机器学习用于尿路感染预测
J Pers Med. 2025 May 16;15(5):200. doi: 10.3390/jpm15050200.
3
Value of Bioinformatics Models for Predicting Translational Control of Angiogenesis.生物信息学模型在预测血管生成翻译控制方面的价值。
Circ Res. 2025 May 9;136(10):1147-1165. doi: 10.1161/CIRCRESAHA.125.325438. Epub 2025 May 8.
4
Recent methodological advances in federated learning for healthcare.医疗保健领域联邦学习的最新方法进展。
Patterns (N Y). 2024 Jun 14;5(6):101006. doi: 10.1016/j.patter.2024.101006.
5
Collaborative and privacy-enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions.协作和增强隐私的数据仓库工作流程:开发自然语言处理管道检测医疗条件的示例。
J Am Med Inform Assoc. 2024 May 20;31(6):1280-1290. doi: 10.1093/jamia/ocae069.
6
Personalized anti-tumor drug efficacy prediction based on clinical data.基于临床数据的个性化抗肿瘤药物疗效预测
Heliyon. 2024 Mar 4;10(6):e27300. doi: 10.1016/j.heliyon.2024.e27300. eCollection 2024 Mar 30.
7
Smart forecasting of artifacts in contrast-enhanced breast MRI before contrast agent administration.对比增强乳腺 MRI 检查前对比剂注射引起伪影的智能预测。
Eur Radiol. 2024 Jul;34(7):4752-4763. doi: 10.1007/s00330-023-10469-7. Epub 2023 Dec 15.
8
To predict the risk of chronic kidney disease (CKD) using Generalized Additive2 Models (GA2M).利用广义加性模型(GA2M)预测慢性肾脏病(CKD)的风险。
J Am Med Inform Assoc. 2023 Aug 18;30(9):1494-1502. doi: 10.1093/jamia/ocad097.
9
Federated Learning in Health care Using Structured Medical Data.利用结构化医疗数据进行医疗保健中的联邦学习。
Adv Kidney Dis Health. 2023 Jan;30(1):4-16. doi: 10.1053/j.akdh.2022.11.007.
10
Use of Social Media Data to Diagnose and Monitor Psychotic Disorders: Systematic Review.使用社交媒体数据诊断和监测精神障碍:系统评价。
J Med Internet Res. 2022 Sep 6;24(9):e36986. doi: 10.2196/36986.

本文引用的文献

1
Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness.机器学习和人工智能研究如何造福患者:透明度、可重复性、伦理和有效性方面的 20 个关键问题。
BMJ. 2020 Mar 20;368:l6927. doi: 10.1136/bmj.l6927.
2
Learning from electronic health records across multiple sites: A communication-efficient and privacy-preserving distributed algorithm.从多个站点的电子健康记录中学习:一种通信高效且隐私保护的分布式算法。
J Am Med Inform Assoc. 2020 Mar 1;27(3):376-385. doi: 10.1093/jamia/ocz199.
3
Estimating the success of re-identifications in incomplete datasets using generative models.利用生成模型估计不完全数据集重识别的成功率。
Nat Commun. 2019 Jul 23;10(1):3069. doi: 10.1038/s41467-019-10933-3.
4
Scalable and accurate deep learning with electronic health records.借助电子健康记录实现可扩展且准确的深度学习。
NPJ Digit Med. 2018 May 8;1:18. doi: 10.1038/s41746-018-0029-1. eCollection 2018.
5
Privacy in the age of medical big data.医疗大数据时代的隐私问题。
Nat Med. 2019 Jan;25(1):37-43. doi: 10.1038/s41591-018-0272-7. Epub 2019 Jan 7.
6
On the privacy-conscientious use of mobile phone data.论移动电话数据的隐私谨慎使用。
Sci Data. 2018 Dec 11;5:180286. doi: 10.1038/sdata.2018.286.
7
Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records.使用机器学习预测急诊入院风险:基于电子健康记录的开发和验证。
PLoS Med. 2018 Nov 20;15(11):e1002695. doi: 10.1371/journal.pmed.1002695. eCollection 2018 Nov.
8
The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care.人工智能临床医生学习重症监护中脓毒症的最佳治疗策略。
Nat Med. 2018 Nov;24(11):1716-1720. doi: 10.1038/s41591-018-0213-5. Epub 2018 Oct 22.
9
Privacy-preserving record linkage in large databases using secure multiparty computation.使用安全多方计算在大型数据库中进行隐私保护的记录链接。
BMC Med Genomics. 2018 Oct 11;11(Suppl 4):84. doi: 10.1186/s12920-018-0400-8.
10
Gaps in health information exchange between hospitals that treat many shared patients.医院之间在治疗许多共同患者的健康信息交流方面存在差距。
J Am Med Inform Assoc. 2018 Sep 1;25(9):1114-1121. doi: 10.1093/jamia/ocy089.