Suppr超能文献

无偏隐私保护联邦学习的折叠分层交叉验证。

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning.

机构信息

Centre of Research in Epidemiology and Statistics (CRESS), Université de Paris, French Institute of Health and Medical Research (INSERM), National Institute of Agricultural Research (INRA), Paris, France.

CIC 1413, Center for Research in Cancerology and Immunology Nantes-Angers (CRCINA), Dermatology Department, Centre Hospitalier Universitaire Nantes, Nantes University, Nantes, France.

出版信息

J Am Med Inform Assoc. 2020 Aug 1;27(8):1244-1251. doi: 10.1093/jamia/ocaa096.

Abstract

OBJECTIVE

We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs).

MATERIALS AND METHODS

Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records.

RESULTS

In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome.

DISCUSSION

Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates.

CONCLUSION

Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.

摘要

目的

我们引入了折叠分层交叉验证,这是一种与隐私保护联邦学习兼容的验证方法,可以防止电子健康记录 (EHR) 副本造成的数据泄露。

材料与方法

折叠分层交叉验证通过在包含具有相似特征的患者的折叠中对 EHR 进行初始分层来补充交叉验证,从而确保记录的副本要么同时出现在训练折叠中,要么同时出现在验证折叠中。通过蒙特卡罗模拟,我们使用合成数据和 MIMIC-III(重症监护医疗信息集市-III)病历研究了模型数据分析中折叠分层交叉验证的性质。

结果

在 EHR 副本可能导致准确性估计过高的情况下,应用折叠分层交叉验证可以防止这种偏差,而无需进行完全去重。然而,如果用于分层的协变量与结果强烈相关,则可能会出现悲观偏差。

讨论

虽然折叠分层交叉验证的计算开销较低,但为了提高效率,它需要初步确定一个既与副本共享又与结果弱相关的协变量。当可用时,个人标识符的哈希值或患者的出生日期提供了这样的协变量。相反,假名化会干扰折叠分层交叉验证,因为它可能会破坏副本之间分层协变量的平等性。

结论

折叠分层交叉验证是一种易于实现的方法,当在包含副本的分布式 EHR 上训练模型时,可以防止数据泄露,同时保护隐私。

相似文献

7
Personalized Federated Graph Learning on Non-IID Electronic Health Records.基于非独立同分布电子健康记录的个性化联邦图学习。
IEEE Trans Neural Netw Learn Syst. 2024 Sep;35(9):11843-11856. doi: 10.1109/TNNLS.2024.3370297. Epub 2024 Sep 3.
9
Utility-preserving anonymization for health data publishing.用于健康数据发布的效用保持匿名化
BMC Med Inform Decis Mak. 2017 Jul 11;17(1):104. doi: 10.1186/s12911-017-0499-0.

引用本文的文献

6
Personalized anti-tumor drug efficacy prediction based on clinical data.基于临床数据的个性化抗肿瘤药物疗效预测
Heliyon. 2024 Mar 4;10(6):e27300. doi: 10.1016/j.heliyon.2024.e27300. eCollection 2024 Mar 30.

本文引用的文献

5
Privacy in the age of medical big data.医疗大数据时代的隐私问题。
Nat Med. 2019 Jan;25(1):37-43. doi: 10.1038/s41591-018-0272-7. Epub 2019 Jan 7.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验