无偏隐私保护联邦学习的折叠分层交叉验证。

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning.

机构信息

Centre of Research in Epidemiology and Statistics (CRESS), Université de Paris, French Institute of Health and Medical Research (INSERM), National Institute of Agricultural Research (INRA), Paris, France.

CIC 1413, Center for Research in Cancerology and Immunology Nantes-Angers (CRCINA), Dermatology Department, Centre Hospitalier Universitaire Nantes, Nantes University, Nantes, France.

出版信息

J Am Med Inform Assoc. 2020 Aug 1;27(8):1244-1251. doi: 10.1093/jamia/ocaa096.

DOI:10.1093/jamia/ocaa096

PMID:32620945

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7647321/

Abstract

OBJECTIVE

We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs).

MATERIALS AND METHODS

Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records.

RESULTS

In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome.

DISCUSSION

Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates.

CONCLUSION

Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.

摘要

目的

我们引入了折叠分层交叉验证，这是一种与隐私保护联邦学习兼容的验证方法，可以防止电子健康记录 (EHR) 副本造成的数据泄露。

材料与方法

折叠分层交叉验证通过在包含具有相似特征的患者的折叠中对 EHR 进行初始分层来补充交叉验证，从而确保记录的副本要么同时出现在训练折叠中，要么同时出现在验证折叠中。通过蒙特卡罗模拟，我们使用合成数据和 MIMIC-III（重症监护医疗信息集市-III）病历研究了模型数据分析中折叠分层交叉验证的性质。

结果

在 EHR 副本可能导致准确性估计过高的情况下，应用折叠分层交叉验证可以防止这种偏差，而无需进行完全去重。然而，如果用于分层的协变量与结果强烈相关，则可能会出现悲观偏差。

讨论

虽然折叠分层交叉验证的计算开销较低，但为了提高效率，它需要初步确定一个既与副本共享又与结果弱相关的协变量。当可用时，个人标识符的哈希值或患者的出生日期提供了这样的协变量。相反，假名化会干扰折叠分层交叉验证，因为它可能会破坏副本之间分层协变量的平等性。

结论

折叠分层交叉验证是一种易于实现的方法，当在包含副本的分布式 EHR 上训练模型时，可以防止数据泄露，同时保护隐私。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

无偏隐私保护联邦学习的折叠分层交叉验证。

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning.

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料与方法

结果

讨论

结论

相似文献

引用本文的文献

本文引用的文献

无偏隐私保护联邦学习的折叠分层交叉验证。

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning.

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料与方法

结果

讨论

结论

相似文献

引用本文的文献

本文引用的文献