Department of Computer Science, University of Toronto, Canada; Peter Munk Cardiac Centre, University Health Network, Canada; Vector Institute, Toronto, Canada.
Vector Institute, Toronto, Canada; CISPA Helmholtz Center for Information Security, Germany; Department of Electrical and Computer Engineering, University of Toronto, Canada.
EBioMedicine. 2024 Mar;101:105006. doi: 10.1016/j.ebiom.2024.105006. Epub 2024 Feb 19.
Machine Learning (ML) has demonstrated its great potential on medical data analysis. Large datasets collected from diverse sources and settings are essential for ML models in healthcare to achieve better accuracy and generalizability. Sharing data across different healthcare institutions or jurisdictions is challenging because of complex and varying privacy and regulatory requirements. Hence, it is hard but crucial to allow multiple parties to collaboratively train an ML model leveraging the private datasets available at each party without the need for direct sharing of those datasets or compromising the privacy of the datasets through collaboration.
In this paper, we address this challenge by proposing Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH). This framework offers the following key benefits: (1) it allows different parties to collaboratively train an ML model without transferring their private datasets (i.e., no data centralization); (2) it safeguards patients' privacy by limiting the potential privacy leakage arising from any contents shared across the parties during the training process; and (3) it facilitates the ML model training without relying on a centralized party/server.
We demonstrate the generalizability and power of DeCaPH on three distinct tasks using real-world distributed medical datasets: patient mortality prediction using electronic health records, cell-type classification using single-cell human genomes, and pathology identification using chest radiology images. The ML models trained with DeCaPH framework have less than 3.2% drop in model performance comparing to those trained by the non-privacy-preserving collaborative framework. Meanwhile, the average vulnerability to privacy attacks of the models trained with DeCaPH decreased by up to 16%. In addition, models trained with our DeCaPH framework achieve better performance than those models trained solely with the private datasets from individual parties without collaboration and those trained with the previous privacy-preserving collaborative training framework under the same privacy guarantee by up to 70% and 18.2% respectively.
We demonstrate that the ML models trained with DeCaPH framework have an improved utility-privacy trade-off, showing DeCaPH enables the models to have good performance while preserving the privacy of the training data points. In addition, the ML models trained with DeCaPH framework in general outperform those trained solely with the private datasets from individual parties, showing that DeCaPH enhances the model generalizability.
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC, RGPIN-2020-06189 and DGECR-2020-00294), Canadian Institute for Advanced Research (CIFAR) AI Catalyst Grants, CIFAR AI Chair programs, Temerty Professor of AI Research and Education in Medicine, University of Toronto, Amazon, Apple, DARPA through the GARD project, Intel, Meta, the Ontario Early Researcher Award, and the Sloan Foundation. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.
机器学习(ML)在医学数据分析方面显示出了巨大的潜力。为了使医疗保健中的 ML 模型达到更好的准确性和泛化能力,需要从不同的来源和环境中收集大型数据集。由于复杂且不断变化的隐私和监管要求,在不同的医疗机构或司法管辖区之间共享数据具有挑战性。因此,允许多个方在不直接共享这些数据集或通过协作来损害数据集的隐私的情况下,利用每个方可用的私有数据集来协作训练 ML 模型是困难但至关重要的。
在本文中,我们通过提出去中心化、协作和保护隐私的多医院数据的 ML(DeCaPH)来解决这个挑战。该框架提供了以下关键优势:(1)它允许不同的方在不转移其私有数据集的情况下协作训练 ML 模型(即,没有数据集中化);(2)它通过限制在训练过程中各方之间共享的任何内容引起的潜在隐私泄露来保护患者的隐私;(3)它在不依赖中心化方/服务器的情况下促进 ML 模型的训练。
我们使用真实分布的医疗数据集在三个不同的任务上展示了 DeCaPH 的泛化能力和能力:使用电子健康记录预测患者死亡率、使用单细胞人类基因组进行细胞类型分类、以及使用胸部放射图像进行病理学识别。与非隐私保护协作框架训练的模型相比,使用 DeCaPH 框架训练的模型的性能下降不到 3.2%。同时,使用 DeCaPH 框架训练的模型的平均隐私攻击脆弱性降低了 16%。此外,与仅使用单个方的私有数据集进行训练的模型以及在相同隐私保证下使用先前的隐私保护协作训练框架进行训练的模型相比,使用我们的 DeCaPH 框架训练的模型的性能分别提高了 70%和 18.2%。
我们证明了使用 DeCaPH 框架训练的 ML 模型具有改进的效用-隐私权衡,表明 DeCaPH 使模型在保护训练数据点隐私的同时具有良好的性能。此外,使用 DeCaPH 框架训练的 ML 模型的性能通常优于仅使用单个方的私有数据集进行训练的模型,表明 DeCaPH 增强了模型的泛化能力。
这项工作得到了加拿大自然科学与工程研究理事会(NSERC,RGPIN-2020-06189 和 DGECR-2020-00294)、加拿大先进研究所(CIFAR)人工智能催化剂赠款、CIFAR 人工智能主席计划、多伦多大学的人工智能研究和医学教育 Temerty 教授、亚马逊、苹果、DARPA 通过 GARD 项目、英特尔、元、安大略省早期研究员奖和斯隆基金会的支持。准备这项研究使用的资源部分由安大略省、加拿大通过 CIFAR 的联邦政府以及赞助 Vector 研究所的公司提供。