Elhussein Ahmed, Gürsoy Gamze
Department of Biomedical Informatics, Columbia University, New York Genome Center, New York City, NY, U.S.A.
Department of Biomedical Informatics, Department of Computer Science, Columbia University, New York Genome Center, New York City, NY, U.S.A.
Proc Mach Learn Res. 2023;219:150-166.
Federated Learning (FL) is a machine learning framework that enables multiple organizations to train a model without sharing their data with a central server. However, it experiences significant performance degradation if the data is non-identically independently distributed (non-IID). This is a problem in medical settings, where variations in the patient population contribute significantly to distribution differences across hospitals. Personalized FL addresses this issue by accounting for site-specific distribution differences. Clustered FL, a Personalized FL variant, was used to address this problem by clustering patients into groups across hospitals and training separate models on each group. However, privacy concerns remained as a challenge as the clustering process requires exchange of patient-level information. This was previously solved by forming clusters using aggregated data, which led to inaccurate groups and performance degradation. In this study, we propose Privacy-preserving Community-Based Federated machine Learning (PCBFL), a novel Clustered FL framework that can cluster patients using patient-level data while protecting privacy. PCBFL uses Secure Multiparty Computation, a cryptographic technique, to securely calculate patient-level similarity scores across hospitals. We then evaluate PCBFL by training a federated mortality prediction model using 20 sites from the eICU dataset. We compare the performance gain from PCBFL against traditional and existing Clustered FL frameworks. Our results show that PCBFL successfully forms clinically meaningful cohorts of low, medium, and high-risk patients. PCBFL outperforms traditional and existing Clustered FL frameworks with an average AUC improvement of 4.3% and AUPRC improvement of 7.8%.
联邦学习(FL)是一种机器学习框架,它使多个组织能够在不与中央服务器共享数据的情况下训练模型。然而,如果数据不是独立同分布(非IID)的,它的性能会显著下降。在医疗环境中,这是一个问题,因为患者群体的差异会显著导致不同医院之间的分布差异。个性化联邦学习通过考虑特定地点的分布差异来解决这个问题。聚类联邦学习是个性化联邦学习的一种变体,它通过将患者跨医院聚类成组并在每个组上训练单独的模型来解决这个问题。然而,隐私问题仍然是一个挑战,因为聚类过程需要交换患者级别的信息。以前通过使用聚合数据形成聚类来解决这个问题,这导致分组不准确和性能下降。在本研究中,我们提出了基于隐私保护社区的联邦机器学习(PCBFL),这是一种新颖的聚类联邦学习框架,它可以在保护隐私的同时使用患者级数据对患者进行聚类。PCBFL使用安全多方计算(一种加密技术)来安全地计算不同医院之间的患者级相似性分数。然后,我们通过使用eICU数据集中的20个地点训练一个联邦死亡率预测模型来评估PCBFL。我们将PCBFL的性能提升与传统和现有的聚类联邦学习框架进行比较。我们的结果表明,PCBFL成功地形成了低、中、高风险患者具有临床意义的队列。PCBFL的表现优于传统和现有的聚类联邦学习框架,平均AUC提高了4.3%,AUPRC提高了7.8%。