Kumar Sourav, Lakshminarayanan A, Chang Ken, Guretno Feri, Mien Ivan Ho, Kalpathy-Cramer Jayashree, Krishnaswamy Pavitra, Singh Praveer
Department of Radiology, Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Boston, MA, USA.
Institute for Infocomm Research, ASTAR, Singapore.
Distrib Collab Fed Learn Afford AI Healthc Resour Div Glob Health (2022). 2022 Sep;13573:119-129. doi: 10.1007/978-3-031-18523-6_12. Epub 2022 Oct 7.
Federated Learning (FL) wherein multiple institutions collaboratively train a machine learning model without sharing data is becoming popular. Participating institutions might not contribute equally - some contribute more data, some better quality data or some more diverse data. To fairly rank the contribution of different institutions, Shapley value (SV) has emerged as the method of choice. Exact SV computation is impossibly expensive, especially when there are hundreds of contributors. Existing SV computation techniques use approximations. However, in healthcare where the number of contributing institutions are likely not of a colossal scale, computing exact SVs is still exorbitantly expensive, but not impossible. For such settings, we propose an efficient SV computation technique called SaFE (Shapley Value for Federated Learning using Ensembling). We empirically show that SaFE computes values that are close to exact SVs, and that it performs better than current SV approximations. This is particularly relevant in medical imaging setting where widespread heterogeneity across institutions is rampant and fast accurate data valuation is required to determine the contribution of each participant in multi-institutional collaborative learning.
联邦学习(FL),即多个机构在不共享数据的情况下协作训练机器学习模型,正变得越来越流行。参与的机构贡献可能并不均等——有些贡献更多的数据,有些贡献质量更高的数据,或者有些贡献更多样化的数据。为了公平地对不同机构的贡献进行排名,沙普利值(SV)已成为首选方法。精确计算沙普利值成本高得令人望而却步,尤其是当有数百个贡献者时。现有的沙普利值计算技术使用近似值。然而,在医疗保健领域,贡献机构的数量可能不会达到巨大规模,计算精确的沙普利值仍然极其昂贵,但并非不可能。对于这种情况,我们提出了一种高效的沙普利值计算技术,称为SaFE(使用集成的联邦学习沙普利值)。我们通过实验表明,SaFE计算出的值接近精确的沙普利值,并且其性能优于当前的沙普利值近似方法。这在医学成像环境中尤为重要,因为各机构之间普遍存在异质性,并且需要快速准确的数据评估来确定每个参与者在多机构协作学习中的贡献。