Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Charlestown, Massachusetts; Center for Clinical Data Science, Massachusetts General Hospital and Brigham and Women's Hospital, Boston, Massachusetts.
Department of Ophthalmology, Oregon Health and Science University, Portland, Oregon.
Ophthalmol Retina. 2022 Aug;6(8):657-663. doi: 10.1016/j.oret.2022.02.015. Epub 2022 Mar 14.
To compare the performance of deep learning classifiers for the diagnosis of plus disease in retinopathy of prematurity (ROP) trained using 2 methods for developing models on multi-institutional data sets: centralizing data versus federated learning (FL) in which no data leave each institution.
Evaluation of a diagnostic test or technology.
Deep learning models were trained, validated, and tested on 5255 wide-angle retinal images in the neonatal intensive care units of 7 institutions as part of the Imaging and Informatics in ROP study. All images were labeled for the presence of plus, preplus, or no plus disease with a clinical label and a reference standard diagnosis (RSD) determined by 3 image-based ROP graders and the clinical diagnosis.
We compared the area under the receiver operating characteristic curve (AUROC) for models developed on multi-institutional data, using a central approach initially, followed by FL, and compared locally trained models with both approaches. We compared the model performance (κ) with the label agreement (between clinical and RSD), data set size, and number of plus disease cases in each training cohort using the Spearman correlation coefficient (CC).
Model performance using AUROC and linearly weighted κ.
Four settings of experiment were used: FL trained on RSD against central trained on RSD, FL trained on clinical labels against central trained on clinical labels, FL trained on RSD against central trained on clinical labels, and FL trained on clinical labels against central trained on RSD (P = 0.046, P = 0.126, P = 0.224, and P = 0.0173, respectively). Four of the 7 (57%) models trained on local institutional data performed inferiorly to the FL models. The model performance for local models was positively correlated with the label agreement (between clinical and RSD labels, CC = 0.389, P = 0.387), total number of plus cases (CC = 0.759, P = 0.047), and overall training set size (CC = 0.924, P = 0.002).
We found that a trained FL model performs comparably to a centralized model, confirming that FL may provide an effective, more feasible solution for interinstitutional learning. Smaller institutions benefit more from collaboration than larger institutions, showing the potential of FL for addressing disparities in resource access.
比较两种在多机构数据集上开发模型的方法(集中式数据与联邦学习(FL))对早产儿视网膜病变(ROP)的加病进行诊断的深度学习分类器的性能,这两种方法均不使数据离开每个机构。
诊断测试或技术的评估。
作为 Imaging and Informatics in ROP 研究的一部分,在 7 个机构的新生儿重症监护病房中对 5255 张广角视网膜图像进行了深度学习模型的训练、验证和测试。所有图像均使用临床标签和由 3 名基于图像的 ROP 分级员和临床诊断确定的参考标准诊断(RSD)进行加病、前加病或无加病的标记。
我们比较了在多机构数据上使用集中方法初始开发的模型的接收者操作特征曲线下面积(AUROC),然后比较了 FL,以及与这两种方法相关的本地训练模型。我们使用 Spearman 相关系数(CC)比较了模型性能(κ)与标签一致性(临床和 RSD 之间)、数据集大小和每个训练队列中的加病病例数。
使用 AUROC 和线性加权κ评估模型性能。
使用了四种实验设置:FL 基于 RSD 训练对抗中央基于 RSD 训练、FL 基于临床标签训练对抗中央基于临床标签训练、FL 基于 RSD 训练对抗中央基于临床标签训练和 FL 基于临床标签训练对抗中央基于 RSD 训练(P=0.046,P=0.126,P=0.224,P=0.0173,分别)。在 7 个机构中,有 4 个(57%)的本地数据训练模型的性能不如 FL 模型。本地模型的性能与标签一致性(临床和 RSD 标签之间,CC=0.389,P=0.387)、加病病例总数(CC=0.759,P=0.047)和整个训练集大小(CC=0.924,P=0.002)呈正相关。
我们发现,经过训练的 FL 模型的表现与集中式模型相当,这证实了 FL 可能为机构间学习提供有效且更可行的解决方案。较小的机构比较大的机构从合作中受益更多,这表明 FL 有可能解决资源获取方面的差异。