Chen Kun-Yi, Shyu Chi-Ren, Tsai Yuan-Yu, Baskett William I, Chang Chi-Yu, Chou Che-Yi, Tsai Jeffrey J P, Shae Zon-Yin
Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211 USA.
Department of Computer Science and Information Engineering, Asia University, Taichung, 413305 Taiwan.
J Healthc Inform Res. 2025 Mar 22;9(3):437-464. doi: 10.1007/s41666-025-00195-8. eCollection 2025 Sep.
Building unbiased and robust machine learning models using datasets from multiple healthcare systems is critical for addressing the needs of diverse patient populations. However, variations in patient demographics and healthcare protocols across systems often lead to significant differences in data distributions. Not Independent and Not Identically Distributed (non-IID) data presents a major challenge in developing effective federated learning (FL) frameworks. This study proposes a method to estimate the non-IID degree between datasets and introduces three metrics (variability, separability, and computational time) to evaluate and compare the performance of non-IID degree estimation methods. We developed a novel non-IID FL algorithm that incorporates the proposed non-IID degree estimation index as regularization into existing FL algorithms for acute kidney injury risk (AKI) prediction. Our results demonstrate that the proposed method for estimating non-IID degree outperforms previous approaches by effectively identifying differences in data distributions between datasets, consistently producing similar estimates of non-IID degree when evaluating different subsamples from the same dataset, requiring significantly less computational time, and providing better interpretability. Finally, we showed that the proposed non-IID FL algorithm achieves higher test accuracy than local learning, concurrent FL algorithms, and centralized learning for the AKI prediction task.
使用来自多个医疗系统的数据集构建无偏差且稳健的机器学习模型对于满足不同患者群体的需求至关重要。然而,各系统间患者人口统计学特征和医疗协议的差异常常导致数据分布存在显著差异。非独立同分布(non-IID)数据在开发有效的联邦学习(FL)框架方面构成了重大挑战。本研究提出了一种估计数据集之间非IID程度的方法,并引入了三个指标(可变性、可分离性和计算时间)来评估和比较非IID程度估计方法的性能。我们开发了一种新颖的非IID FL算法,该算法将所提出的非IID程度估计指标作为正则化项纳入现有的用于急性肾损伤风险(AKI)预测的FL算法中。我们的结果表明,所提出的估计非IID程度的方法优于先前的方法,它能有效识别数据集之间的数据分布差异,在评估来自同一数据集的不同子样本时始终产生相似的非IID程度估计值,所需计算时间显著更少,且具有更好的可解释性。最后,我们表明,对于AKI预测任务,所提出的非IID FL算法比局部学习、并发FL算法和集中式学习具有更高的测试准确率。