Peng Junfeng, Liu Xujiang, Cai Ziwei, Huang Yuanpei, Lin Jiayi, Zhou Mi, Xiao Zhenpei, Lai Huifang, Cao Zhihao, Peng Hui, Wang Jihong, Xu Jun
Department of Computer Science and Engineering, Guangdong University of Education, Guangzhou 510303, China.
Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou 510640, China.
Heliyon. 2024 Jun 28;10(13):e33566. doi: 10.1016/j.heliyon.2024.e33566. eCollection 2024 Jul 15.
The high prevalence, morbidity and mortality, and disease heterogeneity of chronic obstructive pulmonary disease (COPD) result in the scattered data derived from patient visits in different medical units. The huge cost of integrating the scattered data for analysis and modeling, as well as the legal demand for patient privacy protection lead to the emergence of data island.
On the premise of protecting patient privacy, integrating scattered data of patients from different medical units for high-quality modeling is beneficial to promoting the development of digital health. Based on this, we develop a distributed COPD disease diagnosis system termed COPD average federated learning (COPD_AVG_FL) using FedAvg.
First, to build the COPD_AVG_FL, the clinical data of COPD patients from the real world is collected and the data pre-processing is performed to clean the incorrect data, outlier samples and missing values. Then, a classical federated learning architecture is designed as COPD_AVG_FL. Finally, to evaluate the established COPD_AVG_FL system, we develop Centralized Machine Learning (CML).
Our results suggest that, with the assistance of COPD_AVG_FL, the absolute improvement rates are 13.4% (accuracy), 13.3% (precision), 12.8% (recall), 13.1% (F1-Score) and 12.9% (AUC) on the test data, respectively. The decoupling between model training and raw training data protects the patients' privacy, and helps to securely integrate more COPD data from different medical units to generate a more comprehensive model COPD_AVG_FL. This approach promotes the landing of wise information technology of medicine for COPD in the real clinical world. Code for our model will be made available at https://github.com/Cczhh/COPD_AVG_FL/tree/master.
慢性阻塞性肺疾病(COPD)的高患病率、发病率和死亡率以及疾病异质性导致不同医疗单位患者就诊数据分散。整合这些分散数据进行分析和建模的巨大成本,以及对患者隐私保护的法律要求导致了数据孤岛的出现。
在保护患者隐私的前提下,整合不同医疗单位患者的分散数据进行高质量建模有利于促进数字健康的发展。基于此,我们使用联邦平均算法(FedAvg)开发了一个名为COPD平均联邦学习(COPD_AVG_FL)的分布式COPD疾病诊断系统。
首先,为构建COPD_AVG_FL,收集来自现实世界的COPD患者临床数据并进行数据预处理,以清理错误数据、异常样本和缺失值。然后,设计一种经典的联邦学习架构作为COPD_AVG_FL。最后,为评估已建立的COPD_AVG_FL系统,我们开发了集中式机器学习(CML)。
我们的结果表明,在COPD_AVG_FL的辅助下,测试数据上的绝对改善率分别为13.4%(准确率)、13.3%(精确率)、12.8%(召回率)、13.1%(F1分数)和12.9%(AUC)。模型训练与原始训练数据的解耦保护了患者隐私,并有助于安全地整合来自不同医疗单位的更多COPD数据以生成更全面的模型COPD_AVG_FL。这种方法促进了COPD医学智能信息技术在实际临床中的落地。我们模型的代码将在https://github.com/Cczhh/COPD_AVG_FL/tree/master上提供。