Lu Yuying, Gu Tian, Duan Rui
Department of Biostatistics, Columbia Mailman School of Public Health, New York, NY 10032, USA.
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.
Stat Biosci. 2024 Aug 13. doi: 10.1007/s12561-024-09449-2.
Large-scale genomics data combined with Electronic Health Records (EHRs) illuminate the path towards personalized disease management and enhanced medical interventions. However, the absence of "gold standard" disease labels makes the development of machine learning models a challenging task. Additionally, imbalances in demographic representation within datasets compromise the development of unbiased healthcare solutions. In response to these challenges, we introduce FEderated Semi-Supervised Transfer Learning (FEST) for improving disease risk predictions in underrepresented populations. FEST facilitates the collaborative training of models across various institutions by leveraging both labeled and unlabeled data from diverse subpopulations. It addresses distributional variations across different populations and healthcare institutions by combining density ratio reweighting and model calibration techniques. Federated learning algorithms are developed for training models using only summary-level statistics. We perform simulation studies to assess the efficacy of FEST in comparisons with a few alternative methods. Subsequently, we apply FEST to training a genetic risk prediction model for type 2 diabetes that targets the African-Ancestry population using data from the Massachusetts General Brigham (MGB) Biobank. Both our computational experiments and real-world data application underline the superior performance of FEST over competing methods.
大规模基因组学数据与电子健康记录(EHRs)相结合,为个性化疾病管理和强化医疗干预指明了道路。然而,缺乏“金标准”疾病标签使得机器学习模型的开发成为一项具有挑战性的任务。此外,数据集中人口统计学代表性的不平衡损害了无偏医疗保健解决方案的开发。为应对这些挑战,我们引入了联邦半监督迁移学习(FEST),以改善代表性不足人群的疾病风险预测。FEST通过利用来自不同亚人群的标记和未标记数据,促进跨机构的模型协作训练。它通过结合密度比重新加权和模型校准技术,解决了不同人群和医疗机构之间的分布差异。开发了联邦学习算法,用于仅使用汇总级统计数据训练模型。我们进行模拟研究,以评估FEST与一些替代方法相比的有效性。随后,我们应用FEST使用来自马萨诸塞州综合布莱根(MGB)生物银行的数据,为以非洲裔人群为目标的2型糖尿病训练遗传风险预测模型。我们的计算实验和实际数据应用都强调了FEST相对于竞争方法的卓越性能。