通过联合半监督迁移学习利用不准确的电子健康记录数据增强遗传风险预测

Enhancing Genetic Risk Prediction through Federated Semi-Supervised Transfer Learning with Inaccurate Electronic Health Record Data.

作者信息

Lu Yuying, Gu Tian, Duan Rui

机构信息

Department of Biostatistics, Columbia Mailman School of Public Health, New York, NY 10032, USA.

Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.

出版信息

Stat Biosci. 2024 Aug 13. doi: 10.1007/s12561-024-09449-2.

DOI:10.1007/s12561-024-09449-2

PMID:40917581

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12409711/

Abstract

Large-scale genomics data combined with Electronic Health Records (EHRs) illuminate the path towards personalized disease management and enhanced medical interventions. However, the absence of "gold standard" disease labels makes the development of machine learning models a challenging task. Additionally, imbalances in demographic representation within datasets compromise the development of unbiased healthcare solutions. In response to these challenges, we introduce FEderated Semi-Supervised Transfer Learning (FEST) for improving disease risk predictions in underrepresented populations. FEST facilitates the collaborative training of models across various institutions by leveraging both labeled and unlabeled data from diverse subpopulations. It addresses distributional variations across different populations and healthcare institutions by combining density ratio reweighting and model calibration techniques. Federated learning algorithms are developed for training models using only summary-level statistics. We perform simulation studies to assess the efficacy of FEST in comparisons with a few alternative methods. Subsequently, we apply FEST to training a genetic risk prediction model for type 2 diabetes that targets the African-Ancestry population using data from the Massachusetts General Brigham (MGB) Biobank. Both our computational experiments and real-world data application underline the superior performance of FEST over competing methods.

摘要

大规模基因组学数据与电子健康记录（EHRs）相结合，为个性化疾病管理和强化医疗干预指明了道路。然而，缺乏“金标准”疾病标签使得机器学习模型的开发成为一项具有挑战性的任务。此外，数据集中人口统计学代表性的不平衡损害了无偏医疗保健解决方案的开发。为应对这些挑战，我们引入了联邦半监督迁移学习（FEST），以改善代表性不足人群的疾病风险预测。FEST通过利用来自不同亚人群的标记和未标记数据，促进跨机构的模型协作训练。它通过结合密度比重新加权和模型校准技术，解决了不同人群和医疗机构之间的分布差异。开发了联邦学习算法，用于仅使用汇总级统计数据训练模型。我们进行模拟研究，以评估FEST与一些替代方法相比的有效性。随后，我们应用FEST使用来自马萨诸塞州综合布莱根（MGB）生物银行的数据，为以非洲裔人群为目标的2型糖尿病训练遗传风险预测模型。我们的计算实验和实际数据应用都强调了FEST相对于竞争方法的卓越性能。

相似文献

Enhancing Genetic Risk Prediction through Federated Semi-Supervised Transfer Learning with Inaccurate Electronic Health Record Data.通过联合半监督迁移学习利用不准确的电子健康记录数据增强遗传风险预测

Stat Biosci. 2024 Aug 13. doi: 10.1007/s12561-024-09449-2.

Semi-supervised Double Deep Learning Temporal Risk Prediction (SeDDLeR) with Electronic Health Records.基于电子健康记录的半监督双深度学习时间风险预测（SeDDLeR）

J Biomed Inform. 2024 Sep;157:104685. doi: 10.1016/j.jbi.2024.104685. Epub 2024 Jul 14.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Radiomics-Based Model Using Tumor and Peritumoral Features with Semi-Supervised and Privileged Learning for Metastatic Risk Prediction in Lung Cancer: A Multi-Site Study.基于影像组学的模型：利用肿瘤及瘤周特征结合半监督和特权学习预测肺癌转移风险的多中心研究

Comput Methods Programs Biomed. 2025 Aug 20;271:109029. doi: 10.1016/j.cmpb.2025.109029.

Personalized federated learning with hierarchical reweighting for multi-center clinical prediction.

Comput Methods Programs Biomed. 2025 Nov;271:109015. doi: 10.1016/j.cmpb.2025.109015. Epub 2025 Aug 22.

Trajectory-Ordered Objectives for Self-Supervised Representation Learning of Temporal Healthcare Data Using Transformers: Model Development and Evaluation Study.使用Transformer进行时间序列医疗数据自监督表示学习的轨迹有序目标：模型开发与评估研究

JMIR Med Inform. 2025 Jun 4;13:e68138. doi: 10.2196/68138.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Development of Machine Learning-based Algorithms to Predict the 2- and 5-year Risk of TKA After Tibial Plateau Fracture Treatment.基于机器学习的算法用于预测胫骨平台骨折治疗后2年和5年全膝关节置换风险的研究进展

Clin Orthop Relat Res. 2025 Mar 12. doi: 10.1097/CORR.0000000000003442.

Semi-supervised semantic segmentation of cell nuclei with diffusion model and collaborative learning.基于扩散模型和协同学习的细胞核半监督语义分割

J Med Imaging (Bellingham). 2025 Nov;12(6):061403. doi: 10.1117/1.JMI.12.6.061403. Epub 2025 Mar 20.

Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.稳定机器学习以获得可重复和可解释的结果：一种针对特定个体见解的新型验证方法。

Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.

本文引用的文献

Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects.目标治疗效果的联合自适应因果估计（FACE）

J Am Stat Assoc. 2025 Mar 17. doi: 10.1080/01621459.2025.2453249.

Robust angle-based transfer learning in high dimensions.高维空间中基于稳健角度的迁移学习

J R Stat Soc Series B Stat Methodol. 2024 Dec 3;87(3):723-745. doi: 10.1093/jrsssb/qkae111. eCollection 2025 Jul.

Semi-supervised Triply Robust Inductive Transfer Learning.半监督三重稳健归纳迁移学习

J Am Stat Assoc. 2025;120:1037-1047. doi: 10.1080/01621459.2024.2393463. Epub 2024 Oct 10.

Doubly Robust Augmented Model Accuracy Transfer Inference with High Dimensional Features.具有高维特征的双稳健增强模型精度转移推断

J Am Stat Assoc. 2025;120(549):524-534. doi: 10.1080/01621459.2024.2356291. Epub 2024 Jun 24.

TARGETING UNDERREPRESENTED POPULATIONS IN PRECISION MEDICINE: A FEDERATED TRANSFER LEARNING APPROACH.精准医学中针对代表性不足人群：一种联邦迁移学习方法。

Ann Appl Stat. 2023 Dec;17(4):2970-2992. doi: 10.1214/23-AOAS1747. Epub 2023 Oct 30.

Federated causal inference in heterogeneous observational data.基于异质观测数据的联邦因果推断。

Stat Med. 2023 Oct 30;42(24):4418-4439. doi: 10.1002/sim.9868. Epub 2023 Aug 8.

Semi-Supervised Deep Transfer Learning for Benign-Malignant Diagnosis of Pulmonary Nodules in Chest CT Images.基于半监督深度迁移学习的胸部 CT 图像肺结节良恶性诊断。

IEEE Trans Med Imaging. 2022 Apr;41(4):771-781. doi: 10.1109/TMI.2021.3123572. Epub 2022 Apr 1.

Genetic discovery and risk characterization in type 2 diabetes across diverse populations.不同人群2型糖尿病的基因发现与风险特征分析

HGG Adv. 2021 Apr 8;2(2). doi: 10.1016/j.xhgg.2021.100029. Epub 2021 Mar 9.

Tutorial: a guide to performing polygenic risk score analyses.教程：多基因风险评分分析操作指南。

Nat Protoc. 2020 Sep;15(9):2759-2772. doi: 10.1038/s41596-020-0353-1. Epub 2020 Jul 24.

Learning from local to global: An efficient distributed algorithm for modeling time-to-event data.从局部到全局学习：一种用于建模事件时间数据的高效分布式算法。

J Am Med Inform Assoc. 2020 Jul 1;27(7):1028-1036. doi: 10.1093/jamia/ocaa044.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。