Li By Sai, Cai Tianxi, Duan Rui
Institute of Statistics and Big Data, Renmin University of China.
Department of Biostatistics, Harvard T.H. Chan School of Public Health.
Ann Appl Stat. 2023 Dec;17(4):2970-2992. doi: 10.1214/23-AOAS1747. Epub 2023 Oct 30.
The limited representation of minorities and disadvantaged populations in large-scale clinical and genomics research poses a significant barrier to translating precision medicine research into practice. Prediction models are likely to underperform in underrepresented populations due to heterogeneity across populations, thereby exacerbating known health disparities. To address this issue, we propose FETA, a two-way data integration method that leverages a federated transfer learning approach to integrate heterogeneous data from diverse populations and multiple healthcare institutions, with a focus on a target population of interest having limited sample sizes. We show that FETA achieves performance comparable to the pooled analysis, where individual-level data is shared across institutions, with only a small number of communications across participating sites. Our theoretical analysis and simulation study demonstrate how FETA's estimation accuracy is influenced by communication budgets, privacy restrictions, and heterogeneity across populations. We apply FETA to multisite data from the electronic Medical Records and Genomics (eMERGE) Network to construct genetic risk prediction models for extreme obesity. Compared to models trained using target data only, source data only, and all data without accounting for population-level differences, FETA shows superior predictive performance. FETA has the potential to improve estimation and prediction accuracy in underrepresented populations and reduce the gap in model performance across populations.
少数群体和弱势群体在大规模临床和基因组学研究中的代表性有限,这对将精准医学研究转化为实际应用构成了重大障碍。由于不同人群之间的异质性,预测模型在代表性不足的人群中可能表现不佳,从而加剧了已知的健康差距。为了解决这个问题,我们提出了FETA,这是一种双向数据整合方法,它利用联邦迁移学习方法来整合来自不同人群和多个医疗机构的异构数据,重点关注样本量有限的目标感兴趣人群。我们表明,FETA实现了与汇总分析相当的性能,在汇总分析中,个体层面的数据在各机构之间共享,而参与站点之间只需进行少量通信。我们的理论分析和模拟研究证明了FETA的估计准确性是如何受到通信预算、隐私限制和人群异质性影响的。我们将FETA应用于电子病历与基因组学(eMERGE)网络的多站点数据,以构建极端肥胖的遗传风险预测模型。与仅使用目标数据、仅使用源数据以及不考虑人群水平差异的所有数据训练的模型相比,FETA显示出卓越的预测性能。FETA有潜力提高代表性不足人群的估计和预测准确性,并缩小不同人群之间的模型性能差距。