Gu Tian, Han Yi, Duan Rui
Department of Biostatistics, Columbia University Mailman School of Public Health, New York, NY 10032, USA.
Department of Statistics, Columbia University, New York, NY 10027, USA.
J R Stat Soc Series B Stat Methodol. 2024 Dec 3;87(3):723-745. doi: 10.1093/jrsssb/qkae111. eCollection 2025 Jul.
Transfer learning improves target model performance by leveraging data from related source populations, especially when target data are scarce. This study addresses the challenge of training high-dimensional regression models with limited target data in the presence of heterogeneous source populations. We focus on a practical setting where only parameter estimates of pretrained source models are available, rather than individual-level source data. For a single source model, we propose a novel angle-based transfer learning (angleTL) method that leverages concordance between source and target model parameters. AngleTL adapts to the signal strength of the target model, unifies several benchmark methods, and mitigates negative transfer when between-population heterogeneity is large. We extend angleTL to incorporate multiple source models, accounting for varying levels of relevance among them. Our high-dimensional asymptotic analysis provides insights into when a source model benefits the target model and demonstrates the superiority of angleTL over other methods. Extensive simulations validate these findings and highlight the feasibility of applying angleTL to transfer genetic risk prediction models across multiple biobanks.
迁移学习通过利用来自相关源群体的数据来提高目标模型的性能,尤其是在目标数据稀缺的情况下。本研究解决了在存在异质源群体的情况下,使用有限的目标数据训练高维回归模型的挑战。我们关注的是一种实际情况,即只有预训练源模型的参数估计可用,而不是个体层面的源数据。对于单个源模型,我们提出了一种新颖的基于角度的迁移学习(angleTL)方法,该方法利用源模型和目标模型参数之间的一致性。AngleTL适应目标模型的信号强度,统一了几种基准方法,并在群体间异质性较大时减轻负迁移。我们将angleTL扩展为纳入多个源模型,考虑它们之间不同程度的相关性。我们的高维渐近分析深入探讨了源模型何时对目标模型有益,并证明了angleTL相对于其他方法的优越性。广泛的模拟验证了这些发现,并突出了将angleTL应用于跨多个生物银行转移遗传风险预测模型的可行性。