Pan Lulu, Gao Qian, Wei Kecheng, Yu Yongfu, Qin Guoyou, Wang Tong
Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China.
Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China.
PLoS Comput Biol. 2025 Jan 10;21(1):e1012739. doi: 10.1371/journal.pcbi.1012739. eCollection 2025 Jan.
Transfer learning aims to integrate useful information from multi-source datasets to improve the learning performance of target data. This can be effectively applied in genomics when we learn the gene associations in a target tissue, and data from other tissues can be integrated. However, heavy-tail distribution and outliers are common in genomics data, which poses challenges to the effectiveness of current transfer learning approaches. In this paper, we study the transfer learning problem under high-dimensional linear models with t-distributed error (Trans-PtLR), which aims to improve the estimation and prediction of target data by borrowing information from useful source data and offering robustness to accommodate complex data with heavy tails and outliers. In the oracle case with known transferable source datasets, a transfer learning algorithm based on penalized maximum likelihood and expectation-maximization algorithm is established. To avoid including non-informative sources, we propose to select the transferable sources based on cross-validation. Extensive simulation experiments as well as an application demonstrate that Trans-PtLR demonstrates robustness and better performance of estimation and prediction when heavy-tail and outliers exist compared to transfer learning for linear regression model with normal error distribution. Data integration, Variable selection, T distribution, Expectation maximization algorithm, Genotype-Tissue Expression, Cross validation.
迁移学习旨在整合来自多源数据集的有用信息,以提高目标数据的学习性能。当我们在目标组织中学习基因关联时,这可以有效地应用于基因组学,并且可以整合来自其他组织的数据。然而,重尾分布和异常值在基因组学数据中很常见,这给当前的迁移学习方法的有效性带来了挑战。在本文中,我们研究了具有t分布误差的高维线性模型下的迁移学习问题(Trans-PtLR),其目的是通过借鉴有用源数据的信息来改进目标数据的估计和预测,并提供鲁棒性以适应具有重尾和异常值的复杂数据。在已知可转移源数据集的理想情况下,建立了一种基于惩罚最大似然和期望最大化算法的迁移学习算法。为了避免包含无信息的源,我们建议基于交叉验证选择可转移源。广泛的模拟实验以及一个应用表明,与具有正态误差分布的线性回归模型的迁移学习相比,当存在重尾和异常值时,Trans-PtLR表现出鲁棒性以及更好的估计和预测性能。数据整合、变量选择、t分布、期望最大化算法、基因型-组织表达、交叉验证。