Li Sai, Cai T Tony, Li Hongzhe
Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennvania, Philadelphia, PA 19104.
Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104.
J R Stat Soc Series B Stat Methodol. 2022 Feb;84(1):149-173. doi: 10.1111/rssb.12479. Epub 2021 Nov 16.
This paper considers estimation and prediction of a high-dimensional linear regression in the setting of transfer learning where, in addition to observations from the target model, auxiliary samples from different but possibly related regression models are available. When the set of informative auxiliary studies is known, an estimator and a predictor are proposed and their optimality is established. The optimal rates of convergence for prediction and estimation are faster than the corresponding rates without using the auxiliary samples. This implies that knowledge from the informative auxiliary samples can be transferred to improve the learning performance of the target problem. When the set of informative auxiliary samples is unknown, we propose a data-driven procedure for transfer learning, called Trans-Lasso, and show its robustness to non-informative auxiliary samples and its efficiency in knowledge transfer. The proposed procedures are demonstrated in numerical studies and are applied to a dataset concerning the associations among gene expressions. It is shown that Trans-Lasso leads to improved performance in gene expression prediction in a target tissue by incorporating data from multiple different tissues as auxiliary samples.
本文考虑在迁移学习环境下高维线性回归的估计和预测问题,其中除了来自目标模型的观测值外,还可获得来自不同但可能相关回归模型的辅助样本。当已知信息丰富的辅助研究集时,提出了一种估计器和一个预测器,并确立了它们的最优性。预测和估计的最优收敛速度比不使用辅助样本时的相应速度更快。这意味着来自信息丰富的辅助样本的知识可以被转移,以提高目标问题的学习性能。当信息丰富的辅助样本集未知时,我们提出一种用于迁移学习的数据驱动方法,称为Trans-Lasso,并展示了它对非信息辅助样本的稳健性及其在知识转移方面的效率。所提出的方法在数值研究中得到了验证,并应用于一个关于基因表达关联的数据集。结果表明,通过将来自多个不同组织的数据作为辅助样本纳入,Trans-Lasso在目标组织的基因表达预测中提高了性能。