Wang Peiyao, Li Quefeng, Shen Dinggang, Liu Yufeng
University of North Carolina at Chapel Hill.
ShanghaiTech University.
Stat Sin. 2023 Jan;33(1):27-53. doi: 10.5705/ss.202020.0145.
In modern scientific research, data heterogeneity is commonly observed owing to the abundance of complex data. We propose a factor regression model for data with heterogeneous subpopulations. The proposed model can be represented as a decomposition of heterogeneous and homogeneous terms. The heterogeneous term is driven by latent factors in different subpopulations. The homogeneous term captures common variation in the covariates and shares common regression coefficients across subpopulations. Our proposed model attains a good balance between a global model and a group-specific model. The global model ignores the data heterogeneity, while the group-specific model fits each subgroup separately. We prove the estimation and prediction consistency for our proposed estimators, and show that it has better convergence rates than those of the group-specific and global models. We show that the extra cost of estimating latent factors is asymptotically negligible and the minimax rate is still attainable. We further demonstrate the robustness of our proposed method by studying its prediction error under a mis-specified group-specific model. Finally, we conduct simulation studies and analyze a data set from the Alzheimer's Disease Neuroimaging Initiative and an aggregated microarray data set to further demonstrate the competitiveness and interpretability of our proposed factor regression model.
在现代科学研究中,由于复杂数据丰富,数据异质性普遍存在。我们针对具有异质子群体的数据提出了一种因子回归模型。所提出的模型可表示为异质项和同质项的分解。异质项由不同子群体中的潜在因子驱动。同质项捕捉协变量中的共同变化,并在子群体间共享共同的回归系数。我们提出的模型在全局模型和特定群体模型之间实现了良好的平衡。全局模型忽略数据异质性,而特定群体模型分别拟合每个子组。我们证明了所提出估计量的估计和预测一致性,并表明它比特定群体模型和全局模型具有更好的收敛速度。我们表明,估计潜在因子的额外成本在渐近意义上可忽略不计,并且极小极大率仍然可以达到。我们通过研究在错误指定的特定群体模型下的预测误差,进一步证明了所提出方法的稳健性。最后,我们进行了模拟研究,并分析了来自阿尔茨海默病神经影像学倡议的数据集和一个汇总的微阵列数据集,以进一步证明我们提出的因子回归模型的竞争力和可解释性。