Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics, Chengdu, China.
Institute of Western China Economic Research, Southwestern University of Finance and Economics, Chengdu, China.
Stat Med. 2024 Nov 10;43(25):4836-4849. doi: 10.1002/sim.10213. Epub 2024 Sep 5.
The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM.
当前的高维线性因子模型无法考虑不同类型的变量,而高维非线性因子模型往往忽略了混合类型数据中的过离散。然而,过离散在实际应用中很常见,特别是在生物医学和基因组学研究等领域。为了满足这一实际需求,我们提出了一种用于对过离散混合类型数据进行高维非线性因子分析的过离散广义因子模型(OverGFM)。我们的方法引入了一个额外的误差项,以捕捉仅由因子无法解释的过离散。然而,由于非线性模型中涉及两个高维潜在随机矩阵,这带来了重大的计算挑战。为了克服这些挑战,我们提出了一种新颖的变分 EM 算法,该算法整合了拉普拉斯和泰勒近似。该算法为复杂的变分参数提供了迭代的显式解,并且被证明具有出色的收敛性质。我们还基于奇异值比开发了一种准则来确定最优的因子数量。数值结果证明了该准则的有效性。通过全面的模拟研究,我们表明 OverGFM 在估计精度和计算效率方面优于最先进的方法。此外,我们通过将其应用于两个来自基因组学的数据集来展示我们方法的实际价值。为了方便使用,我们已将 OverGFM 的实现集成到 R 包 GFM 中。