Suppr超能文献

高维过离散广义因子模型及其在单细胞测序数据分析中的应用。

High-Dimensional Overdispersed Generalized Factor Model With Application to Single-Cell Sequencing Data Analysis.

机构信息

Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics, Chengdu, China.

Institute of Western China Economic Research, Southwestern University of Finance and Economics, Chengdu, China.

出版信息

Stat Med. 2024 Nov 10;43(25):4836-4849. doi: 10.1002/sim.10213. Epub 2024 Sep 5.

Abstract

The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM.

摘要

当前的高维线性因子模型无法考虑不同类型的变量,而高维非线性因子模型往往忽略了混合类型数据中的过离散。然而,过离散在实际应用中很常见,特别是在生物医学和基因组学研究等领域。为了满足这一实际需求,我们提出了一种用于对过离散混合类型数据进行高维非线性因子分析的过离散广义因子模型(OverGFM)。我们的方法引入了一个额外的误差项,以捕捉仅由因子无法解释的过离散。然而,由于非线性模型中涉及两个高维潜在随机矩阵,这带来了重大的计算挑战。为了克服这些挑战,我们提出了一种新颖的变分 EM 算法,该算法整合了拉普拉斯和泰勒近似。该算法为复杂的变分参数提供了迭代的显式解,并且被证明具有出色的收敛性质。我们还基于奇异值比开发了一种准则来确定最优的因子数量。数值结果证明了该准则的有效性。通过全面的模拟研究,我们表明 OverGFM 在估计精度和计算效率方面优于最先进的方法。此外,我们通过将其应用于两个来自基因组学的数据集来展示我们方法的实际价值。为了方便使用,我们已将 OverGFM 的实现集成到 R 包 GFM 中。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验