高维过离散广义因子模型及其在单细胞测序数据分析中的应用。

High-Dimensional Overdispersed Generalized Factor Model With Application to Single-Cell Sequencing Data Analysis.

机构信息

Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics, Chengdu, China.

Institute of Western China Economic Research, Southwestern University of Finance and Economics, Chengdu, China.

出版信息

Stat Med. 2024 Nov 10;43(25):4836-4849. doi: 10.1002/sim.10213. Epub 2024 Sep 5.

DOI:10.1002/sim.10213

PMID:39237124

Abstract

The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM.

摘要

当前的高维线性因子模型无法考虑不同类型的变量，而高维非线性因子模型往往忽略了混合类型数据中的过离散。然而，过离散在实际应用中很常见，特别是在生物医学和基因组学研究等领域。为了满足这一实际需求，我们提出了一种用于对过离散混合类型数据进行高维非线性因子分析的过离散广义因子模型（OverGFM）。我们的方法引入了一个额外的误差项，以捕捉仅由因子无法解释的过离散。然而，由于非线性模型中涉及两个高维潜在随机矩阵，这带来了重大的计算挑战。为了克服这些挑战，我们提出了一种新颖的变分 EM 算法，该算法整合了拉普拉斯和泰勒近似。该算法为复杂的变分参数提供了迭代的显式解，并且被证明具有出色的收敛性质。我们还基于奇异值比开发了一种准则来确定最优的因子数量。数值结果证明了该准则的有效性。通过全面的模拟研究，我们表明 OverGFM 在估计精度和计算效率方面优于最先进的方法。此外，我们通过将其应用于两个来自基因组学的数据集来展示我们方法的实际价值。为了方便使用，我们已将 OverGFM 的实现集成到 R 包 GFM 中。

相似文献

High-Dimensional Overdispersed Generalized Factor Model With Application to Single-Cell Sequencing Data Analysis.高维过离散广义因子模型及其在单细胞测序数据分析中的应用。

Stat Med. 2024 Nov 10;43(25):4836-4849. doi: 10.1002/sim.10213. Epub 2024 Sep 5.

High-dimensional covariate-augmented overdispersed poisson factor model.高维协变量增强过离散泊松因子模型。

Biometrics. 2024 Mar 27;80(2). doi: 10.1093/biomtc/ujae031.

Comparative assessment of parameter estimation methods in the presence of overdispersion: a simulation study.存在过度离散情况下参数估计方法的比较评估：一项模拟研究

Math Biosci Eng. 2019 May 16;16(5):4299-4313. doi: 10.3934/mbe.2019214.

A Variational Maximization-Maximization Algorithm for Generalized Linear Mixed Models with Crossed Random Effects.一种用于具有交叉随机效应的广义线性混合模型的变分最大化-最大化算法。

Psychometrika. 2017 Feb 28. doi: 10.1007/s11336-017-9555-z.

Numerical discretization-based estimation methods for ordinary differential equation models via penalized spline smoothing with applications in biomedical research.基于数值离散化的常微分方程模型估计方法，通过惩罚样条平滑及其在生物医学研究中的应用

Biometrics. 2012 Jun;68(2):344-52. doi: 10.1111/j.1541-0420.2012.01752.x. Epub 2012 Feb 29.

A comparison study on modeling of clustered and overdispersed count data for multiple comparisons.用于多重比较的聚类和过度分散计数数据建模的比较研究

J Appl Stat. 2020 Jul 3;48(16):3220-3232. doi: 10.1080/02664763.2020.1788518. eCollection 2021.

SLIVER: Unveiling large scale gene regulatory networks of single-cell transcriptomic data through causal structure learning and modules aggregation.SLIVER：通过因果结构学习和模块聚合揭示单细胞转录组数据的大规模基因调控网络。

Comput Biol Med. 2024 Aug;178:108690. doi: 10.1016/j.compbiomed.2024.108690. Epub 2024 Jun 9.

Fast estimation of generalized linear latent variable models for performance and process data with ordinal, continuous, and count observed variables.快速估计具有有序、连续和计数观测变量的性能和过程数据的广义线性潜在变量模型。

Br J Math Stat Psychol. 2024 Nov;77(3):477-507. doi: 10.1111/bmsp.12337. Epub 2024 Feb 12.

Group-representative functional network estimation from multi-subject fMRI data via MRF-based image segmentation.基于马尔可夫随机场图像分割的多体素 fMRI 数据的群组代表性功能网络估计。

Comput Methods Programs Biomed. 2019 Oct;179:104976. doi: 10.1016/j.cmpb.2019.07.004. Epub 2019 Jul 19.

Robustness of methods for blinded sample size re-estimation with overdispersed count data.具有过离散计数数据的盲法样本量重估方法的稳健性。

Stat Med. 2013 Sep 20;32(21):3623-35. doi: 10.1002/sim.5800. Epub 2013 Apr 18.

引用本文的文献

Global Thyroid Cancer Patterns and Predictive Analytics: Integrating Machine Learning for Advanced Diagnostic Modelling.全球甲状腺癌模式与预测分析：整合机器学习用于高级诊断建模

J Cell Mol Med. 2025 Jul;29(13):e70676. doi: 10.1111/jcmm.70676.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

高维过离散广义因子模型及其在单细胞测序数据分析中的应用。

High-Dimensional Overdispersed Generalized Factor Model With Application to Single-Cell Sequencing Data Analysis.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献