Kidziński Łukasz, Hui Francis K C, Warton David I, Hastie Trevor
Department of Bioengineering, Stanford University, Stanford, CA 94305, USA.
Research School of Finance, Actuarial Studies and Statistics, The Australian National University, Canberra, ACT 2601, Australia.
J Mach Learn Res. 2022 Nov;23.
Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.
未测量或潜在变量通常是多元测量之间相关性的原因,这些相关性在心理学、生态学和医学等多个领域中都有研究。对于高斯测量,有诸如因子分析或主成分分析等经典工具,它们具有成熟的理论和快速算法。广义线性潜在变量模型(GLLVMs)将此类因子模型推广到非高斯响应。然而,当前用于估计GLLVMs模型参数的算法需要大量计算,并且无法扩展到具有数千个观测单位或响应的大型数据集。在本文中,我们提出了一种将GLLVMs应用于高维数据集的新方法,该方法基于使用惩罚拟似然近似模型,然后使用牛顿法和费舍尔评分来学习模型参数。在计算上,我们的方法明显更快且更稳定,能够对比以前更大的矩阵进行GLLVM拟合。我们将我们的方法应用于一个包含48000个观测单位的数据集,每个单位中有超过2000个观测物种,并发现大部分变异性可以用少数几个因子来解释。我们发布了我们提出的拟合算法的易于使用的实现。