逻辑正态多项模型的可扩展估计与正则化

Scalable estimation and regularization for the logistic normal multinomial model.

作者信息

Zhang Jingru, Lin Wei

机构信息

Center for Statistical Science, School of Mathematical Sciences, Peking University, Beijing, China.

出版信息

Biometrics. 2019 Dec;75(4):1098-1108. doi: 10.1111/biom.13071. Epub 2019 Apr 29.

DOI:10.1111/biom.13071

PMID:31009062

Abstract

Clustered multinomial data are prevalent in a variety of applications such as microbiome studies, where metagenomic sequencing data are summarized as multinomial counts for a large number of bacterial taxa per subject. Count normalization with ad hoc zero adjustment tends to result in poor estimates of abundances for taxa with zero or small counts. To account for heterogeneity and overdispersion in such data, we suggest using the logistic normal multinomial (LNM) model with an arbitrary correlation structure to simultaneously estimate the taxa compositions by borrowing information across subjects. We overcome the computational difficulties in high dimensions by developing a stochastic approximation EM algorithm with Hamiltonian Monte Carlo sampling for scalable parameter estimation in the LNM model. The ill-conditioning problem due to unstructured covariance is further mitigated by a covariance-regularized estimator with a condition number constraint. The advantages of the proposed methods are illustrated through simulations and an application to human gut microbiome data.

摘要

聚类多项数据在多种应用中普遍存在，例如微生物组研究，其中宏基因组测序数据被汇总为每个受试者大量细菌类群的多项计数。采用临时零调整的计数归一化往往会导致对计数为零或较小的类群丰度估计不佳。为了考虑此类数据中的异质性和过度离散，我们建议使用具有任意相关结构的逻辑正态多项（LNM）模型，通过跨受试者借用信息来同时估计类群组成。我们通过开发一种带有哈密顿蒙特卡罗采样的随机近似期望最大化（EM）算法来克服高维计算困难，以在LNM模型中进行可扩展的参数估计。具有条件数约束的协方差正则化估计器进一步缓解了由于无结构协方差导致的病态问题。通过模拟和对人类肠道微生物组数据的应用说明了所提出方法的优点。