Zhang Xinyan, Mallick Himel, Tang Zaixiang, Zhang Lei, Cui Xiangqin, Benson Andrew K, Yi Nengjun
Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, 35294-0022, USA.
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.
BMC Bioinformatics. 2017 Jan 3;18(1):4. doi: 10.1186/s12859-016-1441-7.
Recent advances in next-generation sequencing (NGS) technology enable researchers to collect a large volume of metagenomic sequencing data. These data provide valuable resources for investigating interactions between the microbiome and host environmental/clinical factors. In addition to the well-known properties of microbiome count measurements, for example, varied total sequence reads across samples, over-dispersion and zero-inflation, microbiome studies usually collect samples with hierarchical structures, which introduce correlation among the samples and thus further complicate the analysis and interpretation of microbiome count data.
In this article, we propose negative binomial mixed models (NBMMs) for detecting the association between the microbiome and host environmental/clinical factors for correlated microbiome count data. Although having not dealt with zero-inflation, the proposed mixed-effects models account for correlation among the samples by incorporating random effects into the commonly used fixed-effects negative binomial model, and can efficiently handle over-dispersion and varying total reads. We have developed a flexible and efficient IWLS (Iterative Weighted Least Squares) algorithm to fit the proposed NBMMs by taking advantage of the standard procedure for fitting the linear mixed models.
We evaluate and demonstrate the proposed method via extensive simulation studies and the application to mouse gut microbiome data. The results show that the proposed method has desirable properties and outperform the previously used methods in terms of both empirical power and Type I error. The method has been incorporated into the freely available R package BhGLM ( http://www.ssg.uab.edu/bhglm/ and http://github.com/abbyyan3/BhGLM ), providing a useful tool for analyzing microbiome data.
新一代测序(NGS)技术的最新进展使研究人员能够收集大量宏基因组测序数据。这些数据为研究微生物组与宿主环境/临床因素之间的相互作用提供了宝贵资源。除了微生物组计数测量的众所周知的特性,例如,不同样本间的总序列读数不同、过度离散和零膨胀外,微生物组研究通常收集具有层次结构的样本,这会引入样本间的相关性,从而进一步使微生物组计数数据的分析和解释变得复杂。
在本文中,我们提出了负二项混合模型(NBMMs),用于检测相关微生物组计数数据中微生物组与宿主环境/临床因素之间的关联。尽管未处理零膨胀问题,但所提出的混合效应模型通过将随机效应纳入常用的固定效应负二项模型来考虑样本间的相关性,并且可以有效地处理过度离散和总读数变化的问题。我们开发了一种灵活高效的迭代加权最小二乘法(IWLS)算法,通过利用拟合线性混合模型的标准程序来拟合所提出的NBMMs。
我们通过广泛的模拟研究以及对小鼠肠道微生物组数据的应用来评估和展示所提出的方法。结果表明,所提出的方法具有理想的特性,在经验功效和I型错误方面均优于先前使用的方法。该方法已被纳入免费的R包BhGLM(http://www.ssg.uab.edu/bhglm/ 和http://github.com/abbyyan3/BhGLM ),为分析微生物组数据提供了一个有用的工具。