Department of Mathematics, Parahyangan Catholic University, Bandung, Indonesia.
Biomedical Data Sciences, section Medical Statistics, Leiden University Medical Centre, Leiden, The Netherlands.
Stat Med. 2019 May 30;38(12):2248-2268. doi: 10.1002/sim.8101. Epub 2019 Feb 13.
Clustered overdispersed multivariate count data are challenging to model due to the presence of correlation within and between samples. Typically, the first source of correlation needs to be addressed but its quantification is of less interest. Here, we focus on the correlation between time points. In addition, the effects of covariates on the multivariate counts distribution need to be assessed. To fulfill these requirements, a regression model based on the Dirichlet-multinomial distribution for association between covariates and the categorical counts is extended by using random effects to deal with the additional clustering. This model is the Dirichlet-multinomial mixed regression model. Alternatively, a negative binomial regression mixed model can be deployed where the corresponding likelihood is conditioned on the total count. It appears that these two approaches are equivalent when the total count is fixed and independent of the random effects. We consider both subject-specific and categorical-specific random effects. However, the latter has a larger computational burden when the number of categories increases. Our work is motivated by microbiome data sets obtained by sequencing of the amplicon of the bacterial 16S rRNA gene. These data have a compositional structure and are typically overdispersed. The microbiome data set is from an epidemiological study carried out in a helminth-endemic area in Indonesia. The conclusions are as follows: time has no statistically significant effect on microbiome composition, the correlation between subjects is statistically significant, and treatment has a significant effect on the microbiome composition only in infected subjects who remained infected.
由于样本内和样本间存在相关性,聚集过度离散的多元计数数据建模具有挑战性。通常,需要解决第一个来源的相关性,但对其量化的兴趣较小。在这里,我们关注时间点之间的相关性。此外,需要评估协变量对多元计数分布的影响。为了满足这些要求,我们扩展了基于协变量与分类计数之间关联的狄利克雷-多项分布的回归模型,通过使用随机效应来处理额外的聚类。该模型是狄利克雷-多项混合回归模型。或者,可以部署负二项式回归混合模型,其中相应的似然条件是总计数。当总计数固定且独立于随机效应时,这两种方法似乎是等效的。我们考虑了个体特异性和类别特异性随机效应。然而,当类别数量增加时,后者的计算负担更大。我们的工作是由通过扩增子测序获得的细菌 16S rRNA 基因的微生物组数据集驱动的。这些数据具有组成结构,通常是过度离散的。微生物组数据集来自印度尼西亚一个蠕虫流行地区进行的一项流行病学研究。结论如下:时间对微生物组组成没有统计学上的显著影响,主体之间的相关性具有统计学意义,并且治疗仅在持续感染的感染主体中对微生物组组成有显著影响。