Mao Jialiang, Ma L I
Department of Statistical Science, Duke University.
Ann Appl Stat. 2022 Sep;16(3):1476-1499. doi: 10.1214/21-aoas1552. Epub 2022 Jul 19.
Studying the human microbiome has gained substantial interest in recent years, and a common task in the analysis of these data is to cluster microbiome compositions into subtypes. This subdivision of samples into subgroups serves as an intermediary step in achieving personalized diagnosis and treatment. In applying existing clustering methods to modern microbiome studies including the American Gut Project (AGP) data, we found that this seemingly standard task, however, is very challenging in the microbiome composition context due to several key features of such data. Standard distance-based clustering algorithms generally do not produce reliable results as they do not take into account the heterogeneity of the cross-sample variability among the bacterial taxa, while existing model-based approaches do not allow sufficient flexibility for the identification of complex within-cluster variation from cross-cluster variation. Direct applications of such methods generally lead to overly dispersed clusters in the AGP data and such phenomenon is common for other microbiome data. To overcome these challenges, we introduce Dirichlet-tree multinomial mixtures (DTMM) as a Bayesian generative model for clustering amplicon sequencing data in microbiome studies. DTMM models the microbiome population with a mixture of Dirichlet-tree kernels that utilizes the phylogenetic tree to offer a more flexible covariance structure in characterizing within-cluster variation, and it provides a means for identifying a subset of signature taxa that distinguish the clusters. We perform extensive simulation studies to evaluate the performance of DTMM and compare it to state-of-the-art model-based and distance-based clustering methods in the microbiome context, and carry out a validation study on a publicly available longitudinal data set to confirm the biological relevance of the clusters. Finally, we report a case study on the fecal data from the AGP to identify compositional clusters among individuals with inflammatory bowel disease and diabetes. Among our most interesting findings is that enterotypes (i.e., gut microbiome clusters) are not always defined by the most dominant species as previous analyses had assumed, but can involve a number of less abundant OTUs, which cannot be identified with existing distance-based and method-based approaches.
近年来,对人类微生物组的研究引起了广泛关注,而分析这些数据的一项常见任务是将微生物组组成聚类为不同亚型。将样本细分为亚组是实现个性化诊断和治疗的中间步骤。在将现有聚类方法应用于包括美国肠道项目(AGP)数据在内的现代微生物组研究时,我们发现,由于此类数据的几个关键特征,在微生物组组成背景下,这项看似标准的任务极具挑战性。基于标准距离的聚类算法通常无法产生可靠的结果,因为它们没有考虑细菌分类群之间跨样本变异性的异质性,而现有的基于模型的方法在从跨聚类变异中识别复杂的聚类内变异时,灵活性不足。直接应用这些方法通常会导致AGP数据中的聚类过度分散,这种现象在其他微生物组数据中也很常见。为了克服这些挑战,我们引入了狄利克雷树多项混合模型(DTMM),作为微生物组研究中用于聚类扩增子测序数据的贝叶斯生成模型。DTMM使用狄利克雷树核的混合来对微生物组群体进行建模,该模型利用系统发育树在表征聚类内变异时提供更灵活的协方差结构,并提供了一种识别区分聚类的特征分类群子集的方法。我们进行了广泛的模拟研究,以评估DTMM的性能,并将其与微生物组背景下基于模型和基于距离的最新聚类方法进行比较,并在一个公开可用的纵向数据集上进行了验证研究,以确认聚类的生物学相关性。最后,我们报告了一项关于AGP粪便数据的案例研究,以识别炎症性肠病和糖尿病个体之间的组成聚类。我们最有趣的发现之一是,肠型(即肠道微生物组聚类)并不总是像之前的分析所假设的那样由最占优势的物种定义,而是可能涉及许多丰度较低的操作分类单元(OTU),而现有基于距离和基于方法的方法无法识别这些OTU。