Department of Statistics, Iowa State University, Ames, IA 50011, USA.
Department of Energy, Joint Genome Institute, Berkeley, CA 94720, USA.
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac782.
High-throughput sequencing technologies have greatly facilitated microbiome research and have generated a large volume of microbiome data with the potential to answer key questions regarding microbiome assembly, structure and function. Cluster analysis aims to group features that behave similarly across treatments, and such grouping helps to highlight the functional relationships among features and may provide biological insights into microbiome networks. However, clustering microbiome data are challenging due to the sparsity and high dimensionality.
We propose a model-based clustering method based on Poisson hurdle models for sparse microbiome count data. We describe an expectation-maximization algorithm and a modified version using simulated annealing to conduct the cluster analysis. Moreover, we provide algorithms for initialization and choosing the number of clusters. Simulation results demonstrate that our proposed methods provide better clustering results than alternative methods under a variety of settings. We also apply the proposed method to a sorghum rhizosphere microbiome dataset that results in interesting biological findings.
R package is freely available for download at https://cran.r-project.org/package=PHclust.
Supplementary data are available at Bioinformatics online.
高通量测序技术极大地促进了微生物组研究,并产生了大量具有回答关于微生物组组装、结构和功能的关键问题潜力的微生物组数据。聚类分析旨在对在不同处理中表现相似的特征进行分组,这种分组有助于突出特征之间的功能关系,并可能为微生物组网络提供生物学见解。然而,由于稀疏性和高维性,聚类微生物组数据具有挑战性。
我们提出了一种基于泊松障碍模型的基于模型的聚类方法,用于稀疏微生物计数数据。我们描述了一种期望最大化算法和一种使用模拟退火的修改版本来进行聚类分析。此外,我们还提供了初始化和选择聚类数量的算法。模拟结果表明,在各种设置下,我们提出的方法比替代方法提供了更好的聚类结果。我们还将所提出的方法应用于高粱根际微生物组数据集,得到了有趣的生物学发现。
R 包可在 https://cran.r-project.org/package=PHclust 上免费下载。
补充数据可在生物信息学在线获得。