Zhou Huijuan, Chen Jun, Zhang Xianyang
Shanghai University of Finance and Economics, Shanghai, China.
Mayo Clinic, Rochester, Minnesota, USA.
bioRxiv. 2025 May 12:2025.05.08.652808. doi: 10.1101/2025.05.08.652808.
Microbiome sequencing data are inherently sparse and compositional, with excessive zeros arising from biological absence or insufficient sampling. These zeros pose significant challenges for downstream analyses, particularly those that require log-transformation. We introduce BMDD (BiModal Dirichlet Distribution), a novel probabilistic modeling framework for accurate imputation of microbiome sequencing data. Unlike existing imputation approaches that assume unimodal abundance, BMDD captures the bimodal abundance distribution of the taxa via a mixture of Dirichlet priors. It uses variational inference and a scalable expectation-maximization algorithm for efficient imputation. Through simulations and real microbiome datasets, we demonstrate that BMDD outperforms competing methods in reconstructing true abundances and improves the performance of differential abundance analysis. Through multiple posterior samples, BMDD enables robust inference by accounting for uncertainty in zero imputation. Our method offers a principled and computationally efficient solution for analyzing high-dimensional, zero-inflated microbiome sequencing data and is broadly applicable in microbial biomarker discovery and host-microbiome interaction studies. BMDD is available at: https://github.com/zhouhj1994/BMDD.
微生物组测序数据本质上是稀疏且具有组成性的,由于生物学缺失或采样不足会出现过多的零值。这些零值给下游分析带来了重大挑战,尤其是那些需要对数转换的分析。我们引入了BMDD(双峰狄利克雷分布),这是一种用于准确估算微生物组测序数据的新型概率建模框架。与现有的假设单峰丰度的估算方法不同,BMDD通过狄利克雷先验的混合来捕捉分类群的双峰丰度分布。它使用变分推理和可扩展的期望最大化算法进行高效估算。通过模拟和真实的微生物组数据集,我们证明BMDD在重建真实丰度方面优于竞争方法,并提高了差异丰度分析的性能。通过多个后验样本,BMDD通过考虑零值估算中的不确定性实现了稳健的推断。我们的方法为分析高维、零膨胀的微生物组测序数据提供了一种有原则且计算高效的解决方案,广泛适用于微生物生物标志物发现和宿主-微生物组相互作用研究。BMDD可在以下网址获取:https://github.com/zhouhj1994/BMDD。