Qi Changlu, Cai Yiting, He Guoyou, Qian Kai, Guo Mian, Cheng Liang
College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, HL, China.
Department of Neurosurgery, The Second Affiliated Hospital, Harbin Medical University, Harbin, HL, China.
Gut Microbes. 2025 Dec;17(1):2552347. doi: 10.1080/19490976.2025.2552347. Epub 2025 Sep 1.
The involvement of gut microbiota in host physiological activities is crucial, yet the high sparsity of microbiome data, marked by numerous zeros in count matrices, presents huge analytical challenges. To overcome this, we developed mbSparse, an imputation algorithm that leverages deep learning rather than traditional predefined count distributions. Utilizing a feature autoencoder for learning sample representations and a conditional variational autoencoder (CVAE) for data reconstruction, mbSparse effectively integrates these processes to enhance imputation. Our results demonstrate that mbSparse achieves exceptional accuracy, with mean squared error reductions of up to 4.1 compared to existing microbiome methods, even amid outlier samples and varying sequencing depths. In colorectal cancer analysis, mbSparse increases the detection of validated disease-associated taxa from 7 to 27, while predictive accuracy improves, as evidenced by area under the precision-recall area under the curve values rising from 0.85 to 0.93. Additionally, mbSparse addresses non-biological zeros by restoring over 88% of removed counts and achieving a Pearson correlation of 0.9354 at a 10% removal rate, preserving essential taxonomic relationships. Finally, our exploration of mbSparse variants reveals that the CVAE is critical for enhancing accuracy, providing valuable insights for further optimizing microbiome data imputation techniques.
肠道微生物群参与宿主生理活动至关重要,然而,微生物组数据的高稀疏性(以计数矩阵中的大量零为特征)带来了巨大的分析挑战。为了克服这一问题,我们开发了mbSparse,这是一种利用深度学习而非传统预定义计数分布的插补算法。mbSparse利用特征自动编码器学习样本表示,并利用条件变分自动编码器(CVAE)进行数据重建,有效地整合了这些过程以增强插补效果。我们的结果表明,mbSparse具有卓越的准确性,与现有的微生物组方法相比,即使在存在异常样本和不同测序深度的情况下,均方误差降低了4.1。在结直肠癌分析中,mbSparse将已验证的疾病相关分类群的检测数量从7个增加到27个,同时预测准确性提高,精确召回曲线下面积值从0.85上升到0.93就证明了这一点。此外,mbSparse通过恢复超过88%的去除计数来处理非生物学零,并在10%的去除率下实现了0.9354的皮尔逊相关性,保留了重要的分类关系。最后,我们对mbSparse变体的探索表明,CVAE对于提高准确性至关重要,为进一步优化微生物组数据插补技术提供了有价值的见解。