School of Mathematics and Statistics, Qingdao University, Qingdao, China.
J Mol Biol. 2024 Dec 1;436(23):168841. doi: 10.1016/j.jmb.2024.168841. Epub 2024 Oct 26.
Microbiome research has increasingly underscored the profound link between microbial compositions and human health, with numerous studies establishing a strong correlation between microbiome characteristics and various diseases. However, the analysis of microbiome data is frequently compromised by inherent sparsity issues, characterized by a substantial presence of observed zeros. These zeros not only skew the abundance distribution of microbial species but also undermine the reliability of scientific conclusions drawn from such data. Addressing this challenge, we introduce GEMimp, an innovative imputation method designed to infuse robustness into microbiome data analysis. GEMimp leverages the node2vec algorithm, which incorporates both Breadth-First Search (BFS) and Depth-First Search (DFS) strategies in its random walks sampling process. This approach enables GEMimp to learn nuanced, low-dimensional representations of each taxonomic unit, facilitating the reconstruction of their similarity networks with unprecedented accuracy. Our comparative analysis pits GEMimp against state-of-the-art imputation methods including SAVER, MAGIC and mbImpute. The results unequivocally demonstrate that GEMimp outperforms its counterparts by achieving the highest Pearson correlation coefficient when compared to the original raw dataset. Furthermore, GEMimp shows notable proficiency in identifying significant taxa, enhancing the detection of disease-related taxa and effectively mitigating the impact of sparsity on both simulated and real-world datasets, such as those pertaining to Type 2 Diabetes (T2D) and Colorectal Cancer (CRC). These findings collectively highlight the strong effectiveness of GEMimp, allowing for better analysis on microbial data. With alleviation of sparsity issues, it could be greatly facilitated in downstream analyses and even in the field of microbiology.
微生物组研究越来越强调微生物组成与人类健康之间的深刻联系,许多研究确立了微生物组特征与各种疾病之间的强相关性。然而,微生物组数据的分析经常受到固有稀疏问题的影响,其特征是存在大量观测到的零值。这些零值不仅使微生物物种丰度分布产生偏差,还破坏了从这些数据得出的科学结论的可靠性。为了解决这一挑战,我们引入了 GEMimp,这是一种创新的插补方法,旨在为微生物组数据分析注入稳健性。GEMimp 利用 node2vec 算法,该算法在其随机游走采样过程中结合了广度优先搜索 (Breadth-First Search, BFS) 和深度优先搜索 (Depth-First Search, DFS) 策略。这种方法使 GEMimp 能够学习每个分类单元的细微、低维表示,以前所未有的准确性重建它们的相似性网络。我们的对比分析将 GEMimp 与包括 SAVER、MAGIC 和 mbImpute 在内的最先进的插补方法进行了比较。结果明确表明,与其他方法相比,GEMimp 通过与原始原始数据集相比达到最高的 Pearson 相关系数,从而实现了更好的性能。此外,GEMimp 在识别重要分类单元方面表现出色,增强了对与疾病相关的分类单元的检测,并有效地减轻了稀疏性对模拟和真实世界数据集(例如 2 型糖尿病 (Type 2 Diabetes, T2D) 和结直肠癌 (Colorectal Cancer, CRC) 数据集)的影响。这些发现共同强调了 GEMimp 的强大效果,使微生物数据的分析更加完善。通过缓解稀疏性问题,它可以极大地促进下游分析,甚至在微生物学领域。