Department of Computer Science and Technology and Institute of Artificial Intelligence, Tsinghua University, Beijing 100084, China; Sogou Inc., Beijing 100084, China.
Peking Union Medical College, Chinese Academy of Medical Science, Beijing 100005, China; Department of Ultrasound, Peking Union Medical College Hospital, Beijing 100005, China.
Genomics Proteomics Bioinformatics. 2021 Oct;19(5):834-847. doi: 10.1016/j.gpb.2020.06.015. Epub 2021 Feb 17.
Identification of significant biological relationships or patterns is central to many metagenomic studies. Methods that estimate association networks have been proposed for this purpose; however, they assume that associations are static, neglecting the fact that relationships in a microbial ecosystem may vary with changes in environmental factors (EFs), which can result in inaccurate estimations. Therefore, in this study, we propose a computational model, called the k-Lognormal-Dirichlet-Multinomial (kLDM) model, which estimates multiple association networks that correspond to specific environmental conditions, and simultaneously infers microbe-microbe and EF-microbe associations for each network. The effectiveness of the kLDM model was demonstrated on synthetic data, a colorectal cancer (CRC) dataset, the Tara Oceans dataset, and the American Gut Project dataset. The results revealed that the widely-used Spearman's rank correlation coefficient method performed much worse than the other methods, indicating the importance of separating samples by environmental conditions. Cancer fecal samples were then compared with cancer-free samples, and the estimation achieved by kLDM exhibited fewer associations among microbes but stronger associations between specific bacteria, especially five CRC-associated operational taxonomic units, indicating gut microbe translocation in cancer patients. Some EF-dependent associations were then found within a marine eukaryotic community. Finally, the gut microbial heterogeneity of inflammatory bowel disease patients was detected. These results demonstrate that kLDM can elucidate the complex associations within microbial ecosystems. The kLDM program, R, and Python scripts, together with all experimental datasets, are accessible at https://github.com/tinglab/kLDM.git.
识别重要的生物关系或模式是许多宏基因组研究的核心。为此,已经提出了估计关联网络的方法;然而,它们假设关联是静态的,忽略了微生物生态系统中的关系可能随着环境因素 (EFs) 的变化而变化的事实,这可能导致不准确的估计。因此,在本研究中,我们提出了一种称为 k-对数正态-狄利克雷-多项式 (kLDM) 模型的计算模型,该模型估计对应于特定环境条件的多个关联网络,并同时推断每个网络中微生物-微生物和 EF-微生物的关联。kLDM 模型在合成数据、结直肠癌 (CRC) 数据集、塔拉海洋数据集和美国肠道项目数据集上进行了有效性验证。结果表明,广泛使用的 Spearman 秩相关系数方法的性能远不如其他方法,这表明按环境条件分离样本的重要性。然后将癌症粪便样本与无癌症样本进行比较,kLDM 的估计结果显示微生物之间的关联较少,但特定细菌之间的关联更强,特别是五个与 CRC 相关的操作分类单位,表明癌症患者的肠道微生物易位。然后在海洋真核生物群落中发现了一些依赖 EF 的关联。最后,检测了炎症性肠病患者的肠道微生物异质性。这些结果表明 kLDM 可以阐明微生物生态系统中的复杂关联。kLDM 程序、R 和 Python 脚本以及所有实验数据集均可在 https://github.com/tinglab/kLDM.git 上获得。