State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, 163 Xianlin Avenue, Qixia District, Nanjing 210000, China.
Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, 159 Panlong road, Xuanwu District, Nanjing 210000, China.
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae539.
Metagenomic analyses facilitate the exploration of the microbial world, advancing our understanding of microbial roles in ecological and biological processes. A pivotal aspect of metagenomic analysis involves assessing the quality of metagenome-assembled genomes (MAGs), crucial for accurate biological insights. Current machine learning-based methods often treat completeness and contamination prediction as separate tasks, overlooking their inherent relationship and limiting models' generalization. In this study, we present DeepCheck, a multitasking deep learning framework for simultaneous prediction of MAG completeness and contamination. DeepCheck consistently outperforms existing tools in accuracy across various experimental settings and demonstrates comparable speed while maintaining high predictive accuracy even for new lineages. Additionally, we employ interpretable machine learning techniques to identify specific genes and pathways that drive the model's predictions, enabling independent investigation and assessment of these biological elements for deeper insights.
宏基因组分析有助于探索微生物世界,增进我们对微生物在生态和生物过程中作用的理解。宏基因组分析的一个关键方面涉及评估宏基因组组装基因组 (MAG) 的质量,这对于准确的生物学见解至关重要。当前基于机器学习的方法通常将完整性和污染预测视为单独的任务,忽略了它们之间的内在关系,限制了模型的泛化能力。在这项研究中,我们提出了 DeepCheck,这是一种用于同时预测 MAG 完整性和污染的多任务深度学习框架。DeepCheck 在各种实验设置下的准确性均优于现有工具,并且在保持高预测准确性的同时,速度也相当快,即使对于新的谱系也是如此。此外,我们还采用了可解释的机器学习技术来识别驱动模型预测的特定基因和途径,从而能够对这些生物学元素进行独立的研究和评估,以获得更深入的见解。