Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States.
Stowers Institute for Medical Research, 1000 E 50 St, Kansas City, MO 64110, United States.
Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae424.
Accurate taxonomic profiling of microbial taxa in a metagenomic sample is vital to gain insights into microbial ecology. Recent advancements in sequencing technologies have contributed tremendously toward understanding these microbes at species resolution through a whole shotgun metagenomic approach. In this study, we developed a new bioinformatics tool, coverage-based analysis for identification of microbiome (CAIM), for accurate taxonomic classification and quantification within both long- and short-read metagenomic samples using an alignment-based method. CAIM depends on two different containment techniques to identify species in metagenomic samples using their genome coverage information to filter out false positives rather than the traditional approach of relative abundance. In addition, we propose a nucleotide-count-based abundance estimation, which yield lesser root mean square error than the traditional read-count approach. We evaluated the performance of CAIM on 28 metagenomic mock communities and 2 synthetic datasets by comparing it with other top-performing tools. CAIM maintained a consistently good performance across datasets in identifying microbial taxa and in estimating relative abundances than other tools. CAIM was then applied to a real dataset sequenced on both Nanopore (with and without amplification) and Illumina sequencing platforms and found high similarity of taxonomic profiles between the sequencing platforms. Lastly, CAIM was applied to fecal shotgun metagenomic datasets of 232 colorectal cancer patients and 229 controls obtained from 4 different countries and 44 primary liver cancer patients and 76 controls. The predictive performance of models using the genome-coverage cutoff was better than those using the relative-abundance cutoffs in discriminating colorectal cancer and primary liver cancer patients from healthy controls with a highly confident species markers.
准确地对宏基因组样本中的微生物分类群进行分类对于深入了解微生物生态学至关重要。最近测序技术的进步通过全基因组 shotgun 宏基因组方法极大地促进了对这些微生物在物种分辨率水平上的理解。在这项研究中,我们开发了一种新的生物信息学工具,基于覆盖度的微生物组分析(CAIM),用于使用基于比对的方法对长读长和短读长宏基因组样本进行准确的分类和定量。CAIM 依赖于两种不同的包含技术,使用其基因组覆盖度信息来识别宏基因组样本中的物种,从而过滤掉假阳性,而不是传统的相对丰度方法。此外,我们提出了一种基于核苷酸计数的丰度估计方法,其均方根误差小于传统的读计数方法。我们通过将 CAIM 与其他表现最佳的工具进行比较,在 28 个宏基因组模拟群落和 2 个合成数据集上评估了其性能。CAIM 在识别微生物分类群和估计相对丰度方面的表现始终优于其他工具。然后,我们将 CAIM 应用于在 Nanopore (扩增和不扩增)和 Illumina 测序平台上测序的真实数据集,发现测序平台之间的分类群谱具有高度相似性。最后,CAIM 应用于来自 4 个不同国家的 232 名结直肠癌患者和 229 名对照者以及 44 名原发性肝癌患者和 76 名对照者的粪便 shotgun 宏基因组数据集。使用基因组覆盖度截止值的模型的预测性能优于使用相对丰度截止值的模型,在区分结直肠癌和原发性肝癌患者与健康对照者时,具有高度置信度的物种标志物。