Suppr超能文献

数据库选择和置信度分数对使用Kraken2进行分类学分类性能的影响。

Impact of database choice and confidence score on the performance of taxonomic classification using Kraken2.

作者信息

Liu Yunlong, Ghaffari Morteza H, Ma Tao, Tu Yan

机构信息

Key Laboratory of Feed Biotechnology of the Ministry of Agricultural and Rural Affairs, Institute of Feed Research, Chinese Academy of Agricultural Sciences, Beijing, 100081 China.

Institute of Animal Science, Physiology Unit, University of Bonn, Bonn, 53115 Germany.

出版信息

aBIOTECH. 2024 Jul 31;5(4):465-475. doi: 10.1007/s42994-024-00178-0. eCollection 2024 Dec.

Abstract

UNLABELLED

Accurate taxonomic classification is essential to understanding microbial diversity and function through metagenomic sequencing. However, this task is complicated by the vast variety of microbial genomes and the computational limitations of bioinformatics tools. The aim of this study was to evaluate the impact of reference database selection and confidence score (CS) settings on the performance of Kraken2, a widely used k-mer-based metagenomic classifier. In this study, we generated simulated metagenomic datasets to systematically evaluate how the choice of reference databases, from the compact Minikraken v1 to the expansive nt- and GTDB r202, and different CS (from 0 to 1.0) affect the key performance metrics of Kraken2. These metrics include classification rate, precision, recall, F1 score, and accuracy of true versus calculated bacterial abundance estimation. Our results show that higher CS, which increases the rigor of taxonomic classification by requiring greater k-mer agreement, generally decreases the classification rate. This effect is particularly pronounced for smaller databases such as Minikraken and Standard-16, where no reads could be classified when the CS was above 0.4. In contrast, for larger databases such as Standard, nt and GTDB r202, precision and F1 scores improved significantly with increasing CS, highlighting their robustness to stringent conditions. Recovery rates were mostly stable, indicating consistent detection of species under different CS settings. Crucially, the results show that a comprehensive reference database combined with a moderate CS (0.2 or 0.4) significantly improves classification accuracy and sensitivity. This finding underscores the need for careful selection of database and CS parameters tailored to specific scientific questions and available computational resources to optimize the results of metagenomic analyses.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1007/s42994-024-00178-0.

摘要

未标注

准确的分类学分类对于通过宏基因组测序理解微生物多样性和功能至关重要。然而,这项任务因微生物基因组的巨大多样性和生物信息学工具的计算局限性而变得复杂。本研究的目的是评估参考数据库选择和置信度得分(CS)设置对广泛使用的基于k-mer的宏基因组分类器Kraken2性能的影响。在本研究中,我们生成了模拟宏基因组数据集,以系统评估从紧凑的Minikraken v1到庞大的nt和GTDB r202等参考数据库的选择,以及不同的CS(从0到1.0)如何影响Kraken2的关键性能指标。这些指标包括分类率、精确率、召回率、F1分数以及真实与计算的细菌丰度估计的准确性。我们的结果表明,较高的CS通过要求更高的k-mer一致性来提高分类学分类的严格性,通常会降低分类率。这种效应在较小的数据库(如Minikraken和Standard-16)中尤为明显,当CS高于0.4时,没有 reads 能够被分类。相比之下,对于较大的数据库(如Standard、nt和GTDB r202),随着CS的增加,精确率和F1分数显著提高,突出了它们对严格条件的鲁棒性。回收率大多稳定,表明在不同的CS设置下对物种的检测一致。至关重要的是,结果表明,综合参考数据库与适度的CS(0.2或0.4)相结合可显著提高分类准确性和灵敏度。这一发现强调了需要根据特定的科学问题和可用的计算资源仔细选择数据库和CS参数,以优化宏基因组分析的结果。

补充信息

在线版本包含可在10.1007/s42994-024-00178-0获取的补充材料。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/28d7/11624175/6eb5855c2455/42994_2024_178_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验