从默认值到数据库：参数和数据库的选择极大地影响了宏基因组分类工具的性能。

From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools.

机构信息

Department of Pharmacology, Faculty of Medicine, Dalhousie University, Halifax, Canada.

Integrated Microbiome Resource (IMR), Dalhousie University, Halifax, Canada.

出版信息

Microb Genom. 2023 Mar;9(3). doi: 10.1099/mgen.0.000949.

DOI:10.1099/mgen.0.000949

PMID:36867161

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10132073/

Abstract

In metagenomic analyses of microbiomes, one of the first steps is usually the taxonomic classification of reads by comparison to a database of previously taxonomically classified genomes. While different studies comparing metagenomic taxonomic classification methods have determined that different tools are 'best', there are two tools that have been used the most to-date: Kraken (-mer-based classification against a user-constructed database) and MetaPhlAn (classification by alignment to clade-specific marker genes), the latest versions of which are Kraken2 and MetaPhlAn 3, respectively. We found large discrepancies in both the proportion of reads that were classified as well as the number of species that were identified when we used both Kraken2 and MetaPhlAn 3 to classify reads within metagenomes from human-associated or environmental datasets. We then investigated which of these tools would give classifications closest to the real composition of metagenomic samples using a range of simulated and mock samples and examined the combined impact of tool-parameter-database choice on the taxonomic classifications given. This revealed that there may not be a one-size-fits-all 'best' choice. While Kraken2 can achieve better overall performance, with higher precision, recall and F1 scores, as well as alpha- and beta-diversity measures closer to the known composition than MetaPhlAn 3, the computational resources required for this may be prohibitive for many researchers, and the default database and parameters should not be used. We therefore conclude that the best tool-parameter-database choice for a particular application depends on the scientific question of interest, which performance metric is most important for this question and the limit of available computational resources.

摘要

在微生物组的宏基因组分析中，通常的第一步通常是通过与以前分类的基因组数据库进行比较，对读取内容进行分类。虽然比较宏基因组分类方法的不同研究已经确定了不同的工具是“最佳”的，但到目前为止，使用最多的两种工具是：Kraken（基于-mer 的分类，针对用户构建的数据库）和 MetaPhlAn（通过与特定进化枝的标记基因比对进行分类），它们的最新版本分别是 Kraken2 和 MetaPhlAn 3。我们发现，当我们使用 Kraken2 和 MetaPhlAn 3 对来自人类相关或环境数据集的宏基因组中的读取内容进行分类时，无论是分类的读取内容比例，还是鉴定的物种数量，都存在很大的差异。然后，我们使用一系列模拟和模拟样本研究了这些工具中的哪一个最能给出与宏基因组样本实际组成最接近的分类，检查了工具参数数据库选择对分类的综合影响。这表明，可能没有一种适合所有情况的“最佳”选择。虽然 Kraken2 可以实现更好的整体性能，具有更高的精度、召回率和 F1 分数，以及与 MetaPhlAn 3 相比，α和β多样性测量更接近已知组成，但这可能需要大量的计算资源，对于许多研究人员来说可能是不可行的，并且不应使用默认的数据库和参数。因此，我们的结论是，对于特定的应用程序，最佳的工具参数数据库选择取决于研究人员感兴趣的科学问题、对于这个问题最重要的性能指标以及可用的计算资源限制。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc17/10132073/9ff414fb4d26/mgen-9-00949-g001.jpg

相似文献

From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools.从默认值到数据库：参数和数据库的选择极大地影响了宏基因组分类工具的性能。

Microb Genom. 2023 Mar;9(3). doi: 10.1099/mgen.0.000949.

An in-depth evaluation of metagenomic classifiers for soil microbiomes.对土壤微生物群落宏基因组分类器的深入评估。

Environ Microbiome. 2024 Mar 28;19(1):19. doi: 10.1186/s40793-024-00561-w.

Crowdsourced benchmarking of taxonomic metagenome profilers: lessons learned from the sbv IMPROVER Microbiomics challenge.众包分类学宏基因组分析器的基准测试：从 sbv IMPROVER 微生物组学挑战赛中吸取的经验教训。

BMC Genomics. 2022 Aug 30;23(1):624. doi: 10.1186/s12864-022-08803-2.

Qmatey: an automated pipeline for fast exact matching-based alignment and strain-level taxonomic binning and profiling of metagenomes.Qmatey：一种自动化的基于快速精确匹配的宏基因组比对、菌株分类和分析的流水线。

Brief Bioinform. 2023 Sep 22;24(6). doi: 10.1093/bib/bbad351.

Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4.利用 MetaPhlAn 4 对未鉴定物种进行宏基因组分类分析的扩展和改进。

Nat Biotechnol. 2023 Nov;41(11):1633-1644. doi: 10.1038/s41587-023-01688-w. Epub 2023 Feb 23.

Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets.评价长读 shotgun 宏基因组测序数据集的分类和分析方法。

BMC Bioinformatics. 2022 Dec 13;23(1):541. doi: 10.1186/s12859-022-05103-0.

Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.用于宏基因组差异分析的k-mer谱适用性评估。

BMC Bioinformatics. 2016 Jan 16;17:38. doi: 10.1186/s12859-015-0875-7.

Comparative analysis of metagenomic classifiers for long-read sequencing datasets.长读测序数据集的宏基因组分类器的比较分析。

BMC Bioinformatics. 2024 Jan 11;25(1):15. doi: 10.1186/s12859-024-05634-8.

DL-TODA: A Deep Learning Tool for Omics Data Analysis.DL-TODA：一种用于组学数据分析的深度学习工具。

Biomolecules. 2023 Mar 24;13(4):585. doi: 10.3390/biom13040585.

Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data.评估噬菌体：宏基因组测序数据中噬菌体鉴定工具的基准测试。

Microbiome. 2023 Apr 21;11(1):84. doi: 10.1186/s40168-023-01533-x.

引用本文的文献

A metagenomic approach to One Health surveillance of antimicrobial resistance in a UK veterinary centre.一种用于英国兽医中心抗菌药物耐药性“同一个健康”监测的宏基因组学方法。

Microb Genom. 2025 Sep;11(9). doi: 10.1099/mgen.0.001471.

ganon2: up-to-date and scalable metagenomics analysis.Ganon2：最新且可扩展的宏基因组学分析。

NAR Genom Bioinform. 2025 Jul 17;7(3):lqaf094. doi: 10.1093/nargab/lqaf094. eCollection 2025 Sep.

Advancing metagenomic classification with NABAS+: a novel alignment-based approach.使用NABAS+推进宏基因组分类：一种基于比对的新方法。

NAR Genom Bioinform. 2025 Jul 4;7(3):lqaf092. doi: 10.1093/nargab/lqaf092. eCollection 2025 Sep.

Precise and scalable metagenomic profiling with sample-tailored minimizer libraries.使用样本定制的最小化子库进行精确且可扩展的宏基因组分析。

NAR Genom Bioinform. 2025 Jun 9;7(2):lqaf076. doi: 10.1093/nargab/lqaf076. eCollection 2025 Jun.

High Prevalence of a Novel Circovirus in the European Hedgehog (), a Common Species in Decline.在欧洲刺猬（一种数量正在减少的常见物种）中一种新型圆环病毒的高流行率。

Transbound Emerg Dis. 2024 Nov 27;2024:4670252. doi: 10.1155/2024/4670252. eCollection 2024.

Best Practices and Considerations for Conducting Research on Diet-Gut Microbiome Interactions and Their Impact on Health in Adult Populations: An Umbrella Review.成人人群饮食与肠道微生物群相互作用及其对健康影响的研究最佳实践与考量：一项综述。

Adv Nutr. 2025 May;16(5):100419. doi: 10.1016/j.advnut.2025.100419. Epub 2025 Apr 1.

NGS-based Aspergillus detection in plasma and lung lavage of children with invasive pulmonary aspergillosis.基于二代测序技术检测侵袭性肺曲霉病患儿血浆及肺灌洗液中的曲霉

NPJ Genom Med. 2025 Mar 17;10(1):24. doi: 10.1038/s41525-025-00482-8.

The cerebrospinal fluid virome in people with HIV: links to neuroinflammation and cognition.HIV感染者的脑脊液病毒组：与神经炎症和认知的关联

bioRxiv. 2025 Feb 28:2025.02.28.640732. doi: 10.1101/2025.02.28.640732.

Impact of simulation and reference catalogues on the evaluation of taxonomic profiling pipelines.模拟和参考目录对分类学分析流程评估的影响。

Microb Genom. 2025 Jan;11(1). doi: 10.1099/mgen.0.001330.

Farm-to-fork changes in poultry microbiomes and resistomes in Maputo City, Mozambique.莫桑比克马普托市家禽微生物群和耐药基因组从农场到餐桌的变化。

mSystems. 2025 Jan 21;10(1):e0103724. doi: 10.1128/msystems.01037-24. Epub 2024 Dec 19.

本文引用的文献

Critical Assessment of Metagenome Interpretation: the second round of challenges.宏基因组解读的关键评估：第二轮挑战。

Nat Methods. 2022 Apr;19(4):429-440. doi: 10.1038/s41592-022-01431-4. Epub 2022 Apr 8.

RESCRIPt: Reproducible sequence taxonomy reference database management.RESCIPT：可重复序列分类法参考数据库管理。

PLoS Comput Biol. 2021 Nov 8;17(11):e1009581. doi: 10.1371/journal.pcbi.1009581. eCollection 2021 Nov.

A standardized archaeal taxonomy for the Genome Taxonomy Database.基于基因组分类数据库的标准化古菌分类学。

Nat Microbiol. 2021 Jul;6(7):946-959. doi: 10.1038/s41564-021-00918-8. Epub 2021 Jun 21.

Challenges in benchmarking metagenomic profilers.宏基因组分析工具的基准测试挑战。

Nat Methods. 2021 Jun;18(6):618-626. doi: 10.1038/s41592-021-01141-3. Epub 2021 May 13.

Evaluation of the Microba Community Profiler for Taxonomic Profiling of Metagenomic Datasets From the Human Gut Microbiome.用于人类肠道微生物组宏基因组数据集分类分析的Microba群落分析器评估

Front Microbiol. 2021 Apr 20;12:643682. doi: 10.3389/fmicb.2021.643682. eCollection 2021.

Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3.利用 bioBakery 3 整合具有分类学、功能和菌株水平特征的多样化微生物群落。

Elife. 2021 May 4;10:e65088. doi: 10.7554/eLife.65088.

TIPP2: metagenomic taxonomic profiling using phylogenetic markers.TIPP2：使用系统发育标记进行宏基因组分类分析

Bioinformatics. 2021 Jul 27;37(13):1839-1845. doi: 10.1093/bioinformatics/btab023.

UniProt: the universal protein knowledgebase in 2021.UniProt：2021 年的通用蛋白质知识库。

Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489. doi: 10.1093/nar/gkaa1100.

PICRUSt2 for prediction of metagenome functions.用于宏基因组功能预测的PICRUSt2

Nat Biotechnol. 2020 Jun;38(6):685-688. doi: 10.1038/s41587-020-0548-6.

A complete domain-to-species taxonomy for Bacteria and Archaea.细菌和古菌的完整域到种分类 taxonomy。

Nat Biotechnol. 2020 Sep;38(9):1079-1086. doi: 10.1038/s41587-020-0501-8. Epub 2020 Apr 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

从默认值到数据库：参数和数据库的选择极大地影响了宏基因组分类工具的性能。

From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献