MNBC：一种基于多线程 Minimizer 的朴素贝叶斯分类器，用于改进宏基因组序列分类。

MNBC: a multithreaded Minimizer-based Naïve Bayes Classifier for improved metagenomic sequence classification.

机构信息

National Centre for Animal Disease, Canadian Food Inspection Agency, Lethbridge County, AB, T1J 5R7, Canada.

Saskatoon Research and Development Centre, Agriculture and Agri-Food Canada, Saskatoon, SK, S7N 0X2, Canada.

出版信息

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae601.

DOI:10.1093/bioinformatics/btae601

PMID:39388213

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11522871/

Abstract

MOTIVATION

State-of-the-art tools for classifying metagenomic sequencing reads provide both rapid and accurate options, although the combination of both in a single tool is a constantly improving area of research. The machine learning-based Naïve Bayes Classifier (NBC) approach provides a theoretical basis for accurate classification of all reads in a sample.

RESULTS

We developed the multithreaded Minimizer-based Naïve Bayes Classifier (MNBC) tool to improve the NBC approach by applying minimizers, as well as plurality voting for closely related classification scores. A standard reference- and test-sequence framework using simulated variable-length reads benchmarked MNBC with six other state-of-the-art tools: MetaMaps, Ganon, Kraken2, KrakenUniq, CLARK, and Centrifuge. We also applied MNBC to the "marine" and "strain-madness" short-read metagenomic datasets in the Critical Assessment of Metagenome Interpretation (CAMI) II challenge using a corresponding database from the time. MNBC efficiently identified reads from unknown microorganisms, and exhibited the highest species- and genus-level precision and recall on short reads, as well as the highest species-level precision on long reads. It also achieved the highest accuracy on the "strain-madness" dataset.

AVAILABILITY AND IMPLEMENTATION

MNBC is freely available at: https://github.com/ComputationalPathogens/MNBC.

摘要

动机

用于分类宏基因组测序reads 的最先进工具提供了快速且准确的选项，尽管将这两者组合在一个工具中是一个不断改进的研究领域。基于机器学习的朴素贝叶斯分类器（NBC）方法为准确分类样本中的所有reads 提供了理论基础。

结果

我们开发了基于多线程 Minimizer 的朴素贝叶斯分类器（MNBC）工具，通过应用 minimizers 以及对密切相关的分类分数进行多数投票，改进了 NBC 方法。使用模拟可变长度 reads 的标准参考和测试序列框架，使用六个其他最先进的工具对 MNBC 进行了基准测试：MetaMaps、Ganon、Kraken2、KrakenUniq、CLARK 和 Centrifuge。我们还使用相应的数据库，将 MNBC 应用于 Critical Assessment of Metagenome Interpretation (CAMI) II 挑战中的“海洋”和“菌株疯狂”短读宏基因组数据集。MNBC 能够有效地识别未知微生物的reads，并在短reads 上表现出最高的物种和属水平的精度和召回率，在长reads 上表现出最高的物种水平的精度，在“菌株疯狂”数据集上也达到了最高的准确性。

可用性和实现

MNBC 可在以下网址免费获取：https://github.com/ComputationalPathogens/MNBC。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2be5/11522871/0cc800709953/btae601f1.jpg

相似文献

MNBC: a multithreaded Minimizer-based Naïve Bayes Classifier for improved metagenomic sequence classification.MNBC：一种基于多线程 Minimizer 的朴素贝叶斯分类器，用于改进宏基因组序列分类。

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae601.

HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data.HAYSTAC：一种用于高通量测序数据中稳健快速物种鉴定的贝叶斯框架。

PLoS Comput Biol. 2022 Sep 30;18(9):e1010493. doi: 10.1371/journal.pcbi.1010493. eCollection 2022 Sep.

NBC: the Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads.NBC：用于宏基因组读取分类的朴素贝叶斯分类工具网络服务器。

Bioinformatics. 2011 Jan 1;27(1):127-9. doi: 10.1093/bioinformatics/btq619. Epub 2010 Nov 8.

Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps.使用 MetaMaps 对长读进行菌株水平宏基因组分配和组成估计。

Nat Commun. 2019 Jul 11;10(1):3066. doi: 10.1038/s41467-019-10934-2.

Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets.评价长读 shotgun 宏基因组测序数据集的分类和分析方法。

BMC Bioinformatics. 2022 Dec 13;23(1):541. doi: 10.1186/s12859-022-05103-0.

Sketching and sampling approaches for fast and accurate long read classification.快速准确的长读分类的草图和采样方法。

BMC Bioinformatics. 2022 Oct 31;23(1):452. doi: 10.1186/s12859-022-05014-0.

Mora: abundance aware metagenomic read re-assignment for disentangling similar strains.莫拉：用于区分相似菌株的丰度感知宏基因组读数重新分配法

BMC Bioinformatics. 2024 Apr 23;25(1):161. doi: 10.1186/s12859-024-05768-9.

Bayexer: an accurate and fast Bayesian demultiplexer for Illumina sequences.Bayexer：一种用于Illumina序列的准确且快速的贝叶斯解复用器。

Bioinformatics. 2015 Dec 15;31(24):4000-2. doi: 10.1093/bioinformatics/btv501. Epub 2015 Aug 26.

CSSSCL: a python package that uses combined sequence similarity scores for accurate taxonomic classification of long and short sequence reads.CSSSCL：一个使用组合序列相似性得分对长序列和短序列读数进行准确分类的Python软件包。

Bioinformatics. 2016 Feb 1;32(3):453-5. doi: 10.1093/bioinformatics/btv587. Epub 2015 Oct 9.

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences.ganon：针对大型且最新的参考序列集进行精确的宏基因组分类。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i12-i20. doi: 10.1093/bioinformatics/btaa458.

引用本文的文献

Interpretive machine learning predicts short-term mortality risk in elderly sepsis patients.解释性机器学习可预测老年脓毒症患者的短期死亡风险。

Front Physiol. 2025 Mar 26;16:1549138. doi: 10.3389/fphys.2025.1549138. eCollection 2025.

本文引用的文献

MT-MAG: Accurate and interpretable machine learning for complete or partial taxonomic assignments of metagenomeassembled genomes.MT-MAG：用于宏基因组组装基因组的完整或部分分类学分配的准确且可解释的机器学习。

PLoS One. 2023 Aug 18;18(8):e0283536. doi: 10.1371/journal.pone.0283536. eCollection 2023.

Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4.利用 MetaPhlAn 4 对未鉴定物种进行宏基因组分类分析的扩展和改进。

Nat Biotechnol. 2023 Nov;41(11):1633-1644. doi: 10.1038/s41587-023-01688-w. Epub 2023 Feb 23.

Critical Assessment of Metagenome Interpretation: the second round of challenges.宏基因组解读的关键评估：第二轮挑战。

Nat Methods. 2022 Apr;19(4):429-440. doi: 10.1038/s41592-022-01431-4. Epub 2022 Apr 8.

mOTUs: Profiling Taxonomic Composition, Transcriptional Activity and Strain Populations of Microbial Communities.mOTUs：微生物群落的分类组成、转录活性和菌株种群分析。

Curr Protoc. 2021 Aug;1(8):e218. doi: 10.1002/cpz1.218.

DeepMicrobes: taxonomic classification for metagenomics with deep learning.深度微生物：用于宏基因组学的深度学习分类法

NAR Genom Bioinform. 2020 Feb 19;2(1):lqaa009. doi: 10.1093/nargab/lqaa009. eCollection 2020 Mar.

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life.跟上基因组的步伐：高效学习我们日益增长的生命之树知识。

BMC Bioinformatics. 2020 Sep 21;21(1):412. doi: 10.1186/s12859-020-03744-7.

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences.ganon：针对大型且最新的参考序列集进行精确的宏基因组分类。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i12-i20. doi: 10.1093/bioinformatics/btaa458.

LEMMI: a continuous benchmarking platform for metagenomics classifiers.LEMMI：用于宏基因组分类器的连续基准测试平台。

Genome Res. 2020 Aug;30(8):1208-1216. doi: 10.1101/gr.260398.119. Epub 2020 Jul 2.

Improved metagenomic analysis with Kraken 2.Kraken 2 提升宏基因组分析。

Genome Biol. 2019 Nov 28;20(1):257. doi: 10.1186/s13059-019-1891-0.

Benchmarking Metagenomics Tools for Taxonomic Classification.基于元基因组工具的分类学基准测试。

Cell. 2019 Aug 8;178(4):779-794. doi: 10.1016/j.cell.2019.07.010.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

MNBC：一种基于多线程 Minimizer 的朴素贝叶斯分类器，用于改进宏基因组序列分类。

MNBC: a multithreaded Minimizer-based Naïve Bayes Classifier for improved metagenomic sequence classification.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献