用于宏基因组分类学分类查询评估的朴素贝叶斯分类器++

The Naïve Bayes classifier++ for metagenomic taxonomic classification-query evaluation.

作者信息

Duan Haozhe Neil, Hearne Gavin, Polikar Robi, Rosen Gail L

机构信息

Ecological and Evolutionary Signal Processing and Informatics (EESI) Laboratory, Drexel University, Philadelphia, PA 19104, United States.

Signal Processing and Pattern Recognition Laboratory, Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08018, United States.

出版信息

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae743.

DOI:10.1093/bioinformatics/btae743

PMID:39700412

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11729721/

Abstract

MOTIVATION

This study examines the query performance of the NBC++ (Incremental Naive Bayes Classifier) program for variations in canonicality, k-mer size, databases, and input sample data size. We demonstrate that both NBC++ and Kraken2 are influenced by database depth, with macro measures improving as depth increases. However, fully capturing the diversity of life, especially viruses, remains a challenge.

RESULTS

NBC++ can competitively profile the superkingdom content of metagenomic samples using a small training database. NBC++ spends less time training and can use a fraction of the memory than Kraken2 but at the cost of long querying time. Major NBC++ enhancements include accommodating canonical k-mer storage (leading to significant storage savings) and adaptable and optimized memory allocation that accelerates query analysis and enables the software to be run on nearly any system. Additionally, the output now includes log-likelihood values for each training genome, providing users with valuable confidence information.

AVAILABILITY AND IMPLEMENTATION

Source code and Dockerfile are available at http://github.com/EESI/Naive_Bayes.

摘要

动机

本研究考察了NBC++（增量朴素贝叶斯分类器）程序在规范性、k-mer大小、数据库和输入样本数据大小变化时的查询性能。我们证明了NBC++和Kraken2都受数据库深度的影响，随着深度增加，宏观指标会有所改善。然而，要全面捕捉生命的多样性，尤其是病毒的多样性，仍然是一项挑战。

结果

NBC++使用一个小型训练数据库就能对宏基因组样本的超界内容进行有竞争力的分析。与Kraken2相比，NBC++训练时间更短，使用的内存也仅为其一小部分，但代价是查询时间较长。NBC++的主要改进包括支持规范的k-mer存储（从而显著节省存储空间）以及适应性和优化的内存分配，这加速了查询分析，并使该软件几乎能在任何系统上运行。此外，现在的输出包括每个训练基因组的对数似然值，为用户提供了有价值的置信信息。