Duan Haozhe Neil, Hearne Gavin, Polikar Robi, Rosen Gail L
Ecological and Evolutionary Signal Processing and Informatics (EESI) Laboratory, Drexel University, Philadelphia, PA 19104, United States.
Signal Processing and Pattern Recognition Laboratory, Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08018, United States.
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae743.
This study examines the query performance of the NBC++ (Incremental Naive Bayes Classifier) program for variations in canonicality, k-mer size, databases, and input sample data size. We demonstrate that both NBC++ and Kraken2 are influenced by database depth, with macro measures improving as depth increases. However, fully capturing the diversity of life, especially viruses, remains a challenge.
NBC++ can competitively profile the superkingdom content of metagenomic samples using a small training database. NBC++ spends less time training and can use a fraction of the memory than Kraken2 but at the cost of long querying time. Major NBC++ enhancements include accommodating canonical k-mer storage (leading to significant storage savings) and adaptable and optimized memory allocation that accelerates query analysis and enables the software to be run on nearly any system. Additionally, the output now includes log-likelihood values for each training genome, providing users with valuable confidence information.
Source code and Dockerfile are available at http://github.com/EESI/Naive_Bayes.
本研究考察了NBC++(增量朴素贝叶斯分类器)程序在规范性、k-mer大小、数据库和输入样本数据大小变化时的查询性能。我们证明了NBC++和Kraken2都受数据库深度的影响,随着深度增加,宏观指标会有所改善。然而,要全面捕捉生命的多样性,尤其是病毒的多样性,仍然是一项挑战。
NBC++使用一个小型训练数据库就能对宏基因组样本的超界内容进行有竞争力的分析。与Kraken2相比,NBC++训练时间更短,使用的内存也仅为其一小部分,但代价是查询时间较长。NBC++的主要改进包括支持规范的k-mer存储(从而显著节省存储空间)以及适应性和优化的内存分配,这加速了查询分析,并使该软件几乎能在任何系统上运行。此外,现在的输出包括每个训练基因组的对数似然值,为用户提供了有价值的置信信息。
源代码和Dockerfile可在http://github.com/EESI/Naive_Bayes获取。