POSMM：一种高效的无比对宏基因组分析工具，可补充基于比对的分析。

POSMM: an efficient alignment-free metagenomic profiler that complements alignment-based profiling.

作者信息

Burks David J, Pusadkar Vaidehi, Azad Rajeev K

机构信息

Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, 76203, USA.

Department of Mathematics, University of North Texas, Denton, TX, 76203, USA.

出版信息

Environ Microbiome. 2023 Mar 8;18(1):16. doi: 10.1186/s40793-023-00476-y.

DOI:10.1186/s40793-023-00476-y

PMID:36890583

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9993663/

Abstract

We present here POSMM (pronounced 'Possum'), Python-Optimized Standard Markov Model classifier, which is a new incarnation of the Markov model approach to metagenomic sequence analysis. Built on the top of a rapid Markov model based classification algorithm SMM, POSMM reintroduces high sensitivity associated with alignment-free taxonomic classifiers to probe whole genome or metagenome datasets of increasingly prohibitive sizes. Logistic regression models generated and optimized using the Python sklearn library, transform Markov model probabilities to scores suitable for thresholding. Featuring a dynamic database-free approach, models are generated directly from genome fasta files per run, making POSMM a valuable accompaniment to many other programs. By combining POSMM with ultrafast classifiers such as Kraken2, their complementary strengths can be leveraged to produce higher overall accuracy in metagenomic sequence classification than by either as a standalone classifier. POSMM is a user-friendly and highly adaptable tool designed for broad use by the metagenome scientific community.

摘要

我们在此展示POSMM（发音为“负鼠”），即Python优化的标准马尔可夫模型分类器，它是马尔可夫模型方法在宏基因组序列分析中的新形式。基于快速的基于马尔可夫模型的分类算法SMM构建，POSMM重新引入了与无比对分类器相关的高灵敏度，以探测规模日益庞大的全基因组或宏基因组数据集。使用Python的sklearn库生成并优化的逻辑回归模型，将马尔可夫模型概率转换为适合阈值处理的分数。POSMM采用无动态数据库的方法，每次运行时直接从基因组fasta文件生成模型，使其成为许多其他程序的宝贵补充。通过将POSMM与Kraken2等超快分类器相结合，可以利用它们的互补优势，在宏基因组序列分类中产生比单独使用任何一个分类器更高的总体准确率。POSMM是一个用户友好且高度可适应的工具，旨在供宏基因组科学界广泛使用。