Sapoval Nicolae, Liu Yunxi, Curry Kristen D, Kille Bryce, Huang Wenyu, Kokroko Natalie, Nute Michael G, Tyshaieva Alona, Dilthey Alexander, Molloy Erin K, Treangen Todd J
Department of Computer Science, Rice University, Houston, TX 77005, USA.
Department of Computer Science, University of Maryland, College Park, MD 20742, USA.
bioRxiv. 2024 Aug 25:2024.06.01.596961. doi: 10.1101/2024.06.01.596961.
The advent of long-read sequencing of microbiomes necessitates the development of new taxonomic profilers tailored to long-read shotgun metagenomic datasets. Here, we introduce Lemur and Magnet, a pair of tools optimized for lightweight and accurate taxonomic profiling for long-read shotgun metagenomic datasets. Lemur is a marker-gene-based method that leverages an EM algorithm to reduce false positive calls while preserving true positives; Magnet is a whole-genome read-mapping-based method that provides detailed presence and absence calls for bacterial genomes. We demonstrate that Lemur and Magnet can run in minutes to hours on a laptop with 32 GB of RAM, even for large inputs, a crucial feature given the portability of long-read sequencing machines. Furthermore, the marker gene database used by Lemur is only 4 GB and contains information from over 300,000 RefSeq genomes. Lemur and Magnet are open-source and available at https://github.com/treangenlab/lemur and https://github.com/treangenlab/magnet.
微生物群落长读长测序技术的出现,使得有必要开发专门针对长读长鸟枪法宏基因组数据集的新型分类分析工具。在此,我们介绍Lemur和Magnet这一对工具,它们针对长读长鸟枪法宏基因组数据集进行了优化,旨在实现轻量级且准确的分类分析。Lemur是一种基于标记基因的方法,它利用期望最大化(EM)算法减少假阳性结果,同时保留真阳性结果;Magnet是一种基于全基因组读段比对的方法,可提供细菌基因组详细的存在与否判定。我们证明,即使处理大输入量数据,Lemur和Magnet在配备32GB内存的笔记本电脑上运行只需几分钟到几小时,鉴于长读长测序仪的便携性,这是一个关键特性。此外,Lemur使用的标记基因数据库仅4GB,包含来自超过300,000个RefSeq基因组的信息。Lemur和Magnet是开源的,可在https://github.com/treangenlab/lemur和https://github.com/treangenlab/magnet获取。