School of Computing, National University of Singapore, Singapore, 117417, Republic of Singapore.
Agency for Science, Technology and Research (A*STAR), Genome Institute of Singapore (GIS), Singapore, 138672, Republic of Singapore.
BMC Bioinformatics. 2024 Apr 16;25(Suppl 1):153. doi: 10.1186/s12859-024-05760-3.
With the rapid increase in throughput of long-read sequencing technologies, recent studies have explored their potential for taxonomic classification by using alignment-based approaches to reduce the impact of higher sequencing error rates. While alignment-based methods are generally slower, k-mer-based taxonomic classifiers can overcome this limitation, potentially at the expense of lower sensitivity for strains and species that are not in the database.
We present MetageNN, a memory-efficient long-read taxonomic classifier that is robust to sequencing errors and missing genomes. MetageNN is a neural network model that uses short k-mer profiles of sequences to reduce the impact of distribution shifts on error-prone long reads. Benchmarking MetageNN against other machine learning approaches for taxonomic classification (GeNet) showed substantial improvements with long-read data (20% improvement in F1 score). By utilizing nanopore sequencing data, MetageNN exhibits improved sensitivity in situations where the reference database is incomplete. It surpasses the alignment-based MetaMaps and MEGAN-LR, as well as the k-mer-based Kraken2 tools, with improvements of 100%, 36%, and 23% respectively at the read-level analysis. Notably, at the community level, MetageNN consistently demonstrated higher sensitivities than the previously mentioned tools. Furthermore, MetageNN requires < 1/4th of the database storage used by Kraken2, MEGAN-LR and MMseqs2 and is > 7× faster than MetaMaps and GeNet and > 2× faster than MEGAN-LR and MMseqs2.
This proof of concept work demonstrates the utility of machine-learning-based methods for taxonomic classification using long reads. MetageNN can be used on sequences not classified by conventional methods and offers an alternative approach for memory-efficient classifiers that can be optimized further.
随着长读测序技术通量的快速增加,最近的研究探索了基于比对的方法在分类学中的应用潜力,以减少较高测序错误率的影响。虽然基于比对的方法通常较慢,但基于 k-mer 的分类器可以克服这一限制,但其代价是对数据库中不存在的菌株和物种的敏感性降低。
我们提出了 MetageNN,这是一种内存高效的长读分类器,对测序错误和缺失基因组具有鲁棒性。MetageNN 是一种神经网络模型,它使用序列的短 k-mer 轮廓来减少分布偏移对易错长读的影响。将 MetageNN 与其他用于分类学的机器学习方法(GeNet)进行基准测试表明,长读数据的性能有了实质性的提高(F1 得分提高了 20%)。通过利用纳米孔测序数据,MetageNN 在参考数据库不完整的情况下表现出更高的敏感性。与基于比对的 MetaMaps 和 MEGAN-LR 以及基于 k-mer 的 Kraken2 工具相比,MetageNN 在读取水平分析上分别提高了 100%、36%和 23%。值得注意的是,在群落水平上,MetageNN 始终表现出比上述工具更高的敏感性。此外,MetageNN 所需的数据库存储空间小于 Kraken2、MEGAN-LR 和 MMseqs2 的 1/4,比 MetaMaps 和 GeNet 快 7 倍以上,比 MEGAN-LR 和 MMseqs2 快 2 倍以上。
这项概念验证工作证明了基于机器学习的方法在长读分类学中的应用潜力。MetageNN 可以用于传统方法无法分类的序列,并提供了一种替代方法,用于进一步优化内存高效的分类器。