Suppr超能文献

PC-mer:一种用于宏基因组学分析和分类的超快速、内存高效的工具。

PC-mer: An Ultra-fast memory-efficient tool for metagenomics profiling and classification.

作者信息

Akbari Rokn Abadi Saeedeh, Mohammadi Amirhossein, Koohi Somayyeh

机构信息

Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.

出版信息

PLoS One. 2024 Aug 1;19(8):e0307279. doi: 10.1371/journal.pone.0307279. eCollection 2024.

Abstract

Features extraction methods, such as k-mer-based methods, have recently made up a significant role in classifying and analyzing approaches for metagenomics data. But, they are challenged by various bottlenecks, such as performance limitations, high memory consumption, and computational overhead. To deal with these challenges, we developed an innovative features extraction and sequence profiling method for DNA/RNA sequences, called PC-mer, taking advantage of the physicochemical properties of nucleotides. PC-mer in comparison with the k-mer profiling methods provides a considerable memory usage reduction by a factor of 2k while improving the metagenomics classification performance, for both machine learning-based and computational-based methods, at the various levels and also archives speedup more than 1000x for the training phase. Examining ML-based PC-mer on various datasets confirms that it can achieve 100% accuracy in classifying samples at the class, order, and family levels. Despite the k-mer-based classification methods, it also improves genus-level classification accuracy by more than 14% for shotgun dataset (i.e. achieves accuracy of 97.5%) and more than 5% for amplicon dataset (i.e. achieves accuracy of 98.6%). Due to these improvements, we provide two PC-mer-based tools, which can actually replace the popular k-mer-based tools: one for classifying and another for comparing metagenomics data.

摘要

诸如基于k-mer的方法等特征提取方法,最近在宏基因组学数据的分类和分析方法中发挥了重要作用。但是,它们面临着各种瓶颈的挑战,例如性能限制、高内存消耗和计算开销。为了应对这些挑战,我们利用核苷酸的物理化学性质,开发了一种用于DNA/RNA序列的创新特征提取和序列分析方法,称为PC-mer。与k-mer分析方法相比,PC-mer在提高宏基因组学分类性能方面,对于基于机器学习和基于计算的方法,在各个层面上都将内存使用量显著减少了2k倍,并且在训练阶段实现了超过1000倍的加速。在各种数据集上对基于机器学习的PC-mer进行检验证实,它在对样本进行纲、目和科级别的分类时能够达到100%的准确率。与基于k-mer的分类方法不同,对于鸟枪法数据集,它还将属级分类准确率提高了超过14%(即达到97.5%的准确率),对于扩增子数据集提高了超过5%(即达到98.6%的准确率)。由于这些改进,我们提供了两个基于PC-mer的工具,它们实际上可以取代流行的基于k-mer的工具:一个用于分类,另一个用于比较宏基因组学数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0794/11293629/3f3216fc055d/pone.0307279.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验