K2Mem：从测序数据中发现用于宏基因组读分类的判别 K- mers。

K2Mem: Discovering Discriminative K-mers From Sequencing Data for Metagenomic Reads Classification.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2022 Jan-Feb;19(1):220-229. doi: 10.1109/TCBB.2021.3117406. Epub 2022 Feb 3.

DOI:10.1109/TCBB.2021.3117406

Abstract

The major problem when analyzing a metagenomic sample is to taxonomically annotate its reads to identify the species they contain. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of recall (the actual number of classified reads) the performances fall at around 50%. One of the reasons is the fact that the sequences in a sample can be very different from the corresponding reference genome, e.g., viral genomes are highly mutated. To address this issue, in this paper we study the problem of metagenomic reads classification by improving the reference k-mers library with novel discriminative k-mers from the input sequencing reads. We evaluated the performance in different conditions against several other tools and the results showed an improved F-measure, especially when close reference genomes are not available. Availability: https://github.com.

摘要

当分析宏基因组样本时，主要的问题是对其reads 进行分类注释，以确定它们包含的物种。目前大多数可用的方法都侧重于使用一组参考基因组及其 k-mer 对reads 进行分类。虽然这些方法在精度方面已经达到了接近完美的准确率，但在召回率（实际分类的reads 数量）方面，性能却只有 50%左右。原因之一是样本中的序列可能与相应的参考基因组非常不同，例如，病毒基因组高度突变。为了解决这个问题，本文通过从输入测序reads 中提取新的有判别力的 k-mer 来改进参考 k-mer 库，从而研究了宏基因组reads 分类的问题。我们在不同的条件下对几种其他工具的性能进行了评估，结果表明 F-measure 得到了提高，尤其是在没有接近的参考基因组的情况下。可获取性：https://github.com。