Draesslerová Dominika, Ahmed Omar, Gagie Travis, Holub Jan, Langmead Ben, Manzini Giovanni, Navarro Gonzalo
Czech Technical University in Prague, Czech Republic.
Johns Hopkins University, USA.
Lebniz Int Proc Inform. 2024 Jul;301. doi: 10.4230/LIPIcs.SEA.2024.10. Epub 2024 Jul 11.
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use -mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can ■ build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; ■ for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; ■ find the minimum and maximum values stored in that interval; ■ take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: ■ a KATKA kernel, which discards characters that are not in the first or last occurrence of any -tuple, for a parameter ; a minimizer digest; ■ a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.
对于分类学分类,我们被要求在系统发育树中对基因组进行索引,以便之后在给定一个DNA读数时,我们能够快速选择一个可能包含该读数所源自基因组的小子树。尽管像Kraken这样的流行分类器使用k - mers,但最近的研究表明,使用最大精确匹配(MEMs)可以带来更好的分类效果。例如,我们可以:
■ 基于按从左到右顺序串联的树中的基因组构建一个增强的FM索引;
■ 对于读数中的每个MEM,在后缀数组中找到包含该MEM在那些基因组中出现起始位置的区间;
■ 找到该区间中存储的最小值和最大值;
■ 获取包含这些位置字符的基因组的最低共同祖先(LCA)。然而,只有当树中基因组的总大小相当小时,这个解决方案才可行。在本文中,我们考虑将相同的解决方案应用于基因组串联的三种有损压缩表示:
■ 一个KATKA内核,对于一个参数k,它会丢弃不在任何k元组的首次或末次出现中的字符;
■ 一个最小化器摘要;
■ 一个最小化器摘要的KATKA内核。通过一个测试数据集以及它的这三种表示、模拟读数和各种参数设置,我们检查了有多少读数的最长MEM仅出现在生成这些读数的序列中(“真阳性”读数)。对于某些参数设置,我们在仅略微降低真阳性率的同时实现了显著的压缩。