Schulz Tizian, Medvedev Paul
Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany.
Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, Germany.
Lebniz Int Proc Inform. 2023 Sep;273. doi: 10.4230/LIPIcs.WABI.2023.14. Epub 2023 Aug 29.
Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in time and space, where is the number of -mers inside the sketch of the reference, is the number of -mers inside the read's sketch and is the number of times that -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm's performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.
给定一个测序读段,读段映射的总体目标是在参考基因组中找到具有“相似序列”的位置。传统上,“相似序列”被定义为具有高比对分数,并且读段映射器被视为解决这个明确定义问题的启发式方法。然而,对于基于草图的映射器,尚未有一个问题表述来明确一个精确的基于草图的映射算法应该解决什么问题。此外,没有基于草图的方法能够找到读段在某个分数阈值以上的所有可能映射位置。在本文中,我们在序列草图层面上阐述了读段映射问题。我们给出了一个精确的动态规划算法,该算法能找到高于给定相似性阈值的所有匹配。它的运行时间和空间复杂度分别为 ,其中 是参考草图中 - 聚体的数量, 是读段草图中 - 聚体的数量, 是模式草图中的 - 聚体在文本草图中出现的次数。我们评估了我们算法在将长读段映射到人类 Y 染色体的 T2T 组装上的性能,在该组装中扩增区域使得找到所有良好映射位置变得很有必要。对于与 minimap2 相当的精度水平,我们算法的召回率为 0.88,而 minimap2 仅为 0.76。