Firtina Can, Park Jisung, Alser Mohammed, Kim Jeremie S, Cali Damla Senol, Shahroodi Taha, Ghiasi Nika Mansouri, Singh Gagandeep, Kanellopoulos Konstantinos, Alkan Can, Mutlu Onur
ETH Zurich, Zurich 8092, Switzerland.
POSTECH, Pohang 37673, Republic of Korea.
NAR Genom Bioinform. 2023 Jan 20;5(1):lqad004. doi: 10.1093/nargab/lqad004. eCollection 2023 Mar.
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce , the first efficient and accurate mechanism that can identify exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×-83.9× (on average 19.3×), has a lower memory footprint by 0.9×-14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
生成短子序列(称为种子)的哈希值,能够通过将种子与其哈希值进行单次查找匹配,快速识别基因组序列之间的相似性。然而,这些哈希值仅可用于查找精确匹配的种子,因为传统的哈希方法会为不同的种子(包括高度相似的种子)分配不同的哈希值。仅查找精确匹配的种子会导致以下两种情况之一:(i)增加成本高昂的序列比对的使用频率,或者(ii)灵敏度受限。我们引入了BLEND,这是第一种高效且准确的机制,它能够通过单次查找种子的哈希值来识别精确匹配和高度相似的种子,即模糊种子匹配。BLEND(i)利用一种名为SimHash的技术,该技术可以为相似的集合生成相同的哈希值,并且(ii)提供了适当的机制,以便在使用SimHash技术时将种子用作集合,从而有效地找到模糊种子匹配。我们展示了BLEND在用于读取重叠和读取映射时的优势。对于读取重叠,与最先进的工具minimap2相比,BLEND的速度快2.4倍至83.9倍(平均为19.3倍),内存占用低0.9倍至14.1倍(平均为3.8倍),并且能够找到更高质量的重叠,从而实现更准确的组装。对于读取映射,BLEND比minimap2快0.8倍至4.1倍(平均为1.7倍)。源代码可在https://github.com/CMU - SAFARI/BLEND获取。