Li Xiang, Chen Ke, Shao Mingfu
Department of Computer Science and Engineering, The Pennsylvania State University, United States.
Huck Institutes of the Life Science, The Pennsylvania State University, United Statess.
bioRxiv. 2024 Jun 3:2024.05.30.596711. doi: 10.1101/2024.05.30.596711.
Seeding is an essential preparatory step for large-scale sequence comparisons. Substring-based seeding methods such as kmers are ideal for sequences with low error rates but struggle to achieve high sensitivity while maintaining a reasonable precision for error-prone long reads. SubseqHash, a novel subsequence-based seeding method we recently developed, achieves superior accuracy to substring-based methods in seeding sequences with high mutation/error rates, while the only drawback is its computation speed. In this paper, we propose SubseqHash2, an improved algorithm that can compute multiple sets of seeds in one run by defining orders over all length- subsequences and identifying the optimal subsequence under each of the orders in a single dynamic programming framework. The algorithm is further accelerated using SIMD instructions. SubseqHash2 achieves a 10-50× speedup over repeating SubseqHash while maintaining the high accuracy of seeds. We demonstrate that SubseqHash2 drastically outperforms popular substring-based methods including kmers, minimizers, syncmers, and Strobemers for three fundamental applications. In read mapping, SubseqHash2 can generate adequate seed-matches for aligning hard reads that minimap2 fails on. In sequence alignment, SubseqHash2 achieves high coverage of correct seeds and low coverage of incorrect seeds. In overlap detection, seeds produced by SubseqHash2 lead to more correct overlapping pairs at the same false-positive rate. With all the algorithmic breakthroughs of SubseqHash2, we clear the path for the wide adoption of subsequence-based seeds in long-read analysis. SubseqHash2 is available at https://github.com/Shao-Group/SubseqHash2.
种子查找是大规模序列比较的一个基本准备步骤。基于子串的种子查找方法(如kmer)适用于错误率低的序列,但在为易出错的长读段维持合理精度的同时,难以实现高灵敏度。SubseqHash是我们最近开发的一种新颖的基于子序列的种子查找方法,在对具有高突变/错误率的序列进行种子查找时,其准确性优于基于子串的方法,唯一的缺点是其计算速度。在本文中,我们提出了SubseqHash2,这是一种改进算法,通过在所有长度为的子序列上定义顺序,并在单个动态规划框架中确定每个顺序下的最优子序列,从而可以一次运行计算多组种子。该算法使用SIMD指令进一步加速。SubseqHash2在保持种子高精度的同时,比重复运行SubseqHash的速度提高了10 - 50倍。我们证明,对于三个基本应用,SubseqHash2的性能大幅优于包括kmer、minimizer、syncmer和Strobemer在内的流行基于子串的方法。在读取映射中,SubseqHash2可以生成足够的种子匹配,用于比对minimap2无法处理的困难读段。在序列比对中,SubseqHash2实现了正确种子的高覆盖率和错误种子的低覆盖率。在重叠检测中,SubseqHash2生成的种子在相同误报率下能产生更多正确的重叠对。凭借SubseqHash2所有的算法突破,我们为基于子序列的种子在长读段分析中的广泛应用扫清了障碍。SubseqHash2可在https://github.com/Shao-Group/SubseqHash2获取。