Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada.
Faculty of Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
Bioinformatics. 2022 Oct 14;38(20):4812-4813. doi: 10.1093/bioinformatics/btac564.
Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research.
ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism.
ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license.
Supplementary data are available at Bioinformatics online.
在分析碱基错配率较高的核苷酸序列时,间隔种子是 k-mer 的强大替代品。散列对于高效存储大量序列数据也至关重要。在这里,我们引入了 ntHash2,这是一种用于间隔种子散列的快速算法,可以集成到各种生物信息学工具中,用于高效的序列分析,并在基因组研究中得到应用。
ntHash2 对各种间隔种子的散列速度比上一版本快 2.1 倍,比使用原始适应的传统散列算法快 3.8 倍。此外,我们通过修改规范散列机制,降低了 ntHash 的碰撞率,并提高了更长 k-mer 长度的散列分布的均匀性。
ntHash2 可在 MIT 许可证下免费在线获得,网址为 github.com/bcgsc/ntHash。
补充数据可在 Bioinformatics 在线获得。