Wong Johnathan, Kazemi Parham, Coombe Lauren, Warren René L, Birol Inanç
Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada.
bioRxiv. 2023 May 10:2023.05.08.539909. doi: 10.1101/2023.05.08.539909.
-mer hashing is a common operation in many foundational bioinformatics problems. However, generic string hashing algorithms are not optimized for this application. Strings in bioinformatics use specific alphabets, a trait leveraged for nucleic acid sequences in earlier work. We note that amino acid sequences, with complexities and context that cannot be captured by generic hashing algorithms, can also benefit from a domain-specific hashing algorithm. Such a hashing algorithm can accelerate and improve the sensitivity of bioinformatics applications developed for protein sequences.
Here, we present aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ~10X faster than generic string hashing algorithms in hashing adjacent -mers.
aaHash is available online at https://github.com/bcgsc/btllib and is free for academic use.
-mer哈希是许多基础生物信息学问题中的常见操作。然而,通用字符串哈希算法并未针对此应用进行优化。生物信息学中的字符串使用特定字母表,这一特性在早期工作中已被用于核酸序列。我们注意到,氨基酸序列具有通用哈希算法无法捕捉的复杂性和上下文信息,也能从特定领域的哈希算法中受益。这样的哈希算法可以加速并提高针对蛋白质序列开发的生物信息学应用的灵敏度。
在此,我们提出了aaHash,一种专为氨基酸序列量身定制的递归哈希算法。该算法利用多个哈希级别来表示氨基酸之间的生化相似性。在对相邻-mer进行哈希时,aaHash的执行速度比通用字符串哈希算法快约10倍。