氨基酸哈希值：递归氨基酸序列哈希法。

aaHash: recursive amino acid sequence hashing.

作者信息

Wong Johnathan, Kazemi Parham, Coombe Lauren, Warren René L, Birol Inanç

机构信息

Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada.

出版信息

bioRxiv. 2023 May 10:2023.05.08.539909. doi: 10.1101/2023.05.08.539909.

DOI:10.1101/2023.05.08.539909

PMID:37214907

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10197579/

Abstract

MOTIVATION

-mer hashing is a common operation in many foundational bioinformatics problems. However, generic string hashing algorithms are not optimized for this application. Strings in bioinformatics use specific alphabets, a trait leveraged for nucleic acid sequences in earlier work. We note that amino acid sequences, with complexities and context that cannot be captured by generic hashing algorithms, can also benefit from a domain-specific hashing algorithm. Such a hashing algorithm can accelerate and improve the sensitivity of bioinformatics applications developed for protein sequences.

RESULTS

Here, we present aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ~10X faster than generic string hashing algorithms in hashing adjacent -mers.

AVAILABILITY AND IMPLEMENTATION

aaHash is available online at https://github.com/bcgsc/btllib and is free for academic use.

摘要

动机

-mer哈希是许多基础生物信息学问题中的常见操作。然而，通用字符串哈希算法并未针对此应用进行优化。生物信息学中的字符串使用特定字母表，这一特性在早期工作中已被用于核酸序列。我们注意到，氨基酸序列具有通用哈希算法无法捕捉的复杂性和上下文信息，也能从特定领域的哈希算法中受益。这样的哈希算法可以加速并提高针对蛋白质序列开发的生物信息学应用的灵敏度。