Suppr超能文献

氨基酸哈希值:递归氨基酸序列哈希法。

aaHash: recursive amino acid sequence hashing.

作者信息

Wong Johnathan, Kazemi Parham, Coombe Lauren, Warren René L, Birol Inanç

机构信息

Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada.

出版信息

bioRxiv. 2023 May 10:2023.05.08.539909. doi: 10.1101/2023.05.08.539909.

Abstract

MOTIVATION

-mer hashing is a common operation in many foundational bioinformatics problems. However, generic string hashing algorithms are not optimized for this application. Strings in bioinformatics use specific alphabets, a trait leveraged for nucleic acid sequences in earlier work. We note that amino acid sequences, with complexities and context that cannot be captured by generic hashing algorithms, can also benefit from a domain-specific hashing algorithm. Such a hashing algorithm can accelerate and improve the sensitivity of bioinformatics applications developed for protein sequences.

RESULTS

Here, we present aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ~10X faster than generic string hashing algorithms in hashing adjacent -mers.

AVAILABILITY AND IMPLEMENTATION

aaHash is available online at https://github.com/bcgsc/btllib and is free for academic use.

摘要

动机

-mer哈希是许多基础生物信息学问题中的常见操作。然而,通用字符串哈希算法并未针对此应用进行优化。生物信息学中的字符串使用特定字母表,这一特性在早期工作中已被用于核酸序列。我们注意到,氨基酸序列具有通用哈希算法无法捕捉的复杂性和上下文信息,也能从特定领域的哈希算法中受益。这样的哈希算法可以加速并提高针对蛋白质序列开发的生物信息学应用的灵敏度。

结果

在此,我们提出了aaHash,一种专为氨基酸序列量身定制的递归哈希算法。该算法利用多个哈希级别来表示氨基酸之间的生化相似性。在对相邻-mer进行哈希时,aaHash的执行速度比通用字符串哈希算法快约10倍。

可用性与实现

aaHash可在https://github.com/bcgsc/btllib上在线获取,供学术使用免费。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc1f/10197579/705ae2a52c6d/nihpp-2023.05.08.539909v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验