Chen Ke, Shao Mingfu
Department of Computer Science and Engineering, The Pennsylvania State University, State College, United States.
Huck Institutes of the Life Sciences, The Pennsylvania State University, State College, United States.
Algorithms Mol Biol. 2023 Jul 24;18(1):7. doi: 10.1186/s13015-023-00234-2.
Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps.
In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be [Formula: see text]-sensitive if any two sequences within an edit distance of [Formula: see text] are mapped into at least one shared bucket, and any two sequences with distance at least [Formula: see text] are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of [Formula: see text] and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal.
These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.
许多生物信息学应用涉及对一组序列进行分桶,其中每个序列可以被分配到多个桶中。为了实现高灵敏度和高精度,分桶方法需要将相似的序列分配到同一个桶中,同时将不相似的序列分配到不同的桶中。现有的基于k-mer的分桶方法在处理低错误率的测序数据时效率很高,但在处理高错误率的数据时灵敏度会大大降低。局部敏感哈希(Locality-sensitive hashing,LSH)方案能够通过容忍相似序列中的编辑来缓解这个问题,但目前的先进方法仍然存在很大差距。
在本文中,我们通过允许LSH函数将一个序列哈希到多个桶中,对其进行了推广。形式上,如果编辑距离为[公式:见原文]的任意两个序列被映射到至少一个共享桶中,并且距离至少为[公式:见原文]的任意两个序列被映射到不相交的桶子集,则将一个将固定长度的序列映射到桶子集的分桶函数定义为[公式:见原文]敏感的。我们构造了具有各种[公式:见原文]值的局部敏感分桶(Locality-sensitive bucketing,LSB)函数,并分析了它们相对于所需桶的总数以及特定序列被映射到的桶数的效率。我们还证明了这两个参数在不同设置下的下界,并表明我们构造的一些LSB函数是最优的。
这些结果为它们在分析高错误率序列中的实际应用奠定了理论基础,同时也为设计无间隙LSH函数的难度提供了见解。