Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States.
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad652.
Pairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise.
In this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how "lexicographically similar" the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision-recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an O(n) time algorithm, and circumventing the seemingly fundamental O(n2) scaling associated with pairwise similarity search.
LexicHash is available on GitHub at https://github.com/gcgreenberg/LexicHash.
两两序列比对是一项计算量很大的任务,尤其是在第三代测序技术的背景下。这个问题通常通过使用基于哈希的方法(如 MinHash)来近似估计序列相似度来解决。在 MinHash 中,读取中的所有 k-mer 都被哈希化,并且存储最小哈希值,即 min-hash。然后可以通过在许多不同的哈希函数之间计算一对读取之间的最小哈希匹配数来估计成对相似度。参数 k 的选择控制着识别比对任务中的一个重要权衡:较大的 k 值可以更有信心地识别比对(高准确率),但可能导致许多比对缺失(低召回率),尤其是在存在大量噪声的情况下。
在这项工作中,我们引入了 LexicHash,这是一种新的相似度估计方法,它实际上与 k 的选择无关,并实现了大 k 值的高精度和小 k 值 MinHash 的高灵敏度。LexicHash 是 MinHash 的一种变体,具有精心设计的哈希函数。在估计两个读取之间的相似度时,不是简单地检查 min-hashes 是否匹配(如标准 MinHash 中那样),而是检查 LexicHash min-hashes 有多“字典序相似”。在我们对 40 个 PacBio 数据集的实验中,LexicHash 获得的精度-召回曲线下面积平均比 MinHash 提高了 20.9%。此外,LexicHash 框架自然适用于最大比对的高效搜索,生成一个 O(n)时间算法,并避免了与成对相似度搜索相关的看似基本的 O(n2) 缩放。
LexicHash 可在 GitHub 上获得,网址为 https://github.com/gcgreenberg/LexicHash。