Baharav Tavor Z, Kamath Govinda M, Tse David N, Shomorony Ilan
Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA.
Microsoft Research New England, Cambridge, MA 02142, USA.
Patterns (N Y). 2020 Jul 31;1(6):100081. doi: 10.1016/j.patter.2020.100081. eCollection 2020 Sep 11.
Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise -mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the -mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven -mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores.
成对序列比对通常是基因组分析流程中的一个计算瓶颈,特别是在第三代测序技术的背景下。为了加速这一过程,有时会使用成对的k-mer杰卡德相似度作为比对大小的代理,以便过滤读段对,并采用最小哈希来有效估计这些相似度。然而,当数据集的k-mer分布明显不均匀时(例如,由于GC偏差和重复序列),杰卡德相似度就不再是比对大小的良好代理。在这项工作中,我们引入了一种基于最小哈希的方法来估计比对大小,称为谱杰卡德相似度,它自然地考虑了不均匀的k-mer分布。谱杰卡德相似度是通过对最小哈希碰撞矩阵进行奇异值分解来计算的。我们通过实验表明,这个新指标能为比对大小提供明显更好的估计,并且我们为这些谱相似度分数提供了一个计算高效的估计器。