Department of Computer Science, Rice University, Houston, TX, United States.
Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States.
Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad512.
The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.
To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.
MashMap3 is available at https://github.com/marbl/MashMap.
k-mer 集合上的杰卡德相似度已被证明是序列同一性的一种方便的替代指标。通过避免昂贵的碱基级比对,并比较简化的序列表示,MashMap 等工具可以扩展到大量的两两比较,同时仍然提供有用的相似度估计。然而,由于它们依赖于最小化器筛选,以前版本的 MashMap 被证明是杰卡德相似度的有偏差和不一致的估计器。这直接影响到依赖这些估计准确性的下游工具。
为了解决这个问题,我们提出了 minmer 筛选方案,该方案通过使用带有每个窗口多个采样 k-mer 的滚动 minhash 来推广最小化器方案。我们从理论和经验上证明了 minmers 产生了局部杰卡德相似度的无偏估计器,并且我们在 MashMap 的更新版本中实现了这个方案。在默认的 ANI 阈值下,基于 minmer 的实现比基于 minimizer 的实现快 10 多倍,使其非常适合大规模比较基因组学应用。
MashMap3 可在 https://github.com/marbl/MashMap 上获得。