Suppr超能文献

极小值是极小值的推广,能够实现无偏的局部杰卡德估计。

Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.

机构信息

Department of Computer Science, Rice University, Houston, TX, United States.

Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States.

出版信息

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad512.

Abstract

MOTIVATION

The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.

RESULTS

To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.

AVAILABILITY AND IMPLEMENTATION

MashMap3 is available at https://github.com/marbl/MashMap.

摘要

动机

k-mer 集合上的杰卡德相似度已被证明是序列同一性的一种方便的替代指标。通过避免昂贵的碱基级比对,并比较简化的序列表示,MashMap 等工具可以扩展到大量的两两比较,同时仍然提供有用的相似度估计。然而,由于它们依赖于最小化器筛选,以前版本的 MashMap 被证明是杰卡德相似度的有偏差和不一致的估计器。这直接影响到依赖这些估计准确性的下游工具。

结果

为了解决这个问题,我们提出了 minmer 筛选方案,该方案通过使用带有每个窗口多个采样 k-mer 的滚动 minhash 来推广最小化器方案。我们从理论和经验上证明了 minmers 产生了局部杰卡德相似度的无偏估计器,并且我们在 MashMap 的更新版本中实现了这个方案。在默认的 ANI 阈值下,基于 minmer 的实现比基于 minimizer 的实现快 10 多倍,使其非常适合大规模比较基因组学应用。

可用性和实现

MashMap3 可在 https://github.com/marbl/MashMap 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4361/10505501/633a021c26da/btad512f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验