• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

最小化 Jaccard 估计量有偏且不一致。

The minimizer Jaccard estimator is biased and inconsistent.

机构信息

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.

Department of Biology, The Pennsylvania State University, University Park, PA, USA.

出版信息

Bioinformatics. 2022 Jun 24;38(Suppl 1):i169-i176. doi: 10.1093/bioinformatics/btac244.

DOI:10.1093/bioinformatics/btac244
PMID:35758786
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9235516/
Abstract

MOTIVATION

Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences.

RESULTS

We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool.

AVAILABILITY AND IMPLEMENTATION

Scripts to reproduce our experiments are available at https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

现在草图在生物信息学中被广泛用于减少数据量和提高数据处理速度。草图方法具有改进的可扩展性,但也存在准确性降低和偏差增加的风险。在本文中,我们研究了最小器草图及其在估计两个序列之间的 Jaccard 相似性中的应用。

结果

我们表明最小器 Jaccard 估计器存在偏差和不一致性,这意味着估计器与真实值之间的期望差异(即偏差)不为零,即使在序列长度增长的极限情况下也是如此。我们推导出了一个偏差的解析公式,作为共享 k-mer 沿着序列排列方式的函数。我们从理论和实验上都表明,存在一些序列族,其中偏差可能很大(例如,真实的 Jaccard 可以是估计值的两倍以上)。最后,我们证明了这种偏差会影响广泛使用的 mashmap 读取映射工具的准确性。

可用性和实现

可在 https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce 获得重现我们实验的脚本。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/97264a41f292/btac244f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/817a532fa5f6/btac244f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/609aa24e4769/btac244f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/a2b95afde192/btac244f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/95066d6a5ad7/btac244f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/ea581fd665e9/btac244f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/bd70bd91ac87/btac244f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/97264a41f292/btac244f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/817a532fa5f6/btac244f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/609aa24e4769/btac244f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/a2b95afde192/btac244f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/95066d6a5ad7/btac244f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/ea581fd665e9/btac244f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/bd70bd91ac87/btac244f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9410/9235516/97264a41f292/btac244f7.jpg

相似文献

1
The minimizer Jaccard estimator is biased and inconsistent.最小化 Jaccard 估计量有偏且不一致。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i169-i176. doi: 10.1093/bioinformatics/btac244.
2
Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.极小值是极小值的推广,能够实现无偏的局部杰卡德估计。
Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad512.
3
Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.最小哈希值是最小化器的一种推广,可实现无偏局部杰卡德估计。
bioRxiv. 2023 May 18:2023.05.16.540882. doi: 10.1101/2023.05.16.540882.
4
A simple refined DNA minimizer operator enables 2-fold faster computation.一个简单的改进 DNA 简化操作符可以使计算速度提高 2 倍。
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae045.
5
Weighted minimizer sampling improves long read mapping.加权最小化抽样提高长读测序数据的比对。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i111-i118. doi: 10.1093/bioinformatics/btaa435.
6
Compact and evenly distributed k-mer binning for genomic sequences.用于基因组序列的紧凑且均匀分布的k-mer分箱
Bioinformatics. 2021 Sep 9;37(17):2563-2569. doi: 10.1093/bioinformatics/btab156.
7
CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.CMash:基于 k-mer 的 Jaccard 和包含指数的快速、多分辨率估计。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i28-i35. doi: 10.1093/bioinformatics/btac237.
8
Theory of local k-mer selection with applications to long-read alignment.基于局部 k-mer 选择的理论及其在长读测序比对中的应用。
Bioinformatics. 2022 Oct 14;38(20):4659-4669. doi: 10.1093/bioinformatics/btab790.
9
Improved design and analysis of practical minimizers.实用极小化器的改进设计与分析。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i119-i127. doi: 10.1093/bioinformatics/btaa472.
10
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

引用本文的文献

1
High-resolution metagenome assembly for modern long reads with myloasm.利用肌浆瘤对现代长读长进行高分辨率宏基因组组装。
bioRxiv. 2025 Sep 6:2025.09.05.674543. doi: 10.1101/2025.09.05.674543.
2
A k-mer-based estimator of the substitution rate between repetitive sequences.一种基于k-mer的重复序列间替换率估计方法。
bioRxiv. 2025 Jun 25:2025.06.19.660607. doi: 10.1101/2025.06.19.660607.
3
KPop: accurate and scalable comparative analysis of microbial genomes by sequence embeddings.KPop:通过序列嵌入对微生物基因组进行准确且可扩展的比较分析。

本文引用的文献

1
Theory of local k-mer selection with applications to long-read alignment.基于局部 k-mer 选择的理论及其在长读测序比对中的应用。
Bioinformatics. 2022 Oct 14;38(20):4659-4669. doi: 10.1093/bioinformatics/btab790.
2
The Statistics of -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches.无伪匹配情况下简单突变过程中序列的 -mers 统计。
J Comput Biol. 2022 Feb;29(2):155-168. doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.
3
Effective sequence similarity detection with strobemers.利用频闪体进行有效的序列相似性检测。
Genome Biol. 2025 Jun 18;26(1):170. doi: 10.1186/s13059-025-03585-8.
4
When less is more: sketching with minimizers in genomics.少即是多:基因组学中的最小化器草图。
Genome Biol. 2024 Oct 14;25(1):270. doi: 10.1186/s13059-024-03414-4.
5
Exact Sketch-Based Read Mapping.基于草图的精确读段映射
Lebniz Int Proc Inform. 2023 Sep;273. doi: 10.4230/LIPIcs.WABI.2023.14. Epub 2023 Aug 29.
6
ESKEMAP: exact sketch-based read mapping.ESKEMAP:基于草图的精确读段映射。
Algorithms Mol Biol. 2024 May 4;19(1):19. doi: 10.1186/s13015-024-00261-7.
7
Skani enables accurate and efficient genome comparison for modern metagenomic datasets.Skani可为现代宏基因组数据集实现准确且高效的基因组比较。
Nat Methods. 2023 Nov;20(11):1633-1634. doi: 10.1038/s41592-023-02019-2.
8
Fast and robust metagenomic sequence comparison through sparse chaining with skani.通过使用 skani 进行稀疏链接实现快速稳健的宏基因组序列比较。
Nat Methods. 2023 Nov;20(11):1661-1665. doi: 10.1038/s41592-023-02018-3. Epub 2023 Sep 21.
9
Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。
J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.
10
Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.极小值是极小值的推广,能够实现无偏的局部杰卡德估计。
Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad512.
Genome Res. 2021 Nov;31(11):2080-2094. doi: 10.1101/gr.275648.121. Epub 2021 Oct 19.
4
Syncmers are more sensitive than minimizers for selecting conserved ‑mers in biological sequences.同步寡聚体在选择生物序列中的保守寡聚体方面比最小寡聚体更敏感。
PeerJ. 2021 Feb 5;9:e10805. doi: 10.7717/peerj.10805. eCollection 2021.
5
Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis.纠错功能使牛津纳米孔技术能够用于无参考转录组分析。
Nat Commun. 2021 Jan 4;12(1):2. doi: 10.1038/s41467-020-20340-8.
6
Minimally overlapping words for sequence similarity search.用于序列相似性搜索的最小重叠词。
Bioinformatics. 2021 Apr 1;36(22-23):5344-5350. doi: 10.1093/bioinformatics/btaa1054.
7
Improved design and analysis of practical minimizers.实用极小化器的改进设计与分析。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i119-i127. doi: 10.1093/bioinformatics/btaa472.
8
Weighted minimizer sampling improves long read mapping.加权最小化抽样提高长读测序数据的比对。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i111-i118. doi: 10.1093/bioinformatics/btaa435.
9
De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality Value-Based Algorithm.使用基于质量值的贪婪算法对长读长转录组数据进行从头聚类
J Comput Biol. 2020 Apr;27(4):472-484. doi: 10.1089/cmb.2019.0299. Epub 2020 Mar 16.
10
Dashing: fast and accurate genomic distances with HyperLogLog.使用 HyperLogLog 实现快速准确的基因组距离计算。
Genome Biol. 2019 Dec 4;20(1):265. doi: 10.1186/s13059-019-1875-0.