• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用FracMinHash估计相似度和距离。

Estimating similarity and distance using FracMinHash.

作者信息

Rahman Hera Mahmudur, Koslicki David

机构信息

School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, USA.

Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, USA.

出版信息

Algorithms Mol Biol. 2025 May 15;20(1):8. doi: 10.1186/s13015-025-00276-8.

DOI:10.1186/s13015-025-00276-8
PMID:40375084
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12082993/
Abstract

MOTIVATION

The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics are still lacking.

THEORETICAL CONTRIBUTIONS

In this paper, we present a theoretical framework for estimating similarity/distance metrics by using FracMinHash sketches, when the metric is expressible in a certain form. We establish conditions under which such an estimation is sound and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings.

PRACTICAL CONTRIBUTIONS

We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. frac-kmc is also the first parallel tool for this task, allowing for speeding up sketch generation using multiple CPU cores - an option lacking in existing serialized tools. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.

摘要

动机

基因组和宏基因组数据的数量和规模不断增加,因此需要可扩展且强大的计算模型来进行精确分析。利用生物样本中的k - 聚体的草图绘制技术已被证明对大规模分析很有用。近年来,FracMinHash已成为一种流行的草图绘制技术,并已应用于多个有用的应用程序中。最近关于FracMinHash的研究证明了其对包含度和杰卡德指数的无偏估计。然而,对于其他度量的理论研究仍然缺乏。

理论贡献

在本文中,当度量可以用某种形式表示时,我们提出了一个使用FracMinHash草图估计相似性/距离度量的理论框架。我们建立了这种估计合理的条件,并推荐了一个最小比例因子s以获得准确的结果。实验证据支持我们的理论发现。

实际贡献

我们还展示了frac - kmc,一个快速高效的FracMinHash草图生成程序。frac - kmc是已知最快的FracMinHash草图生成器,在真实数据上进行余弦相似性估计时能提供准确精确的结果。frac - kmc也是用于此任务的第一个并行工具,可以使用多个CPU核心加速草图生成,这是现有序列化工具所没有的选项。我们表明,通过使用frac - kmc计算FracMinHash草图,我们可以在真实数据上快速准确地估计成对相似性。frac - kmc可在此处免费获取:https://github.com/KoslickiLab/frac - kmc/ 。

相似文献

1
Estimating similarity and distance using FracMinHash.使用FracMinHash估计相似度和距离。
Algorithms Mol Biol. 2025 May 15;20(1):8. doi: 10.1186/s13015-025-00276-8.
2
Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation.使用FracMinHash的余弦相似度估计:理论分析、安全条件及实现
bioRxiv. 2024 May 30:2024.05.24.595805. doi: 10.1101/2024.05.24.595805.
3
Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.使用 FracMinHash 在广泛的进化距离范围内推导突变率的置信区间。
Genome Res. 2023 Jul;33(7):1061-1068. doi: 10.1101/gr.277651.123. Epub 2023 Jun 21.
4
Metagenomic functional profiling: to sketch or not to sketch?宏基因组功能谱分析:描绘还是不描绘?
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii165-ii173. doi: 10.1093/bioinformatics/btae397.
5
HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors.HyperGen:使用超维向量进行紧凑且高效的基因组草图绘制
Bioinformatics. 2024 Jul 16;40(7). doi: 10.1093/bioinformatics/btae452.
6
Fractional hitting sets for efficient multiset sketching.用于高效多重集草图绘制的分数击中集
Algorithms Mol Biol. 2025 Feb 8;20(1):1. doi: 10.1186/s13015-024-00268-0.
7
CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.CMash:基于 k-mer 的 Jaccard 和包含指数的快速、多分辨率估计。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i28-i35. doi: 10.1093/bioinformatics/btac237.
8
Streaming histogram sketching for rapid microbiome analytics.流式直方图概要分析快速微生物组分析。
Microbiome. 2019 Mar 16;7(1):40. doi: 10.1186/s40168-019-0653-2.
9
Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.使用最小去环集保证小窗口的草图方法。
J Comput Biol. 2024 Jul;31(7):597-615. doi: 10.1089/cmb.2024.0544. Epub 2024 Jul 9.
10
Sketching methods with small window guarantee using minimum decycling sets.使用最小去环集保证小窗口的绘制方法。
ArXiv. 2023 Nov 6:arXiv:2311.03592v1.

本文引用的文献

1
Comparison of Six Measures of Genetic Similarity of Interspecific Hybrids F Generation and Their Parental Forms Estimated on the Basis of ISSR Markers.基于 ISSR 标记估计的种间杂种 F1 代及其亲本形式的 6 种遗传相似性度量的比较。
Genes (Basel). 2024 Aug 23;15(9):1114. doi: 10.3390/genes15091114.
2
Fast and robust metagenomic sequence comparison through sparse chaining with skani.通过使用 skani 进行稀疏链接实现快速稳健的宏基因组序列比较。
Nat Methods. 2023 Nov;20(11):1661-1665. doi: 10.1038/s41592-023-02018-3. Epub 2023 Sep 21.
3
Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.
使用 FracMinHash 在广泛的进化距离范围内推导突变率的置信区间。
Genome Res. 2023 Jul;33(7):1061-1068. doi: 10.1101/gr.277651.123. Epub 2023 Jun 21.
4
CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.CMash:基于 k-mer 的 Jaccard 和包含指数的快速、多分辨率估计。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i28-i35. doi: 10.1093/bioinformatics/btac237.
5
The Statistics of -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches.无伪匹配情况下简单突变过程中序列的 -mers 统计。
J Comput Biol. 2022 Feb;29(2):155-168. doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.
6
The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis.基于标准差和余弦相似度的无监督特征选择算法在基因组数据分析中的应用
Front Genet. 2021 May 13;12:684100. doi: 10.3389/fgene.2021.684100. eCollection 2021.
7
Mash Screen: high-throughput sequence containment estimation for genome discovery.Mash 屏幕:用于基因组发现的高通量序列包含度估计。
Genome Biol. 2019 Nov 5;20(1):232. doi: 10.1186/s13059-019-1841-x.
8
Large-scale sequence comparisons with .与……进行大规模序列比较
F1000Res. 2019 Jul 4;8:1006. doi: 10.12688/f1000research.19675.1. eCollection 2019.
9
Method for Identifying Cancer-Related Genes Using Gene Similarity-Based Collaborative Filtering.基于基因相似性的协同过滤识别癌症相关基因的方法
J Comput Biol. 2019 Aug;26(8):875-881. doi: 10.1089/cmb.2018.0115. Epub 2019 May 22.
10
Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons.Libra:一种基于可扩展 k-mer 的大规模所有与所有宏基因组比较工具。
Gigascience. 2019 Feb 1;8(2):giy165. doi: 10.1093/gigascience/giy165.