Suppr超能文献

使用FracMinHash估计相似度和距离。

Estimating similarity and distance using FracMinHash.

作者信息

Rahman Hera Mahmudur, Koslicki David

机构信息

School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, USA.

Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, USA.

出版信息

Algorithms Mol Biol. 2025 May 15;20(1):8. doi: 10.1186/s13015-025-00276-8.

Abstract

MOTIVATION

The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics are still lacking.

THEORETICAL CONTRIBUTIONS

In this paper, we present a theoretical framework for estimating similarity/distance metrics by using FracMinHash sketches, when the metric is expressible in a certain form. We establish conditions under which such an estimation is sound and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings.

PRACTICAL CONTRIBUTIONS

We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. frac-kmc is also the first parallel tool for this task, allowing for speeding up sketch generation using multiple CPU cores - an option lacking in existing serialized tools. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.

摘要

动机

基因组和宏基因组数据的数量和规模不断增加,因此需要可扩展且强大的计算模型来进行精确分析。利用生物样本中的k - 聚体的草图绘制技术已被证明对大规模分析很有用。近年来,FracMinHash已成为一种流行的草图绘制技术,并已应用于多个有用的应用程序中。最近关于FracMinHash的研究证明了其对包含度和杰卡德指数的无偏估计。然而,对于其他度量的理论研究仍然缺乏。

理论贡献

在本文中,当度量可以用某种形式表示时,我们提出了一个使用FracMinHash草图估计相似性/距离度量的理论框架。我们建立了这种估计合理的条件,并推荐了一个最小比例因子s以获得准确的结果。实验证据支持我们的理论发现。

实际贡献

我们还展示了frac - kmc,一个快速高效的FracMinHash草图生成程序。frac - kmc是已知最快的FracMinHash草图生成器,在真实数据上进行余弦相似性估计时能提供准确精确的结果。frac - kmc也是用于此任务的第一个并行工具,可以使用多个CPU核心加速草图生成,这是现有序列化工具所没有的选项。我们表明,通过使用frac - kmc计算FracMinHash草图,我们可以在真实数据上快速准确地估计成对相似性。frac - kmc可在此处免费获取:https://github.com/KoslickiLab/frac - kmc/ 。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验