Suppr超能文献

使用FracMinHash的余弦相似度估计:理论分析、安全条件及实现

Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation.

作者信息

Hera Mahmudur Rahman, Koslicki David

机构信息

School of Electrical Engineering and Computer Science, Pennsylvania State University, USA.

Huck Institutes of the Life Sciences, Pennsylvania State University, USA.

出版信息

bioRxiv. 2024 May 30:2024.05.24.595805. doi: 10.1101/2024.05.24.595805.

Abstract

MOTIVATION

The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics, such as the cosine similarity, are still lacking.

THEORETICAL CONTRIBUTIONS

In this paper, we present a theoretical framework for estimating cosine similarity from FracMinHash sketches. We establish conditions under which this estimation is sound, and recommend a minimum scale factor for accurate results. Experimental evidence supports our theoretical findings.

PRACTICAL CONTRIBUTIONS

We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise cosine similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.

摘要

动机

基因组和宏基因组数据的数量和规模不断增加,因此需要可扩展且强大的计算模型来进行精确分析。利用生物样本中的k - 聚体的草图绘制技术已被证明对大规模分析很有用。近年来,FracMinHash已成为一种流行的草图绘制技术,并已用于多个有用的应用中。最近关于FracMinHash的研究证明了其对包含度和杰卡德指数的无偏估计。然而,对于其他度量标准,如余弦相似度,仍缺乏理论研究。

理论贡献

在本文中,我们提出了一个从FracMinHash草图估计余弦相似度的理论框架。我们建立了该估计合理的条件,并推荐了一个最小比例因子以获得准确结果。实验证据支持我们的理论发现。

实际贡献

我们还展示了frac - kmc,一个快速高效的FracMinHash草图生成程序。frac - kmc是已知最快的FracMinHash草图生成器,能为实际数据的余弦相似度估计提供准确精确的结果。我们表明,通过使用frac - kmc计算FracMinHash草图,我们可以在实际数据上快速准确地估计成对余弦相似度。frac - kmc可在此处免费获取:https://github.com/KoslickiLab/frac - kmc/ 。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验