Hera Mahmudur Rahman, Koslicki David
School of Electrical Engineering and Computer Science, Pennsylvania State University, USA.
Huck Institutes of the Life Sciences, Pennsylvania State University, USA.
bioRxiv. 2024 May 30:2024.05.24.595805. doi: 10.1101/2024.05.24.595805.
The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics, such as the cosine similarity, are still lacking.
In this paper, we present a theoretical framework for estimating cosine similarity from FracMinHash sketches. We establish conditions under which this estimation is sound, and recommend a minimum scale factor for accurate results. Experimental evidence supports our theoretical findings.
We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise cosine similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.
基因组和宏基因组数据的数量和规模不断增加,因此需要可扩展且强大的计算模型来进行精确分析。利用生物样本中的k - 聚体的草图绘制技术已被证明对大规模分析很有用。近年来,FracMinHash已成为一种流行的草图绘制技术,并已用于多个有用的应用中。最近关于FracMinHash的研究证明了其对包含度和杰卡德指数的无偏估计。然而,对于其他度量标准,如余弦相似度,仍缺乏理论研究。
在本文中,我们提出了一个从FracMinHash草图估计余弦相似度的理论框架。我们建立了该估计合理的条件,并推荐了一个最小比例因子以获得准确结果。实验证据支持我们的理论发现。
我们还展示了frac - kmc,一个快速高效的FracMinHash草图生成程序。frac - kmc是已知最快的FracMinHash草图生成器,能为实际数据的余弦相似度估计提供准确精确的结果。我们表明,通过使用frac - kmc计算FracMinHash草图,我们可以在实际数据上快速准确地估计成对余弦相似度。frac - kmc可在此处免费获取:https://github.com/KoslickiLab/frac - kmc/ 。