• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用 FracMinHash 在广泛的进化距离范围内推导突变率的置信区间。

Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.

机构信息

Department of Computer Science and Engineering, The Pennsylvania State University, State College, Pennsylvania 16801, USA.

Department of Population Health and Reproduction, University of California, Davis, California 95616, USA.

出版信息

Genome Res. 2023 Jul;33(7):1061-1068. doi: 10.1101/gr.277651.123. Epub 2023 Jun 21.

DOI:10.1101/gr.277651.123
PMID:37344105
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10538494/
Abstract

Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. This approach has been successfully applied to metagenomic taxonomic profiling in the widely used tool sourmash gather. Although experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis to derive various statistics of FracMinHash, and prove that although FracMinHash is not unbiased (in the sense that its expected value is not equal to the quantity it attempts to estimate), this bias is easily corrected for both the containment and Jaccard index versions. Next, we show how FracMinHash can be used to compute point estimates as well as confidence intervals for evolutionary mutation distance between a pair of sequences by assuming a simple mutation model. We also investigate edge cases in which these analyses may fail to effectively warn the users of FracMinHash indicating the likelihood of such cases. Our analyses show that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely compared with traditional MinHash, and the point estimates and confidence intervals perform significantly better in estimating mutation distances.

摘要

草图方法为计算生物学家提供了可扩展的技术,可用于分析规模不断增长的数据集。MinHash 是一种用于估计集合相似度的技术,最近得到了广泛的应用。然而,传统的 MinHash 已被证明在应用于非常不同大小的集合时表现不佳。FracMinHash 最近被引入作为 MinHash 的一种改进,以弥补当集合大小不同时性能不足的问题。这种方法已成功应用于广泛使用的 sourmash gather 中的宏基因组分类分析。尽管实验证据令人鼓舞,但 FracMinHash 尚未从理论角度进行分析。在本文中,我们从理论角度对 FracMinHash 进行了各种分析,得出了 FracMinHash 的各种统计数据,并证明了尽管 FracMinHash 不是无偏的(从其期望值不等于它试图估计的数量的意义上来说),但这种偏差可以很容易地纠正包含和 Jaccard 指数版本的偏差。接下来,我们展示了如何通过假设一个简单的突变模型,使用 FracMinHash 来计算一对序列之间的进化突变距离的点估计和置信区间。我们还研究了这些分析可能无法有效地警告 FracMinHash 用户的边缘情况,指示了这种情况发生的可能性。我们的分析表明,与传统的 MinHash 相比,FracMinHash 更准确和精确地估计了基因组在大型宏基因组中的包含情况,点估计和置信区间在估计突变距离方面表现得更好。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/229b/10538494/f2ac9adad2c8/1061f04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/229b/10538494/dd8b0d7669ad/1061f01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/229b/10538494/dd3774972735/1061f02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/229b/10538494/f058799766a7/1061f03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/229b/10538494/f2ac9adad2c8/1061f04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/229b/10538494/dd8b0d7669ad/1061f01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/229b/10538494/dd3774972735/1061f02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/229b/10538494/f058799766a7/1061f03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/229b/10538494/f2ac9adad2c8/1061f04.jpg

相似文献

1
Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.使用 FracMinHash 在广泛的进化距离范围内推导突变率的置信区间。
Genome Res. 2023 Jul;33(7):1061-1068. doi: 10.1101/gr.277651.123. Epub 2023 Jun 21.
2
Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation.使用FracMinHash的余弦相似度估计:理论分析、安全条件及实现
bioRxiv. 2024 May 30:2024.05.24.595805. doi: 10.1101/2024.05.24.595805.
3
Metagenomic functional profiling: to sketch or not to sketch?宏基因组功能谱分析:描绘还是不描绘?
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii165-ii173. doi: 10.1093/bioinformatics/btae397.
4
Mash: fast genome and metagenome distance estimation using MinHash.Mash:使用MinHash进行快速的基因组和宏基因组距离估计。
Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.
5
On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference.基于 MinHash 的未校正距离向用于系统发育推断的恰当进化距离的转化。
F1000Res. 2020 Nov 10;9:1309. doi: 10.12688/f1000research.26930.1. eCollection 2020.
6
Mash Screen: high-throughput sequence containment estimation for genome discovery.Mash 屏幕:用于基因组发现的高通量序列包含度估计。
Genome Biol. 2019 Nov 5;20(1):232. doi: 10.1186/s13059-019-1841-x.
7
To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics.从 PB 级到更多:概率和信号处理算法的最新进展及其在宏基因组学中的应用。
Nucleic Acids Res. 2020 Jun 4;48(10):5217-5234. doi: 10.1093/nar/gkaa265.
8
The Statistics of -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches.无伪匹配情况下简单突变过程中序列的 -mers 统计。
J Comput Biol. 2022 Feb;29(2):155-168. doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.
9
Sketching and sampling approaches for fast and accurate long read classification.快速准确的长读分类的草图和采样方法。
BMC Bioinformatics. 2022 Oct 31;23(1):452. doi: 10.1186/s12859-022-05014-0.
10
K-mer based prediction of relatedness and ribotypes.基于 K- -mer 的亲缘关系和核糖体分型预测。
Microb Genom. 2022 Apr;8(4). doi: 10.1099/mgen.0.000804.

引用本文的文献

1
A k-mer-based estimator of the substitution rate between repetitive sequences.一种基于k-mer的重复序列间替换率估计方法。
bioRxiv. 2025 Jun 25:2025.06.19.660607. doi: 10.1101/2025.06.19.660607.
2
Variant evolution graph: Can we infer how SARS-CoV-2 variants are evolving?变异进化图:我们能否推断出严重急性呼吸综合征冠状病毒2(SARS-CoV-2)变体是如何进化的?
PLoS One. 2025 Jun 9;20(6):e0323970. doi: 10.1371/journal.pone.0323970. eCollection 2025.
3
Estimation of substitution and indel rates via -mer statistics.通过 - 聚体统计估计替换和插入缺失率。 (这里原文中的“ -mer”表述不完整,正常应该是如“k-mer”等具体形式,翻译时按照现有内容进行了直译)

本文引用的文献

1
The Statistics of -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches.无伪匹配情况下简单突变过程中序列的 -mers 统计。
J Comput Biol. 2022 Feb;29(2):155-168. doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.
2
Context-aware genomic surveillance reveals hidden transmission of a carbapenemase-producing .语境感知基因组监测揭示了产碳青霉烯酶的......的隐藏传播。
Microb Genom. 2021 Dec;7(12). doi: 10.1099/mgen.0.000741.
3
Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer.
bioRxiv. 2025 Jun 21:2025.05.14.653858. doi: 10.1101/2025.05.14.653858.
4
Estimating similarity and distance using FracMinHash.使用FracMinHash估计相似度和距离。
Algorithms Mol Biol. 2025 May 15;20(1):8. doi: 10.1186/s13015-025-00276-8.
5
Starship giant transposable elements cluster by host taxonomy using k-mer-based phylogenetics.使用基于k-mer的系统发育学,星际巨型转座因子按宿主分类法聚类。
G3 (Bethesda). 2025 Jun 4;15(6). doi: 10.1093/g3journal/jkaf082.
6
Rapid species-level metagenome profiling and containment estimation with sylph.利用Sylph进行快速的物种水平宏基因组分析和含量估计。
Nat Biotechnol. 2024 Oct 8. doi: 10.1038/s41587-024-02412-y.
7
Metagenomic functional profiling: to sketch or not to sketch?宏基因组功能谱分析:描绘还是不描绘?
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii165-ii173. doi: 10.1093/bioinformatics/btae397.
8
ModDotPlot-rapid and interactive visualization of tandem repeats.ModDotPlot-快速和交互式串联重复序列可视化。
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae493.
9
HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors.HyperGen:使用超维向量进行紧凑且高效的基因组草图绘制
Bioinformatics. 2024 Jul 16;40(7). doi: 10.1093/bioinformatics/btae452.
10
GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.GSearch:通过组合 K -mer 哈希和分层可导航小世界图实现超快速和可扩展的基因组搜索。
Nucleic Acids Res. 2024 Sep 9;52(16):e74. doi: 10.1093/nar/gkae609.
最小化空间 de Bruijn 图:在个人计算机上数分钟内完成长读段的全基因组组装。
Cell Syst. 2021 Oct 20;12(10):958-968.e6. doi: 10.1016/j.cels.2021.08.009. Epub 2021 Sep 14.
4
GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy.GTDB:通过系统发生一致、等级归一化和基于完整基因组的分类学,对细菌和古菌多样性进行持续普查。
Nucleic Acids Res. 2022 Jan 7;50(D1):D785-D794. doi: 10.1093/nar/gkab776.
5
Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis.纠错功能使牛津纳米孔技术能够用于无参考转录组分析。
Nat Commun. 2021 Jan 4;12(1):2. doi: 10.1038/s41467-020-20340-8.
6
Metalign: efficient alignment-based metagenomic profiling via containment min hash.Metalign:基于包含最小哈希的高效基于比对的宏基因组分析。
Genome Biol. 2020 Sep 10;21(1):242. doi: 10.1186/s13059-020-02159-0.
7
Large-scale sequence comparisons with .与……进行大规模序列比较
F1000Res. 2019 Jul 4;8:1006. doi: 10.12688/f1000research.19675.1. eCollection 2019.
8
MiCoP: microbial community profiling method for detecting viral and fungal organisms in metagenomic samples.MiCoP:一种用于检测宏基因组样本中病毒和真菌生物的微生物群落分析方法。
BMC Genomics. 2019 Jun 6;20(Suppl 5):423. doi: 10.1186/s12864-019-5699-9.
9
High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.高通量 ANI 分析 9 万余组原核基因组揭示了清晰的物种界限。
Nat Commun. 2018 Nov 30;9(1):5114. doi: 10.1038/s41467-018-07641-9.
10
KrakenUniq: confident and fast metagenomics classification using unique k-mer counts.KrakenUniq:基于独特的 k-mer 计数实现自信且快速的宏基因组分类。
Genome Biol. 2018 Nov 16;19(1):198. doi: 10.1186/s13059-018-1568-0.