Suppr超能文献

无伪匹配情况下简单突变过程中序列的 -mers 统计。

The Statistics of -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches.

机构信息

Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.

Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.

出版信息

J Comput Biol. 2022 Feb;29(2):155-168. doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.

Abstract

-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability , under the assumption that there are no spurious -mer matches. How does this process affect the -mers of ? We derive the expectation and variance of the number of mutated -mers and of the number of islands (a maximal interval of mutated -mers) and oceans (a maximal interval of nonmutated -mers). We then derive hypothesis tests and confidence intervals (CIs) for given an observed number of mutated -mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

摘要

基于 -mer 的方法在生物信息学中被广泛应用,但我们对它们的统计特性的理解还存在许多空白。在这里,我们考虑一个简单的模型,其中一个序列(例如,一个基因组或一个读取)通过一个简单的突变过程发生突变,在这个过程中,每个核苷酸都以某个概率独立发生突变,假设没有虚假的 -mer 匹配。这个过程会如何影响 -mers 的数量?我们推导出了突变 -mers 的数量和岛屿(突变 -mers 的最大区间)和海洋(非突变 -mers 的最大区间)数量的期望和方差。然后,我们为给定观察到的突变 -mers 数量或给定 Jaccard 相似性(带或不带 MinHash)推导了假设检验和置信区间 (CI)。我们使用几个精选的应用程序展示了我们的结果的有用性:获得补充 Mash 距离点估计的置信区间,通过 Minimap2 在对齐过程中过滤读取,以及通过 Jabba 对 de Bruijn 图的长读取对齐进行评分。

相似文献

3
Effective sequence similarity detection with strobemers.利用频闪体进行有效的序列相似性检测。
Genome Res. 2021 Nov;31(11):2080-2094. doi: 10.1101/gr.275648.121. Epub 2021 Oct 19.
5
SAKE: Strobemer-assisted k-mer extraction.SAKE:频闪辅助 k-mer 提取。
PLoS One. 2023 Nov 29;18(11):e0294415. doi: 10.1371/journal.pone.0294415. eCollection 2023.
10
KMC 2: fast and resource-frugal k-mer counting.KMC 2:快速且资源节约型的k-mer计数法
Bioinformatics. 2015 May 15;31(10):1569-76. doi: 10.1093/bioinformatics/btv022. Epub 2015 Jan 20.

引用本文的文献

4
Estimating similarity and distance using FracMinHash.使用FracMinHash估计相似度和距离。
Algorithms Mol Biol. 2025 May 15;20(1):8. doi: 10.1186/s13015-025-00276-8.
8
Metagenomic functional profiling: to sketch or not to sketch?宏基因组功能谱分析:描绘还是不描绘?
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii165-ii173. doi: 10.1093/bioinformatics/btae397.
10
Exact Sketch-Based Read Mapping.基于草图的精确读段映射
Lebniz Int Proc Inform. 2023 Sep;273. doi: 10.4230/LIPIcs.WABI.2023.14. Epub 2023 Aug 29.

本文引用的文献

5
Improved representation of sequence bloom trees.序列 Bloom 树的表示方法改进。
Bioinformatics. 2020 Feb 1;36(3):721-727. doi: 10.1093/bioinformatics/btz662.
10
Minimap2: pairwise alignment for nucleotide sequences.Minimap2:核苷酸序列的两两比对。
Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验