Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.
Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.
J Comput Biol. 2022 Feb;29(2):155-168. doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.
-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability , under the assumption that there are no spurious -mer matches. How does this process affect the -mers of ? We derive the expectation and variance of the number of mutated -mers and of the number of islands (a maximal interval of mutated -mers) and oceans (a maximal interval of nonmutated -mers). We then derive hypothesis tests and confidence intervals (CIs) for given an observed number of mutated -mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.
基于 -mer 的方法在生物信息学中被广泛应用,但我们对它们的统计特性的理解还存在许多空白。在这里,我们考虑一个简单的模型,其中一个序列(例如,一个基因组或一个读取)通过一个简单的突变过程发生突变,在这个过程中,每个核苷酸都以某个概率独立发生突变,假设没有虚假的 -mer 匹配。这个过程会如何影响 -mers 的数量?我们推导出了突变 -mers 的数量和岛屿(突变 -mers 的最大区间)和海洋(非突变 -mers 的最大区间)数量的期望和方差。然后,我们为给定观察到的突变 -mers 数量或给定 Jaccard 相似性(带或不带 MinHash)推导了假设检验和置信区间 (CI)。我们使用几个精选的应用程序展示了我们的结果的有用性:获得补充 Mash 距离点估计的置信区间,通过 Minimap2 在对齐过程中过滤读取,以及通过 Jabba 对 de Bruijn 图的长读取对齐进行评分。