Wu Haonan, Blanca Antonio, Medvedev Paul
Department of Computer Science and Engineering, The Pennsylvania State University.
Department of Biochemistry and Molecular Biology, The Pennsylvania State University.
bioRxiv. 2025 Jun 25:2025.06.19.660607. doi: 10.1101/2025.06.19.660607.
K-mer-based analysis of genomic data is ubiquitous, but the presence of repetitive k-mers continues to pose problems for the accuracy of many methods. For example, the Mash tool (Ondov et al 2016) can accurately estimate the substitution rate between two low-repetitive sequences from their k-mer sketches; however, it is inaccurate on repetitive sequences such as the centromere of a human chromosome. Follow-up work by Blanca et al. (2021) has attempted to model how mutations affect k-mer sets based on strong assumptions that the sequence is non-repetitive and that mutations do not create spurious k-mer matches. However, the theoretical foundations for extending an estimator like Mash to work in the presence of repeat sequences have been lacking. In this work, we relax the non-repetitive assumption and propose a novel estimator for the mutation rate. We derive theoretical bounds on our estimator's bias. Our experiments show that it remains accurate for repetitive genomic sequences, such as the alpha satellite higher order repeats in centromeres. We demonstrate our estimator's robustness across diverse datasets and various ranges of the substitution rate and k-mer size. Finally, we show how sketching can be used to avoid dealing with large k-mer sets while retaining accuracy. Our software is available at https://github.com/medvedevgroup/Repeat-Aware_Substitution_Rate_Estimator.
基于k-mer的基因组数据分析无处不在,但重复k-mer的存在继续给许多方法的准确性带来问题。例如,Mash工具(Ondov等人,2016年)可以根据两个低重复序列的k-mer草图准确估计它们之间的替换率;然而,对于人类染色体着丝粒等重复序列,它并不准确。Blanca等人(2021年)的后续工作试图基于序列是非重复的以及突变不会产生虚假k-mer匹配的强假设,对突变如何影响k-mer集进行建模。然而,一直缺乏将像Mash这样的估计器扩展到在存在重复序列的情况下工作的理论基础。在这项工作中,我们放宽了非重复假设,并提出了一种新的突变率估计器。我们推导了我们估计器偏差的理论界限。我们的实验表明,对于重复的基因组序列,如着丝粒中的α卫星高阶重复序列,它仍然是准确的。我们展示了我们的估计器在不同数据集以及替换率和k-mer大小的各种范围内的稳健性。最后,我们展示了如何使用草图来避免处理大型k-mer集同时保持准确性。我们的软件可在https://github.com/medvedevgroup/Repeat-Aware_Substitution_Rate_Estimator获取。