Suppr超能文献

通过对种子链扩展启发式算法的平均情况分析,证明序列比对器可以在几乎(log)时间内保证准确性。

Proving sequence aligners can guarantee accuracy in almost ( log ) time through an average-case analysis of the seed-chain-extend heuristic.

机构信息

Department of Mathematics, University of Toronto, Toronto, Ontario M5S 2E4, Canada;

Department of Mathematics, University of Toronto, Toronto, Ontario M5S 2E4, Canada.

出版信息

Genome Res. 2023 Jul;33(7):1175-1187. doi: 10.1101/gr.277637.122. Epub 2023 Mar 29.

Abstract

Seed-chain-extend with -mer seeds is a powerful heuristic technique for sequence alignment used by modern sequence aligners. Although effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with -mers Assume we are given a random nucleotide sequence of length ∼ that is indexed (or seeded) and a mutated substring of length ∼ ≤ with mutation rate θ < 0.206. We prove that we can find a = Θ(log ) for the -mer size such that the expected runtime of seed-chain-extend under optimal linear-gap cost chaining and quadratic time gap extension is ( log ), where (θ) < 2.43 · θ holds as a loose bound. The alignment also turns out to be good; we prove that more than [Formula: see text] fraction of the homologous bases is under an optimal chain. We also show that our bounds work when -mers are , that is, only a subset of all -mers is selected, and that sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and on real noisy long-read data and show that our theoretical runtimes can predict real runtimes accurately. We conjecture that our bounds can be improved further, and in particular, (θ) can be further reduced.

摘要

使用 -mer 种子进行种子链扩展是现代序列比对器中用于序列比对的强大启发式技术。尽管在实践中对于运行时和准确性都非常有效,但对于种子链扩展,并没有关于对齐效果的理论保证。在这项工作中,我们为 -mer 种子链扩展的功效提供了第一个严格的界限。假设我们给定了一个长度约为的随机核苷酸序列,该序列被索引(或播种),并且长度约为的突变子串具有突变率θ < 0.206。我们证明,我们可以找到一个大小为的 = Θ(log ),使得在最优线性间隙代价连锁和二次时间间隙扩展下,种子链扩展的预期运行时间为 ( log ),其中 (θ) < 2.43 · θ 作为宽松界限成立。该对齐也很好;我们证明,在最优链下,超过 [公式:见文本]分数的同源碱基是 。我们还表明,当 -mer 是 时,我们的界仍然有效,即仅选择所有 -mer 的一个子集,并且草图化可以减少连锁时间,而不会过多地增加对齐时间或降低准确性,从而证明了草图化作为序列对齐的实用加速技术的有效性。我们在模拟和真实的嘈杂长读数据上验证了我们的结果,并表明我们的理论运行时间可以准确预测实际运行时间。我们推测我们的界限可以进一步改进,特别是 (θ) 可以进一步降低。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1c30/10538486/716fbd51a1c7/1175f01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验