Suppr超能文献

最短公共超字符串和基因组组装计算复杂度中的相变

Phase transition in the computational complexity of the shortest common superstring and genome assembly.

作者信息

Fernandez L A, Martin-Mayor V, Yllanes D

机构信息

Departamento de Física Teórica, Universidad Complutense, 28040 Madrid, Spain.

Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), 50018 Zaragoza, Spain.

出版信息

Phys Rev E. 2024 Jan;109(1-1):014133. doi: 10.1103/PhysRevE.109.014133.

Abstract

Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging short fragments, or reads, is known to be NP-hard, either as a version of the shortest common superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed to grow exponentially with the problem size in the worst case. Despite this fact, high-throughput technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating the existence of a phase transition in the computational complexity of the problem and showing that practical instances always fall in the "easy" phase (solvable by polynomial-time algorithms). In addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.

摘要

基因组组装是通过比对和合并短片段(即 reads)来重建长基因序列的过程,已知它是 NP 难问题,无论是作为最短公共超串问题的一个版本,还是采用哈密顿回路公式化表述。也就是说,在最坏情况下,计算时间被认为会随着问题规模呈指数增长。尽管如此,高通量技术和现代算法目前使生物信息学家能够处理包含数十亿条 reads 的数据集。我们运用统计力学方法,通过证明该问题计算复杂度中存在相变,并表明实际实例总是处于“简单”相(可由多项式时间算法求解),来解决这一难题。此外,我们提出了一种马尔可夫链蒙特卡罗方法,在困难模式下它优于常见的确定性算法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验