最短公共超字符串和基因组组装计算复杂度中的相变

Phase transition in the computational complexity of the shortest common superstring and genome assembly.

作者信息

Fernandez L A, Martin-Mayor V, Yllanes D

机构信息

Departamento de Física Teórica, Universidad Complutense, 28040 Madrid, Spain.

Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), 50018 Zaragoza, Spain.

出版信息

Phys Rev E. 2024 Jan;109(1-1):014133. doi: 10.1103/PhysRevE.109.014133.

DOI:10.1103/PhysRevE.109.014133

PMID:38366408

Abstract

Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging short fragments, or reads, is known to be NP-hard, either as a version of the shortest common superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed to grow exponentially with the problem size in the worst case. Despite this fact, high-throughput technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating the existence of a phase transition in the computational complexity of the problem and showing that practical instances always fall in the "easy" phase (solvable by polynomial-time algorithms). In addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.

摘要

基因组组装是通过比对和合并短片段（即 reads）来重建长基因序列的过程，已知它是 NP 难问题，无论是作为最短公共超串问题的一个版本，还是采用哈密顿回路公式化表述。也就是说，在最坏情况下，计算时间被认为会随着问题规模呈指数增长。尽管如此，高通量技术和现代算法目前使生物信息学家能够处理包含数十亿条 reads 的数据集。我们运用统计力学方法，通过证明该问题计算复杂度中存在相变，并表明实际实例总是处于“简单”相（可由多项式时间算法求解），来解决这一难题。此外，我们提出了一种马尔可夫链蒙特卡罗方法，在困难模式下它优于常见的确定性算法。

相似文献

Phase transition in the computational complexity of the shortest common superstring and genome assembly.最短公共超字符串和基因组组装计算复杂度中的相变

Phys Rev E. 2024 Jan;109(1-1):014133. doi: 10.1103/PhysRevE.109.014133.

Multiple sequence assembly from reads alignable to a common reference genome.基于可比对至公共参考基因组的读长进行多重序列组装。

IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1283-95. doi: 10.1109/TCBB.2010.107.

Short superstrings and the structure of overlapping strings.短超弦与重叠弦的结构

J Comput Biol. 1995 Summer;2(2):307-32. doi: 10.1089/cmb.1995.2.307.

Optimal algorithms for haplotype assembly from whole-genome sequence data.从全基因组序列数据中进行单倍型组装的最优算法。

Bioinformatics. 2010 Jun 15;26(12):i183-90. doi: 10.1093/bioinformatics/btq215.

A new graph model and algorithms for consistent superstring problems.一种新的图模型和算法，用于解决一致超串问题。

Philos Trans A Math Phys Eng Sci. 2014 Apr 21;372(2016):20130134. doi: 10.1098/rsta.2013.0134. Print 2014 May 28.

GenHap: a novel computational method based on genetic algorithms for haplotype assembly.GenHap：一种基于遗传算法的新型单倍型组装计算方法。

BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):172. doi: 10.1186/s12859-019-2691-y.

Accelerating MCMC algorithms.加速马尔可夫链蒙特卡罗算法。

Wiley Interdiscip Rev Comput Stat. 2018 Sep-Oct;10(5):e1435. doi: 10.1002/wics.1435. Epub 2018 Jun 13.

Fast and SNP-aware short read alignment with SALT.基于 SALT 的快速 SNP 感知短读序列比对。

BMC Bioinformatics. 2021 Aug 25;22(Suppl 9):172. doi: 10.1186/s12859-021-04088-6.

A combinatorial approach to the design of vaccines.一种用于疫苗设计的组合方法。

J Math Biol. 2015 May;70(6):1327-58. doi: 10.1007/s00285-014-0797-4. Epub 2014 May 25.

Maximum likelihood genome assembly.最大似然基因组组装

J Comput Biol. 2009 Aug;16(8):1101-16. doi: 10.1089/cmb.2009.0047.

引用本文的文献

Using reinforcement learning in genome assembly: in-depth analysis of a Q-learning assembler.基因组组装中强化学习的应用：Q学习组装器的深入分析

Front Bioinform. 2025 Aug 20;5:1633623. doi: 10.3389/fbinf.2025.1633623. eCollection 2025.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

最短公共超字符串和基因组组装计算复杂度中的相变

Phase transition in the computational complexity of the shortest common superstring and genome assembly.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献