Shomorony Ilan, Kim Samuel H, Courtade Thomas A, Tse David N C
Department of Electrical Engineering & Computer Sciences, University of California, Berkeley, CA, USA.
Department of Electrical Engineering & Computer Sciences, University of California, Berkeley, CA, USA Department of Electrical Engineering, Stanford University, Stanford, CA, USA.
Bioinformatics. 2016 Sep 1;32(17):i494-i502. doi: 10.1093/bioinformatics/btw450.
In the context of third-generation long-read sequencing technologies, read-overlap-based approaches are expected to play a central role in the assembly step. A fundamental challenge in assembling from a read-overlap graph is that the true sequence corresponds to a Hamiltonian path on the graph, and, under most formulations, the assembly problem becomes NP-hard, restricting practical approaches to heuristics. In this work, we avoid this seemingly fundamental barrier by first setting the computational complexity issue aside, and seeking an algorithm that targets information limits In particular, we consider a basic feasibility question: when does the set of reads contain enough information to allow unambiguous reconstruction of the true sequence?
Based on insights from this information feasibility question, we present an algorithm-the Not-So-Greedy algorithm-to construct a sparse read-overlap graph. Unlike most other assembly algorithms, Not-So-Greedy comes with a performance guarantee: whenever information feasibility conditions are satisfied, the algorithm reduces the assembly problem to an Eulerian path problem on the resulting graph, and can thus be solved in linear time. In practice, this theoretical guarantee translates into assemblies of higher quality. Evaluations on both simulated reads from real genomes and a PacBio Escherichia coli K12 dataset demonstrate that Not-So-Greedy compares favorably with standard string graph approaches in terms of accuracy of the resulting read-overlap graph and contig N50.
Available at github.com/samhykim/nsg
courtade@eecs.berkeley.edu or dntse@stanford.edu
Supplementary data are available at Bioinformatics online.
在第三代长读长测序技术的背景下,基于读段重叠的方法有望在组装步骤中发挥核心作用。从读段重叠图进行组装的一个基本挑战是,真实序列对应于图上的一条哈密顿路径,并且在大多数情况下,组装问题会变成NP难问题,这使得实际方法只能采用启发式算法。在这项工作中,我们先将计算复杂性问题搁置一旁,转而寻求一种针对信息限制的算法,从而避开了这个看似根本性的障碍。具体而言,我们考虑一个基本的可行性问题:读段集合何时包含足够的信息以实现对真实序列的明确重建?
基于对这个信息可行性问题的见解,我们提出了一种算法——非贪婪算法——来构建一个稀疏读段重叠图。与大多数其他组装算法不同,非贪婪算法具有性能保证:只要满足信息可行性条件,该算法就能将组装问题简化为所得图上的欧拉路径问题,因此可以在线性时间内求解。在实际应用中,这种理论保证转化为更高质量的组装结果。对来自真实基因组的模拟读段以及一个PacBio大肠杆菌K12数据集的评估表明,在所得读段重叠图的准确性和重叠群N50方面,非贪婪算法优于标准的弦图方法。
可在github.com/samhykim/nsg获取
courtade@eecs.berkeley.edu或dntse@stanford.edu
补充数据可在《生物信息学》在线获取。