Bonizzoni Paola, Dondi Riccardo, Klau Gunnar W, Pirola Yuri, Pisanti Nadia, Zaccaria Simone
1 Department of Computer Science (DISCO), University of Milano-Bicocca , Milan, Italy .
2 Department of Social and Human Sciences, University of Bergamo , Bergamo, Italy .
J Comput Biol. 2016 Sep;23(9):718-36. doi: 10.1089/cmb.2015.0220. Epub 2016 Jun 9.
In diploid genomes, haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum error correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. MEC is computationally hard to solve, but some approximation-based or fixed-parameter approaches have been proved capable of obtaining accurate results on real data. In this work, we expand the current characterization of the computational complexity of MEC from the approximation and the fixed-parameter tractability point of view. In particular, we show that MEC is not approximable within a constant factor, whereas it is approximable within a logarithmic factor in the size of the input. Furthermore, we answer open questions on the fixed-parameter tractability for parameters of classical or practical interest: the total number of corrections and the fragment length. In addition, we present a direct 2-approximation algorithm for a variant of the problem that has also been applied in the framework of clustering data. Finally, since polyploid genomes, such as those of plants and fishes, are composed of more than two copies of the chromosomes, we introduce a novel formulation of MEC, namely the k-ploid MEC problem, that extends the traditional problem to deal with polyploid genomes. We show that the novel formulation is still both computationally hard and hard to approximate. Nonetheless, from the parameterized point of view, we prove that the problem is tractable for parameters of practical interest such as the number of haplotypes and the coverage, or the number of haplotypes and the fragment length.
在二倍体基因组中,单倍型组装是一个计算问题,即从可能受测序错误影响的测序读段(称为片段)开始,重建每条染色体的两个亲本拷贝(称为单倍型)。最小错误校正(MEC)是单倍型组装中的一个突出计算问题,给定一组片段,其目标是通过应用最少数量的碱基校正来重建两个单倍型。MEC在计算上难以求解,但已证明一些基于近似或固定参数的方法能够在实际数据上获得准确结果。在这项工作中,我们从近似和固定参数可处理性的角度扩展了当前对MEC计算复杂性的刻画。具体而言,我们表明MEC在常数因子内不可近似,而在输入大小的对数因子内可近似。此外,我们回答了关于经典或实际感兴趣参数(校正总数和片段长度)的固定参数可处理性的开放性问题。此外,我们为该问题的一个变体提出了一种直接的2近似算法,该变体也已应用于聚类数据框架。最后,由于多倍体基因组,如植物和鱼类的基因组,由染色体的两个以上拷贝组成,我们引入了MEC的一种新形式,即k倍体MEC问题,它将传统问题扩展到处理多倍体基因组。我们表明新形式在计算上仍然既困难又难以近似。尽管如此,从参数化的角度来看我们证明该问题对于实际感兴趣的参数(如单倍型数量和覆盖度,或单倍型数量和片段长度)是可处理的。