Zhao Yu-Ying, Wu Ling-Yun, Zhang Ji-Hong, Wang Rui-Sheng, Zhang Xiang-Sun
Institute of Applied Mathematics, Academy of Mathematics and Systems Science, CAS, Beijing 100080, China.
Comput Biol Chem. 2005 Aug;29(4):281-7. doi: 10.1016/j.compbiolchem.2005.05.001.
Given an assembled genome of a diploid organism the haplotype assembly problem can be formulated as retrieval of a pair of haplotypes from a set of aligned weighted SNP fragments. Known computational formulations (models) of this problem are minimum letter flips (MLF) and the weighted minimum letter flips (WMLF; Greenberg et al. (INFORMS J. Comput. 2004, 14, 211-213)). In this paper we show that the general WMLF model is NP-hard even for the gapless case. However the algorithmic solutions for selected variants of WMFL can exist and we propose a heuristic algorithm based on a dynamic clustering technique. We also introduce a new formulation of the haplotype assembly problem that we call COMPLETE WMLF (CWMLF). This model and algorithms for its implementation take into account a simultaneous presence of multiple kinds of data errors. Extensive computational experiments indicate that the algorithmic implementations of the CWMLF model achieve higher accuracy of haplotype reconstruction than the WMLF-based algorithms, which in turn appear to be more accurate than those based on MLF.
对于一个二倍体生物的组装基因组,单倍型组装问题可以表述为从一组比对后的加权单核苷酸多态性(SNP)片段中检索出一对单倍型。这个问题已知的计算形式(模型)是最小字母翻转(MLF)和加权最小字母翻转(WMLF;格林伯格等人,《运筹学与管理科学学会计算杂志》,2004年,第14卷,第211 - 213页)。在本文中,我们表明,即使在无间隙的情况下,一般的WMLF模型也是NP难的。然而,对于WMLF的某些特定变体可能存在算法解决方案,我们提出了一种基于动态聚类技术的启发式算法。我们还引入了一种新的单倍型组装问题的形式,我们称之为完全WMLF(CWMLF)。这个模型及其实现算法考虑了多种数据错误的同时存在。广泛的计算实验表明,CWMLF模型的算法实现比基于WMLF的算法具有更高的单倍型重建准确性,而基于WMLF的算法又似乎比基于MLF的算法更准确。