Wang Rui-Sheng, Wu Ling-Yun, Zhang Xiang-Sun, Chen Luonan
Faculty of Engineering, Osaka Sangyo University, Osaka 574-8530, Japan.
Genome Inform. 2006;17(2):162-71.
Single nucleotide polymorphism (SNP) is the most frequent form of human genetic variations and of importance for medical diagnosis and tracking disease genes. A haplotype is a sequence of SNPs from a single copy of a chromosome, and haplotype assembly from SNP fragments is based on DNA fragments with SNPs and the methodology of shotgun sequence assembly. In contrast to conventional combinatorial models which aim at different error types in SNP fragments, in this paper we propose a new statistical model - a Markov chain model for haplotype assembly based on information of SNP fragments. The main advantage of this model over combinatorial ones is that it requires no prior information on error types in data. In addition, unlike exact algorithms with the exponential-time computation complexity for most combinatorial models, the proposed model can be solved in polynomial time and thus is efficient for large-scale problems. Experiment results on several data sets illustrate the effectiveness of the new method.
单核苷酸多态性(SNP)是人类遗传变异中最常见的形式,对医学诊断和疾病基因追踪具有重要意义。单倍型是来自染色体单拷贝的SNP序列,从SNP片段进行单倍型组装是基于带有SNP的DNA片段以及鸟枪法序列组装方法。与针对SNP片段中不同错误类型的传统组合模型不同,本文我们提出了一种新的统计模型——基于SNP片段信息的单倍型组装马尔可夫链模型。该模型相对于组合模型的主要优势在于它不需要数据中错误类型的先验信息。此外,与大多数组合模型具有指数时间计算复杂度的精确算法不同,所提出的模型可以在多项式时间内求解,因此对于大规模问题是高效的。在几个数据集上的实验结果说明了新方法的有效性。