Xie Minzhu, Wang Jianxin, Chen Jianer
School of Information Science and Engineering, Central South University, Changsha 410083, China.
Bioinformatics. 2008 Jul 1;24(13):i105-13. doi: 10.1093/bioinformatics/btn147.
In genetic studies of complex diseases, haplotypes provide more information than genotypes. However, haplotyping is much more difficult than genotyping using biological techniques. Therefore effective computational techniques have been in demand. The individual haplotyping problem is the computational problem of inducing a pair of haplotypes from an individual's aligned SNP fragments. Based on various optimal criteria and including different extra information, many models for the problem have been proposed. Higher accuracy of the models has been an important issue in the study of haplotype reconstruction.
The current article proposes a highly accurate model for the single individual haplotyping problem based on weighted fragments and genotypes with errors. The model is proved to be NP-hard even with gapless fragments. Based on the characteristics of Single Nucleotide Polymorphism (SNP) fragments, a parameterized algorithm of time complexity O(nk(2)2(k(2)) + m log m + mk(1)) is developed, where m is the number of fragments, n is the number of SNP sites, k(1) is the maximum number of SNP sites that a fragment covers (no more than n and usually smaller than 10) and k(2) is the maximum number of the fragments covering a SNP site (usually no more than 19). Extensive experiments show that this model is more accurate in haplotype reconstruction than other models.
The program of the parameterized algorithm can be obtained by sending an email to the corresponding author.
在复杂疾病的基因研究中,单倍型比基因型提供的信息更多。然而,使用生物技术进行单倍型分型比基因分型要困难得多。因此,人们一直需要有效的计算技术。个体单倍型分型问题是从个体的对齐单核苷酸多态性(SNP)片段中推断出一对单倍型的计算问题。基于各种最优标准并包含不同的额外信息,已经提出了许多针对该问题的模型。模型的更高准确性一直是单倍型重建研究中的一个重要问题。
本文基于带误差的加权片段和基因型,为单一个体单倍型分型问题提出了一个高精度模型。即使对于无间隙片段,该模型也被证明是NP难的。基于单核苷酸多态性(SNP)片段的特征,开发了一种时间复杂度为O(nk(2)2(k(2)) + m log m + mk(1))的参数化算法,其中m是片段数量,n是SNP位点数量,k(1)是一个片段覆盖的SNP位点的最大数量(不超过n且通常小于10),k(2)是覆盖一个SNP位点的片段的最大数量(通常不超过19)。大量实验表明,该模型在单倍型重建方面比其他模型更准确。
通过向通讯作者发送电子邮件可以获得参数化算法的程序。