CNR-Istituto di Analisi dei Sistemi e Informatica, A. Ruberti, Roma, Italy.
IEEE/ACM Trans Comput Biol Bioinform. 2010 Jul-Sep;7(3):511-23. doi: 10.1109/TCBB.2008.130.
Haplotype data play a relevant role in several genetic studies, e.g., mapping of complex disease genes, drug design, and evolutionary studies on populations. However, the experimental determination of haplotypes is expensive and time-consuming. This motivates the increasing interest in techniques for inferring haplotype data from genotypes, which can instead be obtained quickly and economically. Several such techniques are based on the maximum parsimony principle, which has been justified by both experimental results and theoretical arguments. However, the problem of haplotype inference by parsimony was shown to be NP-hard, thus limiting the applicability of exact parsimony-based techniques to relatively small data sets. In this paper, we introduce collapse rule, a generalization of the well-known Clark's rule, and describe a new heuristic algorithm for haplotype inference (implemented in a program called CollHaps), based on parsimony and the iterative application of collapse rules. The performance of CollHaps is tested on several data sets. The experiments show that CollHaps enables the user to process large data sets obtaining very "parsimonious" solutions in short processing times. They also show a correlation, especially for large data sets, between parsimony and correct reconstruction, supporting the validity of the parsimony principle to produce accurate solutions.
单体型数据在许多遗传研究中起着重要作用,例如复杂疾病基因的定位、药物设计和群体进化研究。然而,单体型的实验测定既昂贵又耗时。这促使人们越来越关注从基因型推断单体型数据的技术,因为基因型可以快速、经济地获得。有几种基于最大简约原理的此类技术,这一原理已被实验结果和理论论证所证实。然而,单体型简约推断问题被证明是 NP 难的,因此限制了基于精确简约的技术在相对较小数据集上的适用性。在本文中,我们引入了压缩规则,这是广为人知的 Clark 规则的推广,并描述了一种基于简约和压缩规则的迭代应用的新启发式单体型推断算法(在名为 CollHaps 的程序中实现)。在几个数据集上测试了 CollHaps 的性能。实验表明,CollHaps 使用户能够处理大型数据集,并在短时间内获得非常“简约”的解决方案。它们还显示出简约性和正确重建之间的相关性,特别是对于大型数据集,这支持了简约性原则产生准确解决方案的有效性。