Huang Yao-Ting, Zhang Kui, Chen Ting, Chao Kun-Mao
Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan.
BMC Bioinformatics. 2005 Nov 1;6:263. doi: 10.1186/1471-2105-6-263.
Recent studies have shown that the patterns of linkage disequilibrium observed in human populations have a block-like structure, and a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block. In reality, some tag SNPs may be missing, and we may fail to distinguish two distinct haplotypes due to the ambiguity caused by missing data.
We show there exists a subset of SNPs (referred to as robust tag SNPs) which can still distinguish all distinct haplotypes even when some SNPs are missing. The problem of finding minimum robust tag SNPs is shown to be NP-hard. To find robust tag SNPs efficiently, we propose two greedy algorithms and one linear programming relaxation algorithm. The experimental results indicate that (1) the solutions found by these algorithms are quite close to the optimal solution; (2) the genotyping cost saved by using tag SNPs can be as high as 80%; and (3) genotyping additional tag SNPs for tolerating missing data is still cost-effective.
Genotyping robust tag SNPs is more practical than just genotyping the minimum tag SNPs if we can not avoid the occurrence of missing data. Our theoretical analysis and experimental results show that the performance of our algorithms is not only efficient but the solution found is also close to the optimal solution.
最近的研究表明,在人类群体中观察到的连锁不平衡模式具有块状结构,并且一小部分单核苷酸多态性(称为标签单核苷酸多态性)足以区分该块中的每对单倍型模式。在实际情况中,一些标签单核苷酸多态性可能缺失,并且由于缺失数据导致的模糊性,我们可能无法区分两种不同的单倍型。
我们表明存在一个单核苷酸多态性子集(称为稳健标签单核苷酸多态性),即使某些单核苷酸多态性缺失,该子集仍能区分所有不同的单倍型。寻找最小稳健标签单核苷酸多态性的问题被证明是NP难的。为了有效地找到稳健标签单核苷酸多态性,我们提出了两种贪心算法和一种线性规划松弛算法。实验结果表明:(1)这些算法找到的解决方案非常接近最优解;(2)使用标签单核苷酸多态性节省的基因分型成本可高达80%;(3)为容忍缺失数据而对额外的标签单核苷酸多态性进行基因分型仍然具有成本效益。
如果我们无法避免缺失数据的出现,对稳健标签单核苷酸多态性进行基因分型比仅对最小标签单核苷酸多态性进行基因分型更具实用性。我们的理论分析和实验结果表明,我们算法的性能不仅高效,而且找到的解决方案也接近最优解。