选择额外的标签单核苷酸多态性以耐受基因分型中的缺失数据。

Selecting additional tag SNPs for tolerating missing data in genotyping.

作者信息

Huang Yao-Ting, Zhang Kui, Chen Ting, Chao Kun-Mao

机构信息

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan.

出版信息

BMC Bioinformatics. 2005 Nov 1;6:263. doi: 10.1186/1471-2105-6-263.

BACKGROUND

Recent studies have shown that the patterns of linkage disequilibrium observed in human populations have a block-like structure, and a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block. In reality, some tag SNPs may be missing, and we may fail to distinguish two distinct haplotypes due to the ambiguity caused by missing data.

RESULTS

We show there exists a subset of SNPs (referred to as robust tag SNPs) which can still distinguish all distinct haplotypes even when some SNPs are missing. The problem of finding minimum robust tag SNPs is shown to be NP-hard. To find robust tag SNPs efficiently, we propose two greedy algorithms and one linear programming relaxation algorithm. The experimental results indicate that (1) the solutions found by these algorithms are quite close to the optimal solution; (2) the genotyping cost saved by using tag SNPs can be as high as 80%; and (3) genotyping additional tag SNPs for tolerating missing data is still cost-effective.

CONCLUSION

Genotyping robust tag SNPs is more practical than just genotyping the minimum tag SNPs if we can not avoid the occurrence of missing data. Our theoretical analysis and experimental results show that the performance of our algorithms is not only efficient but the solution found is also close to the optimal solution.

背景

最近的研究表明，在人类群体中观察到的连锁不平衡模式具有块状结构，并且一小部分单核苷酸多态性（称为标签单核苷酸多态性）足以区分该块中的每对单倍型模式。在实际情况中，一些标签单核苷酸多态性可能缺失，并且由于缺失数据导致的模糊性，我们可能无法区分两种不同的单倍型。

结果

我们表明存在一个单核苷酸多态性子集（称为稳健标签单核苷酸多态性），即使某些单核苷酸多态性缺失，该子集仍能区分所有不同的单倍型。寻找最小稳健标签单核苷酸多态性的问题被证明是NP难的。为了有效地找到稳健标签单核苷酸多态性，我们提出了两种贪心算法和一种线性规划松弛算法。实验结果表明：（1）这些算法找到的解决方案非常接近最优解；（2）使用标签单核苷酸多态性节省的基因分型成本可高达80%；（3）为容忍缺失数据而对额外的标签单核苷酸多态性进行基因分型仍然具有成本效益。