Onuki Ritsuko, Yamada Ryo, Yamaguchi Rui, Kanehisa Minoru, Shibuya Tetsuo
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto Japan.
J Comput Biol. 2012 Jan;19(1):55-67. doi: 10.1089/cmb.2010.0227. Epub 2011 Dec 9.
Classification of the individuals' genotype data is important in various kinds of biomedical research. There are many sophisticated clustering algorithms, but most of them require some appropriate similarity measure between objects to be clustered. Hence, accurate inter-diplotype similarity measures are always required for classification of diplotypes. In this article, we propose a new accurate inter-diplotype similarity measure that we call the population model-based distance (PMD), so that we can cluster individuals with diplotype SNPs data (i.e., unphased-diplotypes) with higher accuracies. For unphased-diplotypes, the allele sharing distance (ASD) has been the standard to measure the genetic distance between the diplotypes of individuals. To achieve higher clustering accuracies, our new measure PMD makes good use of a given appropriate population model which has never been utilized in the ASD. As the population model, we propose to use an hidden Markov model (HMM)-based model. We call the PMD based on the model the HHD (HIT HMM-based Distance). We demonstrate the impact of the HHD on the diplotype classification through comprehensive large-scale experiments over the genome-wide 8930 data sets derived from the HapMap SNPs database. The experiments revealed that the HHD enables significantly more accurate clustering than the ASD.
个体基因型数据的分类在各类生物医学研究中都很重要。有许多复杂的聚类算法,但其中大多数都需要在待聚类对象之间有某种合适的相似性度量。因此,单倍型分类始终需要准确的单倍型间相似性度量。在本文中,我们提出了一种新的准确的单倍型间相似性度量,我们称之为基于群体模型的距离(PMD),这样我们就能以更高的准确率对具有单倍型SNP数据(即未分型单倍型)的个体进行聚类。对于未分型单倍型,等位基因共享距离(ASD)一直是衡量个体单倍型之间遗传距离的标准。为了实现更高的聚类准确率,我们的新度量PMD充分利用了一个给定的合适群体模型,而该模型在ASD中从未被使用过。作为群体模型,我们建议使用基于隐马尔可夫模型(HMM)的模型。我们将基于该模型的PMD称为HHD(基于隐马尔可夫模型的距离)。我们通过对来自HapMap SNP数据库的全基因组8930个数据集进行全面的大规模实验,证明了HHD对单倍型分类的影响。实验表明,HHD比ASD能实现显著更准确的聚类。