Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong.
BMC Bioinformatics. 2012 Jun 25;13:146. doi: 10.1186/1471-2105-13-146.
Linkage analysis is the first step in the search for a disease gene. Linkage studies have facilitated the identification of several hundred human genes that can harbor mutations leading to a disease phenotype. In this paper, we study a very important case, where the sampled individuals are closely related, but the pedigree is not given. This situation happens very often when the individuals share a common ancestor 6 or more generations ago. To our knowledge, no algorithm can give good results for this case.
To solve this problem, we first developed some heuristic algorithms for haplotype inference without any given pedigree. We propose a model using the parsimony principle that can be viewed as an extension of the model first proposed by Dan Gusfield. Our heuristic algorithm uses Clark's inference rule to infer haplotype segments.
We ran our program both on the simulated data and a set of real data from the phase II HapMap database. Experiments show that our program performs well. The recall value is from 90% to 99% in various cases. This implies that the program can report more than 90% of the true mutation regions. The value of precision varies from 29% to 90%. When the precision is 29%, the size of the reported regions is three times that of the true mutation region. This is still very useful for narrowing down the range of the disease gene location. Our program can complete the computation for all the tested cases, where there are about 110,000 SNPs on a chromosome, within 20 seconds.
连锁分析是寻找疾病基因的第一步。连锁研究已经促成了数百个人类基因的鉴定,这些基因可能携带有导致疾病表型的突变。在本文中,我们研究了一个非常重要的案例,其中采样个体之间存在密切关系,但没有给出系谱。当个体具有 6 代或 6 代以上的共同祖先时,这种情况经常发生。据我们所知,对于这种情况,没有算法可以给出很好的结果。
为了解决这个问题,我们首先开发了一些没有任何给定系谱的单体型推断启发式算法。我们提出了一个使用简约原则的模型,可以看作是 Dan Gusfield 首次提出的模型的扩展。我们的启发式算法使用 Clark 的推断规则来推断单体型片段。
我们在模拟数据和来自第二阶段 HapMap 数据库的一组真实数据上运行了我们的程序。实验表明,我们的程序表现良好。在各种情况下,召回值在 90%到 99%之间。这意味着程序可以报告超过 90%的真实突变区域。精度值在 29%到 90%之间变化。当精度为 29%时,报告区域的大小是真实突变区域的三倍。这对于缩小疾病基因位置的范围仍然非常有用。我们的程序可以在 20 秒内完成所有测试案例的计算,其中一个染色体上大约有 110,000 个 SNPs。