Sabaa Hadi, Cai Zhipeng, Wang Yining, Goebel Randy, Moore Stephen, Lin Guohui
Department of Computing Science, University of Alberta, Edmonton, Alberta T6G 2E8, Canada.
J Bioinform Comput Biol. 2013 Apr;11(2):1350002. doi: 10.1142/S0219720013500029. Epub 2013 Jan 16.
High-throughput single nucleotide polymorphism genotyping assays conveniently produce genotype data for genome-wide genetic linkage and association studies. For pedigree datasets, the unphased genotype data is used to infer the haplotypes for individuals, according to Mendelian inheritance rules. Linkage studies can then locate putative chromosomal regions based on the haplotype allele sharing among the pedigree members and their disease status. Most existing haplotyping programs require rather strict pedigree structures and return a single inferred solution for downstream analysis. In this research, we relax the pedigree structure to contain ungenotyped founders and present a cubic time whole genome haplotyping algorithm to minimize the number of zero-recombination haplotype blocks. With or without explicitly enumerating all the haplotyping solutions, the algorithm determines all distinct haplotype allele identity-by-descent (IBD) sharings among the pedigree members, in linear time in the total number of haplotyping solutions. Our algorithm is implemented as a computer program iBDD. Extensive simulation experiments using 2 sets of 16 pedigree structures from previous studies showed that, in general, there are trillions of haplotyping solutions, but only up to a few thousand distinct haplotype allele IBD sharings. iBDD is able to return all these sharings for downstream genome-wide linkage and association studies.
高通量单核苷酸多态性基因分型检测能够方便地生成用于全基因组遗传连锁和关联研究的基因型数据。对于家系数据集,根据孟德尔遗传规则,利用未分型的基因型数据来推断个体的单倍型。连锁研究随后可以根据家系成员之间的单倍型等位基因共享情况及其疾病状态来定位假定的染色体区域。大多数现有的单倍型分型程序需要相当严格的家系结构,并返回一个单一的推断解决方案用于下游分析。在本研究中,我们放宽了家系结构,使其包含未分型的奠基者,并提出了一种三次时间复杂度的全基因组单倍型分型算法,以尽量减少零重组单倍型块的数量。无论是否明确枚举所有单倍型分型解决方案,该算法都能在单倍型分型解决方案总数的线性时间内确定家系成员之间所有不同的单倍型等位基因同源性(IBD)共享情况。我们的算法实现为一个计算机程序iBDD。使用先前研究中的2组16个家系结构进行的广泛模拟实验表明,一般来说,单倍型分型解决方案有上万亿个,但不同的单倍型等位基因IBD共享情况最多只有几千种。iBDD能够返回所有这些共享情况,用于下游的全基因组连锁和关联研究。