Yang Shuo, Carmi Shai, Pe'er Itsik
1 Department of Computer Science, Columbia University , New York, New York.
3 Braun School of Public Health, Faculty of Medicine, Hebrew University, Jerusalem, Israel .
J Comput Biol. 2016 Jun;23(6):495-507. doi: 10.1089/cmb.2016.0016. Epub 2016 Apr 22.
The genomes of remotely related individuals occasionally contain long segments that are identical by descent (IBD). Sharing of IBD segments has many applications in population and medical genetics, and it is thus desirable to study their properties in simulations. However, no current method provides a direct, efficient means to extract IBD segments from simulated genealogies. Here, we introduce computationally efficient approaches to extract ground-truth IBD segments from a sequence of genealogies, or equivalently, an ancestral recombination graph. Specifically, we use a two-step scheme, where we first identify putative shared segments by comparing the common ancestors of all pairs of individuals at some distance apart. This reduces the search space considerably, and we then proceed by determining the true IBD status of the candidate segments. Under some assumptions and when allowing a limited resolution of segment lengths, our run-time complexity is reduced from O(n(3) log n) for the naïve algorithm to O(n log n), where n is the number of individuals in the sample.
亲缘关系较远的个体的基因组偶尔会包含通过遗传相同(IBD)的长片段。IBD片段的共享在群体遗传学和医学遗传学中有许多应用,因此在模拟中研究它们的特性是很有必要的。然而,目前没有方法能提供一种直接、有效的手段从模拟系谱中提取IBD片段。在此,我们引入了计算效率高的方法,从一系列系谱(或等效地,从祖先重组图)中提取真实的IBD片段。具体而言,我们使用一种两步方案,首先通过比较一定距离外所有个体对的共同祖先来识别假定的共享片段。这大大减少了搜索空间,然后我们通过确定候选片段的真实IBD状态继续进行。在一些假设下,并且当允许片段长度有有限分辨率时,我们的运行时间复杂度从朴素算法的O(n(3) log n)降低到O(n log n),其中n是样本中的个体数量。