Sapin Emmanuel, Keller Matthew C
Institute for Behavioral Genetics, University of Colorado Boulder, Boulder, CO 80309, USA.
Psychology & Neuroscience Department, University of Colorado Boulder, Boulder, CO, USA.
Bioinformatics. 2021 Aug 9;37(15):2121-2125. doi: 10.1093/bioinformatics/btab084.
Pairwise comparison problems arise in many areas of science. In genomics, datasets are already large and getting larger, and so operations that require pairwise comparisons-either on pairs of SNPs or pairs of individuals-are extremely computationally challenging. We propose a generic algorithm for addressing pairwise comparison problems that breaks a large problem (of order n2 comparisons) into multiple smaller ones (each of order n comparisons), allowing for massive parallelization.
We demonstrated that this approach is very efficient for calling identical by descent (IBD) segments between all pairs of individuals in the UK Biobank dataset, with a 250-fold savings in time and 750-fold savings in memory over the standard approach to detecting such segments across the full dataset. This efficiency should extend to other methods of IBD calling and, more generally, to other pairwise comparison tasks in genomics or other areas of science.
A GitHub page is available at https://github.com/emmanuelsapin with the code to generate data needed for the implementation.
成对比较问题在许多科学领域都会出现。在基因组学中,数据集已经很大且还在不断增大,因此需要成对比较的操作——无论是对单核苷酸多态性(SNP)对还是个体对进行比较——在计算上都极具挑战性。我们提出了一种通用算法来解决成对比较问题,该算法将一个大问题(规模为(n^2)次比较)分解为多个较小的问题(每个规模为(n)次比较),从而实现大规模并行化。
我们证明,对于在英国生物银行数据集中的所有个体对之间调用同源片段(IBD),这种方法非常高效,与在整个数据集上检测此类片段的标准方法相比,时间节省了250倍,内存节省了750倍。这种效率应能扩展到其他IBD调用方法,更广泛地说,还能扩展到基因组学或其他科学领域的其他成对比较任务。