Shemirani Ruhollah, Belbin Gillian M, Avery Christy L, Kenny Eimear E, Gignoux Christopher R, Ambite José Luis
Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA.
Computer Science Department, University of Southern California, Los Angeles, CA, USA.
Nat Commun. 2021 Jun 10;12(1):3546. doi: 10.1038/s41467-021-22910-w.
The ability to identify segments of genomes identical-by-descent (IBD) is a part of standard workflows in both statistical and population genetics. However, traditional methods for finding local IBD across all pairs of individuals scale poorly leading to a lack of adoption in very large-scale datasets. Here, we present iLASH, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to current leading methods and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for millions of individuals. We apply iLASH to the PAGE dataset of 52,000 multi-ethnic participants, including several founder populations with elevated IBD sharing, identifying IBD segments in ~3 minutes per chromosome compared to over 6 days for a state-of-the-art algorithm. iLASH enables efficient analysis of very large-scale datasets, as we demonstrate by computing IBD across the UK Biobank (500,000 individuals), detecting 12.9 billion pairwise connections.
识别同源相同(IBD)的基因组片段的能力是统计遗传学和群体遗传学标准工作流程的一部分。然而,在所有个体对中寻找局部IBD的传统方法扩展性较差,导致在超大规模数据集中未被广泛采用。在这里,我们提出了iLASH,这是一种基于相似性检测技术的算法,在模拟中与当前领先方法相比显示出同等或更高的准确性,并且在基因组数据集上分析速度加快了几个数量级,使得数百万个体的IBD估计变得可行。我们将iLASH应用于约52,000名多民族参与者的PAGE数据集,包括几个IBD共享率较高的创始人群体,与一种先进算法超过6天的时间相比,iLASH每染色体识别IBD片段的时间约为3分钟。正如我们通过计算英国生物银行(约500,000个体)中的IBD所证明的那样,iLASH能够高效分析超大规模数据集,检测到129亿个成对连接。