Department of Biostatistics, University of Washington, Seattle, WA, USA.
Department of Biostatistics, University of Washington, Seattle, WA, USA; Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA.
Am J Hum Genet. 2024 Apr 4;111(4):691-700. doi: 10.1016/j.ajhg.2024.02.015. Epub 2024 Mar 20.
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more computationally efficient inference of identity by descent (IBD) than approaches that infer pairwise IBD segments and provides locus-specific IBD clusters rather than IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2,900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach. Our IBD clustering method is implemented in the open-source ibd-cluster software package.
我们提出了一种在大型生物库规模的序列数据中高效识别同一位点基因型簇的方法。我们的多个体方法比推断个体间 IBD 片段的方法更有效地推断同源重组(IBD),并提供了具有特定基因座的 IBD 簇,而不是 IBD 片段。我们的方法的计算时间、内存需求和输出大小与数据集的个体数量呈线性关系。我们还提出了一种使用多个体 IBD 来检测基因转换改变的等位基因的方法。将我们的方法应用于英国生物库中 125361 名白种英国人的常染色体序列数据中,检测到超过 900 万个发生基因转换的等位基因。这是以前对家族数据进行分析时检测到的基因转换改变的等位基因数量的 2900 倍。我们估计,使用基于家族的方法找到类似数量的基因转换改变的等位基因,需要测序的先证者超过 25 万例,以及来自多代家族成员的更多数量的基因组。我们的 IBD 聚类方法在开源的 ibd-cluster 软件包中实现。