23andMe, Inc, Sunnyvale, CA, USA.
Mol Biol Evol. 2021 May 4;38(5):2131-2151. doi: 10.1093/molbev/msaa328.
Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows-Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors, we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally, we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale data sets with millions of samples. Furthermore, we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis, exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for noncommercial use in the code repository (https://github.com/23andMe/phasedibd, last accessed January 11, 2021).
估算个体间同源(IBD)片段的基因组位置和长度是许多遗传分析的关键步骤。然而,生物库和直接面向消费者的遗传数据集的规模呈指数级增长,使得准确的 IBD 推断成为一项重大的计算挑战。在这里,我们提出了模板化位置 Burrows-Wheeler 变换(TPBWT),以使快速 IBD 估计对基因型和相位误差具有鲁棒性。使用在具有真实基因分型和相位误差的家系上模拟的单倍型数据,我们表明 TPBWT 在速度和准确性方面优于其他最先进的 IBD 推断算法。对于每个相位感知方法,我们探讨了通过片段长度推断 IBD 的假阳性和假阴性率,并描述了常见的错误类型。我们的结果突出了大多数相位 IBD 推断方法的脆弱性;IBD 估计的准确性对单倍型相位的质量高度敏感。此外,我们比较了 TPBWT 与一种广泛使用的、对相位误差具有鲁棒性的无相位 IBD 推断方法的性能。我们引入了基于 TPBWT 的内样本和外样本 IBD 推断算法,并在具有数百万个样本的大规模数据集上演示了它们的计算效率。此外,我们描述了 TPBWT 压缩单倍型的二进制文件格式,这导致了针对非常大规模的队列面板的快速和高效的外样本 IBD 计算。最后,我们在一个简短的实证分析中展示了 TPBWT 的效用,探索了墨西哥内部单倍型共享的地理模式。在墨西哥内部区域之间共享的 IBD 的层次聚类揭示了地理结构的单倍型共享和距离隔离的强烈信号。我们的 TPBWT 软件实现可在代码库中免费非商业使用(https://github.com/23andMe/phasedibd,最后访问时间为 2021 年 1 月 11 日)。