Naseri Ardalan, Yue William, Zhang Shaojie, Zhi Degui
School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston TX 77030, USA.
Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.
bioRxiv. 2023 Jan 10:2023.01.09.523304. doi: 10.1101/2023.01.09.523304.
While rates of recombination events across the genome (genetic maps) are fundamental to genetic research, the majority of current studies only use one standard map. There is evidence suggesting population differences in genetic maps, and thus estimating population-specific maps are of interest. While the recent availability of biobank-scale data offers such opportunities, current methods are not efficient at leveraging very large sample sizes. The most accurate methods are still linkage-disequilibrium (LD)-based methods that are only tractable for a few hundred samples. In this work, we propose a fast and memory-efficient method for estimating genetic maps from population genotyping data. Our method, FastRecomb, leverages the efficient positional Burrows-Wheeler transform (PBWT) data structure for counting IBD segment boundaries as potential recombination events. We used PBWT blocks to avoid redundant counting of pairwise matches. Moreover, we used a panel smoothing technique to reduce the noise from errors and recent mutations. Using simulation, we found that FastRecomb achieves state-of-the-art performance at 10k resolution, in terms of correlation coefficients between the estimated map and the ground truth. This is mainly due to the fact that FastRecomb can effectively take advantage of large panels comprising more than hundreds of thousands of haplotypes. At the same time, other methods lack the efficiency to handle such data. We believe further refinement of FastRecomb would deliver more accurate genetic maps for the genetics community.
虽然全基因组的重组事件发生率(遗传图谱)是基因研究的基础,但目前大多数研究仅使用一种标准图谱。有证据表明遗传图谱存在群体差异,因此估计特定群体的图谱很有意义。尽管生物样本库规模的数据最近开始可用,为研究提供了这样的机会,但目前的方法在利用非常大的样本量方面效率不高。最准确的方法仍然是基于连锁不平衡(LD)的方法,而这些方法仅适用于几百个样本。在这项工作中,我们提出了一种快速且内存高效的方法,用于从群体基因分型数据中估计遗传图谱。我们的方法FastRecomb利用高效的位置布隆过滤器变换(PBWT)数据结构来计数作为潜在重组事件的同源染色体片段边界。我们使用PBWT块来避免成对匹配的重复计数。此外,我们使用了一种面板平滑技术来减少错误和近期突变产生的噪声。通过模拟,我们发现FastRecomb在10k分辨率下,就估计图谱与真实图谱之间的相关系数而言,达到了当前最优的性能。这主要是因为FastRecomb能够有效地利用包含超过数十万单倍型的大型样本面板。与此同时,其他方法缺乏处理此类数据的效率。我们相信对FastRecomb的进一步优化将为遗传学领域提供更准确的遗传图谱。