Nowbandegani Pouria Salehi, Zhang Shenghan, Hu Haoyang, Li Heng, O'Connor Luke J
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
bioRxiv. 2025 Aug 4:2025.08.04.668502. doi: 10.1101/2025.08.04.668502.
Structural variation causes some human haplotypes to align poorly with the linear reference genome, leading to 'reference bias'. A pangenome reference graph could ameliorate this bias by relating a sample to multiple reference assemblies. However, this approach requires a new definition of a 'genetic variant.' We introduce a definition of pangenome variants and a method, , to identify them. Our approach involves a pangenome which includes all nodes (sequences) of the pangenome graph, but only a subset of its edges; non-reference edges are . Our variants are biallelic and have well-defined positions. Analyzing the Minigraph-Cactus draft human pangenome reference graph, we identified 29.6 million genetic variants. Most variants (99.2%) are small, and most small variants (73.9%) are SNPs. 3.5 million variants (11.7%) have a reference allele which is not on GRCh38; these variants are difficult to detect without a pangenome reference, or with existing pangenome-based approaches. They tend to be embedded within tangled, multiallelic regions. We analyze two medically relevant regions, around the HLA-A and RHD genes, identifying thousands of small variants embedded within several large insertions, deletions, and inversions. We release an open-source software tool together with a VCF variant catalogue.
结构变异导致一些人类单倍型与线性参考基因组的比对效果不佳,从而产生“参考偏差”。泛基因组参考图通过将样本与多个参考组装进行关联,可改善这种偏差。然而,这种方法需要对“基因变异”进行新的定义。我们引入了泛基因组变异的定义以及一种识别它们的方法。我们的方法涉及一个泛基因组,它包含泛基因组图的所有节点(序列),但只包含其边的一个子集;非参考边是……我们的变异是双等位基因的,并且具有明确的位置。通过分析Minigraph-Cactus人类泛基因组参考草图,我们识别出了2960万个基因变异。大多数变异(99.2%)是小变异,并且大多数小变异(73.9%)是单核苷酸多态性(SNP)。350万个变异(11.7%)具有不在GRCh38上的参考等位基因;如果没有泛基因组参考,或者使用现有的基于泛基因组的方法,这些变异很难被检测到。它们往往嵌入在复杂的多等位基因区域内。我们分析了两个与医学相关的区域,即HLA - A和RHD基因周围,识别出了数千个嵌入在几个大的插入、缺失和倒位中的小变异。我们发布了一个开源软件工具以及一个VCF变异目录。