UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
University of Ferrara, Ferrara, Italy.
Nat Methods. 2024 Nov;21(11):2017-2023. doi: 10.1038/s41592-024-02407-2. Epub 2024 Sep 11.
Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit ( https://github.com/vgteam/vg ) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.
泛基因组通过更好地代表遗传多样性来减少参考偏差,而不是单一的参考序列。然而,当将样本与泛基因组进行比较时,泛基因组中不属于样本的变体可能会产生误导,例如导致假读映射。这些不相关的变体通常在等位基因频率方面较少见,并且以前已经通过过滤稀有变体来处理。然而,这种简单的启发式方法既不能去除一些不相关的变体,也不能去除许多相关的变体。我们提出了一种新的方法,通过根据读取中的 k-mer 计数对局部单倍型进行采样,来推断个性化的泛基因组子图。我们在 Giraffe 短读对齐器的 vg 工具包(https://github.com/vgteam/vg)中实现了该方法,并使用人类泛基因组参考联盟的人类泛基因组图谱来比较其准确性与最先进的方法。与基因组分析工具包相比,这将小变体基因分型错误减少了四倍,并使已知变体的短读结构变体基因分型与长读变体发现方法具有竞争力。