Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland.
Genome Biol. 2023 May 22;24(1):124. doi: 10.1186/s13059-023-02969-y.
Several models and algorithms have been proposed to build pangenomes from multiple input assemblies, but their impact on variant representation, and consequently downstream analyses, is largely unknown.
We create multi-species super-pangenomes using pggb, cactus, and minigraph with the Bos taurus taurus reference sequence and eleven haplotype-resolved assemblies from taurine and indicine cattle, bison, yak, and gaur. We recover 221 k nonredundant structural variations (SVs) from the pangenomes, of which 135 k (61%) are common to all three. SVs derived from assembly-based calling show high agreement with the consensus calls from the pangenomes (96%), but validate only a small proportion of variations private to each graph. Pggb and cactus, which also incorporate base-level variation, have approximately 95% exact matches with assembly-derived small variant calls, which significantly improves the edit rate when realigning assemblies compared to minigraph. We use the three pangenomes to investigate 9566 variable number tandem repeats (VNTRs), finding 63% have identical predicted repeat counts in the three graphs, while minigraph can over or underestimate the count given its approximate coordinate system. We examine a highly variable VNTR locus and show that repeat unit copy number impacts the expression of proximal genes and non-coding RNA.
Our findings indicate good consensus between the three pangenome methods but also show their individual strengths and weaknesses that need to be considered when analysing different types of variants from multiple input assemblies.
已经提出了几种模型和算法来从多个输入组装构建泛基因组,但它们对变异体表示的影响,以及随后的下游分析,在很大程度上是未知的。
我们使用 pggb、cactus 和 minigraph 创建了多物种超级泛基因组,使用了牛 Taurus taurus 参考序列和来自 Taurine 和 indicine 牛、野牛、牦牛和 gaur 的十一个单倍型解析组装。我们从泛基因组中恢复了 221k 个非冗余结构变异(SVs),其中 135k(61%)是所有三个共同的。基于组装的调用的 SVs 与泛基因组的共识调用高度一致(96%),但仅验证了每个图谱特有的少量变异。pggb 和 cactus 还包含碱基水平的变异,与基于组装的小变异调用的精确匹配率约为 95%,与 minigraph 相比,在重新对齐组装时显著提高了编辑率。我们使用这三个泛基因组来研究 9566 个可变数串联重复(VNTR),发现 63%的 VNTR 在三个图谱中有相同的预测重复计数,而 minigraph 因其近似的坐标系可能会高估或低估计数。我们研究了一个高度可变的 VNTR 基因座,并表明重复单元拷贝数会影响近端基因和非编码 RNA 的表达。
我们的研究结果表明,这三种泛基因组方法之间有很好的一致性,但也显示了它们各自的优缺点,在分析来自多个输入组装的不同类型变异体时需要考虑这些优缺点。