Animal Genomics, Eidgenössische Technische Hochschule (ETH) Zürich, 8315 Zürich, Switzerland
Animal Genomics, Eidgenössische Technische Hochschule (ETH) Zürich, 8315 Zürich, Switzerland.
Proc Natl Acad Sci U S A. 2021 May 18;118(20). doi: 10.1073/pnas.2101056118.
Many genomic analyses start by aligning sequencing reads to a linear reference genome. However, linear reference genomes are imperfect, lacking millions of bases of unknown relevance and are unable to reflect the genetic diversity of populations. This makes reference-guided methods susceptible to reference-allele bias. To overcome such limitations, we build a pangenome from six reference-quality assemblies from taurine and indicine cattle as well as yak. The pangenome contains an additional 70,329,827 bases compared to the reference genome. Our multiassembly approach reveals 30 and 10.1 million bases private to yak and indicine cattle, respectively, and between 3.3 and 4.4 million bases unique to each taurine assembly. Utilizing transcriptomes from 56 cattle, we show that these nonreference sequences encode transcripts that hitherto remained undetected from the reference genome. We uncover genes, primarily encoding proteins contributing to immune response and pathogen-mediated immunomodulation, differentially expressed between -infected and noninfected cattle that are also undetectable in the reference genome. Using whole-genome sequencing data of cattle from five breeds, we show that reads which were previously misaligned against the reference genome now align accurately to the pangenome sequences. This enables us to discover 83,250 polymorphic sites that segregate within and between breeds of cattle and capture genetic differentiation across breeds. Our work makes a so-far unused source of variation amenable to genetic investigations and provides methods and a framework for establishing and exploiting a more diverse reference genome.
许多基因组分析都是从将测序reads 与线性参考基因组比对开始的。然而,线性参考基因组并不完美,缺乏数百万个未知相关的碱基,也无法反映种群的遗传多样性。这使得基于参考的方法容易受到参考等位基因偏倚的影响。为了克服这些限制,我们从黄牛和瘤牛以及牦牛的六个参考质量组装中构建了一个泛基因组。与参考基因组相比,泛基因组增加了 70329827 个碱基。我们的多组装方法分别揭示了牦牛和瘤牛特有的 30 和 1010 万个碱基,以及每个黄牛组装特有的 330 万至 440 万个碱基。利用来自 56 头牛的转录组,我们表明这些非参考序列编码的转录本迄今尚未从参考基因组中检测到。我们发现了一些基因,主要编码参与免疫反应和病原体介导的免疫调节的蛋白质,这些基因在感染和非感染牛之间的表达存在差异,在参考基因组中也无法检测到。使用来自五个品种的牛的全基因组测序数据,我们表明以前与参考基因组错配的reads 现在可以准确地与泛基因组序列对齐。这使我们能够发现 83250 个多态性位点,这些位点在牛的品种内和品种间分离,并捕捉品种间的遗传分化。我们的工作使迄今为止未被利用的变异源能够进行遗传研究,并提供了建立和利用更多样化的参考基因组的方法和框架。