Animal Genomics, ETH Zürich, Zürich, Switzerland.
Genome Biol. 2020 Jul 27;21(1):184. doi: 10.1186/s13059-020-02105-0.
The current bovine genomic reference sequence was assembled from a Hereford cow. The resulting linear assembly lacks diversity because it does not contain allelic variation, a drawback of linear references that causes reference allele bias. High nucleotide diversity and the separation of individuals by hundreds of breeds make cattle ideally suited to investigate the optimal composition of variation-aware references.
We augment the bovine linear reference sequence (ARS-UCD1.2) with variants filtered for allele frequency in dairy (Brown Swiss, Holstein) and dual-purpose (Fleckvieh, Original Braunvieh) cattle breeds to construct either breed-specific or pan-genome reference graphs using the vg toolkit. We find that read mapping is more accurate to variation-aware than linear references if pre-selected variants are used to construct the genome graphs. Graphs that contain random variants do not improve read mapping over the linear reference sequence. Breed-specific augmented and pan-genome graphs enable almost similar mapping accuracy improvements over the linear reference. We construct a whole-genome graph that contains the Hereford-based reference sequence and 14 million alleles that have alternate allele frequency greater than 0.03 in the Brown Swiss cattle breed. Our novel variation-aware reference facilitates accurate read mapping and unbiased sequence variant genotyping for SNPs and Indels.
We develop the first variation-aware reference graph for an agricultural animal ( https://doi.org/10.5281/zenodo.3759712 ). Our novel reference structure improves sequence read mapping and variant genotyping over the linear reference. Our work is a first step towards the transition from linear to variation-aware reference structures in species with high genetic diversity and many sub-populations.
目前的牛基因组参考序列是由一头赫里福德牛组装而成的。由此产生的线性组装缺乏多样性,因为它不包含等位基因变异,这是线性参考的一个缺点,会导致参考等位基因偏倚。高核苷酸多样性和个体之间的数百个品种的分离使得牛非常适合研究具有变异意识的参考序列的最佳组成。
我们使用 vg 工具包,用在乳用(棕色瑞士牛、荷斯坦牛)和兼用(弗莱维赫牛、原始勃艮第牛)牛品种中过滤等位基因频率的变体来扩充牛的线性参考序列(ARS-UCD1.2),构建特定品种或泛基因组参考图谱。我们发现,如果使用预先选择的变体来构建基因组图谱,那么读映射比线性参考序列更准确地反映变异。包含随机变体的图谱并不能提高线性参考序列的读映射准确性。特定品种的扩充和泛基因组图谱几乎可以提高线性参考序列的映射准确性。我们构建了一个包含赫里福德牛参考序列和 1400 万个等位基因的全基因组图谱,这些等位基因在棕色瑞士牛品种中的等位基因频率大于 0.03。我们的新变异感知参考有助于 SNP 和 Indel 的准确读映射和无偏序列变异基因分型。
我们为农业动物开发了第一个具有变异意识的参考图谱(https://doi.org/10.5281/zenodo.3759712)。我们的新参考结构提高了序列读映射和变体基因分型的准确性,优于线性参考。我们的工作是朝着在具有高度遗传多样性和许多亚群体的物种中从线性参考结构向具有变异意识的参考结构过渡的第一步。