Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea.
Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea.
Genome Biol. 2022 Sep 27;23(1):204. doi: 10.1186/s13059-022-02765-0.
Many short-read genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project has been producing new reference genome assemblies with an emphasis on being as complete and error-free as possible, which requires utilizing long reads, long-range scaffolding data, new assembly algorithms, and manual curation. A more thorough evaluation of the recent references relative to prior assemblies can provide a detailed overview of the types and magnitude of improvements.
Here we evaluate new vertebrate genome references relative to the previous assemblies for the same species and, in two cases, the same individuals, including a mammal (platypus), two birds (zebra finch, Anna's hummingbird), and a fish (climbing perch). We find that up to 11% of genomic sequence is entirely missing in the previous assemblies. In the Vertebrate Genomes Project zebra finch assembly, we identify eight new GC- and repeat-rich micro-chromosomes with high gene density. The impact of missing sequences is biased towards GC-rich 5'-proximal promoters and 5' exon regions of protein-coding genes and long non-coding RNAs. Between 26 and 60% of genes include structural or sequence errors that could lead to misunderstanding of their function when using the previous genome assemblies.
Our findings reveal novel regulatory landscapes and protein coding sequences that have been greatly underestimated in previous assemblies and are now present in the Vertebrate Genomes Project reference genomes.
许多短读长基因组组装被发现是不完整的,并且包含错误的组装。脊椎动物基因组计划一直在生成新的参考基因组组装,重点是尽可能完整和无错误,这需要利用长读长、长程支架数据、新的组装算法和人工注释。对最近的参考基因组相对于以前的组装进行更彻底的评估,可以提供有关改进类型和幅度的详细概述。
在这里,我们评估了新的脊椎动物基因组参考相对于同一物种的以前组装,在两种情况下,相对于同一个体,包括哺乳动物(鸭嘴兽)、两种鸟类(斑胸草雀、安娜蜂鸟)和一种鱼类(攀鲈)。我们发现,多达 11%的基因组序列在以前的组装中完全缺失。在脊椎动物基因组计划斑马雀组装中,我们鉴定出了八个具有高基因密度的新 GC 和重复丰富的微染色体。缺失序列的影响偏向于富含 GC 的 5'近端启动子和 5'外显子区域的蛋白质编码基因和长非编码 RNA。在 26%到 60%的基因中,包括结构或序列错误,这些错误可能导致在使用以前的基因组组装时对其功能的误解。
我们的发现揭示了以前的组装中大大低估的新调控景观和蛋白质编码序列,现在在脊椎动物基因组计划参考基因组中都有存在。