Department of Plant Sciences, University of California, Davis, CA, 95616, USA.
GDEC, Université Clermont Auvergne, INRAE, Clermont-Ferrand, 63000, France.
Plant J. 2021 Jul;107(1):303-314. doi: 10.1111/tpj.15289. Epub 2021 May 16.
Until recently, achieving a reference-quality genome sequence for bread wheat was long thought beyond the limits of genome sequencing and assembly technology, primarily due to the large genome size and > 80% repetitive sequence content. The release of the chromosome scale 14.5-Gb IWGSC RefSeq v1.0 genome sequence of bread wheat cv. Chinese Spring (CS) was, therefore, a milestone. Here, we used a direct label and stain (DLS) optical map of the CS genome together with a prior nick, label, repair and stain (NLRS) optical map, and sequence contigs assembled with Pacific Biosciences long reads, to refine the v1.0 assembly. Inconsistencies between the sequence and maps were reconciled and gaps were closed. Gap filling and anchoring of 279 unplaced scaffolds increased the total length of pseudomolecules by 168 Mb (excluding Ns). Positions and orientations were corrected for 233 and 354 scaffolds, respectively, representing 10% of the genome sequence. The accuracy of the remaining 90% of the assembly was validated. As a result of the increased contiguity, the numbers of transposable elements (TEs) and intact TEs have increased in IWGSC RefSeq v2.1 compared with v1.0. In total, 98% of the gene models identified in v1.0 were mapped onto this new assembly through development of a dedicated approach implemented in the MAGAAT pipeline. The numbers of high-confidence genes on pseudomolecules have increased from 105 319 to 105 534. The reconciled assembly enhances the utility of the sequence for genetic mapping, comparative genomics, gene annotation and isolation, and more general studies on the biology of wheat.
直到最近,人们一直认为,由于基因组庞大且重复序列含量超过 80%,要实现小麦参考质量基因组序列,超出了基因组测序和组装技术的范围。因此,发布染色体规模为 14.5Gb 的 IWGSC RefSeq v1.0 小麦品种中国春(CS)参考基因组序列是一个里程碑。在这里,我们使用 CS 基因组的直接标记和染色(DLS)光学图谱,以及之前的缺口、标记、修复和染色(NLRS)光学图谱,以及使用 Pacific Biosciences 长读序列组装的序列连续图谱,来改进 v1.0 组装。图谱和序列之间的不一致之处得到了协调,缺口得到了填补。279 个未定位支架的填充和锚定增加了假染色体的总长度 168 Mb(不包括 Ns)。分别校正了 233 和 354 个支架的位置和方向,分别占基因组序列的 10%。其余 90%的组装的准确性得到了验证。由于连续性的提高,与 v1.0 相比,IWGSC RefSeq v2.1 中转座元件(TEs)和完整 TEs 的数量增加了。通过在 MAGAAT 管道中实施专门的方法,总共 98%在 v1.0 中鉴定的基因模型被映射到这个新的组装上。假染色体上的高可信度基因的数量从 105319 个增加到 105534 个。协调后的组装增强了序列在遗传作图、比较基因组学、基因注释和分离以及小麦生物学的更广泛研究中的实用性。