Wang Bo, Yang Xiaofei, Jia Yanyan, Xu Yu, Jia Peng, Dang Ningxin, Wang Songbo, Xu Tun, Zhao Xixi, Gao Shenghan, Dong Quanbin, Ye Kai
MOE Key Laboratory for Intelligent Networks & Network Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.
School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.
Genomics Proteomics Bioinformatics. 2022 Feb;20(1):4-13. doi: 10.1016/j.gpb.2021.08.003. Epub 2021 Sep 3.
Arabidopsis thaliana is an important and long-established model species for plant molecular biology, genetics, epigenetics, and genomics. However, the latest version of reference genome still contains a significant number of missing segments. Here, we reported a high-quality and almost complete Col-0 genome assembly with two gaps (named Col-XJTU) by combining the Oxford Nanopore Technologies ultra-long reads, Pacific Biosciences high-fidelity long reads, and Hi-C data. The total genome assembly size is 133,725,193 bp, introducing 14.6 Mb of novel sequences compared to the TAIR10.1 reference genome. All five chromosomes of the Col-XJTU assembly are highly accurate with consensus quality (QV) scores > 60 (ranging from 62 to 68), which are higher than those of the TAIR10.1 reference (ranging from 45 to 52). We completely resolved chromosome (Chr) 3 and Chr5 in a telomere-to-telomere manner. Chr4 was completely resolved except the nucleolar organizing regions, which comprise long repetitive DNA fragments. The Chr1 centromere (CEN1), reportedly around 9 Mb in length, is particularly challenging to assemble due to the presence of tens of thousands of CEN180 satellite repeats. Using the cutting-edge sequencing data and novel computational approaches, we assembled a 3.8-Mb-long CEN1 and a 3.5-Mb-long CEN2. We also investigated the structure and epigenetics of centromeres. Four clusters of CEN180 monomers were detected, and the centromere-specific histone H3-like protein (CENH3) exhibited a strong preference for CEN180 Cluster 3. Moreover, we observed hypomethylation patterns in CENH3-enriched regions. We believe that this high-quality genome assembly, Col-XJTU, would serve as a valuable reference to better understand the global pattern of centromeric polymorphisms, as well as the genetic and epigenetic features in plants.
拟南芥是植物分子生物学、遗传学、表观遗传学和基因组学领域一种重要且长期使用的模式物种。然而,最新版本的参考基因组仍包含大量缺失片段。在此,我们通过结合牛津纳米孔技术超长读长、太平洋生物科学公司的高保真长读长以及Hi-C数据,报道了一个高质量且几乎完整的Col-0基因组组装体(命名为Col-XJTU),该组装体有两个缺口。基因组组装体的总大小为133,725,193碱基对,与TAIR10.1参考基因组相比,引入了14.6兆碱基的新序列。Col-XJTU组装体的所有五条染色体都高度准确,一致质量(QV)得分>60(范围为62至68),高于TAIR10.1参考基因组(范围为45至52)。我们以端粒到端粒的方式完全解析了3号染色体和5号染色体。4号染色体除核仁组织区外已完全解析,核仁组织区包含长重复DNA片段。据报道,1号染色体着丝粒(CEN1)长度约为9兆碱基,由于存在数以万计的CEN180卫星重复序列,其组装极具挑战性。利用前沿测序数据和新颖的计算方法,我们组装了一个3.8兆碱基长的CEN1和一个3.5兆碱基长的CEN2。我们还研究了着丝粒的结构和表观遗传学。检测到四个CEN180单体簇,着丝粒特异性组蛋白H3样蛋白(CENH3)对CEN180簇3表现出强烈偏好。此外,我们在CENH3富集区域观察到低甲基化模式。我们相信,这个高质量的基因组组装体Col-XJTU将作为一个有价值的参考,以更好地理解着丝粒多态性的全局模式以及植物中的遗传和表观遗传特征。