使用长读长测序和 Hi-C 测序方法构建分相的、基于双亲的三人韩版参考基因组。

KOREF_S1: phased, parental trio-binned Korean reference genome using long reads and Hi-C sequencing methods.

机构信息

Korean Genomics Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea.

Department of Biomedical Engineering, College of Information and Biotechnology, Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea.

出版信息

Gigascience. 2022 Mar 24;11. doi: 10.1093/gigascience/giac022.

DOI:10.1093/gigascience/giac022

PMID:35333300

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8952264/

Abstract

BACKGROUND

KOREF is the Korean reference genome, which was constructed with various sequencing technologies including long reads, short reads, and optical mapping methods. It is also the first East Asian multiomic reference genome accompanied by extensive clinical information, time-series and multiomic data, and parental sequencing data. However, it was still not a chromosome-scale reference. Here, we updated the previous KOREF assembly to a new chromosome-level haploid assembly of KOREF, KOREF_S1v2.1. Oxford Nanopore Technologies (ONT) PromethION, Pacific Biosciences HiFi-CCS, and Hi-C technology were used to build the most accurate East Asian reference assembled so far.

RESULTS

We produced 705 Gb ONT reads and 114 Gb Pacific Biosciences HiFi reads, and corrected ONT reads by Pacific Biosciences reads. The corrected ultra-long reads reached higher accuracy of 1.4% base errors than the previous KOREF_S1v1.0, which was mainly built with short reads. KOREF has parental genome information, and we successfully phased it using a trio-binning method, acquiring a near-complete haploid-assembly. The final assembly resulted in total length of 2.9 Gb with an N50 of 150 Mb, and the longest scaffold covered 97.3% of GRCh38's chromosome 2. In addition, the final assembly showed high base accuracy, with <0.01% base errors.

CONCLUSIONS

KOREF_S1v2.1 is the first chromosome-scale haploid assembly of the Korean reference genome with high contiguity and accuracy. Our study provides useful resources of the Korean reference genome and demonstrates a new strategy of hybrid assembly that combines ONT's PromethION and PacBio's HiFi-CCS.

摘要

背景

KOREF 是韩国参考基因组，它是使用包括长读长、短读长和光学图谱方法在内的各种测序技术构建的。它也是第一个东亚多组学参考基因组，同时伴有广泛的临床信息、时间序列和多组学数据以及双亲测序数据。然而，它仍然不是染色体级别的参考基因组。在这里，我们将之前的 KOREF 组装更新为 KOREF_S1v2.1 的新染色体水平单体组装。牛津纳米孔技术（ONT）PromethION、太平洋生物科学 HiFi-CCS 和 Hi-C 技术被用于构建迄今为止最准确的东亚参考基因组。

结果

我们生成了 705 Gb ONT 读取和 114 Gb 太平洋生物科学 HiFi 读取，并使用太平洋生物科学读取纠正了 ONT 读取。经过校正的超长读取比之前主要使用短读取构建的 KOREF_S1v1.0 的准确率提高了 1.4%，碱基错误率更低。KOREF 具有双亲基因组信息，我们成功地使用三胞胎分箱方法对其进行了相位，获得了接近完整的单体组装。最终组装的总长度为 2.9 Gb，N50 为 150 Mb，最长的支架覆盖了 GRCh38 染色体 2 的 97.3%。此外，最终组装的碱基准确率很高，<0.01%的碱基错误率。