Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 277-8562, Japan.
Bioinformatics. 2023 Jul 1;39(7). doi: 10.1093/bioinformatics/btad398.
Diploid assembly, or determining sequences of homologous chromosomes separately, is essential to elucidate genetic differences between haplotypes. One approach is to call and phase single nucleotide variants (SNVs) on a reference sequence. However, this approach becomes unstable on large segmental duplications (SDs) or structural variations (SVs) because the alignments of reads deriving from these regions tend to be unreliable. Another approach is to use highly accurate PacBio HiFi reads to output diploid assembly directly. Nonetheless, HiFi reads cannot phase homozygous regions longer than their length and require oxford nanopore technology (ONT) reads or Hi-C to produce a fully phased assembly. Is a single long-read sequencing technology sufficient to create an accurate diploid assembly?
Here, we present JTK, a megabase-scale diploid genome assembler. It first randomly samples kilobase-scale sequences (called 'chunks') from the long reads, phases variants found on them, and produces two haplotypes. The novel idea of JTK is to utilize chunks to capture SNVs and SVs simultaneously. From 60-fold ONT reads on the HG002 and a Japanese sample, it fully assembled two haplotypes with approximately 99.9% accuracy on the histocompatibility complex (MHC) and the leukocyte receptor complex (LRC) regions, which was impossible by the reference-based approach. In addition, in the LRC region on a Japanese sample, JTK output an assembly of better contiguity than those built from high-coverage HiFi+Hi-C. In the coming age of pan-genomics, JTK would complement the reference-based phasing method to assemble the difficult-to-assemble but medically important regions.
JTK is available at https://github.com/ban-m/jtk, and the datasets are available at https://doi.org/10.5281/zenodo.7790310 or JGAS000580 in DDBJ.
二倍体组装,或分别确定同源染色体的序列,对于阐明单倍型之间的遗传差异至关重要。一种方法是在参考序列上调用并相位单核苷酸变体 (SNV)。然而,这种方法在大的片段重复 (SD) 或结构变异 (SV) 上变得不稳定,因为来自这些区域的读取的比对往往不可靠。另一种方法是使用高度准确的 PacBio HiFi 读取直接输出二倍体组装。然而,HiFi 读取不能相位纯合区域长于其长度,并且需要牛津纳米孔技术 (ONT) 读取或 Hi-C 来产生完全相位组装。单一的长读测序技术是否足以创建准确的二倍体组装?
在这里,我们提出了 JTK,这是一种兆碱基规模的二倍体基因组组装器。它首先从长读中随机采样千碱基规模的序列(称为“块”),在其上相位变体,并产生两个单倍型。JTK 的新颖思想是利用块同时捕获 SNV 和 SV。从 60 倍 ONT 读取的 HG002 和一个日本样本中,它在主要组织相容性复合体 (MHC) 和白细胞受体复合物 (LRC) 区域完全组装了两个单倍型,准确度约为 99.9%,这是基于参考的方法不可能实现的。此外,在日本样本的 LRC 区域,JTK 输出的组装比高覆盖度 HiFi+Hi-C 构建的组装更具连续性。在泛基因组学的时代,JTK 将补充基于参考的相位方法,以组装难以组装但具有医学重要性的区域。
JTK 可在 https://github.com/ban-m/jtk 获得,数据集可在 https://doi.org/10.5281/zenodo.7790310 或 DDBJ 中的 JGAS000580 获得。