Centre of Systems Biology, Biomedical Research Foundation, Academy of Athens, 11527 Athens, Greece.
School of Pharmacy, Aristotle University of Thessaloniki (AUTh), 54124 Thessaloniki, Greece.
Mol Med Rep. 2021 Apr;23(4). doi: 10.3892/mmr.2021.11890. Epub 2021 Feb 4.
Genome assemblers are computational tools for genome assembly, based on a plenitude of primary sequencing data. The quality of genome assemblies is estimated by their contiguity and the occurrences of misassemblies (duplications, deletions, translocations or inversions). The rapid development of sequencing technologies has enabled the rise of novel genome assembly strategies. The ultimate goal of such strategies is to utilise the features of each sequencing platform in order to address the existing weaknesses of each sequencing type and compose a complete and correct genome map. In the present study, the hybrid strategy, which is based on Illumina short paired‑end reads and Nanopore long reads, was benchmarked using MaSuRCA and Wengan assemblers. Moreover, the long‑read assembly strategy, which is based on Nanopore reads, was benchmarked using Canu or PacBio HiFi reads were benchmarked using Hifiasm and HiCanu. The assemblies were performed on a computational cluster with limited computational resources. Their outputs were evaluated in terms of accuracy and computational performance. PacBio HiFi assembly strategy outperforms the other ones, while Hi‑C scaffolding, which is based on chromatin 3D structure, is required in order to increase continuity, accuracy and completeness when large and complex genomes, such as the human one, are assembled. The use of Hi‑C data is also necessary while using the hybrid assembly strategy. The results revealed that HiFi sequencing enabled the rise of novel algorithms which require less genome coverage than that of the other strategies making the assembly a less computationally demanding task. Taken together, these developments may lead to the democratisation of genome assembly projects which are now approachable by smaller labs with limited technical and financial resources.
基因组组装器是基于大量原始测序数据进行基因组组装的计算工具。基因组组装的质量通过其连续性和错误组装(重复、缺失、易位或倒位)的发生来评估。测序技术的快速发展催生了新型基因组组装策略的出现。这些策略的最终目标是利用每个测序平台的特点,以解决每种测序类型的现有弱点,并组成一个完整和正确的基因组图谱。在本研究中,基于 Illumina 短配对末端读取和 Nanopore 长读取的混合策略,使用 MaSuRCA 和 Wengan 组装器进行了基准测试。此外,基于 Nanopore 读取的长读取组装策略,使用 Canu 或 PacBio HiFi 读取,使用 Hifiasm 和 HiCanu 进行了基准测试。组装是在计算资源有限的计算集群上进行的。从准确性和计算性能方面对它们的输出进行了评估。PacBio HiFi 组装策略优于其他策略,而 Hi-C 支架,它基于染色质 3D 结构,当组装大型和复杂的基因组,如人类基因组时,需要增加连续性、准确性和完整性。在使用混合组装策略时,还需要使用 Hi-C 数据。结果表明,HiFi 测序催生了新的算法,这些算法比其他策略需要更少的基因组覆盖度,从而使组装成为一项计算要求较低的任务。总之,这些发展可能会导致基因组组装项目的民主化,现在较小的实验室也可以使用有限的技术和财务资源来进行这些项目。