Zeng Peng, Tian Zunzhe, Han Yuwei, Zhang Weixiong, Zhou Tinggan, Peng Yingmei, Hu Hao, Cai Jing
State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macau, China.
School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, China.
Chin Med. 2022 Aug 9;17(1):94. doi: 10.1186/s13020-022-00644-1.
Many medicinal plants are known for their complex genomes with high ploidy, heterozygosity, and repetitive content which pose severe challenges for genome sequencing of those species. Long reads from Oxford nanopore sequencing technology (ONT) or Pacific Biosciences Single Molecule, Real-Time (SMRT) sequencing offer great advantages in de novo genome assembly, especially for complex genomes with high heterozygosity and repetitive content. Currently, multiple allotetraploid species have sequenced their genomes by long-read sequencing. However, we found that a considerable proportion of these genomes (7.9% on average, maximum 23.7%) could not be covered by NGS (Next Generation Sequencing) reads (uncovered region by NGS reads, UCR) suggesting the questionable and low-quality of those area or genomic areas that can't be sequenced by NGS due to sequencing bias. The underlying causes of those UCR in the genome assembly and solutions to this problem have never been studied.
In the study, we sequenced the tetraploid genome of Veratrum dahuricum (Turcz.) O. Loes (VDL), a Chinese medicinal plant, with ONT platform and assembled the genome with three strategies in parallel. We compared the qualities, coverage, and heterozygosity of the three ONT assemblies with another released assembly of the same individual using reads from PacBio circular consensus sequencing (CCS) technology, to explore the cause of the UCR.
By mapping the NGS reads against the three ONT assemblies and the CCS assembly, we found that the coverage of those ONT assemblies by NGS reads ranged from 49.15 to 76.31%, much smaller than that of the CCS assembly (99.53%). And alignment between ONT assemblies and CCS assembly showed that most UCR can be aligned with CCS assembly. So, we conclude that the UCRs in ONT assembly are low-quality sequences with a high error rate that can't be aligned with short reads, rather than genomic regions that can't be sequenced by NGS. Further comparison among the intermediate versions of ONT assemblies showed that the most probable origin of those errors is a combination of artificial errors introduced by "self-correction" and initial sequencing error in long reads. We also found that polishing the ONT assembly with CCS reads can correct those errors efficiently.
Through analyzing genome features and reads alignment, we have found the causes for the high proportion of UCR in ONT assembly of VDL are sequencing errors and additional errors introduced by self-correction. The high error rates of ONT-raw reads make them not suitable for self-correction prior to allotetraploid genome assembly, as the self-correction will introduce artificial errors to > 5% of the UCR sequences. We suggest high-precision CCS reads be used to polish the assembly to correct those errors effectively for polyploid genomes.
许多药用植物以其复杂的基因组而闻名,这些基因组具有高倍性、杂合性和重复序列,这给这些物种的基因组测序带来了严峻挑战。来自牛津纳米孔测序技术(ONT)或太平洋生物科学公司单分子实时(SMRT)测序的长读长在从头基因组组装中具有很大优势,特别是对于具有高杂合性和重复序列的复杂基因组。目前,多个异源四倍体物种已通过长读长测序完成了基因组测序。然而,我们发现这些基因组中有相当一部分(平均7.9%,最高23.7%)无法被二代测序(NGS) reads覆盖(NGS reads未覆盖区域,UCR),这表明由于测序偏差,那些区域或无法通过NGS测序的基因组区域存在问题且质量较低。在基因组组装中这些UCR的潜在原因以及该问题的解决方案从未被研究过。
在本研究中,我们使用ONT平台对中国药用植物毛穗藜芦(Turcz.)O. Loes(VDL)的四倍体基因组进行测序,并并行采用三种策略组装基因组。我们将三种ONT组装的质量、覆盖度和杂合性与使用PacBio环形一致序列测序(CCS)技术对同一个体的另一个已发布组装进行比较,以探究UCR的原因。
通过将NGS reads与三种ONT组装以及CCS组装进行比对,我们发现NGS reads对那些ONT组装的覆盖度在49.15%至76.31%之间,远小于CCS组装的覆盖度(99.53%)。ONT组装与CCS组装之间的比对表明,大多数UCR可以与CCS组装比对上。因此,我们得出结论,ONT组装中的UCR是错误率高的低质量序列,无法与短读长比对,而不是无法通过NGS测序的基因组区域。ONT组装中间版本之间的进一步比较表明,这些错误最可能的来源是“自我校正”引入的人为错误和长读长中的初始测序错误。我们还发现用CCS reads对ONT组装进行优化可以有效地校正这些错误。
通过分析基因组特征和读长比对,我们发现VDL的ONT组装中UCR比例高的原因是测序错误和自我校正引入的额外错误。ONT原始读长的高错误率使其不适用于异源四倍体基因组组装前的自我校正,因为自我校正会将人为错误引入超过5%的UCR序列。我们建议使用高精度的CCS reads对组装进行优化,以有效地校正多倍体基因组中的这些错误。