Research School of Biology, the Australian National University. 134 Linnaeus Way, Acton, Canberra, ACT, 2601, Australia.
Department of Genetics and Animal Breeding, Faculty of Veterinary Medicine, Chittagong Veterinary and Animal Sciences University. Khulshi, Chattogram, 4225, Bangladesh.
Gigascience. 2020 Jan 1;9(1). doi: 10.1093/gigascience/giz160.
Eucalyptus pauciflora (the snow gum) is a long-lived tree with high economic and ecological importance. Currently, little genomic information for E. pauciflora is available. Here, we sequentially assemble the genome of Eucalyptus pauciflora with different methods, and combine multiple existing and novel approaches to help to select the best genome assembly.
We generated high coverage of long- (Nanopore, 174×) and short- (Illumina, 228×) read data from a single E. pauciflora individual and compared assemblies from 5 assemblers (Canu, SMARTdenovo, Flye, Marvel, and MaSuRCA) with different read lengths (1 and 35 kb minimum read length). A key component of our approach is to keep a randomly selected collection of ∼10% of both long and short reads separated from the assemblies to use as a validation set for assessing assemblies. Using this validation set along with a range of existing tools, we compared the assemblies in 8 ways: contig N50, BUSCO scores, LAI (long terminal repeat assembly index) scores, assembly ploidy, base-level error rate, CGAL (computing genome assembly likelihoods) scores, structural variation, and genome sequence similarity. Our result showed that MaSuRCA generated the best assembly, which is 594.87 Mb in size, with a contig N50 of 3.23 Mb, and an estimated error rate of ∼0.006 errors per base.
We report a draft genome of E. pauciflora, which will be a valuable resource for further genomic studies of eucalypts. The approaches for assessing and comparing genomes should help in assessing and choosing among many potential genome assemblies from a single dataset.
桉树(雪桉)是一种具有重要经济和生态价值的长寿树种。目前,有关桉树的基因组信息很少。在这里,我们使用不同的方法对桉树的基因组进行了顺序组装,并结合了多种现有的和新的方法来帮助选择最佳的基因组组装。
我们从单个桉树个体中生成了高覆盖度的长读(纳米孔,174×)和短读(Illumina,228×)数据,并比较了 5 种组装器(Canu、SMARTdenovo、Flye、Marvel 和 MaSuRCA)使用不同读长(最小读长 1 和 35 kb)的组装结果。我们方法的一个关键组成部分是将随机选择的长读和短读的约 10%保留下来,作为评估组装的验证集。使用这个验证集以及一系列现有的工具,我们从 8 个方面比较了组装结果:contig N50、BUSCO 分数、LAI(长末端重复组装指数)分数、组装的倍性、碱基水平错误率、CGAL(计算基因组组装可能性)分数、结构变异和基因组序列相似性。结果表明,MaSuRCA 生成的组装结果最好,大小为 594.87 Mb,contig N50 为 3.23 Mb,估计的碱基错误率约为 0.006 个错误/碱基。
我们报告了桉树的一个草图基因组,这将是桉树进一步基因组研究的宝贵资源。评估和比较基因组的方法应该有助于评估和选择来自单个数据集的许多潜在基因组组装。