John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK.
The Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, UK.
Gigascience. 2018 May 1;7(5). doi: 10.1093/gigascience/giy053.
The accurate sequencing and assembly of very large, often polyploid, genomes remains a challenging task, limiting long-range sequence information and phased sequence variation for applications such as plant breeding. The 15-Gb hexaploid bread wheat (Triticum aestivum) genome has been particularly challenging to sequence, and several different approaches have recently generated long-range assemblies. Mapping and understanding the types of assembly errors are important for optimising future sequencing and assembly approaches and for comparative genomics.
Here we use a Fosill 38-kb jumping library to assess medium and longer-range order of different publicly available wheat genome assemblies. Modifications to the Fosill protocol generated longer Illumina sequences and enabled comprehensive genome coverage. Analyses of two independent Bacterial Artificial Chromosome (BAC)-based chromosome-scale assemblies, two independent Illumina whole genome shotgun assemblies, and a hybrid Single Molecule Real Time (SMRT-PacBio) and short read (Illumina) assembly were carried out. We revealed a surprising scale and variety of discrepancies using Fosill mate-pair mapping and validated several of each class. In addition, Fosill mate-pairs were used to scaffold a whole genome Illumina assembly, leading to a 3-fold increase in N50 values.
Our analyses, using an independent means to validate different wheat genome assemblies, show that whole genome shotgun assemblies based solely on Illumina sequences are significantly more accurate by all measures compared to BAC-based chromosome-scale assemblies and hybrid SMRT-Illumina approaches. Although current whole genome assemblies are reasonably accurate and useful, additional improvements will be needed to generate complete assemblies of wheat genomes using open-source, computationally efficient, and cost-effective methods.
准确测序和组装非常大的基因组,通常是多倍体基因组,仍然是一项具有挑战性的任务,这限制了长程序列信息和分相序列变异在植物育种等应用中的应用。15Gb 的六倍体面包小麦(Triticum aestivum)基因组特别难以测序,最近有几种不同的方法生成了长程组装。映射和理解组装错误的类型对于优化未来的测序和组装方法以及比较基因组学都很重要。
在这里,我们使用 Fosill 38kb 跳跃文库来评估不同公开可用的小麦基因组组装的中程和长程顺序。对 Fosill 协议的修改生成了更长的 Illumina 序列,并实现了全面的基因组覆盖。对两个独立的基于细菌人工染色体(BAC)的染色体规模组装、两个独立的 Illumina 全基因组鸟枪法组装以及混合单分子实时(SMRT-PacBio)和短读(Illumina)组装进行了分析。我们使用 Fosill 配对映射揭示了令人惊讶的规模和种类的差异,并验证了每一类的几个。此外,Fosill 配对用于支架整个基因组 Illumina 组装,导致 N50 值增加了 3 倍。
我们的分析使用独立的方法来验证不同的小麦基因组组装,表明仅基于 Illumina 序列的全基因组鸟枪法组装在所有方面都比 BAC 基于染色体规模的组装和混合 SMRT-Illumina 方法更准确。尽管目前的全基因组组装具有相当的准确性和实用性,但需要进一步改进,以便使用开源、计算高效且具有成本效益的方法生成完整的小麦基因组组装。