Douglass Alexander P, O'Brien Caoimhe E, Offei Benjamin, Coughlan Aisling Y, Ortiz-Merino Raúl A, Butler Geraldine, Byrne Kevin P, Wolfe Kenneth H
School of Medicine, UCD Conway Institute, University College Dublin, Dublin 4, Ireland.
School of Biomolecular and Biomedical Sciences, UCD Conway Institute, University College Dublin, Dublin 4, Ireland.
G3 (Bethesda). 2019 Mar 7;9(3):879-887. doi: 10.1534/g3.118.200745.
Illumina sequencing has revolutionized yeast genomics, with prices for commercial draft genome sequencing now below $200. The popular SPAdes assembler makes it simple to generate a genome assembly for any yeast species. However, whereas making genome assemblies has become routine, understanding what they contain is still challenging. Here, we show how graphing the information that SPAdes provides about the length and coverage of each scaffold can be used to investigate the nature of an assembly, and to diagnose possible problems. Scaffolds derived from mitochondrial DNA, ribosomal DNA, and yeast plasmids can be identified by their high coverage. Contaminating data, such as cross-contamination from other samples in a multiplex sequencing run, can be identified by its low coverage. Scaffolds derived from the bacteriophage PhiX174 and Lambda DNAs that are frequently used as molecular standards in Illumina protocols can also be detected. Assemblies of yeast genomes with high heterozygosity, such as interspecies hybrids, often contain two types of scaffold: regions of the genome where the two alleles assembled into two separate scaffolds and each has a coverage level , and regions where the two alleles co-assembled (collapsed) into a single scaffold that has a coverage level 2 Visualizing the data with Coverage--Length (CVL) plots, which can be done using Microsoft Excel or Google Sheets, provides a simple method to understand the structure of a genome assembly and detect aberrant scaffolds or contigs. We provide a Python script that allows assemblies to be filtered to remove contaminants identified in CVL plots.
Illumina测序技术彻底改变了酵母基因组学,如今商业草图基因组测序的价格已低于200美元。广受欢迎的SPAdes组装程序使为任何酵母物种生成基因组组装变得简单。然而,虽然进行基因组组装已成为常规操作,但理解它们所包含的内容仍然具有挑战性。在这里,我们展示了如何通过绘制SPAdes提供的关于每个支架的长度和覆盖度的信息,来研究组装的性质并诊断可能存在的问题。源自线粒体DNA、核糖体DNA和酵母质粒的支架可以通过其高覆盖度来识别。污染数据,例如在多重测序运行中来自其他样本的交叉污染,可以通过其低覆盖度来识别。源自经常在Illumina协议中用作分子标准的噬菌体PhiX174和Lambda DNA的支架也可以被检测到。具有高杂合性的酵母基因组组装,如种间杂种,通常包含两种类型的支架:基因组中两个等位基因组装成两个单独支架且每个都有一个覆盖度水平的区域,以及两个等位基因共同组装(折叠)成一个覆盖度水平为2的单个支架的区域。使用覆盖度-长度(CVL)图可视化数据(这可以使用Microsoft Excel或Google Sheets完成),提供了一种理解基因组组装结构并检测异常支架或重叠群的简单方法。我们提供了一个Python脚本,可用于对组装进行过滤,以去除在CVL图中识别出的污染物。