Institute of Evolutionary Biology, Ashworth Laboratories, University of Edinburgh Edinburgh, UK.
Institute of Evolutionary Biology, Ashworth Laboratories, University of Edinburgh Edinburgh, UK ; Edinburgh Genomics, University of Edinburgh Edinburgh, UK.
Front Genet. 2013 Nov 29;4:237. doi: 10.3389/fgene.2013.00237. eCollection 2013.
Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory-reared model organisms. They are often small, and cannot be isolated free of their environment - whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programs. Here we present an approach to extracting, from mixed DNA sequence data, subsets that correspond to single species' genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimized assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots). We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC-annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from https://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer.
为目标真核生物从头组装基因组项目生成原始数据相对容易。这种大规模数据获取的民主化使许多研究团队能够计划组装非模式生物的基因组。这些新的基因组靶标与传统的、近交的、实验室饲养的模式生物非常不同。它们通常很小,并且不能在没有环境的情况下分离——无论是摄入的食物、寄生虫的周围宿主生物体,还是附着在或存在于所采样个体内部或内部的共生和共生生物体。从单一物种中提取纯 DNA 在技术上可能是不可能的,但混合生物体 DNA 的组装可能很困难,因为大多数基因组组装器在面对不同化学计量的多个基因组时表现不佳。在故意试图捕获环境中存在的所有基因组的宏基因组数据集中,此类问题很常见,但此类程序通常不是复制子组装的目标。在这里,我们提出了一种从混合 DNA 序列数据中提取对应于单个物种基因组的子集的方法,从而改善基因组组装。我们使用数字(GC 碱基和读取覆盖率的比例)和生物(注释数据库中最佳匹配序列)指标来辅助划分草稿组装 contigs 和为这些 contigs 做出贡献的读取,然后将这些 contigs 分为不同的 bin,然后通过使用分类群注释的 GC-coverage 图(TAGC 图)对其进行严格、优化的组装。我们还介绍了 Blobsplorer,这是一种辅助从 TAGC 注释数据中探索和选择子集的工具。以这种方式对数据进行分区可以挽救组装不良的基因组,并揭示真核生物基因组项目中意想不到的共生体和共生体。TAGC 图管道脚本可从 https://github.com/blaxterlab/blobology 获得,而 Blobsplorer 工具可从 https://github.com/mojones/Blobsplorer 获得。