Computer Science & Engineering Department, University of Connecticut, Storrs, Connecticut, USA.
National Center for Artificial Intelligence and Big Data Technology, King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia.
J Comput Biol. 2020 Sep;27(9):1407-1421. doi: 10.1089/cmb.2019.0505. Epub 2020 Feb 12.
By using next-generation sequencing technologies, it is possible to quickly and inexpensively generate large numbers of relatively short reads from both the nuclear and mitochondrial DNA (mtDNA) contained in a biological sample. Unfortunately, assembling such whole-genome sequencing (WGS) data with standard de novo assemblers often fails to generate high-quality mitochondrial genome sequences due to the large difference in copy number (and hence sequencing depth) between the mitochondrial and nuclear genomes. Assembly of complete mitochondrial genome sequences is further complicated by the fact that many de novo assemblers are not designed for circular genomes and by the presence of repeats in the mitochondrial genomes of some species. In this article, we describe the Statistical Mitogenome Assembly with RepeaTs (SMART) pipeline for automated assembly of mitochondrial genomes from WGS data. SMART uses an efficient coverage-based filter to first select a subset of reads enriched in mtDNA sequences. Contigs produced by an initial assembly step are filtered using the Basic Local Alignment Search Tool searches against a comprehensive mitochondrial genome database and are used as "baits" for an alignment-based filter that produces the set of reads used in a second de novo assembly and scaffolding step. In the presence of repeats, the possible paths through the assembly graph are evaluated using a maximum likelihood model. Additionally, the assembly process is repeated for a user-specified number of times on resampled subsets of reads to select for annotation of the reconstructed sequences with highest bootstrap support. Experiments on WGS data sets from a variety of species show that the SMART pipeline produces complete circular mitochondrial genome sequences with a higher success rate than current state-of-the-art tools, particularly for low-coverage WGS data sets.
利用下一代测序技术,可以快速且经济高效地从生物样本中的核 DNA(nDNA)和线粒体 DNA(mtDNA)中生成大量相对较短的读取序列。不幸的是,由于线粒体基因组和核基因组之间的拷贝数(因此测序深度)存在较大差异,使用标准从头组装器组装此类全基因组测序(WGS)数据通常无法生成高质量的线粒体基因组序列。由于许多从头组装器并非专为圆形基因组设计,并且某些物种的线粒体基因组中存在重复序列,因此组装完整的线粒体基因组序列会更加复杂。在本文中,我们描述了用于从 WGS 数据自动组装线粒体基因组的统计线粒体基因组组装与重复(SMART)流程。SMART 使用高效的基于覆盖度的过滤器,首先选择富含 mtDNA 序列的读取子集。使用初始组装步骤生成的 contigs 通过针对综合线粒体基因组数据库的基本局部比对搜索工具(Basic Local Alignment Search Tool,BLAST)搜索进行过滤,并用作对齐过滤的“诱饵”,该过滤生成用于第二次从头组装和支架构建步骤的读取集。在存在重复的情况下,使用最大似然模型评估通过组装图的可能路径。此外,还可以在读取的重新采样子集中重复执行组装过程,以选择具有最高自举支持的重建序列的注释。针对来自各种物种的 WGS 数据集的实验表明,与当前最先进的工具相比,SMART 流程生成完整的圆形线粒体基因组序列的成功率更高,尤其是对于低覆盖度的 WGS 数据集。