School of Computer Science and Engineering, Pusan National University, Busan, South Korea.
Department of Biomedical Science and Engineering, Konkuk University, Seoul, South Korea.
PLoS One. 2019 Aug 27;14(8):e0221858. doi: 10.1371/journal.pone.0221858. eCollection 2019.
Genomic data have become major resources to understand complex mechanisms at fine-scale temporal and spatial resolution in functional and evolutionary genetic studies, including human diseases, such as cancers. Recently, a large number of whole genomes of evolving populations of yeast (Saccharomyces cerevisiae W303 strain) were sequenced in a time-dependent manner to identify temporal evolutionary patterns. For this type of study, a chromosome-level sequence assembly of the strain or population at time zero is required to compare with the genomes derived later. However, there is no fully automated computational approach in experimental evolution studies to establish the chromosome-level genome assembly using unique features of sequencing data.
In this study, we developed a new software pipeline, the integrative meta-assembly pipeline (IMAP), to build chromosome-level genome sequence assemblies by generating and combining multiple initial assemblies using three de novo assemblers from short-read sequencing data. We significantly improved the continuity and accuracy of the genome assembly using a large collection of sequencing data and hybrid assembly approaches. We validated our pipeline by generating chromosome-level assemblies of yeast strains W303 and SK1, and compared our results with assemblies built using long-read sequencing and various assembly evaluation metrics. We also constructed chromosome-level sequence assemblies of S. cerevisiae strain Sigma1278b, and three commonly used fungal strains: Aspergillus nidulans A713, Neurospora crassa 73, and Thielavia terrestris CBS 492.74, for which long-read sequencing data are not yet available. Finally, we examined the effect of IMAP parameters, such as reference and resolution, on the quality of the final assembly of the yeast strains W303 and SK1.
We developed a cost-effective pipeline to generate chromosome-level sequence assemblies using only short-read sequencing data. Our pipeline combines the strengths of reference-guided and meta-assembly approaches. Our pipeline is available online at http://github.com/jkimlab/IMAP including a Docker image, as well as a Perl script, to help users install the IMAP package, including several prerequisite programs. Users can use IMAP to easily build the chromosome-level assembly for the genome of their interest.
基因组数据已成为在功能和进化遗传学研究中理解精细时空分辨率下复杂机制的主要资源,包括人类疾病,如癌症。最近,大量进化酵母种群(酿酒酵母 W303 株)的全基因组在时间上进行了测序,以确定时间进化模式。对于这种类型的研究,需要在时间零对该菌株或种群进行染色体水平的序列组装,以便与以后获得的基因组进行比较。然而,在实验进化研究中,没有完全自动化的计算方法可以使用测序数据的独特特征来建立染色体水平的基因组组装。
在本研究中,我们开发了一种新的软件管道,即整合元组装管道(IMAP),通过使用来自短读测序数据的三个从头组装程序生成和组合多个初始组装,来构建染色体水平的基因组序列组装。我们使用大量测序数据和混合组装方法显著提高了基因组组装的连续性和准确性。我们通过生成酵母菌株 W303 和 SK1 的染色体水平组装来验证我们的管道,并使用长读测序和各种组装评估指标与其他组装方法进行比较。我们还构建了酿酒酵母菌株 Sigma1278b 以及三种常用真菌菌株 Aspergillus nidulans A713、Neurospora crassa 73 和 Thielavia terrestris CBS 492.74 的染色体水平序列组装,这些菌株的长读测序数据尚未可用。最后,我们检查了 IMAP 参数(如参考和分辨率)对酵母菌株 W303 和 SK1 最终组装质量的影响。
我们开发了一种使用仅短读测序数据生成染色体水平序列组装的具有成本效益的管道。我们的管道结合了参考指导和元组装方法的优势。我们的管道可在 http://github.com/jkimlab/IMAP 上在线获取,其中包括一个 Docker 镜像以及一个 Perl 脚本,以帮助用户安装 IMAP 包,包括几个必备程序。用户可以使用 IMAP 轻松构建其感兴趣的基因组的染色体水平组装。