Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh.
Department of Computer Science and Engineering, United International University, Dhaka 1212, Bangladesh.
Bioinformatics. 2022 Aug 2;38(15):3717-3724. doi: 10.1093/bioinformatics/btac404.
Advances in sequencing technologies have led to the sequencing of genomes of a multitude of organisms. However, draft genomes of many of these organisms contain a large number of gaps due to the repeats in genomes, low sequencing coverage and limitations in sequencing technologies. Although there exists several tools for filling gaps, many of these do not utilize all information relevant to gap filling.
Here, we present a probabilistic method for filling gaps in draft genome assemblies using second-generation reads based on a generative model for sequencing that takes into account information on insert sizes and sequencing errors. Our method is based on the expectation-maximization algorithm unlike the graph-based methods adopted in the literature. Experiments on real biological datasets show that this novel approach can fill up large portions of gaps with small number of errors and misassemblies compared to other state-of-the-art gap-filling tools.
The method is implemented using C++ in a software named 'Filling Gaps by Iterative Read Distribution (Figbird)', which is available at https://github.com/SumitTarafder/Figbird.
Supplementary data are available at Bioinformatics online.
测序技术的进步使得对大量生物的基因组进行测序成为可能。然而,由于基因组中的重复序列、低测序覆盖率和测序技术的限制,许多这些生物的草图基因组都包含大量的缺口。尽管存在几种用于填补缺口的工具,但其中许多工具并没有利用与填补缺口相关的所有信息。
在这里,我们提出了一种使用第二代reads基于生成模型的方法,该模型考虑了插入大小和测序错误的信息,用于填补草图基因组组装中的缺口。我们的方法基于期望最大化算法,而不是文献中采用的基于图的方法。在真实生物数据集上的实验表明,与其他最先进的缺口填补工具相比,这种新方法可以用少量的错误和错误组装来填补大部分缺口。
该方法使用 C++在名为“通过迭代读分布填充缺口(Figbird)”的软件中实现,可在 https://github.com/SumitTarafder/Figbird 上获得。
补充数据可在Bioinformatics 在线获得。