Medvedev Paul, Brudno Michael
Department of Computer Science, University of Toronto , Toronto, Canada.
J Comput Biol. 2009 Aug;16(8):1101-16. doi: 10.1089/cmb.2009.0047.
Whole genome shotgun assembly is the process of taking many short sequenced segments (reads) and reconstructing the genome from which they originated. We demonstrate how the technique of bidirected network flow can be used to explicitly model the double-stranded nature of DNA for genome assembly. By combining an algorithm for the Chinese Postman Problem on bidirected graphs with the construction of a bidirected de Bruijn graph, we are able to find the shortest double-stranded DNA sequence that contains a given set of k-long DNA molecules. This is the first exact polynomial time algorithm for the assembly of a double-stranded genome. Furthermore, we propose a maximum likelihood framework for assembling the genome that is the most likely source of the reads, in lieu of the standard maximum parsimony approach (which finds the shortest genome subject to some constraints). In this setting, we give a bidirected network flow-based algorithm that, by taking advantage of high coverage, accurately estimates the copy counts of repeats in a genome. Our second algorithm combines these predicted copy counts with matepair data in order to assemble the reads into contigs. We run our algorithms on simulated read data from Escherichia coli and predict copy counts with extremely high accuracy, while assembling long contigs.
全基因组鸟枪法组装是将许多短测序片段(读段)进行拼接,并重建其来源基因组的过程。我们展示了如何使用双向网络流技术来明确地为基因组组装建模DNA的双链性质。通过将双向图上的中国邮递员问题算法与双向德布鲁因图的构建相结合,我们能够找到包含给定一组k长度DNA分子的最短双链DNA序列。这是首个用于双链基因组组装的精确多项式时间算法。此外,我们提出了一个用于组装读段最可能来源基因组的最大似然框架,以替代标准的最大简约方法(该方法在某些约束条件下找到最短基因组)。在此框架下,我们给出了一种基于双向网络流的算法,该算法通过利用高覆盖率,准确估计基因组中重复序列的拷贝数。我们的第二个算法将这些预测的拷贝数与配对末端数据相结合,以便将读段组装成重叠群。我们在来自大肠杆菌的模拟读段数据上运行我们的算法,在组装长重叠群的同时,以极高的准确率预测拷贝数。