片段组装字符串图。

The fragment assembly string graph.

作者信息

Myers Eugene W

机构信息

Department of Computer Science, University of California Berkeley, CA, USA.

出版信息

Bioinformatics. 2005 Sep 1;21 Suppl 2:ii79-85. doi: 10.1093/bioinformatics/bti1114.

DOI:10.1093/bioinformatics/bti1114

PMID:16204131

Abstract

We present a concept and formalism, the string graph, which represents all that is inferable about a DNA sequence from a collection of shotgun sequencing reads collected from it. We give time and space efficient algorithms for constructing a string graph given the collection of overlaps between the reads and, in particular, present a novel linear expected time algorithm for transitive reduction in this context. The result demonstrates that the decomposition of reads into kmers employed in the de Bruijn graph approach described earlier is not essential, and exposes its close connection to the unitig approach we developed at Celera. This paper is a preliminary piece giving the basic algorithm and results that demonstrate the efficiency and scalability of the method. These ideas are being used to build a next-generation whole genome assembler called BOA (Berkeley Open Assembler) that will easily scale to mammalian genomes.

摘要

我们提出了一种概念和形式体系——字符串图，它表示从从DNA序列中收集的鸟枪法测序读段集合中可以推断出的关于该DNA序列的所有信息。给定读段之间的重叠集合，我们给出了用于构建字符串图的时空高效算法，特别是在这种情况下提出了一种新颖的线性期望时间传递约简算法。结果表明，将读段分解为前面描述的德布鲁因图方法中使用的k-mer并非必不可少，并且揭示了它与我们在Celera开发的单倍型重叠群方法的紧密联系。本文是一篇初步的文章，给出了基本算法和结果，证明了该方法的效率和可扩展性。这些想法正被用于构建一个名为BOA（伯克利开放汇编器）的下一代全基因组汇编器，它将很容易扩展到哺乳动物基因组。