Ben-Dor Amir, Karp Richard M, Schwikowski Benno, Shamir Ron
Molecular Diagnostic Department, Analytical Medical Laboratory, Agilent Labs, Agilent Technologies, Palo Alto, CA 94304, USA.
J Comput Biol. 2003;10(3-4):385-98. doi: 10.1089/10665270360688084.
Most shotgun sequencing projects undergo a long and costly phase of finishing, in which a partial assembly forms several contigs whose order, orientation, and relative distance is unknown. We propose here a new technique that supplements the shotgun assembly data by experimentally simple and commonly used complete restriction digests of the target. By computationally combining information from the contig sequences and the fragment sizes measured for several different enzymes, we seek to form a "scaffold" on which the contigs will be placed in their correct orientation, order, and distance. We give a heuristic search algorithm for solving the problem and report on promising preliminary simulation results. The key to the success of the search scheme is the very rapid solution of two time-critical subproblems that are solved to optimality in linear time. Our simulations indicate that with noise levels of some 3% relative error in measuring fragment sizes, using six enzymes, most datasets of 13 contigs spanning 300kb can be correctly ordered, and the remaining ones have most of their pairs of neighboring contigs correct. Hence, the technique has a potential to provide real help to finishing. Even without closing all gaps, the ability to order and orient the contigs correctly makes the partial assembly both more accessible and more useful for biologists.
大多数鸟枪法测序项目都要经历一个漫长且成本高昂的完成阶段,在此阶段中,部分组装会形成几个重叠群,但其顺序、方向和相对距离均未知。我们在此提出一种新技术,通过对目标进行实验上简单且常用的完全限制性酶切来补充鸟枪法组装数据。通过计算整合来自重叠群序列和几种不同酶所测片段大小的信息,我们试图构建一个“支架”,将重叠群以正确的方向、顺序和距离放置在该支架上。我们给出一种启发式搜索算法来解决该问题,并报告了有前景的初步模拟结果。搜索方案成功的关键在于两个时间关键子问题的非常快速的解决方案,这两个子问题能在线性时间内最优解决。我们的模拟表明,在测量片段大小存在约3%相对误差的噪声水平下,使用六种酶,大多数包含13个重叠群、跨度为300kb的数据集能够正确排序,其余数据集中大多数相邻重叠群对也是正确的。因此,该技术有潜力为完成测序提供实际帮助。即使没有填补所有缺口,正确排列和定向重叠群的能力也使得部分组装对生物学家来说更易于获取且更有用。