Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA.
Bioinformatics. 2011 Jan 15;27(2):153-60. doi: 10.1093/bioinformatics/btq646. Epub 2010 Nov 18.
Mired by its connection to a well-known -complete combinatorial optimization problem-namely, the Shortest Common Superstring Problem (SCSP)-historically, the whole-genome sequence assembly (WGSA) problem has been assumed to be amenable only to greedy and heuristic methods. By placing efficiency as their first priority, these methods opted to rely only on local searches, and are thus inherently approximate, ambiguous or error prone, especially, for genomes with complex structures. Furthermore, since choice of the best heuristics depended critically on the properties of (e.g. errors in) the input data and the available long range information, these approaches hindered designing an error free WGSA pipeline.
We dispense with the idea of limiting the solutions to just the approximated ones, and instead favor an approach that could potentially lead to an exhaustive (exponential-time) search of all possible layouts. Its computational complexity thus must be tamed through a constrained search (Branch-and-Bound) and quick identification and pruning of implausible overlays. For his purpose, such a method necessarily relies on a set of score functions (oracles) that can combine different structural properties (e.g. transitivity, coverage, physical maps, etc.). We give a detailed description of this novel assembly framework, referred to as Scoring-and-Unfolding Trimmed Tree Assembler (SUTTA), and present experimental results on several bacterial genomes using next-generation sequencing technology data. We also report experimental evidence that the assembly quality strongly depends on the choice of the minimum overlap parameter k.
SUTTA's binaries are freely available to non-profit institutions for research and educational purposes at http://www.bioinformatics.nyu.edu.
由于与一个众所周知的完全组合优化问题(即最短公共超字符串问题(SCSP))紧密相关,历史上,全基因组序列组装(WGSA)问题一直被认为只能采用贪婪和启发式方法。这些方法将效率作为首要任务,只选择进行局部搜索,因此本质上是近似的、模糊的或容易出错的,尤其是对于具有复杂结构的基因组。此外,由于最佳启发式的选择严重依赖于输入数据(例如错误)和可用长程信息的特性,因此这些方法阻碍了无错误 WGSA 管道的设计。
我们摒弃了将解决方案限制在近似解的想法,而是倾向于采用一种可能导致所有可能布局的穷举(指数时间)搜索的方法。因此,其计算复杂度必须通过约束搜索(分支定界)和快速识别和修剪不合理的覆盖来进行控制。为此,这种方法必须依赖于一组评分函数(oracles),这些函数可以组合不同的结构特性(例如传递性、覆盖范围、物理图谱等)。我们详细描述了这种新的组装框架,称为评分和展开修剪树组装器(SUTTA),并使用下一代测序技术数据在几个细菌基因组上进行了实验结果展示。我们还报告了实验证据,表明组装质量强烈依赖于最小重叠参数 k 的选择。
SUTTA 的二进制文件可供非营利机构免费用于研究和教育目的,可在 http://www.bioinformatics.nyu.edu 上获得。