Ning Kang, Leong Hon Wai
Department of Computer Science, National University of Singapore, Science Drive, Singapore 117543, Singapore.
BMC Bioinformatics. 2006 Dec 12;7 Suppl 4(Suppl 4):S12. doi: 10.1186/1471-2105-7-S4-S12.
The problem of finding a Shortest Common Supersequence (SCS) of a set of sequences is an important problem with applications in many areas. It is a key problem in biological sequences analysis. The SCS problem is well-known to be NP-complete. Many heuristic algorithms have been proposed. Some heuristics work well on a few long sequences (as in sequence comparison applications); others work well on many short sequences (as in oligo-array synthesis). Unfortunately, most do not work well on large SCS instances where there are many, long sequences.
In this paper, we present a Deposition and Reduction (DR) algorithm for solving large SCS instances of biological sequences. There are two processes in our DR algorithm: deposition process, and reduction process. The deposition process is responsible for generating a small set of common supersequences; and the reduction process shortens these common supersequences by removing some characters while preserving the common supersequence property. Our evaluation on simulated data and real DNA and protein sequences show that our algorithm consistently produces the best results compared to many well-known heuristic algorithms, and especially on large instances.
Our DR algorithm provides a partial answer to the open problem of designing efficient heuristic algorithm for SCS problem on many long sequences. Our algorithm has a bounded approximation ratio. The algorithm is efficient, both in running time and space complexity and our evaluation shows that it is practical even for SCS problems on many long sequences.
寻找一组序列的最短公共超序列(SCS)问题是一个在许多领域都有应用的重要问题。它是生物序列分析中的关键问题。众所周知,SCS问题是NP完全问题。已经提出了许多启发式算法。一些启发式算法在少数长序列上效果良好(如在序列比较应用中);其他算法在许多短序列上效果良好(如在寡核苷酸阵列合成中)。不幸的是,大多数算法在存在许多长序列的大型SCS实例上效果不佳。
在本文中,我们提出了一种用于解决生物序列大型SCS实例的沉积与约简(DR)算法。我们的DR算法有两个过程:沉积过程和约简过程。沉积过程负责生成一小集合的公共超序列;约简过程通过去除一些字符同时保留公共超序列属性来缩短这些公共超序列。我们对模拟数据以及真实DNA和蛋白质序列的评估表明,与许多著名的启发式算法相比,我们的算法始终能产生最佳结果,特别是在大型实例上。
我们的DR算法为设计针对许多长序列的SCS问题的高效启发式算法这一开放问题提供了部分答案。我们的算法具有有界近似比。该算法在运行时间和空间复杂度方面都很高效,并且我们的评估表明,即使对于许多长序列的SCS问题,它也是实用的。