Lehrstuhl für Bioinformatik, Friedrich-Schiller-Universität Jena, Ernst-Abbe-Platz 2, Jena, Germany.
BMC Bioinformatics. 2011 Apr 19;12:106. doi: 10.1186/1471-2105-12-106.
The center string (or closest string) problem is a classic computer science problem with important applications in computational biology. Given k input strings and a distance threshold d, we search for a string within Hamming distance at most d to each input string. This problem is NP complete.
In this paper, we focus on exact methods for the problem that are also swift in application. We first introduce data reduction techniques that allow us to infer that certain instances have no solution, or that a center string must satisfy certain conditions. We describe how to use this information to speed up two previously published search tree algorithms. Then, we describe a novel iterative search strategy that is efficient in practice, where some of our reduction techniques can also be applied. Finally, we present results of an evaluation study for two different data sets from a biological application.
We find that the running time for computing the optimal center string is dominated by the subroutine calls for d = dopt -1 and d = dopt. Our data reduction is very effective for both, either rejecting unsolvable instances or solving trivial positions. We find that this speeds up computations considerably.
中心串(或最近串)问题是计算机科学中的一个经典问题,在计算生物学中有重要的应用。给定 k 个输入字符串和距离阈值 d,我们搜索与每个输入字符串的汉明距离不超过 d 的字符串。这个问题是 NP 完全的。
在本文中,我们专注于该问题的精确方法,这些方法在应用中也很快捷。我们首先介绍数据约简技术,这些技术允许我们推断某些实例没有解决方案,或者中心串必须满足某些条件。我们描述了如何利用这些信息来加速两个之前发表的搜索树算法。然后,我们描述了一种新颖的迭代搜索策略,在实践中非常有效,其中我们的一些约简技术也可以应用。最后,我们给出了来自生物应用的两个不同数据集的评估研究结果。
我们发现计算最优中心串的运行时间主要取决于 d = dopt -1 和 d = dopt 的子例程调用。对于这两种情况,我们的数据约简都非常有效,可以拒绝不可解的实例或解决简单的位置。我们发现这大大加快了计算速度。