David R Cheriton School of Computer Science, University of Waterloo, Waterloo, ON.
BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S55. doi: 10.1186/1471-2105-12-S1-S55.
Given n strings s1, …, sn each of length ℓ and a nonnegative integer d, the CLOSEST STRING problem asks to find a center string s such that none of the input strings has Hamming distance greater than d from s. Finding a common pattern in many--but not necessarily all--input strings is an important task that plays a role in many applications in bioinformatics.
Although the closest string model is robust to the oversampling of strings in the input, it is severely affected by the existence of outliers. We propose a refined model, the closest string with outliers (CSWO) problem, to overcome this limitation. This new model asks for a center string s that is within Hamming distance d to at least n - k of the n input strings, where k is a parameter describing the maximum number of outliers. A CSWO solution not only provides the center string as a representative for the set of strings but also reveals the outliers of the set.We provide fixed parameter algorithms for CSWO when d and k are parameters, for both bounded and unbounded alphabets. We also show that when the alphabet is unbounded the problem is W[1]-hard with respect to n - k, ℓ, and d.
Our refined model abstractly models finding common patterns in several but not all input strings. We initialize the study of the computability of this model and show that it is sensitive to different parameterizations. Lastly, we conclude by suggesting several open problems which warrant further investigation.
给定 n 个长度为 ℓ 的字符串 s1,..., sn 和一个非负整数 d,最近字符串问题要求找到一个中心字符串 s,使得没有输入字符串与 s 的汉明距离大于 d。在许多生物信息学应用中,找到许多(但不一定是所有)输入字符串的共同模式是一项重要任务。
尽管最近字符串模型对输入中字符串的过采样具有鲁棒性,但它受到异常值的存在的严重影响。我们提出了一种改进的模型,即带有异常值的最近字符串(CSWO)问题,以克服这一限制。这个新模型要求一个中心字符串 s,它与 n 个输入字符串中的至少 n - k 个的汉明距离在 d 以内,其中 k 是描述异常值最大数量的参数。CSWO 解决方案不仅提供了作为字符串集代表的中心字符串,还揭示了字符串集的异常值。当 d 和 k 是参数时,我们为 CSWO 提供了具有固定参数的算法,适用于有界和无界字母表。我们还表明,当字母表无界时,该问题相对于 n - k、ℓ 和 d 是 W[1]-hard 的。
我们改进的模型抽象地模拟了在几个而不是所有输入字符串中找到共同模式的问题。我们初始化了对该模型可计算性的研究,并表明它对不同的参数化很敏感。最后,我们提出了几个值得进一步研究的开放性问题。