IEEE/ACM Trans Comput Biol Bioinform. 2020 Nov-Dec;17(6):1946-1954. doi: 10.1109/TCBB.2019.2913834. Epub 2020 Dec 8.
The one-sided Exemplar Adjacency Number (EAN) is a known problem for computing the exemplar similarity between a generic linear genome G with gene duplications and an exemplar genome H (over the same set of n gene families). In this problem, we need to compute an exemplar genome G, which is a permutation obtained from G, such that the number of common adjacencies between G and H is maximized. Unfortunately, the problem is not only NP-hard but also NP-hard to approximate. In this paper, we approach the problem by relaxing the constraint such that a sub-permutation G obtained from G does not have to include all the gene families, but still needs to have a length at least k. Hence G is called a pseudo-exemplar genome. Then, a slightly more general problem (One-sided EAN+) is defined: compute a pseudo-exemplar genome G from G such that the number of common adjacencies between H and G is maximized. Certainly One-sided EAN+ contains One-sided EAN as a special case; moreover, it presents some flexibility in designing algorithms. First, we relax and formulate the One-sided EAN+ problem as the maximum independent set (MIS) on a colored interval graph and hence reduce the appearance of each gene to at most two times. We show that this new relaxation is still NP-complete, though a simple factor-2 approximation algorithm can be designed; moreover, we also prove that the problem cannot be approximated within 2-ε by a local search technique. We then show that this relaxed version is fixed-parameter tractable (FPT). Second, to ensure that each gene appears in G at most once, we use integer linear programming (ILP) to solve this problem. Finally, we implement our algorithm and compare it with the up-to-date software GREDU, with simulated signed and unsigned genomes. It turns out that our algorithm is more stable and can process genomes of length up to 12,000 for signed genomes (while GREDU can falter on such a large signed genome and it cannot handle unsigned genomes at all).
单边范例邻接数(EAN)是计算具有基因重复的通用线性基因组 G 与范例基因组 H 之间范例相似度的已知问题(在相同的 n 个基因家族集合上)。在这个问题中,我们需要计算一个范例基因组 G,它是从 G 中获得的排列,使得 G 和 H 之间的公共邻接数最大化。不幸的是,该问题不仅是 NP 难的,而且难以近似。在本文中,我们通过放宽约束来解决这个问题,使得从 G 获得的子排列 G 不一定包含所有的基因家族,但仍需要至少有 k 个长度。因此,G 被称为伪范例基因组。然后,定义了一个稍微更一般的问题(单边 EAN+):从 G 中计算一个伪范例基因组 G,使得 H 和 G 之间的公共邻接数最大化。当然,单边 EAN+ 是单边 EAN 的一个特例;此外,它在设计算法方面具有一定的灵活性。首先,我们将单边 EAN+ 问题放松并形式化为彩色区间图上的最大独立集(MIS),从而将每个基因的出现次数减少到最多两次。我们表明,尽管可以设计一个简单的因子 2 近似算法,但这种新的放松仍然是 NP 完全的;此外,我们还证明了该问题不能通过局部搜索技术在 2-ε 内逼近。然后,我们证明这个放松的版本是固定参数可处理的(FPT)。其次,为了确保每个基因在 G 中只出现一次,我们使用整数线性规划(ILP)来解决这个问题。最后,我们实现了我们的算法,并与最新的软件 GREDU 进行了比较,使用了模拟的有符号和无符号基因组。结果表明,我们的算法更稳定,可以处理长达 12000 个长度的有符号基因组(而 GREDU 在如此大的有符号基因组上可能会失败,并且根本无法处理无符号基因组)。