IEEE/ACM Trans Comput Biol Bioinform. 2022 Jul-Aug;19(4):2071-2079. doi: 10.1109/TCBB.2021.3083896. Epub 2022 Aug 8.
Scaffold filling is a critical step in DNA assembly. Its basic task is to fill the missing genes (fragments) into an incomplete genome (scaffold) to make it similar to the reference genome. There have been a lot of work under distinct measurements in the literature of genome comparison. For genomes with gene duplications, common string partition reveals the similarity more precisely, since it constructs a one-to-one correspondence between the same segments in the two genomes. In this paper, we adopt duo-preservation as the measurement, which is the complement of common string partition, i.e., the number of duo-preservations added to the number of common strings is exactly the length of a genome. Towards a proper scaffold filling, we just focus on the increased duo-preservations. This problem is called scaffold filling to maximize increased duo-preservations (abbr. SF-MIDP). We show that SF-MIDP is solvable in linear time for a simple version where all the genes of the scaffold are matched in a block-matching, but MAX SNP-complete for the general version, and cannot be approximated within [Formula: see text]. Moreover, we present a basic approximation algorithm of factor 2, by which the optimal solution can be described in a new way, and then, improve the approximation factor to [Formula: see text] via a greedy method.
支架填充是 DNA 组装的关键步骤。它的基本任务是将缺失的基因(片段)填充到不完整的基因组(支架)中,使其与参考基因组相似。在基因组比较的文献中有很多不同的方法进行这项工作。对于具有基因重复的基因组,常见的字符串划分更精确地揭示了相似性,因为它在两个基因组中的相同片段之间建立了一对一的对应关系。在本文中,我们采用二元保存作为度量标准,它是常见字符串划分的补集,即添加的二元保存数与公共字符串数之和恰好是基因组的长度。为了进行适当的支架填充,我们只关注增加的二元保存数。这个问题被称为最大化增加的二元保存数的支架填充(缩写为 SF-MIDP)。我们表明,对于一个简单的版本,其中所有支架的基因都在块匹配中匹配,SF-MIDP 可以在线性时间内求解,但对于一般版本,它是 MAX SNP 完全的,并且不能在 [Formula: see text] 内逼近。此外,我们提出了一个基本的逼近算法,其逼近因子为 2,通过该算法可以以一种新的方式描述最优解,然后通过贪婪方法将逼近因子提高到 [Formula: see text]。