Alexandrino Alexsandro Oliveira, Oliveira Andre Rodrigues, Dias Ulisses, Dias Zanoni
Institute of Computing, University of Campinas, Campinas, Brazil.
School of Technology, University of Campinas, Limeira, Brazil.
J Comput Biol. 2021 Mar;28(3):235-247. doi: 10.1089/cmb.2020.0121. Epub 2020 Oct 20.
The rearrangement distance is a well-known problem in the field of comparative genomics. Given two genomes, the rearrangement distance is the minimum number of rearrangements in a set of allowed rearrangements (rearrangement model), which transforms one genome into the other. In rearrangement distance problems, a genome is modeled as a string, where each element represents a conserved region within the two genomes. When the orientation of the genes is known, it is represented by (plus or minus) signs assigned to the elements of the string. Two of the most studied rearrangements are reversals, which invert a segment of the genome, and transpositions, which exchange the relative positions of two adjacent segments of the genome. The first works in genome rearrangements considered that the genomes being compared had the same genetic material and that rearrangement events were restricted to reversals, transpositions, or both. El-Mabrouk extended the reversal model on signed strings to include the operations of insertion and deletion of segments in the genome, which allowed the comparison of genomes with different genetic material. Other studies also addressed this problem and, recently, this problem was proved to be solvable in polynomial time by Willing et al. For unsigned strings, we still observe a lack of results. That said, in this study we prove that computing the rearrangement distance for the following models is NP-Hard: reversals and indels on unsigned strings; transpositions and indels on unsigned strings; and reversals, transpositions, and indels on signed and unsigned strings. Along with the NP-hardness proofs, we present a 2-approximation algorithm for reversals on unsigned strings and 3-approximation algorithms for the other models.
重排距离是比较基因组学领域一个著名的问题。给定两个基因组,重排距离是一组允许的重排(重排模型)中,将一个基因组转化为另一个基因组所需的最少重排次数。在重排距离问题中,基因组被建模为一个字符串,其中每个元素代表两个基因组中的一个保守区域。当基因的方向已知时,用分配给字符串元素的(正或负)符号表示。研究最多的两种重排是反转,它会反转基因组的一个片段;以及转位,它会交换基因组中两个相邻片段的相对位置。最早关于基因组重排的研究认为,被比较的基因组具有相同的遗传物质,并且重排事件仅限于反转、转位或两者皆有。埃尔 - 马布鲁克将有符号字符串上的反转模型扩展到包括基因组片段的插入和删除操作,这使得能够比较具有不同遗传物质的基因组。其他研究也涉及了这个问题,最近,威林等人证明了这个问题在多项式时间内是可解的。对于无符号字符串,我们仍然缺乏相关结果。也就是说,在本研究中我们证明,计算以下模型的重排距离是NP难的:无符号字符串上的反转和插入缺失;无符号字符串上的转位和插入缺失;以及有符号和无符号字符串上的反转、转位和插入缺失。除了NP难的证明,我们还提出了一种用于无符号字符串上反转的2近似算法以及用于其他模型的3近似算法。