IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2080-2093. doi: 10.1109/TCBB.2021.3077418. Epub 2021 Dec 8.
Genome Rearrangements are events that affect large stretches of genomes during evolution. Many mathematical models have been used to estimate the evolutionary distance between two genomes based on genome rearrangements. However, most of them focused on the (order of the) genes of a genome, disregarding other important elements in it. Recently, researchers have shown that considering regions between each pair of genes, called intergenic regions, can enhance distance estimation in realistic data. Two of the most studied genome rearrangements are the reversal, which inverts a sequence of genes, and the transposition, which occurs when two adjacent gene sequences swap their positions inside the genome. In this work, we study the transposition distance between two genomes, but we also consider intergenic regions, a problem we name Sorting by Intergenic Transpositions. We show that this problem is NP-hard and propose two approximation algorithms, with factors 3.5 and 2.5, considering two distinct definitions for the problem. We also investigate the signed reversal and transposition distance between two genomes considering their intergenic regions. This second problem is called Sorting by Signed Intergenic Reversals and Intergenic Transpositions. We show that this problem is NP-hard and develop two approximation algorithms, with factors 3 and 2.5. We check how these algorithms behave when assigning weights for genome rearrangements. Finally, we implemented all these algorithms and tested them on real and simulated data.
基因组重排是在进化过程中影响基因组大片段的事件。许多数学模型被用于基于基因组重排来估计两个基因组之间的进化距离。然而,它们中的大多数都集中在基因组的基因(的顺序)上,而忽略了其中的其他重要元素。最近,研究人员表明,考虑基因组中每对基因之间的区域(称为基因间区)可以增强真实数据中的距离估计。研究最多的两种基因组重排是反转,它会反转基因序列,以及转座,当两个相邻的基因序列在基因组内交换它们的位置时就会发生转座。在这项工作中,我们研究了两个基因组之间的转位距离,但我们也考虑了基因间区,我们将这个问题命名为基因间转位排序。我们表明,这个问题是 NP 难的,并提出了两个逼近算法,考虑了问题的两个不同定义,逼近因子分别为 3.5 和 2.5。我们还研究了两个基因组的有符号反转和转位距离,同时考虑了它们的基因间区。这个第二个问题被称为基因间有符号反转和转位排序。我们表明,这个问题是 NP 难的,并开发了两个逼近算法,逼近因子分别为 3 和 2.5。我们检查了在为基因组重排分配权重时这些算法的表现。最后,我们实现了所有这些算法,并在真实和模拟数据上进行了测试。