Oliveira Andre R, Jean Géraldine, Fertin Guillaume, Dias Ulisses, Dias Zanoni
1Institute of Computing, University of Campinas, Campinas, Brazil.
2LS2N, UMR CNRS 6004, University of Nantes, Nantes, France.
Algorithms Mol Biol. 2019 Nov 5;14:21. doi: 10.1186/s13015-019-0156-5. eCollection 2019.
The evolutionary distance between two genomes can be estimated by computing a minimum length sequence of operations, called , that transform one genome into another. Usually, a genome is modeled as an ordered sequence of genes, and most of the studies in the genome rearrangement literature consist in shaping biological scenarios into mathematical models. For instance, allowing different genome rearrangements operations at the same time, adding constraints to these rearrangements (e.g., each rearrangement can affect at most a given number of genes), considering that a rearrangement implies a cost depending on its length rather than a unit cost, etc. Most of the works, however, have overlooked some important features inside genomes, such as the presence of sequences of nucleotides between genes, called .
In this work, we investigate the problem of computing the distance between two genomes, taking into account both gene order and intergenic sizes. The genome rearrangement operations we consider here are constrained types of reversals and transpositions, called (SSRs) and (SSTs), which affect up to two (consecutive) genes. We denote by (SSOs) any SSR or SST. We show 3-approximation algorithms when the orientation of the genes is not considered when we allow SSRs, SSTs, or SSOs, and 5-approximation algorithms when considering the orientation for either SSRs or SSOs. We also show that these algorithms improve their approximation factors when the input permutation has a higher number of inversions, where the approximation factor decreases from 3 to either 2 or 1.5, and from 5 to either 3 or 2.
两个基因组之间的进化距离可以通过计算将一个基因组转化为另一个基因组所需的最小操作序列长度来估计,这种操作序列称为 。通常,基因组被建模为基因的有序序列,基因组重排文献中的大多数研究在于将生物学场景转化为数学模型。例如,允许同时进行不同的基因组重排操作,对这些重排添加约束条件(例如,每次重排最多可影响给定数量的基因),考虑到重排意味着根据其长度而非单位成本的代价等。然而,大多数工作都忽略了基因组内部的一些重要特征,例如基因之间核苷酸序列的存在,称为 。
在这项工作中,我们研究了计算两个基因组之间距离的问题,同时考虑了基因顺序和基因间大小。我们在此考虑的基因组重排操作是受限类型的反转和转座,称为 (SSRs)和 (SSTs),它们最多影响两个(连续)基因。我们用 (SSOs)表示任何SSR或SST。当不考虑基因方向时,我们展示了允许SSR、SST或SSO时的3近似算法,以及考虑SSR或SSO方向时的5近似算法。我们还表明,当输入排列具有更多反转时,这些算法会提高其近似因子,其中近似因子从3降至2或1.5,从5降至3或2。