Mane Aniket C, Lafond Manuel, Feijao Pedro C, Chauve Cedric
1Department of Mathematics, Simon Fraser University, 8888 University Drive, Burnaby, V5A 1S6 Canada.
2Department of Computer Science, Université de Sherbrooke, Boulevard de l'Université, Sherbrooke, J1K 2R1 Canada.
Algorithms Mol Biol. 2020 May 4;15:8. doi: 10.1186/s13015-020-00169-y. eCollection 2020.
In the field of genome rearrangement algorithms, models accounting for gene duplication lead often to hard problems. For example, while computing the pairwise distance is tractable in most duplication-free models, the problem is NP-complete for most extensions of these models accounting for duplicated genes. Moreover, problems involving more than two genomes, such as the genome median and the Small Parsimony problem, are intractable for most duplication-free models, with some exceptions, for example the Single-Cut-or-Join (SCJ) model.
We introduce a variant of the SCJ distance that accounts for duplicated genes, in the context of directed evolution from an ancestral genome to a descendant genome where orthology relations between ancestral genes and their descendant are known. Our model includes two duplication mechanisms: single-gene tandem duplication and the creation of single-gene circular chromosomes. We prove that in this model, computing the directed distance and a parsimonious evolutionary scenario in terms of SCJ and single-gene duplication events can be done in linear time. We also show that the directed median problem is tractable for this distance, while the rooted median problem, where we assume that one of the given genomes is ancestral to the median, is NP-complete. We also describe an Integer Linear Program for solving this problem. We evaluate the directed distance and rooted median algorithms on simulated data.
Our results provide a simple genome rearrangement model, extending the SCJ model to account for single-gene duplications, for which we prove a mix of tractability and hardness results. For the NP-complete rooted median problem, we design a simple Integer Linear Program. Our publicly available implementation of these algorithms for the directed distance and median problems allow to solve efficiently these problems on large instances.
在基因组重排算法领域,考虑基因复制的模型常常会导致难题。例如,虽然在大多数无复制模型中计算成对距离是可处理的,但对于这些考虑了重复基因的模型的大多数扩展而言,该问题是NP完全问题。此外,涉及两个以上基因组的问题,如基因组中位数和小简约问题,对于大多数无复制模型来说是难以处理的,不过也有一些例外,比如单切割或连接(SCJ)模型。
我们引入了一种SCJ距离的变体,该变体在从祖先基因组到后代基因组的定向进化背景下考虑了重复基因,其中祖先基因与其后代之间的直系同源关系是已知的。我们的模型包括两种复制机制:单基因串联复制和单基因环状染色体的产生。我们证明,在这个模型中,根据SCJ和单基因复制事件计算定向距离和简约进化场景可以在线性时间内完成。我们还表明,对于这个距离,定向中位数问题是可处理的,而根中位数问题(即我们假设给定基因组之一是中位数的祖先)是NP完全问题。我们还描述了一个用于解决此问题的整数线性规划。我们在模拟数据上评估了定向距离和根中位数算法。
我们的结果提供了一个简单的基因组重排模型,将SCJ模型扩展到考虑单基因复制,我们证明了其兼具可处理性和难度结果。对于NP完全的根中位数问题,我们设计了一个简单的整数线性规划。我们针对定向距离和中位数问题的这些算法的公开可用实现允许在大型实例上高效地解决这些问题。