Rubert Diego P, Feijão Pedro, Braga Marília Dias Vieira, Stoye Jens, Martinez Fábio Henrique Viduani
Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, MS Brazil.
Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany.
Algorithms Mol Biol. 2017 Mar 9;12:3. doi: 10.1186/s13015-017-0095-y. eCollection 2017.
Rearrangements are large-scale mutations in genomes, responsible for complex changes and structural variations. Most rearrangements that modify the organization of a genome can be represented by the double cut and join (DCJ) operation. Given two balanced genomes, i.e., two genomes that have exactly the same number of occurrences of each gene in each genome, we are interested in the problem of computing the rearrangement distance between them, i.e., finding the minimum number of DCJ operations that transform one genome into the other. This problem is known to be NP-hard.
We propose a linear time approximation algorithm with approximation factor () for the DCJ distance problem, where is the maximum number of occurrences of any gene in the input genomes. Our algorithm works for linear and circular unichromosomal balanced genomes and uses as an intermediate step an ()-approximation for the minimum common string partition problem, which is closely related to the DCJ distance problem.
Experiments on simulated data sets show that our approximation algorithm is very competitive both in efficiency and in quality of the solutions.
重排是基因组中的大规模突变,会导致复杂的变化和结构变异。大多数改变基因组组织的重排都可以用双切割与连接(DCJ)操作来表示。给定两个平衡基因组,即每个基因组中每个基因出现次数完全相同的两个基因组,我们关注计算它们之间重排距离的问题,也就是找到将一个基因组转化为另一个基因组所需的最少DCJ操作次数。已知这个问题是NP难的。
我们针对DCJ距离问题提出了一种近似因子为()的线性时间近似算法,其中是输入基因组中任何基因的最大出现次数。我们的算法适用于线性和环状单染色体平衡基因组,并将最小公共字符串划分问题的()近似作为中间步骤,该问题与DCJ距离问题密切相关。
在模拟数据集上进行的实验表明,我们的近似算法在效率和解决方案质量方面都非常有竞争力。