Lyubetsky Vassily, Gershgorin Roman, Gorbunov Konstantin
Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute), Bolshoy Karetny per. 19, build.1, Moscow, 127051, Russia.
Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, Leninskiye Gory 1, Main Building, Moscow, 119991, Russia.
BMC Bioinformatics. 2017 Dec 6;18(1):537. doi: 10.1186/s12859-017-1944-x.
Chromosome structure is a very limited model of the genome including the information about its chromosomes such as their linear or circular organization, the order of genes on them, and the DNA strand encoding a gene. Gene lengths, nucleotide composition, and intergenic regions are ignored. Although highly incomplete, such structure can be used in many cases, e.g., to reconstruct phylogeny and evolutionary events, to identify gene synteny, regulatory elements and promoters (considering highly conserved elements), etc. Three problems are considered; all assume unequal gene content and the presence of gene paralogs. The distance problem is to determine the minimum number of operations required to transform one chromosome structure into another and the corresponding transformation itself including the identification of paralogs in two structures. We use the DCJ model which is one of the most studied combinatorial rearrangement models. Double-, sesqui-, and single-operations as well as deletion and insertion of a chromosome region are considered in the model; the single ones comprise cut and join. In the reconstruction problem, a phylogenetic tree with chromosome structures in the leaves is given. It is necessary to assign the structures to inner nodes of the tree to minimize the sum of distances between terminal structures of each edge and to identify the mutual paralogs in a fairly large set of structures. A linear algorithm is known for the distance problem without paralogs, while the presence of paralogs makes it NP-hard. If paralogs are allowed but the insertion and deletion operations are missing (and special constraints are imposed), the reduction of the distance problem to integer linear programming is known. Apparently, the reconstruction problem is NP-hard even in the absence of paralogs. The problem of contigs is to find the optimal arrangements for each given set of contigs, which also includes the mutual identification of paralogs.
We proved that these problems can be reduced to integer linear programming formulations, which allows an algorithm to redefine the problems to implement a very special case of the integer linear programming tool. The results were tested on synthetic and biological samples.
Three well-known problems were reduced to a very special case of integer linear programming, which is a new method of their solutions. Integer linear programming is clearly among the main computational methods and, as generally accepted, is fast on average; in particular, computation systems specifically targeted at it are available. The challenges are to reduce the size of the corresponding integer linear programming formulations and to incorporate a more detailed biological concept in our model of the reconstruction.
染色体结构是基因组的一种非常有限的模型,它包含有关染色体的信息,例如它们的线性或环状组织、其上基因的顺序以及编码基因的DNA链。基因长度、核苷酸组成和基因间区域被忽略。尽管非常不完整,但这种结构在许多情况下都可以使用,例如,用于重建系统发育和进化事件、识别基因共线性、调控元件和启动子(考虑高度保守的元件)等。考虑了三个问题;所有这些问题都假设基因含量不相等且存在基因旁系同源物。距离问题是确定将一种染色体结构转换为另一种染色体结构所需的最少操作数以及相应的转换本身,包括识别两种结构中的旁系同源物。我们使用DCJ模型,它是研究最多的组合重排模型之一。该模型考虑了双操作、半操作和单操作以及染色体区域的缺失和插入;单操作包括切割和连接。在重建问题中,给出了一个在叶节点具有染色体结构的系统发育树。有必要将结构分配给树的内部节点,以最小化每条边的终端结构之间的距离之和,并在相当大的一组结构中识别相互的旁系同源物。对于没有旁系同源物的距离问题,已知一种线性算法,而旁系同源物的存在使其成为NP难问题。如果允许旁系同源物但缺少插入和缺失操作(并施加特殊约束),则已知将距离问题简化为整数线性规划。显然,即使在没有旁系同源物的情况下,重建问题也是NP难的。重叠群问题是为每个给定的重叠群集找到最优排列,这也包括相互识别旁系同源物。
我们证明了这些问题可以简化为整数线性规划公式,这允许一种算法重新定义这些问题,以实现整数线性规划工具非常特殊的情况。结果在合成样本和生物样本上进行了测试。
三个著名的问题被简化为整数线性规划的一个非常特殊的情况,这是解决这些问题的一种新方法。整数线性规划显然是主要的计算方法之一,并且如普遍接受的那样,平均速度很快;特别是,有专门针对它的计算系统。挑战在于减小相应整数线性规划公式的规模,并在我们的重建模型中纳入更详细的生物学概念。