Bohnenkämper Leonard
Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Universitätsstraße 25, 33615, Bielefeld, NRW, Germany.
Algorithms Mol Biol. 2024 Feb 27;19(1):8. doi: 10.1186/s13015-024-00253-7.
One of the most fundamental problems in genome rearrangement studies is the (genomic) distance problem. It is typically formulated as finding the minimum number of rearrangements under a model that are needed to transform one genome into the other. A powerful multi-chromosomal model is the Double Cut and Join (DCJ) model.While the DCJ model is not able to deal with some situations that occur in practice, like duplicated or lost regions, it was extended over time to handle these cases. First, it was extended to the DCJ-indel model, solving the issue of lost markers. Later ILP-solutions for so called natural genomes, in which each genomic region may occur an arbitrary number of times, were developed, enabling in theory to solve the distance problem for any pair of genomes. However, some theoretical and practical issues remained unsolved. On the theoretical side of things, there exist two disparate views of the DCJ-indel model, motivated in the same way, but with different conceptualizations that could not be reconciled so far. On the practical side, while ILP solutions for natural genomes typically perform well on telomere to telomere resolved genomes, they have been shown in recent years to quickly loose performance on genomes with a large number of contigs or linear chromosomes. This has been linked to a particular technique, namely capping. Simply put, capping circularizes linear chromosomes by concatenating them during solving time, increasing the solution space of the ILP superexponentially. Recently, we introduced a new conceptualization of the DCJ-indel model within the context of another rearrangement problem. In this manuscript, we will apply this new conceptualization to the distance problem. In doing this, we uncover the relation between the disparate conceptualizations of the DCJ-indel model. We are also able to derive an ILP solution to the distance problem that does not rely on capping. This solution significantly improves upon the performance of previous solutions on genomes with high numbers of contigs while still solving the problem exactly and being competitive in performance otherwise. We demonstrate the performance advantage on simulated genomes as well as showing its practical usefulness in an analysis of 11 Drosophila genomes.
基因组重排研究中最基本的问题之一是(基因组)距离问题。它通常被表述为在一种模型下找到将一个基因组转化为另一个基因组所需的最少重排次数。一种强大的多染色体模型是双切接(DCJ)模型。虽然DCJ模型无法处理实际中出现的一些情况,比如重复或缺失区域,但随着时间的推移它被扩展以处理这些情况。首先,它被扩展为DCJ - 插入缺失模型,解决了标记丢失的问题。后来针对所谓的自然基因组开发了整数线性规划(ILP)解决方案,其中每个基因组区域可能出现任意次数,理论上能够解决任意一对基因组的距离问题。然而,一些理论和实际问题仍未解决。从事物的理论方面来看,对于DCJ - 插入缺失模型存在两种截然不同的观点,它们的动机相同,但概念化方式不同,到目前为止无法调和。在实际方面,虽然针对自然基因组的ILP解决方案在端粒到端粒解析的基因组上通常表现良好,但近年来已表明它们在具有大量重叠群或线性染色体的基因组上会迅速失去性能。这与一种特定技术,即加帽有关。简单地说,加帽是在求解时通过连接线性染色体使其环化,从而使ILP解决方案空间呈超指数增长。最近,我们在另一个重排问题的背景下引入了DCJ - 插入缺失模型的一种新的概念化。在本手稿中,我们将把这种新的概念化应用于距离问题。通过这样做,我们揭示了DCJ - 插入缺失模型不同概念化之间的关系。我们还能够推导出一个不依赖加帽的距离问题的ILP解决方案。该解决方案在具有大量重叠群的基因组上显著提高了先前解决方案的性能,同时仍然能够准确地解决问题,并且在其他方面性能也具有竞争力。我们在模拟基因组上展示了性能优势,并在对11个果蝇基因组的分析中展示了其实际用途。