Suppr超能文献

自然的无家族基因组距离。

Natural family-free genomic distance.

作者信息

Rubert Diego P, Martinez Fábio V, Braga Marília D V

机构信息

Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil.

Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany.

出版信息

Algorithms Mol Biol. 2021 May 10;16(1):4. doi: 10.1186/s13015-021-00183-8.

Abstract

BACKGROUND

A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome. The traditional approaches in this area are family-based, i.e., require the classification of DNA fragments of both genomes into families. Furthermore, the most elementary family-based models, which are able to compute distances in polynomial time, restrict the families to occur at most once in each genome. In contrast, the distance computation in models that allow multifamilies (i.e., families with multiple occurrences) is NP-hard. Very recently, Bohnenkämper et al. (J Comput Biol 28:410-431, 2021) proposed an ILP formulation for computing the genomic distance of genomes with multifamilies, allowing structural rearrangements, represented by the generic double cut and join (DCJ) operation, and content-modifying insertions and deletions of DNA segments. This ILP is very efficient, but must maximize a matching of the genes in each multifamily, in order to prevent the free lunch artifact that would otherwise let empty or almost empty matchings give smaller distances.

RESULTS

In this paper, we adopt the alternative family-free setting that, instead of family classification, simply uses the pairwise similarities between DNA fragments of both genomes to compute their rearrangement distance. We adapted the ILP mentioned above and developed a model in which pairwise similarities are used to assign weights to both matched and unmatched genes, so that an optimal solution does not necessarily maximize the matching. Our model then results in a natural family-free genomic distance, that takes into consideration all given genes, without prior classification into families, and has a search space composed of matchings of any size. In spite of its bigger search space, our ILP seems to be boosted by a reduction of the number of co-optimal solutions due to the weights. Indeed, it converged faster than the original one by Bohnenkämper et al. for instances with the same number of multiple connections. We can handle not only bacterial genomes, but also fungi and insects, or sets of chromosomes of mammals and plants. In a comparison study of six fruit fly genomes, we obtained accurate results.

摘要

背景

比较基因组学中的一个经典问题是计算重排距离,即把一个给定基因组转化为另一个给定基因组所需的大规模重排的最小次数。该领域的传统方法是基于家族的,也就是说,需要将两个基因组的DNA片段分类到各个家族中。此外,最基本的基于家族的模型能够在多项式时间内计算距离,但限制每个家族在每个基因组中最多出现一次。相比之下,在允许多家族(即多次出现的家族)的模型中进行距离计算是NP难的。最近,博嫩坎珀等人(《计算生物学杂志》28:410 - 431,2021)提出了一种整数线性规划(ILP)公式,用于计算具有多家族的基因组的基因组距离,允许由通用的双切割与连接(DCJ)操作表示的结构重排,以及DNA片段的内容修改插入和删除。这个ILP非常高效,但必须最大化每个多家族中基因的匹配,以防止出现“免费午餐”假象,否则空匹配或几乎空的匹配会给出更小的距离。

结果

在本文中,我们采用了另一种无家族设置,即不进行家族分类,而是简单地利用两个基因组的DNA片段之间的成对相似性来计算它们的重排距离。我们对上述ILP进行了调整,开发了一个模型,其中成对相似性用于为匹配和未匹配的基因分配权重,这样最优解不一定会最大化匹配。我们的模型进而得出了一种自然的无家族基因组距离,它考虑了所有给定的基因,无需事先分类到家族中,并且搜索空间由任意大小的匹配组成。尽管搜索空间更大,但由于权重的作用,我们的ILP似乎因共最优解数量的减少而得到了加速。事实上,对于具有相同数量多重连接的实例,它比博嫩坎珀等人的原始模型收敛得更快。我们不仅可以处理细菌基因组,还可以处理真菌和昆虫的基因组,或者哺乳动物和植物的染色体组。在对六个果蝇基因组的比较研究中,我们获得了准确的结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8af1/8111734/d937f0877fbd/13015_2021_183_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验