Doerr Daniel, Kowada Luis Antonio B, Araujo Eloi, Deshpande Shachi, Dantas Simone, Moret Bernard M E, Stoye Jens
1 École Polytechnique Fédérale de Lausanne , Lausanne, Switzerland .
2 Universidade Federal Fluminense , Niterói, Brazil .
J Comput Biol. 2017 Jun;24(6):616-634. doi: 10.1089/cmb.2017.0065.
Many important questions in molecular biology, evolution, and biomedicine can be addressed by comparative genomic approaches. One of the basic tasks when comparing genomes is the definition of measures of similarity (or dissimilarity) between two genomes, for example, to elucidate the phylogenetic relationships between species. The power of different genome comparison methods varies with the underlying formal model of a genome. The simplest models impose the strong restriction that each genome under study must contain the same genes, each in exactly one copy. More realistic models allow several copies of a gene in a genome. One speaks of gene families, and comparative genomic methods that allow this kind of input are called gene family-based. The most powerful-but also most complex-models avoid this preprocessing of the input data and instead integrate the family assignment within the comparative analysis. Such methods are called gene family-free. In this article, we study an intermediate approach between family-based and family-free genomic similarity measures. Introducing this simpler model, called gene connections, we focus on the combinatorial aspects of gene family-free genome comparison. While in most cases, the computational costs to the general family-free case are the same, we also find an instance where the gene connections model has lower complexity. Within the gene connections model, we define three variants of genomic similarity measures that have different expression powers. We give polynomial-time algorithms for two of them, while we show NP-hardness for the third, most powerful one. We also generalize the measures and algorithms to make them more robust against recent local disruptions in gene order. Our theoretical findings are supported by experimental results, proving the applicability and performance of our newly defined similarity measures.
分子生物学、进化和生物医学中的许多重要问题都可以通过比较基因组方法来解决。比较基因组时的一项基本任务是定义两个基因组之间相似性(或不相似性)的度量,例如,以阐明物种之间的系统发育关系。不同基因组比较方法的能力因基因组的底层形式模型而异。最简单的模型施加了严格的限制,即所研究的每个基因组必须包含相同的基因,且每个基因只有一个拷贝。更现实的模型允许基因组中有一个基因的多个拷贝。人们称之为基因家族,允许这种输入的比较基因组方法称为基于基因家族的方法。最强大但也最复杂的模型避免了对输入数据的这种预处理,而是在比较分析中整合家族分配。这种方法称为无基因家族的方法。在本文中,我们研究了一种介于基于家族和无家族的基因组相似性度量之间的中间方法。引入这个更简单的模型,称为基因连接,我们专注于无基因家族的基因组比较的组合方面。虽然在大多数情况下,与一般的无家族情况相比计算成本相同,但我们也发现了一个实例,其中基因连接模型具有更低的复杂度。在基因连接模型中,我们定义了三种具有不同表达能力的基因组相似性度量变体。我们为其中两种给出了多项式时间算法,而对于第三种最强大的变体,我们证明了它是NP难的。我们还对这些度量和算法进行了推广,使其对最近基因顺序的局部破坏更具鲁棒性。我们的理论发现得到了实验结果的支持,证明了我们新定义的相似性度量的适用性和性能。