有界度相似性图中的彩色直系同源聚类

Colorful orthology clustering in bounded-degree similarity graphs.

作者信息

Sánchez Alitzel López, Lafond Manuel

机构信息

Computer Science Department, Université de Sherbrooke, 2500 Boulevard de l'Université, Sherbrooke, Québec J1K 2R1, Canada.

出版信息

J Bioinform Comput Biol. 2021 Dec;19(6):2140010. doi: 10.1142/S0219720021400102. Epub 2021 Nov 13.

DOI:10.1142/S0219720021400102

PMID:34775924

Abstract

Clustering genes in similarity graphs is a popular approach for orthology prediction. Most algorithms group genes without considering their species, which results in clusters that contain several paralogous genes. Moreover, clustering is known to be problematic when in-paralogs arise from ancient duplications. Recently, we proposed a two-step process that avoids these problems. First, we infer clusters of only orthologs (i.e. with only genes from distinct species), and second, we infer the missing inter-cluster orthologs. In this paper, we focus on the first step, which leads to a problem we call Colorful Clustering. In general, this is as hard as classical clustering. However, in similarity graphs, the number of species is usually small, as well as the neighborhood size of genes in other species. We therefore study the problem of clustering in which the number of colors is bounded by [Formula: see text], and each gene has at most [Formula: see text] neighbors in another species. We show that the well-known formulation remains NP-hard even when [Formula: see text] and [Formula: see text]. We then propose a fixed-parameter algorithm in [Formula: see text] to find the single best cluster in the graph. We implemented this algorithm and included it in the aforementioned two-step approach. Experiments on simulated data show that this approach performs favorably to applying only an unconstrained clustering step.

摘要

在相似性图中对基因进行聚类是一种常用的直系同源预测方法。大多数算法在对基因进行分组时不考虑其所属物种，这导致聚类中包含几个旁系同源基因。此外，当直系同源基因源于古老的复制事件时，聚类会出现问题。最近，我们提出了一个两步过程来避免这些问题。首先，我们推断仅由直系同源基因组成的聚类（即仅包含来自不同物种的基因），其次，我们推断缺失的聚类间直系同源基因。在本文中，我们关注第一步，这导致了一个我们称为彩色聚类的问题。一般来说，这与经典聚类一样困难。然而，在相似性图中，物种数量通常较少，其他物种中基因的邻域大小也较小。因此，我们研究了颜色数量受[公式：见原文]限制且每个基因在另一个物种中最多有[公式：见原文]个邻居的聚类问题。我们表明，即使当[公式：见原文]和[公式：见原文]时，著名的[公式：见原文]公式仍然是NP难的。然后，我们提出了一种关于[公式：见原文]的固定参数算法，以在图中找到单个最佳聚类。我们实现了该算法并将其纳入上述两步方法中。对模拟数据的实验表明，这种方法比仅应用无约束聚类步骤的方法表现更好。