Applied Mathematics Department, Agrocampus Ouest, 65, rue de Saint-Brieuc, Rennes, France.
BMC Bioinformatics. 2013 Feb 7;14:42. doi: 10.1186/1471-2105-14-42.
Gene clustering algorithms are massively used by biologists when analysing omics data. Classical gene clustering strategies are based on the use of expression data only, directly as in Heatmaps, or indirectly as in clustering based on coexpression networks for instance. However, the classical strategies may not be sufficient to bring out all potential relationships amongst genes.
We propose a new unsupervised gene clustering algorithm based on the integration of external biological knowledge, such as Gene Ontology annotations, into expression data. We introduce a new distance between genes which consists in integrating biological knowledge into the analysis of expression data. Therefore, two genes are close if they have both similar expression profiles and similar functional profiles at once. Then a classical algorithm (e.g. K-means) is used to obtain gene clusters. In addition, we propose an automatic evaluation procedure of gene clusters. This procedure is based on two indicators which measure the global coexpression and biological homogeneity of gene clusters. They are associated with hypothesis testing which allows to complement each indicator with a p-value.Our clustering algorithm is compared to the Heatmap clustering and the clustering based on gene coexpression network, both on simulated and real data. In both cases, it outperforms the other methodologies as it provides the highest proportion of significantly coexpressed and biologically homogeneous gene clusters, which are good candidates for interpretation.
Our new clustering algorithm provides a higher proportion of good candidates for interpretation. Therefore, we expect the interpretation of these clusters to help biologists to formulate new hypothesis on the relationships amongst genes.
当分析组学数据时,生物学家大量使用基因聚类算法。经典的基因聚类策略基于仅使用表达数据,直接如热图,或间接如基于共表达网络的聚类。然而,经典策略可能不足以揭示基因之间的所有潜在关系。
我们提出了一种新的无监督基因聚类算法,该算法基于将外部生物学知识(如基因本体论注释)集成到表达数据中。我们引入了一种新的基因间距离,它将生物学知识纳入表达数据分析中。因此,如果两个基因具有相似的表达谱和相似的功能谱,则它们就很接近。然后使用经典算法(例如 K-means)来获得基因簇。此外,我们提出了一种基因簇的自动评估程序。该程序基于两个指标,用于衡量基因簇的全局共表达和生物学同质性。它们与假设检验相关联,可以用 p 值补充每个指标。我们的聚类算法与热图聚类和基于基因共表达网络的聚类在模拟和真实数据上进行了比较。在这两种情况下,它都优于其他方法,因为它提供了更高比例的显著共表达和生物学同质性的基因簇,这些簇是解释的良好候选者。
我们的新聚类算法提供了更高比例的解释良好的候选者。因此,我们期望对这些簇的解释能够帮助生物学家提出关于基因之间关系的新假设。