Nepomuceno Juan A, Troncoso Alicia, Nepomuceno-Chamorro Isabel A, Aguilar-Ruiz Jesús S
1Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Avd. Reina Mercedes s/n, Seville, 41012 Spain.
2Área de Informática, Universidad Pablo de Olavide, Ctra. Utrera km. 1, Seville, 41013 Spain.
BioData Min. 2018 Mar 27;11:4. doi: 10.1186/s13040-018-0165-9. eCollection 2018.
Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure.
The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective.
It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.
双聚类算法旨在在基因表达数据的样本子集中寻找具有相同行为的基因组。如今,公共知识库中可用的生物学知识可用于驱动这些算法,以找到由功能相关的基因组组成的双聚类。另一方面,可以根据基因本体论(GO)中存储的信息定义基因之间的距离。基因对的GO语义相似性度量为每对基因报告一个值,该值确定它们的功能相似性。本文研究了一种基于散布搜索的算法,该算法优化了一个整合GO信息的价值函数。这个价值函数使用一个通过GO度量来处理信息的项。
分析了两种可能不同的基因对GO度量对算法性能的影响。首先,研究了三个包含约一千个基因的著名酵母数据集。其次,该算法还探索了一组与癌症临床数据相关的人类数据集。这些数据大多是由大量基因组成的高维数据集。当搜索过程由所提出的GO度量之一驱动时,得到的双聚类揭示了由相同功能连接的基因组。此外,对一组双聚类的定性生物学研究表明了它们从癌症疾病角度的相关性。
可以得出结论,生物信息的整合提高了双聚类过程的性能。所研究的两种不同的GO度量在酵母数据集的结果上显示出改进。然而,如果数据集由大量基因组成,只有其中一种真正提高了算法性能。第二种情况构成了从临床角度探索有趣数据集的明确选择。