Parraga-Alava Jorge, Dorn Marcio, Inostroza-Ponta Mario
1Centre for Biotechnology and Bioengineering (CeBiB), Departamento de Ingeniería Informática, Universidad de Santiago de Chile, Av. Ecuador 3659, Santiago, Chile.
2Carrera de Computación, Escuela Superior Politécnica Agropecuaria de Manabí Manuel Félix López, Campus Politécnico Sitio El Limón, Calceta, Ecuador.
BioData Min. 2018 Aug 7;11:16. doi: 10.1186/s13040-018-0178-4. eCollection 2018.
Biologists aim to understand the genetic background of diseases, metabolic disorders or any other genetic condition. Microarrays are one of the main high-throughput technologies for collecting information about the behaviour of genetic information on different conditions. In order to analyse this data, clustering arises as one of the main techniques used, and it aims at finding groups of genes that have some criterion in common, like similar expression profile. However, the problem of finding groups is normally multi dimensional, making necessary to approach the clustering as a multi-objective problem where various cluster validity indexes are simultaneously optimised. They are usually based on criteria like compactness and separation, which may not be sufficient since they can not guarantee the generation of clusters that have both similar expression patterns and biological coherence.
We propose a Multi-Objective Clustering algorithm Guided by a-Priori Biological Knowledge (MOC-GaPBK) to find clusters of genes with high levels of co-expression, biological coherence, and also good compactness and separation. Cluster quality indexes are used to optimise simultaneously gene relationships at expression level and biological functionality. Our proposal also includes intensification and diversification strategies to improve the search process.
The effectiveness of the proposed algorithm is demonstrated on four publicly available datasets. Comparative studies of the use of different objective functions and other widely used microarray clustering techniques are reported. Statistical, visual and biological significance tests are carried out to show the superiority of the proposed algorithm.
Integrating a-priori biological knowledge into a multi-objective approach and using intensification and diversification strategies allow the proposed algorithm to find solutions with higher quality than other microarray clustering techniques available in the literature in terms of co-expression, biological coherence, compactness and separation.
生物学家旨在了解疾病、代谢紊乱或任何其他遗传病症的遗传背景。微阵列是用于收集有关不同条件下遗传信息行为的信息的主要高通量技术之一。为了分析这些数据,聚类成为主要使用的技术之一,其目的是找到具有某些共同标准(如相似表达谱)的基因群体。然而,寻找群体的问题通常是多维度的,这使得有必要将聚类作为一个多目标问题来处理,在这个问题中,各种聚类有效性指标会同时得到优化。它们通常基于紧凑性和分离性等标准,但这些标准可能并不充分,因为它们无法保证生成既具有相似表达模式又具有生物学连贯性的聚类。
我们提出了一种由先验生物学知识引导的多目标聚类算法(MOC - GaPBK),以找到具有高共表达水平、生物学连贯性以及良好紧凑性和分离性的基因聚类。聚类质量指标用于同时优化表达水平上的基因关系和生物学功能。我们的提议还包括强化和多样化策略,以改进搜索过程。
在四个公开可用的数据集上证明了所提出算法的有效性。报告了对不同目标函数的使用以及其他广泛使用的微阵列聚类技术的比较研究。进行了统计、可视化和生物学意义测试,以显示所提出算法的优越性。
将先验生物学知识整合到多目标方法中,并使用强化和多样化策略,使得所提出的算法能够找到比文献中其他可用的微阵列聚类技术在共表达、生物学连贯性、紧凑性和分离性方面质量更高的解决方案。