Saha Sriparna, Alok Abhay Kumar, Ekbal Asif
IEEE J Biomed Health Inform. 2016 Jul;20(4):1171-7. doi: 10.1109/JBHI.2015.2451735. Epub 2015 Jul 20.
Studying the patterns hidden in gene-expression data helps to understand the functionality of genes. In general, clustering techniques are widely used for the identification of natural partitionings from the gene expression data. In order to put constraints on dimensionality, feature selection is the key issue because not all features are important from clustering point of view. Moreover some limited amount of supervised information can help to fine tune the obtained clustering solution. In this paper, the problem of simultaneous feature selection and semisupervised clustering is formulated as a multiobjective optimization (MOO) task. A modern simulated annealing-based MOO technique namely AMOSA is utilized as the background optimization methodology. Here, features and cluster centers are represented in the form of a string and the assignment of genes to different clusters is done using a point symmetry-based distance. Six optimization criteria based on several internal and external cluster validity indices are utilized. In order to generate the supervised information, a popular clustering technique, Fuzzy C-mean, is utilized. Appropriate subset of features, proper number of clusters and the proper partitioning are determined using the search capability of AMOSA. The effectiveness of this proposed semisupervised clustering technique, Semi-FeaClustMOO, is demonstrated on five publicly available benchmark gene-expression datasets. Comparison results with the existing techniques for gene-expression data clustering again reveal the superiority of the proposed technique. Statistical and biological significance tests have also been carried out.
研究基因表达数据中隐藏的模式有助于理解基因的功能。一般来说,聚类技术被广泛用于从基因表达数据中识别自然划分。为了对维度施加约束,特征选择是关键问题,因为从聚类的角度来看并非所有特征都很重要。此外,一些有限的监督信息有助于微调所获得的聚类解决方案。在本文中,同时进行特征选择和半监督聚类的问题被表述为一个多目标优化(MOO)任务。一种基于现代模拟退火的MOO技术,即AMOSA,被用作背景优化方法。在这里,特征和聚类中心以字符串的形式表示,并且使用基于点对称的距离将基因分配到不同的聚类中。利用了基于几个内部和外部聚类有效性指标的六个优化标准。为了生成监督信息,使用了一种流行的聚类技术,模糊C均值。利用AMOSA的搜索能力确定适当的特征子集、适当的聚类数量和适当的划分。在五个公开可用的基准基因表达数据集上证明了这种提出的半监督聚类技术Semi-FeaClustMOO的有效性。与现有基因表达数据聚类技术的比较结果再次揭示了所提出技术的优越性。还进行了统计和生物学意义检验。