Department of Electronics and Communication Engineering, National Institute of Technology Calicut, Kerala 673601, India.
Department of Computer Science and Engineering, National Institute of Technology Calicut, Kerala 673601, India.
Comput Biol Med. 2017 Dec 1;91:213-221. doi: 10.1016/j.compbiomed.2017.10.014. Epub 2017 Oct 23.
Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids.
We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids.
We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others.
There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data.
涉及随机性步骤的聚类算法通常会在对同一数据集的不同执行中给出不同的结果。 这种不确定性限制了算法的适用性,例如使用基因表达数据进行癌症亚型预测。 很难明智地比较此类算法的结果与其他算法的结果。 K-Means 算法的不确定性是由于其随机选择数据点作为初始质心。
我们提出了一种改进的、基于密度的 K-Means 版本,它涉及一种选择初始质心的新颖而系统的方法。 该算法的关键思想是选择属于密集区域并且在特征空间中充分分离的数据点作为初始质心。
我们根据一组包含十个癌症基因表达数据集的数据集上的性能,将所提出的算法与一组十一种广泛使用的单一聚类算法和一种用于癌症数据分类的突出集成聚类算法进行了比较。 所提出的算法的整体性能优于其他算法。
在生物医学领域,对于癌症亚型预测的简单、易用和更准确的机器学习工具存在迫切需求。 所提出的算法简单易用,结果稳定。 此外,它提供了比较好的基因表达数据的癌症亚型预测。