Liao Longlong, Li Kenli, Li Keqin, Yang Canqun, Tian Qi
College of Computer, National University of Defense Technology, Sanyi Road, Changsha, China.
State Key Laboratory of High Performance Computing, Sanyi Road, Changsha, China.
BMC Syst Biol. 2018 Nov 22;12(Suppl 6):111. doi: 10.1186/s12918-018-0630-6.
While there are a large number of bioinformatics datasets for clustering, many of them are incomplete, i.e., missing attribute values in some data samples needed by clustering algorithms. A variety of clustering algorithms have been proposed in the past years, but they usually are limited to cluster on the complete dataset. Besides, conventional clustering algorithms cannot obtain a trade-off between accuracy and efficiency of the clustering process since many essential parameters are determined by the human user's experience.
The paper proposes a Multiple Kernel Density Clustering algorithm for Incomplete datasets called MKDCI. The MKDCI algorithm consists of recovering missing attribute values of input data samples, learning an optimally combined kernel for clustering the input dataset, reducing dimensionality with the optimal kernel based on multiple basis kernels, detecting cluster centroids with the Isolation Forests method, assigning clusters with arbitrary shape and visualizing the results.
Extensive experiments on several well-known clustering datasets in bioinformatics field demonstrate the effectiveness of the proposed MKDCI algorithm. Compared with existing density clustering algorithms and parameter-free clustering algorithms, the proposed MKDCI algorithm tends to automatically produce clusters of better quality on the incomplete dataset in bioinformatics.
虽然有大量用于聚类的生物信息学数据集,但其中许多是不完整的,即聚类算法所需的一些数据样本中缺少属性值。在过去几年中已经提出了各种聚类算法,但它们通常仅限于在完整数据集上进行聚类。此外,传统的聚类算法无法在聚类过程的准确性和效率之间取得平衡,因为许多关键参数是由人类用户的经验决定的。
本文提出了一种用于不完整数据集的多核密度聚类算法,称为MKDCI。MKDCI算法包括恢复输入数据样本的缺失属性值、学习用于对输入数据集进行聚类的最优组合核、基于多个基核使用最优核进行降维、使用孤立森林方法检测聚类中心、分配任意形状的聚类并可视化结果。
在生物信息学领域的几个著名聚类数据集上进行的大量实验证明了所提出的MKDCI算法的有效性。与现有的密度聚类算法和无参数聚类算法相比,所提出的MKDCI算法在生物信息学的不完整数据集上倾向于自动产生质量更好的聚类。