IEEE Trans Neural Netw Learn Syst. 2017 Dec;28(12):2936-2948. doi: 10.1109/TNNLS.2016.2608354. Epub 2016 Sep 27.
Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, which mainly focused on attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, in this paper, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intracluster structural compactness and intercluster structural separation linearly combined to measure the quality of sequence clusters. A partition-based algorithm for robust clustering of categorical sequences is also proposed, which provides the new measure with high-quality clustering results by the deterministic initialization and the elimination of noise clusters using an information theoretic method. The new clustering algorithm and the CVI are then assembled within the common model selection procedure to determine the number of clusters in categorical sequence sets. A case study on commonly used protein sequences and the experimental results on some real-world sequence sets from different domains are given to demonstrate the performance of the proposed method.
聚类验证是评估聚类结果质量的过程,在实际机器学习系统中起着重要作用。类别序列,如计算生物学中的生物序列,在实际应用中已经变得很常见。与之前主要关注属性值数据的研究不同,本文针对类别序列的聚类验证问题展开研究。由于缺乏针对序列中隐藏结构特征的内部验证标准,因此目前对序列聚类的评估较为困难。为了解决这个问题,本文提出了一种新的聚类有效性指数(CVI),它是聚类的函数,将簇内结构紧凑性和簇间结构分离性线性组合起来,以衡量序列簇的质量。还提出了一种基于划分的类别序列稳健聚类算法,该算法通过确定性初始化和使用信息论方法消除噪声簇,为新的度量方法提供了高质量的聚类结果。新的聚类算法和 CVI 随后被组装在通用的模型选择过程中,以确定类别序列集中的聚类数量。通过常用蛋白质序列的案例研究和来自不同领域的一些真实序列集的实验结果,验证了所提出方法的性能。