IEEE Trans Cybern. 2019 May;49(5):1657-1668. doi: 10.1109/TCYB.2018.2809562. Epub 2018 Mar 27.
Temporal data clustering can provide underpinning techniques for the discovery of intrinsic structures, which proved important in condensing or summarizing information demanded in various fields of information sciences, ranging from time series analysis to sequential data understanding. In this paper, we propose a novel hidden Markov model (HMM)-based hybrid meta-clustering ensemble with bi-weighting scheme to solve the problems of initialization and model selection associated with temporal data clustering. To improve the performance of the ensemble techniques, the proposed bi-weighting scheme adaptively examines the partition process and hence optimizes the fusion of consensus functions. Specifically, three consensus functions are used to combine the input partitions, generated by HMM-based K -models under different initializations, into a robust consensus partition. An optimal consensus partition is then selected from the three candidates by a normalized mutual information-based objective function. Finally, the optimal consensus partition is further refined by the HMM-based agglomerative clustering algorithm in association with dendrogram-based similarity partitioning algorithm, leading to the advantage that the number of clusters can be automatically and adaptively determined. Extensive experiments on synthetic data, time series, and real-world motion trajectory datasets illustrate that our proposed approach outperforms all the selected benchmarks and hence providing promising potentials for developing improved clustering tools for information analysis and management.
时间数据聚类可以为发现内在结构提供基础技术,这在信息科学的各个领域中都很重要,从时间序列分析到顺序数据理解,这些内在结构可以用来压缩或总结信息。在本文中,我们提出了一种新的基于隐马尔可夫模型(HMM)的混合元聚类集成,具有双加权方案,以解决与时间数据聚类相关的初始化和模型选择问题。为了提高集成技术的性能,所提出的双加权方案自适应地检查分区过程,从而优化共识函数的融合。具体来说,使用三个共识函数将由基于 HMM 的 K-模型在不同初始化下生成的输入分区合并为一个稳健的共识分区。然后,通过基于归一化互信息的目标函数从三个候选者中选择最佳共识分区。最后,通过与基于树状图的相似性分区算法相关联的基于 HMM 的凝聚聚类算法进一步细化最佳共识分区,从而可以自动且自适应地确定聚类的数量。在合成数据、时间序列和真实运动轨迹数据集上的广泛实验表明,我们提出的方法优于所有选定的基准,因此为开发用于信息分析和管理的改进聚类工具提供了有希望的潜力。