Ghassempour Shima, Girosi Federico, Maeder Anthony
School of Computing, Engineering and Mathematics, University of Western Sydney, Campbelltown, NSW 2751 , Australia.
Centre for Health Research, University of Western Sydney, Campbelltown, NSW 2751 , Australia.
Int J Environ Res Public Health. 2014 Mar 6;11(3):2741-63. doi: 10.3390/ijerph110302741.
In this paper we describe an algorithm for clustering multivariate time series with variables taking both categorical and continuous values. Time series of this type are frequent in health care, where they represent the health trajectories of individuals. The problem is challenging because categorical variables make it difficult to define a meaningful distance between trajectories. We propose an approach based on Hidden Markov Models (HMMs), where we first map each trajectory into an HMM, then define a suitable distance between HMMs and finally proceed to cluster the HMMs with a method based on a distance matrix. We test our approach on a simulated, but realistic, data set of 1,255 trajectories of individuals of age 45 and over, on a synthetic validation set with known clustering structure, and on a smaller set of 268 trajectories extracted from the longitudinal Health and Retirement Survey. The proposed method can be implemented quite simply using standard packages in R and Matlab and may be a good candidate for solving the difficult problem of clustering multivariate time series with categorical variables using tools that do not require advanced statistic knowledge, and therefore are accessible to a wide range of researchers.
在本文中,我们描述了一种用于对多变量时间序列进行聚类的算法,这些时间序列中的变量同时包含分类值和连续值。这种类型的时间序列在医疗保健领域很常见,它们代表了个体的健康轨迹。该问题具有挑战性,因为分类变量使得难以定义轨迹之间有意义的距离。我们提出了一种基于隐马尔可夫模型(HMM)的方法,首先将每个轨迹映射到一个HMM中,然后定义HMM之间合适的距离,最后使用基于距离矩阵的方法对HMM进行聚类。我们在一个模拟但现实的数据集上测试了我们的方法,该数据集包含1255条45岁及以上个体的轨迹,在一个具有已知聚类结构的合成验证集上,以及在从纵向健康与退休调查中提取的268条轨迹的较小数据集上。所提出的方法可以使用R和Matlab中的标准包非常简单地实现,并且可能是使用不需要高级统计知识的工具来解决具有分类变量的多变量时间序列聚类难题的一个很好的候选方法,因此广大研究人员都可以使用。