复发事件数据的聚类分析

Clustering of recurrent events data.

作者信息

Babykina G, Vandewalle V, Carretero-Bravo J

机构信息

ULR 2694 - METRICS - Évaluation des Technologies de Santé et des Pratiques Médicales, CHU Lille, Université de Lille, Lille, France.

Université Côte d'Azur, Inria, CNRS, Laboratoire J.A.Dieudonné, Maasai team, Nice, France.

出版信息

J Appl Stat. 2025 Jan 28;52(11):2031-2059. doi: 10.1080/02664763.2025.2452966. eCollection 2025.

DOI:10.1080/02664763.2025.2452966

PMID:40904952

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12404095/

Abstract

Nowadays data are often timestamped, thus, when analysing the events which may occur several times (recurrent events), it is desirable to model the whole dynamics of the counting process rather than to focus on a total number of events. Such kind of data can be encountered in hospital readmissions, disease recurrences or repeated failures of industrial systems. Recurrent events can be analysed in the counting process framework, as in the Andersen-Gill model, assuming that the baseline intensity depends on time and on covariates, as in the Cox model. However, observed covariates are often insufficient to explain the observed heterogeneity in the data. We propose a mixture model for recurrent events, allowing to account for the unobserved heterogeneity and to perform clustering of individuals (unsupervised classification allowing to partition of the heterogeneous data according to unobserved, or latent, variables). Within each cluster, the recurrent event process intensity is specified parametrically and is adjusted for covariates. Model parameters are estimated by maximum likelihood using the EM algorithm; the BIC criterion is adopted to choose an optimal number of clusters. The model feasibility is checked on simulated data. Real data on hospital readmissions of elderly people, which motivated the development of the proposed clustering model, are analysed. The obtained results allow a fine understanding of the recurrent event process in each cluster.

摘要

如今，数据常常带有时间戳，因此，在分析可能多次发生的事件（复发事件）时，对计数过程的整体动态进行建模，而非关注事件总数更为可取。此类数据可见于医院再入院、疾病复发或工业系统的反复故障中。复发事件可在计数过程框架内进行分析，如在安德森 - 吉尔模型中那样，假设基线强度如在考克斯模型中那样取决于时间和协变量。然而，观察到的协变量往往不足以解释数据中观察到的异质性。我们提出一种复发事件的混合模型，该模型能够考虑未观察到的异质性，并对个体进行聚类（无监督分类，允许根据未观察到的或潜在的变量对异质数据进行划分）。在每个聚类中，复发事件过程强度通过参数指定，并针对协变量进行调整。模型参数使用期望最大化（EM）算法通过最大似然估计；采用贝叶斯信息准则（BIC）来选择最优的聚类数。通过模拟数据检验模型的可行性。对激发所提出聚类模型发展的老年人医院再入院的真实数据进行分析。所获结果有助于深入了解每个聚类中的复发事件过程。