Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA.
Stat Med. 2024 Nov 20;43(26):4913-4927. doi: 10.1002/sim.10192. Epub 2024 Sep 9.
Clustering functional data aims to identify unique functional patterns in the entire domain, but this can be challenging due to phase variability that distorts the observed patterns. Curve registration can be used to remove this variability, but determining the appropriate level of warping flexibility can be complicated. Curve registration also requires a target to which a functional object is aligned, typically the cross-sectional mean of functional objects within the same cluster. However, this mean is unknown prior to clustering. Furthermore, there is a trade-off between flexible warping and the number of resulting clusters. Removing more phase variability through curve registration can lead to fewer remaining variations in the functional data, resulting in a smaller number of clusters. Thus, the optimal number of clusters and warping flexibility cannot be uniquely identified. We propose to use external information to solve the identification issue. We define a cross validated Kullback-Leibler information criterion to select the number of clusters and the warping penalty. The criterion is derived from the predictive classification likelihood considering the joint distribution of both the functional data and external variable and penalizes the uncertainty in the cluster membership. We evaluate our method through simulation and apply it to electrocardiographic data collected in the Chronic Renal Insufficiency Cohort study. We identify two distinct clusters of electrocardiogram (ECG) profiles, with the second cluster exhibiting ST segment depression, an indication of cardiac ischemia, compared to the normal ECG profiles in the first cluster.
聚类功能数据旨在识别整个域中的独特功能模式,但由于相位变化会扭曲观察到的模式,因此这可能具有挑战性。曲线配准可用于消除这种可变性,但确定适当的扭曲灵活性水平可能很复杂。曲线配准还需要一个目标,将功能对象与该目标对齐,通常是同一聚类中功能对象的横截面平均值。然而,在聚类之前,平均值是未知的。此外,在灵活的扭曲和产生的聚类数量之间存在权衡。通过曲线配准消除更多的相位变化会导致功能数据中剩余的变化更少,从而导致聚类数量减少。因此,无法唯一确定最佳的聚类数量和扭曲灵活性。我们建议使用外部信息来解决识别问题。我们定义了一个交叉验证的 Kullback-Leibler 信息准则来选择聚类数量和扭曲惩罚。该准则源自考虑功能数据和外部变量联合分布的预测分类似然,并惩罚聚类成员身份的不确定性。我们通过模拟评估了我们的方法,并将其应用于慢性肾功能不全队列研究中收集的心电图数据。我们确定了两个不同的心电图 (ECG) 特征聚类,与第一聚类中的正常 ECG 特征相比,第二个聚类表现出 ST 段压低,表明存在心肌缺血。