基于自监督学习的缺失值多元时间序列数据聚类方法 (SLAC-Time)：在 TBI 表型中的应用。

A self-supervised learning-based approach to clustering multivariate time-series data with missing values (SLAC-Time): An application to TBI phenotyping.

机构信息

Department of Systems and Industrial Engineering, University of Arizona, Tucson, AZ, USA.

College of Medicine, University of Cincinnati, Cincinnati, OH, USA.

出版信息

J Biomed Inform. 2023 Jul;143:104401. doi: 10.1016/j.jbi.2023.104401. Epub 2023 May 22.

DOI:10.1016/j.jbi.2023.104401

PMID:37225066

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10527271/

Abstract

Self-supervised learning approaches provide a promising direction for clustering multivariate time-series data. However, real-world time-series data often include missing values, and the existing approaches require imputing missing values before clustering, which may cause extensive computations and noise and result in invalid interpretations. To address these challenges, we present a Self-supervised Learning-based Approach to Clustering multivariate Time-series data with missing values (SLAC-Time). SLAC-Time is a Transformer-based clustering method that uses time-series forecasting as a proxy task for leveraging unlabeled data and learning more robust time-series representations. This method jointly learns the neural network parameters and the cluster assignments of the learned representations. It iteratively clusters the learned representations with the K-means method and then utilizes the subsequent cluster assignments as pseudo-labels to update the model parameters. To evaluate our proposed approach, we applied it to clustering and phenotyping Traumatic Brain Injury (TBI) patients in the Transforming Research and Clinical Knowledge in Traumatic Brain Injury (TRACK-TBI) study. Clinical data associated with TBI patients are often measured over time and represented as time-series variables characterized by missing values and irregular time intervals. Our experiments demonstrate that SLAC-Time outperforms the baseline K-means clustering algorithm in terms of silhouette coefficient, Calinski Harabasz index, Dunn index, and Davies Bouldin index. We identified three TBI phenotypes that are distinct from one another in terms of clinically significant variables as well as clinical outcomes, including the Extended Glasgow Outcome Scale (GOSE) score, Intensive Care Unit (ICU) length of stay, and mortality rate. The experiments show that the TBI phenotypes identified by SLAC-Time can be potentially used for developing targeted clinical trials and therapeutic strategies.

摘要

自监督学习方法为聚类多元时间序列数据提供了一个有前景的方向。然而，实际的时间序列数据通常包含缺失值，并且现有的方法需要在聚类之前对缺失值进行插补，这可能会导致大量的计算和噪声，并导致无效的解释。为了解决这些挑战，我们提出了一种基于自监督学习的方法来聚类带有缺失值的多元时间序列数据（SLAC-Time）。SLAC-Time 是一种基于 Transformer 的聚类方法，它使用时间序列预测作为利用未标记数据和学习更稳健的时间序列表示的代理任务。该方法联合学习神经网络参数和学习表示的聚类分配。它使用 K-means 方法迭代地对学习表示进行聚类，然后利用后续的聚类分配作为伪标签来更新模型参数。为了评估我们提出的方法，我们将其应用于 Transforming Research and Clinical Knowledge in Traumatic Brain Injury (TRACK-TBI) 研究中对创伤性脑损伤（TBI）患者的聚类和表型分析。与 TBI 患者相关的临床数据通常是随时间测量的，并表示为具有缺失值和不规则时间间隔的时间序列变量。我们的实验表明，SLAC-Time 在轮廓系数、Calinski Harabasz 指数、Dunn 指数和 Davies Bouldin 指数方面优于基线 K-means 聚类算法。我们确定了三种 TBI 表型，它们在临床上有意义的变量和临床结果方面彼此不同，包括扩展格拉斯哥结局量表（GOSE）评分、重症监护病房（ICU）住院时间和死亡率。实验表明，SLAC-Time 识别的 TBI 表型可能可用于开发有针对性的临床试验和治疗策略。