电子健康记录中纵向临床测量的无监督聚类

Unsupervised clustering of longitudinal clinical measurements in electronic health records.

作者信息

Mariam Arshiya, Javidi Hamed, Zabor Emily C, Zhao Ran, Radivoyevitch Tomas, Rotroff Daniel M

机构信息

Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, United States of America.

Center for Quantitative Metabolic Research, Cleveland Clinic, Cleveland, Ohio, United States of America.

出版信息

PLOS Digit Health. 2024 Oct 15;3(10):e0000628. doi: 10.1371/journal.pdig.0000628. eCollection 2024 Oct.

DOI:10.1371/journal.pdig.0000628

PMID:39405315

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11478862/

Abstract

Longitudinal electronic health records (EHR) can be utilized to identify patterns of disease development and progression in real-world settings. Unsupervised temporal matching algorithms are being repurposed to EHR from signal processing- and protein-sequence alignment tasks where they have shown immense promise for gaining insight into disease. The robustness of these algorithms for classifying EHR clinical data remains to be determined. Timeseries compiled from clinical measurements, such as blood pressure, have far more irregularity in sampling and missingness than the data for which these algorithms were developed, necessitating a systematic evaluation of these methods. We applied 30 state-of-the-art unsupervised machine learning algorithms to 6,912 systematically generated simulated clinical datasets across five parameters. These algorithms included eight temporal matching algorithms with fourteen partitional and eight fuzzy clustering methods. Nemenyi tests were used to determine differences in accuracy using the Adjusted Rand Index (ARI). Dynamic time warping and its lower-bound variants had the highest accuracies across all cohorts (median ARI>0.70). All 30 methods were better at discriminating classes with differences in magnitude compared to differences in trajectory shapes. Missingness impacted accuracies only when classes were different by trajectory shape. The method with the highest ARI was then used to cluster a large pediatric metabolic syndrome (MetS) cohort (N = 43,426). We identified three unique childhood BMI patterns with high average cluster consensus (>70%). The algorithm identified a cluster with consistently high BMI which had the greatest risk of MetS, consistent with prior literature (OR = 4.87, 95% CI: 3.93-6.12). While these algorithms have been shown to have similar accuracies for regular timeseries, their accuracies in clinical applications vary substantially in discriminating differences in shape and especially with moderate to high missingness (>10%). This systematic assessment also shows that the most robust algorithms tested here can derive meaningful insights from longitudinal clinical data.

摘要

纵向电子健康记录（EHR）可用于识别现实环境中疾病发展和进展的模式。无监督时间匹配算法正从信号处理和蛋白质序列比对任务中被重新应用于EHR，在这些任务中，它们已显示出在洞察疾病方面的巨大潜力。这些算法对EHR临床数据进行分类的稳健性仍有待确定。从临床测量（如血压）汇编的时间序列在采样和缺失方面比开发这些算法所使用的数据具有更多的不规则性，因此需要对这些方法进行系统评估。我们将30种先进的无监督机器学习算法应用于6912个系统生成的、跨越五个参数的模拟临床数据集。这些算法包括八种时间匹配算法以及十四种划分方法和八种模糊聚类方法。使用Nemenyi检验通过调整兰德指数（ARI）来确定准确性的差异。动态时间规整及其下限变体在所有队列中具有最高的准确率（中位数ARI>0.70）。与轨迹形状的差异相比，所有30种方法在区分幅度差异的类别方面表现更好。只有当类别在轨迹形状上不同时，缺失才会影响准确率。然后使用ARI最高的方法对一个大型儿科代谢综合征（MetS）队列（N = 43426）进行聚类。我们确定了三种独特的儿童BMI模式，平均聚类一致性较高（>70%）。该算法识别出一个BMI持续较高的聚类，其患MetS的风险最大，这与先前的文献一致（OR = 4.87，95% CI：3.93 - 6.12）。虽然这些算法已被证明在处理常规时间序列时具有相似的准确率，但它们在临床应用中区分形状差异的准确率差异很大，尤其是在存在中度到高度缺失（>10%）的情况下。这种系统评估还表明，这里测试的最稳健的算法可以从纵向临床数据中得出有意义的见解。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

电子健康记录中纵向临床测量的无监督聚类

Unsupervised clustering of longitudinal clinical measurements in electronic health records.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

电子健康记录中纵向临床测量的无监督聚类

Unsupervised clustering of longitudinal clinical measurements in electronic health records.

作者信息

机构信息

出版信息

相似文献

本文引用的文献