用于大规模患者数据中脓毒症识别的无监督聚类：一项模型开发与验证研究。

Unsupervised clustering for sepsis identification in large-scale patient data: a model development and validation study.

作者信息

Li Na, Riazi Kiarash, Pan Jie, Thavorn Kednapa, Ziegler Jennifer, Rochwerg Bram, Quan Hude, Prescott Hallie C, Dodek Peter M, Li Bing, Gervais Alain, Garland Allan

机构信息

Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, CWPH 5E34, 3280 Hospital Dr. NW, Calgary, AB, T2N 4Z6, Canada.

Centre for Health Informatics, University of Calgary, Alberta, Canada.

出版信息

Intensive Care Med Exp. 2025 Mar 20;13(1):37. doi: 10.1186/s40635-025-00744-w.

DOI:10.1186/s40635-025-00744-w

PMID:40111645

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11925832/

Abstract

BACKGROUND

Sepsis is a major global health problem. However, it lacks a true reference standard for case identification, complicating epidemiologic surveillance. Consensus definitions have changed multiple times, clinicians struggle to identify sepsis at the bedside, and differing identification algorithms generate wide variation in incidence rates. The two current identification approaches use codes from administrative data, or electronic health record (EHR)-based algorithms such as the Center for Disease Control Adult Sepsis Event (ASE); both have limitations. Here our primary purpose is to report initial steps in developing a novel approach to identifying sepsis using unsupervised clustering methods. Secondarily, we report preliminary analysis of resulting clusters, using identification by ASE criteria as a familiar comparator.

METHODS

This retrospective cohort study used hospital administrative and EHR data on adults admitted to intensive care units (ICUs) at five Canadian medical centres (2015-2017), with split development and validation cohorts. After preprocessing 592 variables (demographics, encounter characteristics, diagnoses, medications, laboratory tests, and clinical management) and applying data reduction, we presented 55 principal components to eight different clustering algorithms. An automated elbow method determined the optimal number of clusters, and the optimal algorithm was selected based on clustering metrics for consistency, separation, distribution and stability. Cluster membership in the validation cohort was assigned using an XGBoost model trained to predict cluster membership in the development cohort. For cluster analysis, we prospectively subdivided clusters by their fractions meeting ASE criteria (≥ 50% ASE-majority clusters vs. ASE-minority clusters), and compared their characteristics.

RESULTS

There were 3660 patients in the development cohort and 3012 in the validation cohort, of which 21.5% (development) and 19.1% (validation) were ASE (+). The Robust and Sparse K-means Clustering (RSKC) method performed best. In the development cohort, it identified 48 clusters of hospitalizations; 11 ASE-majority clusters contained 22.4% of all patients but 77.8% of all ASE (+) patients. 34.9% of the 209 ASE (-) patients in the ASE-majority clusters met more liberal ASE criteria for sepsis. Findings were consistent in the validation cohort.

CONCLUSIONS

Unsupervised clustering applied to diverse, large-scale medical data offers a promising approach to the identification of sepsis phenotypes for epidemiological surveillance.

摘要

背景

脓毒症是一个重大的全球健康问题。然而，它缺乏用于病例识别的真正参考标准，这使得流行病学监测变得复杂。共识定义已经多次更改，临床医生在床边难以识别脓毒症，并且不同的识别算法导致发病率存在很大差异。当前的两种识别方法使用行政数据中的编码，或基于电子健康记录（EHR）的算法，如疾病控制中心成人脓毒症事件（ASE）；两者都有局限性。在这里，我们的主要目的是报告使用无监督聚类方法开发一种识别脓毒症新方法的初步步骤。其次，我们使用ASE标准识别作为熟悉的比较对象，报告对所得聚类的初步分析。

方法

这项回顾性队列研究使用了加拿大五个医疗中心（2015 - 2017年）重症监护病房（ICU）收治的成人患者的医院行政和EHR数据，分为开发队列和验证队列。在对592个变量（人口统计学、就诊特征、诊断、用药、实验室检查和临床管理）进行预处理并应用数据降维后，我们将55个主成分呈现给八种不同的聚类算法。一种自动肘部方法确定最佳聚类数，并根据聚类的一致性、分离度、分布和稳定性指标选择最佳算法。使用在开发队列中训练以预测聚类成员资格的XGBoost模型为验证队列分配聚类成员资格。对于聚类分析，我们根据符合ASE标准的比例（≥50% ASE多数聚类与ASE少数聚类）对聚类进行前瞻性细分，并比较它们的特征。

结果

开发队列中有3660名患者，验证队列中有3012名患者，其中21.5%（开发队列）和19.1%（验证队列）为ASE阳性。稳健稀疏K均值聚类（RSKC）方法表现最佳。在开发队列中，它识别出48个住院聚类；11个ASE多数聚类包含所有患者的22.4%，但所有ASE阳性患者的77.8%。ASE多数聚类中的209名ASE阴性患者中有34.9%符合更宽松的脓毒症ASE标准。验证队列中的结果一致。