从电子健康记录中稳健地提取肺炎相关临床状态。

Robust extraction of pneumonia-associated clinical states from electronic health records.

机构信息

Department of Engineering Sciences and Applied Math, Northwestern University, Evanston, IL 60208.

Interdisciplinary Biological Sciences Program, Northwestern University, Evanston, IL 60208.

出版信息

Proc Natl Acad Sci U S A. 2024 Nov 5;121(45):e2417688121. doi: 10.1073/pnas.2417688121. Epub 2024 Oct 30.

DOI:10.1073/pnas.2417688121

PMID:39475648

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11551366/

Abstract

Mining of electronic health records (EHR) promises to automate the identification of comprehensive disease phenotypes. However, the realization of this promise is hindered by the unavailability of generalizable ground-truth information, data incompleteness and heterogeneity, and the lack of generalization to multiple cohorts. We present here a data-driven approach to identify clinical states that we implement for 585 critical care patients with suspected pneumonia recruited by the SCRIPT study, which we compare to and integrate with 9,918 pneumonia patients from the MIMIC-IV dataset. We extract and curate from their structured EHRs a primary set of clinical features (53 and 59 features for SCRIPT and MIMIC-IV, respectively), including disease severity scores, vital signs, and so on, at various degrees of completeness. We aggregate irregular time series into daily frequency, resulting in 12,495 and 94,684 patient-day pairs for SCRIPT and MIMIC, respectively. We define a "common-sense" ground truth that we then use in a semisupervised pipeline to optimize choices for data preprocessing, and reduce the feature space to four principal components. We describe and validate an ensemble-based clustering method that enables us to robustly identify five clinical states, and use a Gaussian mixture model to quantify uncertainty in cluster assignment. Demonstrating the clinical relevance of the identified states, we find that three states are strongly associated with disease outcomes (dying vs. recovering), while the other two reflect disease etiology. The outcome associated clinical states provide significantly increased discrimination of mortality rates over standard approaches.

摘要

电子健康记录（EHR）的挖掘有望实现全面疾病表型的自动化识别。然而，由于缺乏可推广的真实信息、数据不完整和异质性，以及缺乏对多个队列的泛化能力，这一承诺的实现受到了阻碍。我们在这里提出了一种数据驱动的方法来识别临床状态，我们将其应用于 SCRIPT 研究中招募的 585 名疑似肺炎的重症监护患者，并将其与 MIMIC-IV 数据集的 9918 名肺炎患者进行比较和整合。我们从他们的结构化 EHR 中提取和整理了一组主要的临床特征（分别为 SCRIPT 和 MIMIC-IV 的 53 个和 59 个特征），包括疾病严重程度评分、生命体征等，其完整性程度不一。我们将不规则的时间序列汇总到每日频率中，分别为 SCRIPT 和 MIMIC 生成了 12495 和 94684 个患者日对。我们定义了一个“常识”的真实信息，然后在一个半监督的管道中使用它来优化数据预处理的选择，并将特征空间减少到四个主成分。我们描述并验证了一种基于集成的聚类方法，使我们能够稳健地识别五个临床状态，并使用高斯混合模型来量化聚类分配的不确定性。为了证明所识别的状态的临床相关性，我们发现三个状态与疾病结局（死亡与恢复）强烈相关，而另外两个状态反映了疾病的病因。与结局相关的临床状态显著提高了死亡率的区分能力，优于标准方法。