Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
Institute of Lung Health and Immunity and Comprehensive Pneumology Center with the CPC-M bioArchive; Helmholtz Zentrum Munich; member of the German Center for Lung Research (DZL), Munich, Germany.
Nat Med. 2024 Nov;30(11):3369-3380. doi: 10.1038/s41591-024-03214-0. Epub 2024 Sep 12.
With progressive digitalization of healthcare systems worldwide, large-scale collection of electronic health records (EHRs) has become commonplace. However, an extensible framework for comprehensive exploratory analysis that accounts for data heterogeneity is missing. Here we introduce ehrapy, a modular open-source Python framework designed for exploratory analysis of heterogeneous epidemiology and EHR data. ehrapy incorporates a series of analytical steps, from data extraction and quality control to the generation of low-dimensional representations. Complemented by rich statistical modules, ehrapy facilitates associating patients with disease states, differential comparison between patient clusters, survival analysis, trajectory inference, causal inference and more. Leveraging ontologies, ehrapy further enables data sharing and training EHR deep learning models, paving the way for foundational models in biomedical research. We demonstrate ehrapy's features in six distinct examples. We applied ehrapy to stratify patients affected by unspecified pneumonia into finer-grained phenotypes. Furthermore, we reveal biomarkers for significant differences in survival among these groups. Additionally, we quantify medication-class effects of pneumonia medications on length of stay. We further leveraged ehrapy to analyze cardiovascular risks across different data modalities. We reconstructed disease state trajectories in patients with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) based on imaging data. Finally, we conducted a case study to demonstrate how ehrapy can detect and mitigate biases in EHR data. ehrapy, thus, provides a framework that we envision will standardize analysis pipelines on EHR data and serve as a cornerstone for the community.
随着全球医疗系统的数字化进程不断推进,大规模收集电子健康记录(EHR)已变得司空见惯。然而,目前缺乏一种可扩展的框架来进行全面的探索性分析,以考虑到数据的异质性。在这里,我们介绍 ehrapy,这是一个模块化的开源 Python 框架,专为异质流行病学和 EHR 数据的探索性分析而设计。ehrapy 包含一系列分析步骤,从数据提取和质量控制到低维表示的生成。通过丰富的统计模块进行补充,ehrapy 有助于将患者与疾病状态相关联,对患者群体进行差异比较,进行生存分析、轨迹推断、因果推断等。利用本体论,ehrapy 进一步实现了数据共享和训练 EHR 深度学习模型,为生物医学研究中的基础模型铺平了道路。我们在六个不同的示例中展示了 ehrapy 的功能。我们应用 ehrapy 将受未指明肺炎影响的患者细分为更精细的表型。此外,我们揭示了这些群体之间生存差异的生物标志物。此外,我们量化了肺炎药物对住院时间的药物类别效应。我们进一步利用 ehrapy 来分析不同数据模式下的心血管风险。我们根据成像数据重建了严重急性呼吸综合征冠状病毒 2(SARS-CoV-2)患者的疾病状态轨迹。最后,我们进行了一项案例研究,以展示 ehrapy 如何检测和减轻 EHR 数据中的偏差。因此,ehrapy 提供了一个框架,我们设想该框架将标准化 EHR 数据的分析管道,并成为社区的基石。