通过FIDDLE实现电子健康记录分析的普及：一种用于结构化临床数据的灵活的数据驱动预处理管道。

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data.

作者信息

Tang Shengpu, Davarmanesh Parmida, Song Yanmeng, Koutra Danai, Sjoding Michael W, Wiens Jenna

机构信息

Department of Electrical Engineering and Computer Science, Division of Computer Science and Engineering, University of Michigan, Ann Arbor, USA.

Department of Mathematics, University of Michigan, Ann Arbor, USA.

出版信息

J Am Med Inform Assoc. 2020 Dec 9;27(12):1921-1934. doi: 10.1093/jamia/ocaa139.

DOI:10.1093/jamia/ocaa139

PMID:33040151

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7727385/

Abstract

OBJECTIVE

In applying machine learning (ML) to electronic health record (EHR) data, many decisions must be made before any ML is applied; such preprocessing requires substantial effort and can be labor-intensive. As the role of ML in health care grows, there is an increasing need for systematic and reproducible preprocessing techniques for EHR data. Thus, we developed FIDDLE (Flexible Data-Driven Pipeline), an open-source framework that streamlines the preprocessing of data extracted from the EHR.

MATERIALS AND METHODS

Largely data-driven, FIDDLE systematically transforms structured EHR data into feature vectors, limiting the number of decisions a user must make while incorporating good practices from the literature. To demonstrate its utility and flexibility, we conducted a proof-of-concept experiment in which we applied FIDDLE to 2 publicly available EHR data sets collected from intensive care units: MIMIC-III and the eICU Collaborative Research Database. We trained different ML models to predict 3 clinically important outcomes: in-hospital mortality, acute respiratory failure, and shock. We evaluated models using the area under the receiver operating characteristics curve (AUROC), and compared it to several baselines.

RESULTS

Across tasks, FIDDLE extracted 2,528 to 7,403 features from MIMIC-III and eICU, respectively. On all tasks, FIDDLE-based models achieved good discriminative performance, with AUROCs of 0.757-0.886, comparable to the performance of MIMIC-Extract, a preprocessing pipeline designed specifically for MIMIC-III. Furthermore, our results showed that FIDDLE is generalizable across different prediction times, ML algorithms, and data sets, while being relatively robust to different settings of user-defined arguments.

CONCLUSIONS

FIDDLE, an open-source preprocessing pipeline, facilitates applying ML to structured EHR data. By accelerating and standardizing labor-intensive preprocessing, FIDDLE can help stimulate progress in building clinically useful ML tools for EHR data.

摘要

目的

在将机器学习（ML）应用于电子健康记录（EHR）数据时，在应用任何ML之前必须做出许多决策；这种预处理需要大量精力且可能是劳动密集型的。随着ML在医疗保健中的作用不断增强，对EHR数据进行系统且可重复的预处理技术的需求日益增加。因此，我们开发了FIDDLE（灵活的数据驱动管道），这是一个开源框架，可简化从EHR中提取的数据的预处理。

材料与方法

FIDDLE主要由数据驱动，它将结构化的EHR数据系统地转换为特征向量，在纳入文献中的良好做法的同时，限制了用户必须做出的决策数量。为了证明其效用和灵活性，我们进行了一项概念验证实验，将FIDDLE应用于从重症监护病房收集的2个公开可用的EHR数据集：MIMIC-III和电子重症监护病房协作研究数据库。我们训练了不同的ML模型来预测3个临床重要结局：院内死亡率、急性呼吸衰竭和休克。我们使用受试者操作特征曲线下面积（AUROC）评估模型，并将其与几个基线进行比较。

结果

在所有任务中，FIDDLE分别从MIMIC-III和电子重症监护病房中提取了2528至7403个特征。在所有任务上，基于FIDDLE的模型都取得了良好的判别性能，AUROC为0.757 - 0.886，与专门为MIMIC-III设计的预处理管道MIMIC-Extract的性能相当。此外，我们的结果表明，FIDDLE可在不同的预测时间、ML算法和数据集之间进行推广，同时对用户定义参数的不同设置具有相对较强的鲁棒性。

结论

FIDDLE是一个开源预处理管道，有助于将ML应用于结构化EHR数据。通过加速和标准化劳动密集型预处理，FIDDLE有助于推动构建对EHR数据具有临床实用性的ML工具的进展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/16b3/7727385/0ec740a4a5a9/ocaa139f1.jpg

相似文献

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data.

J Am Med Inform Assoc. 2020 Dec 9;27(12):1921-1934. doi: 10.1093/jamia/ocaa139.

A Multidatabase ExTRaction PipEline (METRE) for facile cross validation in critical care research.

J Biomed Inform. 2023 May;141:104356. doi: 10.1016/j.jbi.2023.104356. Epub 2023 Apr 5.

Open Source Infrastructure for Health Care Data Integration and Machine Learning Analyses.

JCO Clin Cancer Inform. 2019 Aug;3:1-16. doi: 10.1200/CCI.18.00132.

Machine learning-based prediction of clinical outcomes after traumatic brain injury: Hidden information of early physiological time series.

CNS Neurosci Ther. 2024 Jul;30(7):e14848. doi: 10.1111/cns.14848.

Dynamic ElecTronic hEalth reCord deTection (DETECT) of individuals at risk of a first episode of psychosis: a case-control development and validation study.

Lancet Digit Health. 2020 May;2(5):e229-e239. doi: 10.1016/S2589-7500(20)30024-8. Epub 2020 Mar 26.

Combining chest X-rays and electronic health record (EHR) data using machine learning to diagnose acute respiratory failure.

J Am Med Inform Assoc. 2022 May 11;29(6):1060-1068. doi: 10.1093/jamia/ocac030.

EHR-QC: A streamlined pipeline for automated electronic health records standardisation and preprocessing to predict clinical outcomes.

J Biomed Inform. 2023 Nov;147:104509. doi: 10.1016/j.jbi.2023.104509. Epub 2023 Oct 11.

Dynamic and explainable machine learning prediction of mortality in patients in the intensive care unit: a retrospective study of high-frequency data in electronic patient records.

Lancet Digit Health. 2020 Apr;2(4):e179-e191. doi: 10.1016/S2589-7500(20)30018-2. Epub 2020 Mar 12.

Unstructured clinical notes within the 24 hours since admission predict short, mid & long-term mortality in adult ICU patients.

PLoS One. 2022 Jan 6;17(1):e0262182. doi: 10.1371/journal.pone.0262182. eCollection 2022.

Mortality prediction for patients with acute respiratory distress syndrome based on machine learning: a population-based study.

Ann Transl Med. 2021 May;9(9):794. doi: 10.21037/atm-20-6624.

引用本文的文献

Developing and validating machine learning models to predict next-day extubation.

Sci Rep. 2025 Jul 29;15(1):27552. doi: 10.1038/s41598-025-12264-4.

EHRchitect: An open-source software tool for medical event sequences data extraction from Electronic Health Records.

J Clin Transl Sci. 2025 Mar 26;9(1):e79. doi: 10.1017/cts.2025.55. eCollection 2025.

Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models.

KDD. 2024 Aug;2024:4607-4618. doi: 10.1145/3637528.3671836. Epub 2024 Aug 24.

Unity in Diversity: Collaborative Pre-training Across Multimodal Medical Sources.

Proc Conf Assoc Comput Linguist Meet. 2024 Aug;2024(Volume 1 Long Papers):3644-3656. doi: 10.18653/v1/2024.acl-long.199.

Recent Advances in Predictive Modeling with Electronic Health Records.

IJCAI (U S). 2024 Aug;2024:8272-8280. doi: 10.24963/ijcai.2024/914.

Reformulating patient stratification for targeting interventions by accounting for severity of downstream outcomes resulting from disease onset: a case study in sepsis.

J Am Med Inform Assoc. 2025 May 1;32(5):905-913. doi: 10.1093/jamia/ocaf036.

Learning and diSentangling patient static information from time-series Electronic hEalth Records (STEER).

PLOS Digit Health. 2024 Oct 21;3(10):e0000640. doi: 10.1371/journal.pdig.0000640. eCollection 2024 Oct.

Knowledge abstraction and filtering based federated learning over heterogeneous data views in healthcare.

NPJ Digit Med. 2024 Oct 16;7(1):283. doi: 10.1038/s41746-024-01272-9.

Automated Fusion of Multimodal Electronic Health Records for Better Medical Predictions.

Proc SIAM Int Conf Data Min. 2024;2024:361-369. doi: 10.1137/1.9781611978032.41.

An open-source framework for end-to-end analysis of electronic health record data.

Nat Med. 2024 Nov;30(11):3369-3380. doi: 10.1038/s41591-024-03214-0. Epub 2024 Sep 12.

本文引用的文献

Do no harm: a roadmap for responsible machine learning for health care.

Nat Med. 2019 Sep;25(9):1337-1340. doi: 10.1038/s41591-019-0548-6. Epub 2019 Aug 19.

A clinically applicable approach to continuous prediction of future acute kidney injury.

Nature. 2019 Aug;572(7767):116-119. doi: 10.1038/s41586-019-1390-1. Epub 2019 Jul 31.

Scalable and accurate deep learning with electronic health records.

NPJ Digit Med. 2018 May 8;1:18. doi: 10.1038/s41746-018-0029-1. eCollection 2018.

Multitask learning and benchmarking with clinical time series data.

Sci Data. 2019 Jun 17;6(1):96. doi: 10.1038/s41597-019-0103-9.

Using Machine Learning and the Electronic Health Record to Predict Complicated Infection.

Open Forum Infect Dis. 2019 Apr 20;6(5):ofz186. doi: 10.1093/ofid/ofz186. eCollection 2019 May.

Machine learning for patient risk stratification for acute respiratory distress syndrome.

PLoS One. 2019 Mar 28;14(3):e0214465. doi: 10.1371/journal.pone.0214465. eCollection 2019.

ShortFuse: Biomedical Time Series Representations in the Presence of Structured Information.

Proc Mach Learn Res. 2017 Aug;68:59-74.

The eICU Collaborative Research Database, a freely available multi-center database for critical care research.

Sci Data. 2018 Sep 11;5:180178. doi: 10.1038/sdata.2018.178.

Benchmarking deep learning models on large healthcare datasets.

J Biomed Inform. 2018 Jul;83:112-134. doi: 10.1016/j.jbi.2018.04.007. Epub 2018 Jun 5.

Leveraging Clinical Time-Series Data for Prediction: A Cautionary Tale.

AMIA Annu Symp Proc. 2018 Apr 16;2017:1571-1580. eCollection 2017.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过FIDDLE实现电子健康记录分析的普及：一种用于结构化临床数据的灵活的数据驱动预处理管道。

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data.

作者信息

Tang Shengpu, Davarmanesh Parmida, Song Yanmeng, Koutra Danai, Sjoding Michael W, Wiens Jenna

机构信息

Department of Electrical Engineering and Computer Science, Division of Computer Science and Engineering, University of Michigan, Ann Arbor, USA.

Department of Mathematics, University of Michigan, Ann Arbor, USA.