Tang Shengpu, Davarmanesh Parmida, Song Yanmeng, Koutra Danai, Sjoding Michael W, Wiens Jenna
Department of Electrical Engineering and Computer Science, Division of Computer Science and Engineering, University of Michigan, Ann Arbor, USA.
Department of Mathematics, University of Michigan, Ann Arbor, USA.
J Am Med Inform Assoc. 2020 Dec 9;27(12):1921-1934. doi: 10.1093/jamia/ocaa139.
In applying machine learning (ML) to electronic health record (EHR) data, many decisions must be made before any ML is applied; such preprocessing requires substantial effort and can be labor-intensive. As the role of ML in health care grows, there is an increasing need for systematic and reproducible preprocessing techniques for EHR data. Thus, we developed FIDDLE (Flexible Data-Driven Pipeline), an open-source framework that streamlines the preprocessing of data extracted from the EHR.
Largely data-driven, FIDDLE systematically transforms structured EHR data into feature vectors, limiting the number of decisions a user must make while incorporating good practices from the literature. To demonstrate its utility and flexibility, we conducted a proof-of-concept experiment in which we applied FIDDLE to 2 publicly available EHR data sets collected from intensive care units: MIMIC-III and the eICU Collaborative Research Database. We trained different ML models to predict 3 clinically important outcomes: in-hospital mortality, acute respiratory failure, and shock. We evaluated models using the area under the receiver operating characteristics curve (AUROC), and compared it to several baselines.
Across tasks, FIDDLE extracted 2,528 to 7,403 features from MIMIC-III and eICU, respectively. On all tasks, FIDDLE-based models achieved good discriminative performance, with AUROCs of 0.757-0.886, comparable to the performance of MIMIC-Extract, a preprocessing pipeline designed specifically for MIMIC-III. Furthermore, our results showed that FIDDLE is generalizable across different prediction times, ML algorithms, and data sets, while being relatively robust to different settings of user-defined arguments.
FIDDLE, an open-source preprocessing pipeline, facilitates applying ML to structured EHR data. By accelerating and standardizing labor-intensive preprocessing, FIDDLE can help stimulate progress in building clinically useful ML tools for EHR data.
在将机器学习(ML)应用于电子健康记录(EHR)数据时,在应用任何ML之前必须做出许多决策;这种预处理需要大量精力且可能是劳动密集型的。随着ML在医疗保健中的作用不断增强,对EHR数据进行系统且可重复的预处理技术的需求日益增加。因此,我们开发了FIDDLE(灵活的数据驱动管道),这是一个开源框架,可简化从EHR中提取的数据的预处理。
FIDDLE主要由数据驱动,它将结构化的EHR数据系统地转换为特征向量,在纳入文献中的良好做法的同时,限制了用户必须做出的决策数量。为了证明其效用和灵活性,我们进行了一项概念验证实验,将FIDDLE应用于从重症监护病房收集的2个公开可用的EHR数据集:MIMIC-III和电子重症监护病房协作研究数据库。我们训练了不同的ML模型来预测3个临床重要结局:院内死亡率、急性呼吸衰竭和休克。我们使用受试者操作特征曲线下面积(AUROC)评估模型,并将其与几个基线进行比较。
在所有任务中,FIDDLE分别从MIMIC-III和电子重症监护病房中提取了2528至7403个特征。在所有任务上,基于FIDDLE的模型都取得了良好的判别性能,AUROC为0.757 - 0.886,与专门为MIMIC-III设计的预处理管道MIMIC-Extract的性能相当。此外,我们的结果表明,FIDDLE可在不同的预测时间、ML算法和数据集之间进行推广,同时对用户定义参数的不同设置具有相对较强的鲁棒性。
FIDDLE是一个开源预处理管道,有助于将ML应用于结构化EHR数据。通过加速和标准化劳动密集型预处理,FIDDLE有助于推动构建对EHR数据具有临床实用性的ML工具的进展。