一种使用电子病历进行临床结果预测的有效多步骤特征选择框架。

An effective multi-step feature selection framework for clinical outcome prediction using electronic medical records.

作者信息

Wang Hongnian, Zhang Mingyang, Mai Liyi, Li Xin, Bellou Abdelouahab, Wu Lijuan

机构信息

School of Management, Jinan University, Guangzhou, 510632, China.

Key Laboratory of Digital-Intelligent Disease Surveillance and Health Governance, North Sichuan Medical College, Nanchong, 637100, China.

出版信息

BMC Med Inform Decis Mak. 2025 Feb 17;25(1):84. doi: 10.1186/s12911-025-02922-y.

DOI:10.1186/s12911-025-02922-y

PMID:39962480

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11834488/

Abstract

BACKGROUND

Identifying key variables is essential for developing clinical outcome prediction models based on high-dimensional electronic medical records (EMR). However, despite the abundance of feature selection (FS) methods available, challenges remain in choosing the most appropriate method, deciding how many top-ranked variables to include, and ensuring these selections are meaningful from a medical perspective.

METHODS

We developed a practical multi-step feature selection (FS) framework that integrates data-driven statistical inference with a knowledge verification strategy. This framework was validated using two distinct EMR datasets targeting different clinical outcomes. The first cohort, sourced from the Medical Information Mart for Intensive Care III (MIMIC-III), focused on predicting acute kidney injury (AKI) in ICU patients. The second cohort, drawn from the MIMIC-IV Emergency Department (MIMIC-IV-ED), aimed to estimate in-hospital mortality (IHM) for patients transferred from the ED to the ICU. We employed various machine learning (ML) methods and conducted a comparative analysis considering accuracy, stability, similarity, and interpretability. The effectiveness of our FS framework was evaluated using discrimination and calibration metrics, with SHAP applied to enhance the interpretability of model decisions.

RESULTS

Cohort 1 comprised 48,780 ICU encounters, of which 8,883 (18.21%) developed AKI. Cohort 2 included 29,197 transfers from the ED to the ICU, with 3,219 (11.03%) resulting in IHM. Among the ten ML methods evaluated, the tree-based ensemble method achieved the highest accuracy. As the number of top-ranking features increased, the models' accuracy began to stabilize, while feature subset stability (considering sample variations) and inter-method feature similarity reached optimal levels, confirming the validity of the FS framework. The integration of interpretative methods and expert knowledge in the final step further improved feature interpretability. The FS framework effectively reduced the number of features (e.g., from 380 to 35 for Cohort 1, and from 273 to 54 for Cohort 2) without significantly affecting prediction performance (Delong test, p > 0.05).

CONCLUSION

The multi-step FS method developed in this study successfully reduces the dimensionality of features in EMR while preserving the accuracy of clinical outcome prediction. Furthermore, it improves the interpretability of risk factors by incorporating expert knowledge validation.

摘要

背景

识别关键变量对于基于高维电子病历（EMR）开发临床结局预测模型至关重要。然而，尽管有大量可用的特征选择（FS）方法，但在选择最合适的方法、决定纳入多少排名靠前的变量以及确保这些选择从医学角度有意义方面仍然存在挑战。

方法

我们开发了一个实用的多步骤特征选择（FS）框架，该框架将数据驱动的统计推断与知识验证策略相结合。使用针对不同临床结局的两个不同的EMR数据集对该框架进行了验证。第一个队列来自重症监护医学信息库III（MIMIC-III），重点是预测ICU患者的急性肾损伤（AKI）。第二个队列取自MIMIC-IV急诊科（MIMIC-IV-ED），旨在估计从急诊科转入ICU的患者的院内死亡率（IHM）。我们采用了各种机器学习（ML）方法，并从准确性、稳定性、相似性和可解释性方面进行了比较分析。使用区分度和校准指标评估了我们的FS框架的有效性，并应用SHAP来增强模型决策的可解释性。

结果

队列1包括48780次ICU就诊，其中8883例（18.21%）发生了AKI。队列2包括29197例从急诊科转入ICU的患者，其中3219例（11.03%）导致了IHM。在评估的十种ML方法中，基于树的集成方法实现了最高的准确性。随着排名靠前的特征数量增加，模型的准确性开始稳定，而特征子集稳定性（考虑样本变化）和方法间特征相似性达到最佳水平，证实了FS框架的有效性。在最后一步中整合解释性方法和专家知识进一步提高了特征的可解释性。FS框架有效地减少了特征数量（例如，队列1从380个减少到35个，队列2从273个减少到54个），而不会显著影响预测性能（德龙检验，p>0.05）。

结论

本研究中开发的多步骤FS方法成功降低了EMR中的特征维度，同时保持了临床结局预测的准确性。此外，通过纳入专家知识验证，提高了风险因素的可解释性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a073/11834488/bfbbda330656/12911_2025_2922_Fig1_HTML.jpg

相似文献

An effective multi-step feature selection framework for clinical outcome prediction using electronic medical records.

BMC Med Inform Decis Mak. 2025 Feb 17;25(1):84. doi: 10.1186/s12911-025-02922-y.

Development and Validation of a Dynamic Real-Time Risk Prediction Model for Intensive Care Units Patients Based on Longitudinal Irregular Data: Multicenter Retrospective Study.

J Med Internet Res. 2025 Apr 23;27:e69293. doi: 10.2196/69293.

Explainable Machine Learning Model for Predicting Persistent Sepsis-Associated Acute Kidney Injury: Development and Validation Study.

J Med Internet Res. 2025 Apr 28;27:e62932. doi: 10.2196/62932.

Construction and evaluation of a mortality prediction model for patients with acute kidney injury undergoing continuous renal replacement therapy based on machine learning algorithms.

Ann Med. 2024 Dec;56(1):2388709. doi: 10.1080/07853890.2024.2388709. Epub 2024 Aug 19.

Early Prediction of Cardiac Arrest in the Intensive Care Unit Using Explainable Machine Learning: Retrospective Study.

J Med Internet Res. 2024 Sep 17;26:e62890. doi: 10.2196/62890.

Feature Ranking in Predictive Models for Hospital-Acquired Acute Kidney Injury.

Sci Rep. 2018 Nov 23;8(1):17298. doi: 10.1038/s41598-018-35487-0.

Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods.

BMC Med Inform Decis Mak. 2022 Nov 23;22(1):304. doi: 10.1186/s12911-022-02051-w.

Utilizing imbalanced electronic health records to predict acute kidney injury by ensemble learning and time series model.

BMC Med Inform Decis Mak. 2020 Sep 21;20(1):238. doi: 10.1186/s12911-020-01245-4.

Predicting in-hospital mortality of patients with acute kidney injury in the ICU using random forest model.

Int J Med Inform. 2019 May;125:55-61. doi: 10.1016/j.ijmedinf.2019.02.002. Epub 2019 Feb 12.

Prediction of mortality in intensive care unit with short-term heart rate variability: Machine learning-based analysis of the MIMIC-III database.

Comput Biol Med. 2025 Mar;186:109635. doi: 10.1016/j.compbiomed.2024.109635. Epub 2025 Jan 7.

引用本文的文献

ACLPred: an explainable machine learning and tree-based ensemble model for anticancer ligand prediction.

Sci Rep. 2025 Aug 25;15(1):31268. doi: 10.1038/s41598-025-16575-4.

Association between the blood urea nitrogen to serum albumin ratio and the risk of mortality in patients with chronic kidney disease: a cohort study.

BMC Nephrol. 2025 Jun 3;26(1):275. doi: 10.1186/s12882-025-04214-z.

本文引用的文献

Use machine learning models to identify and assess risk factors for coronary artery disease.

PLoS One. 2024 Sep 6;19(9):e0307952. doi: 10.1371/journal.pone.0307952. eCollection 2024.

Development and validation of a Multi-Causal investigation and discovery framework for knowledge harmonization (MINDMerge): A case study with acute kidney injury risk factor discovery using electronic medical records.

Int J Med Inform. 2024 Nov;191:105588. doi: 10.1016/j.ijmedinf.2024.105588. Epub 2024 Aug 5.

Interpretable machine learning model for predicting acute kidney injury in critically ill patients.

BMC Med Inform Decis Mak. 2024 May 31;24(1):148. doi: 10.1186/s12911-024-02537-9.

Machine learning clinical prediction models for acute kidney injury: the impact of baseline creatinine on prediction efficacy.

BMC Med Inform Decis Mak. 2023 Oct 9;23(1):207. doi: 10.1186/s12911-023-02306-0.

Impact of hospitalist care model on patient outcomes in acute medical unit: a retrospective cohort study.

BMJ Open. 2023 Aug 3;13(8):e069561. doi: 10.1136/bmjopen-2022-069561.

Ensemble feature selection with data-driven thresholding for Alzheimer's disease biomarker discovery.

BMC Bioinformatics. 2023 Jan 9;24(1):9. doi: 10.1186/s12859-022-05132-9.

MIMIC-IV, a freely accessible electronic health record dataset.

Sci Data. 2023 Jan 3;10(1):1. doi: 10.1038/s41597-022-01899-x.

A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction.

Front Bioinform. 2022 Jun 27;2:927312. doi: 10.3389/fbinf.2022.927312. eCollection 2022.

Machine learning model for predicting acute kidney injury progression in critically ill patients.

BMC Med Inform Decis Mak. 2022 Jan 19;22(1):17. doi: 10.1186/s12911-021-01740-2.

Development of a knowledge mining approach to uncover heterogeneous risk predictors of acute kidney injury across age groups.

Int J Med Inform. 2021 Dec 9;158:104661. doi: 10.1016/j.ijmedinf.2021.104661.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种使用电子病历进行临床结果预测的有效多步骤特征选择框架。

An effective multi-step feature selection framework for clinical outcome prediction using electronic medical records.

作者信息

Wang Hongnian, Zhang Mingyang, Mai Liyi, Li Xin, Bellou Abdelouahab, Wu Lijuan

机构信息

School of Management, Jinan University, Guangzhou, 510632, China.

Key Laboratory of Digital-Intelligent Disease Surveillance and Health Governance, North Sichuan Medical College, Nanchong, 637100, China.