一种使用电子病历进行临床结果预测的有效多步骤特征选择框架。
An effective multi-step feature selection framework for clinical outcome prediction using electronic medical records.
作者信息
Wang Hongnian, Zhang Mingyang, Mai Liyi, Li Xin, Bellou Abdelouahab, Wu Lijuan
机构信息
School of Management, Jinan University, Guangzhou, 510632, China.
Key Laboratory of Digital-Intelligent Disease Surveillance and Health Governance, North Sichuan Medical College, Nanchong, 637100, China.
出版信息
BMC Med Inform Decis Mak. 2025 Feb 17;25(1):84. doi: 10.1186/s12911-025-02922-y.
BACKGROUND
Identifying key variables is essential for developing clinical outcome prediction models based on high-dimensional electronic medical records (EMR). However, despite the abundance of feature selection (FS) methods available, challenges remain in choosing the most appropriate method, deciding how many top-ranked variables to include, and ensuring these selections are meaningful from a medical perspective.
METHODS
We developed a practical multi-step feature selection (FS) framework that integrates data-driven statistical inference with a knowledge verification strategy. This framework was validated using two distinct EMR datasets targeting different clinical outcomes. The first cohort, sourced from the Medical Information Mart for Intensive Care III (MIMIC-III), focused on predicting acute kidney injury (AKI) in ICU patients. The second cohort, drawn from the MIMIC-IV Emergency Department (MIMIC-IV-ED), aimed to estimate in-hospital mortality (IHM) for patients transferred from the ED to the ICU. We employed various machine learning (ML) methods and conducted a comparative analysis considering accuracy, stability, similarity, and interpretability. The effectiveness of our FS framework was evaluated using discrimination and calibration metrics, with SHAP applied to enhance the interpretability of model decisions.
RESULTS
Cohort 1 comprised 48,780 ICU encounters, of which 8,883 (18.21%) developed AKI. Cohort 2 included 29,197 transfers from the ED to the ICU, with 3,219 (11.03%) resulting in IHM. Among the ten ML methods evaluated, the tree-based ensemble method achieved the highest accuracy. As the number of top-ranking features increased, the models' accuracy began to stabilize, while feature subset stability (considering sample variations) and inter-method feature similarity reached optimal levels, confirming the validity of the FS framework. The integration of interpretative methods and expert knowledge in the final step further improved feature interpretability. The FS framework effectively reduced the number of features (e.g., from 380 to 35 for Cohort 1, and from 273 to 54 for Cohort 2) without significantly affecting prediction performance (Delong test, p > 0.05).
CONCLUSION
The multi-step FS method developed in this study successfully reduces the dimensionality of features in EMR while preserving the accuracy of clinical outcome prediction. Furthermore, it improves the interpretability of risk factors by incorporating expert knowledge validation.
背景
识别关键变量对于基于高维电子病历(EMR)开发临床结局预测模型至关重要。然而,尽管有大量可用的特征选择(FS)方法,但在选择最合适的方法、决定纳入多少排名靠前的变量以及确保这些选择从医学角度有意义方面仍然存在挑战。
方法
我们开发了一个实用的多步骤特征选择(FS)框架,该框架将数据驱动的统计推断与知识验证策略相结合。使用针对不同临床结局的两个不同的EMR数据集对该框架进行了验证。第一个队列来自重症监护医学信息库III(MIMIC-III),重点是预测ICU患者的急性肾损伤(AKI)。第二个队列取自MIMIC-IV急诊科(MIMIC-IV-ED),旨在估计从急诊科转入ICU的患者的院内死亡率(IHM)。我们采用了各种机器学习(ML)方法,并从准确性、稳定性、相似性和可解释性方面进行了比较分析。使用区分度和校准指标评估了我们的FS框架的有效性,并应用SHAP来增强模型决策的可解释性。
结果
队列1包括48780次ICU就诊,其中8883例(18.21%)发生了AKI。队列2包括29197例从急诊科转入ICU的患者,其中3219例(11.03%)导致了IHM。在评估的十种ML方法中,基于树的集成方法实现了最高的准确性。随着排名靠前的特征数量增加,模型的准确性开始稳定,而特征子集稳定性(考虑样本变化)和方法间特征相似性达到最佳水平,证实了FS框架的有效性。在最后一步中整合解释性方法和专家知识进一步提高了特征的可解释性。FS框架有效地减少了特征数量(例如,队列1从380个减少到35个,队列2从273个减少到54个),而不会显著影响预测性能(德龙检验,p>0.05)。
结论
本研究中开发的多步骤FS方法成功降低了EMR中的特征维度,同时保持了临床结局预测的准确性。此外,通过纳入专家知识验证,提高了风险因素的可解释性。