Kim Ji Yoon
Ewha Womans University College of Medicine, Seoul, Korea.
Ewha Med J. 2025 Apr;48(2):e31. doi: 10.12771/emj.2025.00297. Epub 2025 Apr 15.
This study aimed to leverage Shapley additive explanation (SHAP)-based feature engineering to predict appendix cancer. Traditional models often lack transparency, hindering clinical adoption. We propose a framework that integrates SHAP for feature selection, construction, and weighting to enhance accuracy and clinical relevance.
Data from the Kaggle Appendix Cancer Prediction dataset (260,000 samples, 21 features) were used in this prediction study conducted from January through March 2025, in accordance with TRIPOD-AI guidelines. Preprocessing involved label encoding, SMOTE (synthetic minority over-sampling technique) to address class imbalance, and an 80:20 train-test split. Baseline models (random forest, XGBoost, LightGBM) were compared; LightGBM was selected for its superior performance (accuracy=0.8794). SHAP analysis identified key features and guided 3 engineering steps: selection of the top 15 features, construction of interaction-based features (e.g., chronic severity), and feature weighting based on SHAP values. Performance was evaluated using accuracy, precision, recall, and F1-score.
Four LightGBM model configurations were evaluated: baseline (accuracy=0.8794, F1-score=0.8691), feature selection (accuracy=0.8968, F1-score=0.8860), feature construction (accuracy=0.8980, F1-score=0.8872), and feature weighting (accuracy=0.8986, F1-score=0.8877). SHAP-based engineering yielded performance improvements, with feature weighting achieving the highest precision (0.9940). Key features (e.g., red blood cell count and chronic severity) contributed to predictions while maintaining interpretability.
The SHAP-based framework substantially improved the accuracy and transparency of appendix cancer predictions using LightGBM (F1-score=0.8877). This approach bridges the gap between predictive power and clinical interpretability, offering a scalable model for rare disease prediction. Future validation with real-world data is recommended to ensure generalizability.
本研究旨在利用基于夏普利值加法解释(SHAP)的特征工程来预测阑尾癌。传统模型往往缺乏透明度,这阻碍了其在临床中的应用。我们提出了一个框架,该框架整合了SHAP用于特征选择、构建和加权,以提高准确性和临床相关性。
根据TRIPOD-AI指南,在2025年1月至3月进行的这项预测研究中,使用了来自Kaggle阑尾癌预测数据集(260,000个样本,21个特征)的数据。预处理包括标签编码、使用SMOTE(合成少数过采样技术)来解决类别不平衡问题,以及80:20的训练-测试分割。对基线模型(随机森林、XGBoost、LightGBM)进行了比较;LightGBM因其卓越的性能(准确率=0.8794)而被选中。SHAP分析确定了关键特征,并指导了三个工程步骤:选择前15个特征、构建基于交互的特征(例如,慢性严重程度)以及基于SHAP值进行特征加权。使用准确率、精确率、召回率和F1分数来评估性能。
评估了四种LightGBM模型配置:基线配置(准确率=0.8794,F1分数=0.8691)、特征选择配置(准确率=0.8968,F1分数=0.8860)、特征构建配置(准确率=0.8980,F1分数=0.8872)和特征加权配置(准确率=0.8986,F1分数=0.8877)。基于SHAP的工程方法提高了性能,特征加权实现了最高的精确率(0.9940)。关键特征(例如,红细胞计数和慢性严重程度)在保持可解释性的同时对预测有贡献。
基于SHAP的框架显著提高了使用LightGBM进行阑尾癌预测的准确性和透明度(F1分数=0.8877)。这种方法弥合了预测能力与临床可解释性之间的差距,为罕见病预测提供了一个可扩展的模型。建议使用真实世界数据进行未来验证,以确保其通用性。