Ganie Shahid Mohammad, Dutta Pramanik Pijush Kanti, Zhao Zhongming
AI Research Centre, Department of Analytics, Woxsen University, Hyderabad 502345, India.
School of Computer Science and Engineering, Galgotias University, Greater Noida 203201, India.
Bioengineering (Basel). 2025 Apr 29;12(5):472. doi: 10.3390/bioengineering12050472.
: Cancer is a leading cause of death worldwide, and its early detection is crucial for improving patient outcomes. This study aimed to develop and evaluate ensemble learning models, specifically stacking, for the accurate prediction of lung, breast, and cervical cancers using lifestyle and clinical data. : 12 base learners were trained on datasets for lung, breast, and cervical cancer. Stacking ensemble models were then developed using these base learners. The models were evaluated for accuracy, precision, recall, F1-score, AUC-ROC, MCC, and kappa. An explainable AI technique, SHAP, was used to interpret model predictions. : The stacking ensemble model outperformed individual base learners across all three cancer types. On average, for three cancer datasets, it achieved 99.28% accuracy, 99.55% precision, 97.56% recall, and 98.49% F1-score. A similar high performance was observed in terms of AUC, Kappa, and MCC. The SHAP analysis revealed the most influential features for each cancer type, e.g., fatigue and alcohol consumption for lung cancer, worst concave points, mean concave points, and worst perimeter for breast cancer and Schiller test for cervical cancer. : The stacking-based multi-cancer prediction model demonstrated superior accuracy and interpretability compared with traditional models. Combining diverse base learners with explainable AI offers predictive power and transparency in clinical applications. Key demographic and clinical features driving cancer risk were also identified. Further research should validate the model on more diverse populations and cancer types.
癌症是全球主要的死亡原因之一,其早期检测对于改善患者预后至关重要。本研究旨在开发和评估集成学习模型,特别是堆叠模型,以使用生活方式和临床数据准确预测肺癌、乳腺癌和宫颈癌。12个基学习器在肺癌、乳腺癌和宫颈癌数据集上进行了训练。然后使用这些基学习器开发了堆叠集成模型。对模型的准确性、精确率、召回率、F1分数、AUC-ROC、MCC和kappa进行了评估。使用一种可解释的人工智能技术SHAP来解释模型预测。堆叠集成模型在所有三种癌症类型上均优于单个基学习器。平均而言,对于三个癌症数据集,它实现了99.28%的准确率、99.55%的精确率、97.56%的召回率和98.49%的F1分数。在AUC、Kappa和MCC方面也观察到了类似的高性能。SHAP分析揭示了每种癌症类型最具影响力的特征,例如肺癌的疲劳和饮酒、乳腺癌的最差凹点、平均凹点和最差周长以及宫颈癌的席勒试验。与传统模型相比,基于堆叠的多癌症预测模型表现出更高的准确性和可解释性。将不同的基学习器与可解释的人工智能相结合,在临床应用中提供了预测能力和透明度。还确定了驱动癌症风险的关键人口统计学和临床特征。进一步的研究应在更多样化的人群和癌症类型上验证该模型。