Saadh Mohamed J, Ahmed Hanan Hassan, Kareem Radhwan Abdul, Yadav Anupam, Ganesan Subbulakshmi, Shankhyan Aman, Sharma Girish Chandra, Naidu K Satyam, Rakhmatullaev Akmal, Sameer Hayder Naji, Yaseen Ahmed, Athab Zainab H, Adil Mohaned, Farhood Bagher
Faculty of Pharmacy, Middle East University, Amman, 11831, Jordan.
College of Pharmacy, Alnoor University, Mosul, Iraq.
Discov Oncol. 2025 Mar 17;16(1):334. doi: 10.1007/s12672-025-02111-3.
This study proposes an advanced machine learning (ML) framework for breast cancer diagnostics by integrating transcriptomic profiling with optimized feature selection and classification techniques.
A dataset of 1759 samples (987 breast cancer patients, 772 healthy controls) was analyzed using Recursive Feature Elimination, Boruta, and ElasticNet for feature selection. Dimensionality reduction techniques, including Non-Negative Matrix Factorization (NMF), Autoencoders, and transformer-based embeddings (BioBERT, DNABERT), were applied to enhance model interpretability. Classifiers such as XGBoost, LightGBM, ensemble voting, Multi-Layer Perceptron, and Stacking were trained using grid search and cross-validation. Model evaluation was conducted using accuracy, AUC, MCC, Kappa Score, ROC, and PR curves, with external validation performed on an independent dataset of 175 samples.
XGBoost and LightGBM achieved the highest test accuracies (0.91 and 0.90) and AUC values (up to 0.92), particularly with NMF and BioBERT. The ensemble Voting method exhibited the best external accuracy (0.92), confirming its robustness. Transformer-based embeddings and advanced feature selection techniques significantly improved model performance compared to conventional approaches like PCA and Decision Trees.
The proposed ML framework enhances diagnostic accuracy and interpretability, demonstrating strong generalizability on an external dataset. These findings highlight its potential for precision oncology and personalized breast cancer diagnostics.
本研究提出了一种先进的机器学习(ML)框架,通过整合转录组分析、优化的特征选择和分类技术来进行乳腺癌诊断。
使用递归特征消除、Boruta和弹性网络对一个包含1759个样本(987例乳腺癌患者、772例健康对照)的数据集进行特征选择。应用降维技术,包括非负矩阵分解(NMF)、自动编码器和基于变压器的嵌入(BioBERT、DNABERT)来增强模型的可解释性。使用网格搜索和交叉验证对XGBoost、LightGBM、集成投票、多层感知器和堆叠等分类器进行训练。使用准确率、AUC、MCC、Kappa分数、ROC和PR曲线进行模型评估,并在一个包含175个样本的独立数据集上进行外部验证。
XGBoost和LightGBM实现了最高的测试准确率(分别为0.91和0.90)和AUC值(高达0.92),特别是与NMF和BioBERT结合时。集成投票方法表现出最佳的外部准确率(0.92),证实了其稳健性。与主成分分析(PCA)和决策树等传统方法相比,基于变压器的嵌入和先进的特征选择技术显著提高了模型性能。
所提出的ML框架提高了诊断准确率和可解释性,在外部数据集上显示出很强的泛化能力。这些发现突出了其在精准肿瘤学和个性化乳腺癌诊断中的潜力。