Ahmed Kazi Arman, Humaira Israt, Khan Ashiqur Rahman, Hasan Md Shamim, Islam Mukitul, Roy Anik, Karim Mehrab, Uddin Mezbah, Mohammad Ashique, Xames Md Doulotuzzaman
Department of Industrial and Production Engineering, Military Institute of Science and Technology, Dhaka, Bangladesh.
Department of Biomedical Engineering, Military Institute of Science and Technology, Dhaka, Bangladesh.
PLoS One. 2025 Jun 18;20(6):e0326221. doi: 10.1371/journal.pone.0326221. eCollection 2025.
Breast cancer is a significant global health concern with rising incidence and mortality rates. Current diagnostic methods face challenges, necessitating improved approaches. This study employs various machine learning (ML) algorithms, including KNN, SVM, ANN, RF, XGBoost, ensemble models, AutoML, and deep learning (DL) techniques, to enhance breast cancer diagnosis. The objective is to compare the efficiency and accuracy of these models using original and synthetic datasets, contributing to the advancement of breast cancer diagnosis. The methodology comprises three phases, each with two stages. In the first stage of each phase, stratified K-fold cross-validation was performed to train and evaluate multiple ML models. The second stage involved DL-based and AutoML-based ensemble strategies to improve prediction accuracy. In the second and third phases, synthetic data generation methods, such as Gaussian Copula and TVAE, were utilized. The KNN model outperformed others on the original dataset, while the AutoML approach using H2OXGBoost using synthetic data also showed high accuracy. These findings underscore the effectiveness of traditional ML models and AutoML in predicting breast cancer. Additionally, the study demonstrated the potential of synthetic data generation methods to improve prediction performance, aiding decision-making in the diagnosis and treatment of breast cancer.
乳腺癌是一个重大的全球健康问题,其发病率和死亡率不断上升。当前的诊断方法面临挑战,因此需要改进方法。本研究采用了各种机器学习(ML)算法,包括K近邻算法(KNN)、支持向量机(SVM)、人工神经网络(ANN)、随机森林(RF)、极端梯度提升(XGBoost)、集成模型、自动机器学习(AutoML)以及深度学习(DL)技术,以加强乳腺癌的诊断。目的是使用原始数据集和合成数据集比较这些模型的效率和准确性,为乳腺癌诊断的进步做出贡献。该方法包括三个阶段,每个阶段有两个步骤。在每个阶段的第一步中,进行分层K折交叉验证以训练和评估多个ML模型。第二步涉及基于DL和基于AutoML的集成策略,以提高预测准确性。在第二和第三阶段,使用了高斯Copula和变分自编码器(TVAE)等合成数据生成方法。KNN模型在原始数据集上的表现优于其他模型,而使用合成数据的基于H2OXGBoost的AutoML方法也显示出很高的准确性。这些发现强调了传统ML模型和AutoML在预测乳腺癌方面的有效性。此外,该研究证明了合成数据生成方法在提高预测性能方面的潜力,有助于乳腺癌诊断和治疗中的决策制定。