Department of Data Science, Loyola College, Chennai, 600 034, India.
ICMR-National Institute for Research in Tuberculosis, Chennai, 600 031, India.
Sci Rep. 2023 Apr 1;13(1):5362. doi: 10.1038/s41598-023-32029-1.
Breast cancer is the commonest type of cancer in women worldwide and the leading cause of mortality for females. The aim of this research is to classify the alive and death status of breast cancer patients using the Surveillance, Epidemiology, and End Results dataset. Due to its capacity to handle enormous data sets systematically, machine learning and deep learning has been widely employed in biomedical research to answer diverse classification difficulties. Pre-processing the data enables its visualization and analysis for use in making important decisions. This research presents a feasible machine learning-based approach for categorizing SEER breast cancer dataset. Moreover, a two-step feature selection method based on Variance Threshold and Principal Component Analysis was employed to select the features from the SEER breast cancer dataset. After selecting the features, the classification of the breast cancer dataset is carried out using Supervised and Ensemble learning techniques such as Ada Boosting, XG Boosting, Gradient Boosting, Naive Bayes and Decision Tree. Utilizing the train-test split and k-fold cross-validation approaches, the performance of various machine learning algorithms is examined. The accuracy of Decision Tree for both train-test split and cross validation achieved as 98%. In this study, it is observed that the Decision Tree algorithm outperforms other supervised and ensemble learning approaches for the SEER Breast Cancer dataset.
乳腺癌是全球女性最常见的癌症类型,也是女性死亡的主要原因。本研究旨在使用监测、流行病学和最终结果数据集对乳腺癌患者的存活和死亡状态进行分类。由于机器学习和深度学习能够系统地处理大量数据集,因此它们已广泛应用于生物医学研究中,以解决各种分类难题。对数据进行预处理可以使其可视化和分析,从而有助于做出重要决策。本研究提出了一种基于机器学习的 SEER 乳腺癌数据集分类的可行方法。此外,还采用了基于方差阈值和主成分分析的两步特征选择方法从 SEER 乳腺癌数据集中选择特征。选择特征后,使用监督和集成学习技术(如 AdaBoosting、XGBoosting、GradientBoosting、朴素贝叶斯和决策树)对乳腺癌数据集进行分类。利用训练-测试分割和 k 折交叉验证方法,评估了各种机器学习算法的性能。决策树在训练-测试分割和交叉验证中的准确率均达到 98%。在这项研究中,观察到决策树算法在 SEER 乳腺癌数据集上优于其他监督和集成学习方法。