School of Science, Hubei University of Technology, Wuhan 430000, China.
Comput Intell Neurosci. 2023 Jan 11;2023:6530719. doi: 10.1155/2023/6530719. eCollection 2023.
Breast cancer is the most common and deadly type of cancer in the world. Based on machine learning algorithms such as XGBoost, random forest, logistic regression, and K-nearest neighbor, this paper establishes different models to classify and predict breast cancer, so as to provide a reference for the early diagnosis of breast cancer. Recall indicates the probability of detecting malignant cancer cells in medical diagnosis, which is of great significance for the classification of breast cancer, so this article takes recall as the primary evaluation index and considers the precision, accuracy, and 1-score evaluation indicators to evaluate and compare the prediction effect of each model. In order to eliminate the influence of different dimensional concepts on the effect of the model, the data are standardized. In order to find the optimal subset and improve the accuracy of the model, 15 features were screened out as input to the model through the Pearson correlation test. The K-nearest neighbor model uses the cross-validation method to select the optimal value by using recall as an evaluation index. For the problem of positive and negative sample imbalance, the hierarchical sampling method is used to extract the training set and test set proportionally according to different categories. The experimental results show that under different dataset division (8 : 2 and 7 : 3), the prediction effect of the same model will have different changes. Comparative analysis shows that the XGBoost model established in this paper (which divides the training set and test set by 8 : 2) has better effects, and its recall, precision, accuracy, and 1-score are 1.00, 0.960, 0.974, and 0.980, respectively.
乳腺癌是全球最常见和最致命的癌症类型。本文基于 XGBoost、随机森林、逻辑回归和 K-最近邻等机器学习算法,建立了不同的模型来对乳腺癌进行分类和预测,为乳腺癌的早期诊断提供参考。召回率表示在医学诊断中检测到恶性癌细胞的概率,对乳腺癌的分类具有重要意义,因此本文以召回率作为主要评价指标,同时考虑了精确率、准确率和 1 分率评价指标,对各模型的预测效果进行评估和比较。为了消除不同维度概念对模型效果的影响,对数据进行了标准化处理。为了找到最优子集,提高模型的准确率,通过 Pearson 相关检验筛选出 15 个特征作为模型的输入。K-最近邻模型采用交叉验证法,以召回率作为评价指标,选择最优 值。针对正负样本不平衡的问题,采用分层抽样法,根据不同类别按比例抽取训练集和测试集。实验结果表明,在不同数据集划分(8∶2 和 7∶3)下,同一模型的预测效果会有不同的变化。对比分析表明,本文建立的 XGBoost 模型(数据集划分比例为 8∶2)效果较好,其召回率、精确率、准确率和 1 分率分别为 1.00、0.960、0.974 和 0.980。