Al-Azzam Nosayba, Shatnawi Ibrahem
Department of Physiology and Biochemistry, Faculty of Medicine, Jordan University of Science and Technology, Irbid, 22110, Jordan.
Independent Researcher in Data Analytics, Jordan.
Ann Med Surg (Lond). 2021 Jan 8;62:53-64. doi: 10.1016/j.amsu.2020.12.043. eCollection 2021 Feb.
Breast cancer disease is the most common cancer in US women and the second cause of cancer death among women.
To compare and evaluate the performance and accuracy of the key supervised and semi-supervised machine learning algorithms for breast cancer prediction.
We have used nine machine learning classification algorithms for supervised (SL) and semi-supervised learning (SSL): 1) Logistic regression; 2) Gaussian Naive Bayes; 3) Linear Support vector machine; 4) RBF Support vector machine; 5) Decision Tree; 6) Random Forest; 7) Xgboost; 8) Gradient Boosting; 9) KNN. The Wisconsin Diagnosis Cancer dataset was used to train and test these models. To ensure the robustness of the model, we have applied K-fold cross-validation and optimized hyperparameters. We have evaluated and compared the models using accuracy, precision, recall, F1-score, and ROC curves.
The results of all models are inspiring using both SL and SSL. The SSL has high accuracy (90%-98%) with just half of the training data. The KNN model for the SL and logistic regression for the SSL achieved the highest accuracy of 98.
The accuracies of SSL algorithms are very close to the SL algorithms. The accuracies of all models are in the range of 91-98%. SSL is a promising and competitive approach to solve the problem. Using a small sample of labeled and low computational power, the SSL is fully capable of replacing SL algorithms in diagnosing tumor type.
乳腺癌是美国女性中最常见的癌症,也是女性癌症死亡的第二大原因。
比较和评估用于乳腺癌预测的关键监督式和半监督式机器学习算法的性能和准确性。
我们使用了九种用于监督学习(SL)和半监督学习(SSL)的机器学习分类算法:1)逻辑回归;2)高斯朴素贝叶斯;3)线性支持向量机;4)径向基函数支持向量机;5)决策树;6)随机森林;7)Xgboost;8)梯度提升;9)K近邻。使用威斯康星诊断癌症数据集来训练和测试这些模型。为确保模型的稳健性,我们应用了K折交叉验证并优化了超参数。我们使用准确率、精确率、召回率、F1分数和ROC曲线对模型进行了评估和比较。
使用监督学习和半监督学习时,所有模型的结果都令人鼓舞。半监督学习仅使用一半的训练数据就具有较高的准确率(90%-98%)。监督学习中的K近邻模型和半监督学习中的逻辑回归模型达到了最高准确率98。
半监督学习算法的准确率与监督学习算法非常接近。所有模型的准确率在91%-98%范围内。半监督学习是解决该问题的一种有前途且具有竞争力的方法。半监督学习使用少量标记样本且计算能力较低,完全能够在诊断肿瘤类型方面替代监督学习算法。