Information Systems Department, Suez Canal University, Ismailia 41522, Egypt.
ScientificWorldJournal. 2022 Aug 9;2022:1056490. doi: 10.1155/2022/1056490. eCollection 2022.
Cancer is a deadly disease that occurs due to rapid and uncontrolled cell growth. In this article, a machine learning (ML) algorithm is proposed to diagnose different cancer diseases from big data. The algorithm comprises a two-stage hybrid feature selection. In the first stage, an overall ranker is initiated to combine the results of three filter-based feature evaluation methods, namely, chi-squared, -statistic, and mutual information (MI). The features are then ordered according to this combination. In the second stage, the modified wrapper-based sequential forward selection is utilized to discover the optimal feature subset, using ML models such as support vector machine (SVM), decision tree (DT), random forest (RF), and -nearest neighbor (NN) classifiers. To examine the proposed algorithm, many tests have been carried out on four cancerous microarray datasets, employing in the process 10-fold cross-validation and hyperparameter tuning. The performance of the algorithm is evaluated by calculating the diagnostic accuracy. The results indicate that for the leukemia dataset, both SVM and KNN models register the highest accuracy at 100% using only 5 features. For the ovarian cancer dataset, the SVM model achieves the highest accuracy at 100% using only 6 features. For the small round blue cell tumor (SRBCT) dataset, the SVM model also achieves the highest accuracy at 100% using only 8 features. For the lung cancer dataset, the SVM model also achieves the highest accuracy at 99.57% using 19 features. By comparing with other algorithms, the results obtained from the proposed algorithm are superior in terms of the number of selected features and diagnostic accuracy.
癌症是一种致命的疾病,是由于细胞的快速和不受控制的生长引起的。在本文中,提出了一种机器学习(ML)算法,用于从大数据中诊断不同的癌症疾病。该算法包括两阶段混合特征选择。在第一阶段,启动一个总体排名器,以组合三种基于过滤的特征评估方法(卡方检验、-统计量和互信息(MI))的结果。然后根据此组合对特征进行排序。在第二阶段,使用基于包装的顺序前向选择来发现最优特征子集,使用 ML 模型,如支持向量机(SVM)、决策树(DT)、随机森林(RF)和 -最近邻(NN)分类器。为了检验所提出的算法,在四个癌症微阵列数据集上进行了多次测试,在此过程中使用了 10 倍交叉验证和超参数调整。通过计算诊断准确性来评估算法的性能。结果表明,对于白血病数据集,SVM 和 KNN 模型在仅使用 5 个特征时的准确率最高,达到 100%。对于卵巢癌数据集,SVM 模型在仅使用 6 个特征时的准确率最高,达到 100%。对于小圆形蓝色细胞肿瘤(SRBCT)数据集,SVM 模型在仅使用 8 个特征时的准确率也最高,达到 100%。对于肺癌数据集,SVM 模型在使用 19 个特征时的准确率也最高,达到 99.57%。通过与其他算法进行比较,所提出算法在所选特征数量和诊断准确性方面的结果更为优越。