Department of Industrial and Systems Engineering, Dongguk University-Seoul, Seoul 04620, Korea.
Department of Software, Sejong University, Seoul 05006, Korea.
Sensors (Basel). 2020 May 15;20(10):2809. doi: 10.3390/s20102809.
Globally, cervical cancer remains as the foremost prevailing cancer in females. Hence, it is necessary to distinguish the importance of risk factors of cervical cancer to classify potential patients. The present work proposes a cervical cancer prediction model (CCPM) that offers early prediction of cervical cancer using risk factors as inputs. The CCPM first removes outliers by using outlier detection methods such as density-based spatial clustering of applications with noise (DBSCAN) and isolation forest (iForest) and by increasing the number of cases in the dataset in a balanced way, for example, through synthetic minority over-sampling technique (SMOTE) and SMOTE with Tomek link (SMOTETomek). Finally, it employs random forest (RF) as a classifier. Thus, CCPM lies on four scenarios: (1) DBSCAN + SMOTETomek + RF, (2) DBSCAN + SMOTE+ RF, (3) iForest + SMOTETomek + RF, and (4) iForest + SMOTE + RF. A dataset of 858 potential patients was used to validate the performance of the proposed method. We found that combinations of iForest with SMOTE and iForest with SMOTETomek provided better performances than those of DBSCAN with SMOTE and DBSCAN with SMOTETomek. We also observed that RF performed the best among several popular machine learning classifiers. Furthermore, the proposed CCPM showed better accuracy than previously proposed methods for forecasting cervical cancer. In addition, a mobile application that can collect cervical cancer risk factors data and provides results from CCPM is developed for instant and proper action at the initial stage of cervical cancer.
全球范围内,宫颈癌仍然是女性中最普遍的癌症。因此,有必要区分宫颈癌的危险因素的重要性,以对潜在患者进行分类。本研究提出了一种宫颈癌预测模型(CCPM),该模型使用危险因素作为输入来进行宫颈癌的早期预测。CCPM 首先使用异常值检测方法(如基于密度的空间聚类应用噪声(DBSCAN)和隔离森林(iForest))和以平衡方式增加数据集的案例数,例如通过合成少数过采样技术(SMOTE)和带 Tomak 链接的 SMOTE(SMOTETomek)来去除异常值。最后,它采用随机森林(RF)作为分类器。因此,CCPM 基于四个方案:(1)DBSCAN+SMOTETomek+RF,(2)DBSCAN+SMOTE+RF,(3)iForest+SMOTETomek+RF,和(4)iForest+SMOTE+RF。使用 858 名潜在患者的数据集来验证所提出方法的性能。我们发现,iForest 与 SMOTE 的组合和 iForest 与 SMOTETomek 的组合提供了比 DBSCAN 与 SMOTE 的组合和 DBSCAN 与 SMOTETomek 的组合更好的性能。我们还观察到,RF 在几种流行的机器学习分类器中表现最好。此外,所提出的 CCPM 显示出比以前提出的用于预测宫颈癌的方法更高的准确性。此外,还开发了一个移动应用程序,可以收集宫颈癌危险因素数据,并从 CCPM 提供结果,以便在宫颈癌的初始阶段立即采取适当的行动。