Department of Software Engineering (SWE), Daffodil International University (DIU), Sukrabad, Dhaka, 1207, Bangladesh.
Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada; Group of Biophotomatiχ, Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh.
Comput Biol Med. 2021 Dec;139:104985. doi: 10.1016/j.compbiomed.2021.104985. Epub 2021 Oct 28.
Cervical cancer (CC) is the most common type of cancer in women and remains a significant cause of mortality, particularly in less developed countries, although it can be effectively treated if detected at an early stage. This study aimed to find efficient machine-learning-based classifying models to detect early stage CC using clinical data. We obtained a Kaggle data repository CC dataset which contained four classes of attributes including biopsy, cytology, Hinselmann, and Schiller. This dataset was split into four categories based on these class attributes. Three feature transformation methods, including log, sine function, and Z-score were applied to these datasets. Several supervised machine learning algorithms were assessed for their performance in classification. A Random Tree (RT) algorithm provided the best classification accuracy for the biopsy (98.33%) and cytology (98.65%) data, whereas Random Forest (RF) and Instance-Based K-nearest neighbor (IBk) provided the best performance for Hinselmann (99.16%), and Schiller (98.58%) respectively. Among the feature transformation methods, logarithmic gave the best performance for biopsy datasets whereas sine function was superior for cytology. Both logarithmic and sine functions performed the best for the Hinselmann dataset, while Z-score was best for the Schiller dataset. Various Feature Selection Techniques (FST) methods were applied to the transformed datasets to identify and prioritize important risk factors. The outcomes of this study indicate that appropriate system design and tuning, machine learning methods and classification are able to detect CC accurately and efficiently in its early stages using clinical data.
宫颈癌(CC)是女性最常见的癌症类型,仍然是导致死亡的主要原因,尤其是在欠发达国家,尽管如果在早期发现,它可以得到有效治疗。本研究旨在寻找有效的基于机器学习的分类模型,使用临床数据来检测早期宫颈癌。我们从 Kaggle 数据仓库 CC 数据集获得了包含活检、细胞学、Hinselmann 和 Schiller 四个类别的属性的数据集。该数据集根据这些类属性分为四类。我们应用了三种特征变换方法,包括对数、正弦函数和 Z 分数,对这些数据集进行处理。评估了几种监督机器学习算法在分类方面的性能。随机树(RT)算法在活检(98.33%)和细胞学(98.65%)数据方面提供了最佳的分类准确性,而随机森林(RF)和基于实例的 K-最近邻(IBk)算法在 Hinselmann(99.16%)和 Schiller(98.58%)方面提供了最佳性能。在特征变换方法中,对数在活检数据集中表现最好,而正弦函数在细胞学数据集中表现更好。对数和正弦函数在 Hinselmann 数据集上表现最好,而 Z 分数在 Schiller 数据集上表现最好。应用了各种特征选择技术(FST)方法对变换后的数据集进行分析,以确定和优先考虑重要的风险因素。本研究的结果表明,通过使用临床数据,适当的系统设计和调整、机器学习方法和分类能够在早期阶段准确有效地检测宫颈癌。