Öznacar Tuğçe, Güler Tunç
Department of Biostatistics, Ankara Medipol University, Ankara 06570, Turkey.
Department of Medical Oncology, Park Hayat Hospital, Afyonkarahisar 03100, Turkey.
Life (Basel). 2025 Apr 3;15(4):594. doi: 10.3390/life15040594.
Ovarian cancer continues to be one of the most prevalent gynecological cancers diagnosed. Early detection is highly critical for increasing survival chances. This research aims to assess the feature extraction process from various machine learning techniques for better modelling of ovarian cancer and the selection process in ovarian cancer analysis. By eliminating irrelevant features, this approach could guide clinicians towards more accurate results and optimize diagnostic precision.
This study included both patients with and without ovarian cancer, creating a dataset containing 50 independent variables/features. Eight machine learning algorithms: Random Forest, XGBoost, CatBoost, Decision Tree, K-Nearest Neighbors, Naive Bayes, Gradient Boosting, and Support Vector Machine, were evaluated alongside four feature selection techniques: Boruta, PCA, RFE, and MI. Metrics performance has been evaluated to obtain the best possible combination for diagnosis.
These results were obtained using these methods with a significantly reduced number of features. Random Forest and CatBoost's performances demonstrated significant differences in contrast to other algorithms (respectively, AUC 0.94% and 0.95%). On the other hand, feature selection methods such as Boruta and RFE consistently reflected higher AUC and accuracy scores than the others.
This study highlights the importance of choosing appropriate machine learning algorithms and feature selection techniques for ovarian cancer diagnosis. Boruta and RFE showed high accuracy. By reducing the number of features from 50 to the most relevant ones, clinicians can make more precise diagnoses, enhance patient outcomes, and reduce unnecessary tests.
卵巢癌仍然是诊断出的最常见的妇科癌症之一。早期检测对于提高生存几率至关重要。本研究旨在评估从各种机器学习技术中进行特征提取的过程,以便更好地对卵巢癌进行建模,以及评估卵巢癌分析中的选择过程。通过消除不相关特征,该方法可以引导临床医生获得更准确的结果并优化诊断精度。
本研究纳入了患有和未患有卵巢癌的患者,创建了一个包含50个自变量/特征的数据集。评估了八种机器学习算法:随机森林、XGBoost、CatBoost、决策树、K近邻、朴素贝叶斯、梯度提升和支持向量机,以及四种特征选择技术:Boruta、主成分分析(PCA)、递归特征消除(RFE)和互信息(MI)。对指标性能进行了评估,以获得诊断的最佳可能组合。
使用这些方法在特征数量显著减少的情况下获得了这些结果。与其他算法相比,随机森林和CatBoost的性能表现出显著差异(AUC分别为0.94%和0.95%)。另一方面,Boruta和RFE等特征选择方法始终比其他方法反映出更高的AUC和准确率得分。
本研究强调了为卵巢癌诊断选择合适的机器学习算法和特征选择技术的重要性。Boruta和RFE显示出高准确率。通过将特征数量从50个减少到最相关的特征,临床医生可以做出更精确的诊断,改善患者预后,并减少不必要的检查。