Kanaan Soha, Altamimi Ahmad, Qattous Hazem, Rbeihat Haitham
Princess Sumaya University for Technology,(PSUT), Amman, Jordan.
Department of Software Engineering, Princess Sumaya University for Technology (PSUT), Amman, Jordan.
Comput Biol Med. 2025 Jun;191:110184. doi: 10.1016/j.compbiomed.2025.110184. Epub 2025 Apr 17.
Colorectal cancer (CRC) ranks as the third most prevalent cancer worldwide, posing significant public health challenges. Late-stage detection often results in poor treatment outcomes, elevating mortality rates. The economic and psychological burdens of CRC treatment underscore the need for early detection.
This study aims to enhance the early detection of colorectal cancer by employing machine learning (ML) algorithms on non-invasive features. The focus is on constructing a comprehensive dataset, analyzing non-invasive features, and developing predictive models to minimize the necessity for invasive procedures such as colonoscopy. By focusing on non-invasive, easily accessible data, the study aims to develop a model that can be widely applied without the associated risks of invasive procedures.
A retrospective dataset of 400 patients was sourced from the colorectal cancer unit of Royal Medical Services (2021-2022). The dataset included demographic data, imaging reports, laboratory results, and clinical evaluations. The study involved three experiments, training ML models (K-Nearest Neighbors (KNN), Super Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), and Naïve Bayes (NB)) on the collected dataset and a public dataset to validate generalizability. The first experiment used 35 features across the ML algorithms. The second experiment focused on the most informative features. The third experiment validated the models using a public dataset, with Phase I including all data and Phase II excluding missing values.
The Random Forest (RF) algorithm consistently outperformed other models, achieving an accuracy of 95.8 % in the first experiment, increasing to 96.5 % in the second experiment. For the public dataset, RF accuracy was 66.0 % in Phase I and 68.9 % in Phase II. Conversely, the KNN algorithm exhibited the lowest accuracy across all experiments.
This study highlights the effectiveness of ML in early CRC detection using non-invasive techniques. The RF model demonstrated superior accuracy, suggesting its potential application in clinical settings. The research contributes valuable insights into CRC detection within the local context and emphasizes the broader applicability of ML in improving cancer diagnosis and personalized treatment.
结直肠癌(CRC)是全球第三大常见癌症,对公共卫生构成重大挑战。晚期检测往往导致治疗效果不佳,死亡率上升。CRC治疗的经济和心理负担凸显了早期检测的必要性。
本研究旨在通过对非侵入性特征应用机器学习(ML)算法来加强结直肠癌的早期检测。重点在于构建一个综合数据集,分析非侵入性特征,并开发预测模型,以尽量减少诸如结肠镜检查等侵入性程序的必要性。通过关注非侵入性、易于获取的数据,该研究旨在开发一种可广泛应用且无侵入性程序相关风险的模型。
从皇家医疗服务机构的结直肠癌科室获取了一个包含400名患者的回顾性数据集(2021 - 2022年)。该数据集包括人口统计学数据、影像报告、实验室结果和临床评估。该研究涉及三个实验,在收集的数据集和一个公共数据集上训练ML模型(K近邻算法(KNN)、支持向量机(SVM)、随机森林(RF)、决策树(DT)和朴素贝叶斯(NB))以验证其通用性。第一个实验在ML算法中使用了35个特征。第二个实验聚焦于信息量最大的特征。第三个实验使用公共数据集验证模型,第一阶段包括所有数据,第二阶段排除缺失值。
随机森林(RF)算法始终优于其他模型,在第一个实验中准确率达到95.8%,在第二个实验中提高到96.5%。对于公共数据集,RF在第一阶段的准确率为66.0%,在第二阶段为68.9%。相反,KNN算法在所有实验中准确率最低。
本研究突出了ML在使用非侵入性技术进行早期CRC检测中的有效性。RF模型展示了卓越的准确率,表明其在临床环境中的潜在应用价值。该研究为本地背景下的CRC检测提供了有价值的见解,并强调了ML在改善癌症诊断和个性化治疗方面的更广泛适用性。