Eledkawy Amr, Hamza Taher, El-Metwally Sara
Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
BioData Min. 2025 Apr 11;18(1):29. doi: 10.1186/s13040-025-00439-8.
Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.
The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).
The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.
每年有数百万人死于癌症。早期癌症检测对于确保更高的生存率至关重要,因为它为及时的医疗干预提供了机会。本文提出了一种多层次癌症分类系统,该系统使用血浆cfDNA/ctDNA突变和蛋白质生物标志物来识别七种不同的癌症类型:结直肠癌、乳腺癌、上消化道癌、肺癌、胰腺癌、卵巢癌和肝癌。
所提出的系统采用了多阶段二元分类框架,其中每个阶段针对特定的癌症类型进行定制。通过结合六种特征选择器:信息值、卡方检验、随机森林特征重要性、极端随机树特征重要性、递归特征消除和L1正则化,采用多数投票特征选择过程。在特征选择过程之后,针对每种癌症类型单独或在集成软投票设置中定制分类器,包括极端梯度提升、随机森林、极端随机树和二次判别分析,以优化预测准确性。所提出的系统优于先前发表的结果,实现了98.2%的AUC和96.21%的准确率。为确保结果的可重复性,本研究中使用的训练模型和数据集通过GitHub存储库(https://github.com/SaraEl-Metwally/Towards-Precision-Oncology)公开提供。
所识别的生物标志物增强了诊断的可解释性,有助于做出更明智的决策。该系统的性能强调了其在组织定位方面的有效性,通过及时的医疗干预有助于改善患者的治疗结果。