Suppr超能文献

利用两步特征选择和集成学习预测局部晚期直肠癌新辅助放化疗的病理反应

Predicting pathological response to neoadjuvant chemoradiotherapy in locally advanced rectal cancer with two step feature selection and ensemble learning.

作者信息

Qian Changshun, Yang Shuxin, Chen Yijing, Ge Ran, Shi Fangmin, Liu Chengnan, Wang Hui, Guo You

机构信息

School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, 341000, China.

Medical Big Data and Bioinformatics Research Centre, First Affiliated Hospital of Gannan Medical University, Ganzhou, 341000, China.

出版信息

Sci Rep. 2025 Mar 22;15(1):9936. doi: 10.1038/s41598-025-94337-y.

Abstract

Patients with locally advanced rectal cancer (LARC) show substantial individual variability and a pronounced imbalance in response distribution to neoadjuvant chemoradiotherapy (nCRT), posing significant challenges to treatment response prediction. This study aims to identify effective predictive biomarkers and develop an ensemble learning-based prediction model to assess the response of LARC patients to nCRT. A two-step feature selection method was developed to identify predictive biomarkers by deriving stable reversal gene pairs through within-sample relative expression orderings (REOs) from LARC patients undergoing nCRT. Preliminary screening utilized four methods-MDFS, Boruta, MCFS, and VSOLassoBag-to form a candidate feature set. Secondary screening ranked these features by permutation importance, applying Incremental Feature Selection (IFS) with an Extreme Gradient Boosting (XGBoost) to determine final predictive gene pairs. The ensemble model BoostForest, combining boosting and bagging, served as the predictive framework, with SHAP employed for interpretability. Through two-step feature selection, the 32-gene pair signature (32-GPS) was established as the final predictive biomarker. In the test set, the model achieved an area under the precision-recall curve (AUPRC) of 0.983 and an accuracy of 0.988. In the validation cohort, the AUPRC was 0.785, with an accuracy of 0.898, indicating strong model performance. The study further demonstrated that BoostForest achieved superior overall performance compared to Random Forest, Support Vector Machine (SVM), and XGBoost. To evaluate the effectiveness of the 32-GPS, its performance was compared with two alternative feature sets: the lasso-gene pair signature (lasso-GPS), derived through lasso regression, and the 15-shared gene pair signature (15-SGPS), consisting of gene pairs identified by all four feature selection methods. The 32-GPS demonstrated superior performance in both comparisons. The two-step feature selection method identified robust predictive biomarkers, and BoostForest outperformed Random Forest, Support Vector Machine, and XGBoost in classification performance and predictive capability.

摘要

局部晚期直肠癌(LARC)患者在对新辅助放化疗(nCRT)的反应分布上表现出显著的个体差异和明显的不平衡,这给治疗反应预测带来了重大挑战。本研究旨在识别有效的预测生物标志物,并开发一种基于集成学习的预测模型,以评估LARC患者对nCRT的反应。开发了一种两步特征选择方法,通过对接受nCRT的LARC患者的样本内相对表达顺序(REO)推导稳定的反转基因对,来识别预测生物标志物。初步筛选使用了四种方法——MDFS、Boruta、MCFS和VSOLassoBag——来形成候选特征集。二次筛选通过排列重要性对这些特征进行排序,应用带有极端梯度提升(XGBoost)的增量特征选择(IFS)来确定最终的预测基因对。结合提升和装袋的集成模型BoostForest作为预测框架,使用SHAP进行可解释性分析。通过两步特征选择,建立了32基因对特征(32-GPS)作为最终的预测生物标志物。在测试集中,该模型的精确召回曲线下面积(AUPRC)为0.983,准确率为0.988。在验证队列中,AUPRC为0.785,准确率为0.898,表明模型性能强劲。该研究进一步证明,与随机森林、支持向量机(SVM)和XGBoost相比,BoostForest具有卓越的整体性能。为了评估32-GPS的有效性,将其性能与两个替代特征集进行了比较:通过套索回归推导的套索基因对特征(lasso-GPS),以及由所有四种特征选择方法确定的基因对组成的15共享基因对特征(15-SGPS)。在这两项比较中,32-GPS均表现出卓越的性能。两步特征选择方法识别出了强大的预测生物标志物,并且BoostForest在分类性能和预测能力方面优于随机森林、支持向量机和XGBoost。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e63/11929819/2c88a4ed941f/41598_2025_94337_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验