School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai, China.
School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai, China.
Lab Invest. 2024 Mar;104(3):100320. doi: 10.1016/j.labinv.2023.100320. Epub 2023 Dec 28.
Despite the use of machine learning tools, it is challenging to properly model cause-specific deaths in colorectal cancer (CRC) patients and choose appropriate treatments. Here, we propose an interesting feature selection framework, namely union with recursive feature elimination (U-RFE), to select the union feature sets that are crucial in CRC progression-specific mortality using The Cancer Genome Atlas (TCGA) dataset. Based on the union feature sets, we compared the performance of 5 classification algorithms, including logistic regression (LR), support vector machines (SVM), random forest (RF), eXtreme gradient boosting (XGBoost), and Stacking, to identify the best model for classifying 4-category deaths. In the first stage of U-RFE, LR, SVM, and RF were used as base estimators to obtain subsets containing the same number of features but not exactly the same specific features. Union analysis of the subsets was then performed to determine the final union feature set, effectively combining the advantages of different algorithms. We found that the U-RFE framework could improve various models' performance. Stacking outperformed LR, SVM, RF, and XGBoost in most scenarios. When the target feature number of the RFE was set to 50 and the union feature set contained 298 deterministic features, the Stacking model achieved F1_weighted, Recall_weighted, Precision_weighted, Accuracy, and Matthews correlation coefficient of 0.851, 0.864, 0.854, 0.864, and 0.717, respectively. The performance of the minority categories was also significantly improved. Therefore, this recursive feature elimination-based approach of feature selection improves performances of classifying CRC deaths using clinical and omics data or those using other data with high feature redundancy and imbalance.
尽管使用了机器学习工具,但要正确地对结直肠癌(CRC)患者的特定病因死亡进行建模并选择合适的治疗方法仍然具有挑战性。在这里,我们提出了一个有趣的特征选择框架,即联合递归特征消除(U-RFE),该框架使用癌症基因组图谱(TCGA)数据集选择与 CRC 进展特异性死亡率相关的关键联合特征集。基于联合特征集,我们比较了 5 种分类算法的性能,包括逻辑回归(LR)、支持向量机(SVM)、随机森林(RF)、极端梯度提升(XGBoost)和堆叠,以识别最佳模型来对 4 类死亡进行分类。在 U-RFE 的第一阶段,LR、SVM 和 RF 被用作基础估计器,以获取包含相同数量特征但不完全相同特定特征的子集。然后对子集进行联合分析,以确定最终的联合特征集,从而有效地结合了不同算法的优势。我们发现 U-RFE 框架可以提高各种模型的性能。在大多数情况下,堆叠的性能优于 LR、SVM、RF 和 XGBoost。当 RFE 的目标特征数设置为 50 且联合特征集包含 298 个确定性特征时,堆叠模型的 F1_weighted、Recall_weighted、Precision_weighted、Accuracy 和 Matthews 相关系数分别为 0.851、0.864、0.854、0.864 和 0.717。少数类别的性能也得到了显著提高。因此,这种基于递归特征消除的特征选择方法可以提高使用临床和组学数据或使用其他具有高特征冗余和不平衡的数据对 CRC 死亡进行分类的性能。