Department of Electrical, Systems and Automatic Engineering, Universidad of León, Campus de Vegazana s/n, León 24071, Spain.
Grupo Investigación Interacciones Gen-Ambiente y Salud (GIIGAS), Centro de Investigación Biomédica en Red (CIBER), Spain.
Comput Methods Programs Biomed. 2019 Aug;177:219-229. doi: 10.1016/j.cmpb.2019.06.001. Epub 2019 Jun 4.
Risk prediction models aim at identifying people at higher risk of developing a target disease. Feature selection is particularly important to improve the prediction model performance avoiding overfitting and to identify the leading cancer risk (and protective) factors. Assessing the stability of feature selection/ranking algorithms becomes an important issue when the aim is to analyze the features with more prediction power.
This work is focused on colorectal cancer, assessing several feature ranking algorithms in terms of performance for a set of risk prediction models (Neural Networks, Support Vector Machines (SVM), Logistic Regression, k-Nearest Neighbors and Boosted Trees). Additionally, their robustness is evaluated following a conventional approach with scalar stability metrics and a visual approach proposed in this work to study both similarity among feature ranking techniques as well as their individual stability. A comparative analysis is carried out between the most relevant features found out in this study and features provided by the experts according to the state-of-the-art knowledge.
The two best performance results in terms of Area Under the ROC Curve (AUC) are achieved with a SVM classifier using the top-41 features selected by the SVM wrapper approach (AUC=0.693) and Logistic Regression with the top-40 features selected by the Pearson (AUC=0.689). Experiments showed that performing feature selection contributes to classification performance with a 3.9% and 1.9% improvement in AUC for the SVM and Logistic Regression classifier, respectively, with respect to the results using the full feature set. The visual approach proposed in this work allows to see that the Neural Network-based wrapper ranking is the most unstable while the Random Forest is the most stable.
This study demonstrates that stability and model performance should be studied jointly as Random Forest turned out to be the most stable algorithm but outperformed by others in terms of model performance while SVM wrapper and the Pearson correlation coefficient are moderately stable while achieving good model performance.
风险预测模型旨在识别具有更高发病风险的目标疾病人群。特征选择对于提高预测模型的性能、避免过度拟合以及识别主要的癌症风险(和保护)因素尤为重要。当目标是分析具有更多预测能力的特征时,评估特征选择/排序算法的稳定性成为一个重要问题。
这项工作专注于结直肠癌,从性能角度评估了几种特征排序算法,这些算法适用于一组风险预测模型(神经网络、支持向量机 (SVM)、逻辑回归、k-最近邻和 Boosted Trees)。此外,还使用传统的标量稳定性指标和本文提出的可视化方法评估了它们的稳健性,以研究特征排序技术之间的相似性以及它们各自的稳定性。对本研究中发现的最相关特征与根据最新知识提供的专家特征进行了对比分析。
在曲线下面积 (AUC) 方面,SVM 分类器使用 SVM 包装器方法选择的前 41 个特征(AUC=0.693)和使用 Pearson 选择的前 40 个特征的逻辑回归的 AUC 取得了最佳性能结果(AUC=0.689)。实验表明,进行特征选择有助于提高分类性能,SVM 和逻辑回归分类器的 AUC 分别提高了 3.9%和 1.9%,相对于使用全特征集的结果。本文提出的可视化方法表明,基于神经网络的包装器排序最不稳定,而随机森林最稳定。
本研究表明,稳定性和模型性能应该一起研究,因为随机森林虽然在模型性能方面表现优于其他算法,但在稳定性方面却表现不佳,而 SVM 包装器和 Pearson 相关系数在实现良好模型性能的同时具有中等稳定性。