Hassan S Sakira, Ruusuvuori Pekka, Latonen Leena, Huttunen Heikki
Department of Signal Processing, Tampere University of Technology, Tampere, Finland.
Pori Department, Tampere University of Technology, Pori, Finland.; BioMediTech, University of Tampere, Tampere, Finland.
Cancer Inform. 2016 Apr 10;14(Suppl 5):75-85. doi: 10.4137/CIN.S30795. eCollection 2015.
In this paper, we study the problem of feature selection in cancer-related machine learning tasks. In particular, we study the accuracy and stability of different feature selection approaches within simplistic machine learning pipelines. Earlier studies have shown that for certain cases, the accuracy of detection can easily reach 100% given enough training data. Here, however, we concentrate on simplifying the classification models with and seek for feature selection approaches that are reliable even with extremely small sample sizes. We show that as much as 50% of features can be discarded without compromising the prediction accuracy. Moreover, we study the model selection problem among the ℓ 1 regularization path of logistic regression classifiers. To this aim, we compare a more traditional cross-validation approach with a recently proposed Bayesian error estimator.
在本文中,我们研究癌症相关机器学习任务中的特征选择问题。具体而言,我们研究了简单机器学习流程中不同特征选择方法的准确性和稳定性。早期研究表明,在某些情况下,给定足够的训练数据,检测准确率很容易达到100%。然而,在这里我们专注于简化分类模型,并寻找即使在样本量极小的情况下也可靠的特征选择方法。我们表明,在不影响预测准确性的情况下,可以舍弃多达50%的特征。此外,我们研究了逻辑回归分类器的ℓ1正则化路径中的模型选择问题。为此,我们将一种更传统的交叉验证方法与最近提出的贝叶斯误差估计器进行了比较。