Umeå Plant Science Center, Department of Plant Physiology, Umeå University, 901 87 Umeå, Sweden.
BMC Bioinformatics. 2011 Oct 7;12:390. doi: 10.1186/1471-2105-12-390.
Machine learning is a powerful approach for describing and predicting classes in microarray data. Although several comparative studies have investigated the relative performance of various machine learning methods, these often do not account for the fact that performance (e.g. error rate) is a result of a series of analysis steps of which the most important are data normalization, gene selection and machine learning.
In this study, we used seven previously published cancer-related microarray data sets to compare the effects on classification performance of five normalization methods, three gene selection methods with 21 different numbers of selected genes and eight machine learning methods. Performance in term of error rate was rigorously estimated by repeatedly employing a double cross validation approach. Since performance varies greatly between data sets, we devised an analysis method that first compares methods within individual data sets and then visualizes the comparisons across data sets. We discovered both well performing individual methods and synergies between different methods.
Support Vector Machines with a radial basis kernel, linear kernel or polynomial kernel of degree 2 all performed consistently well across data sets. We show that there is a synergistic relationship between these methods and gene selection based on the T-test and the selection of a relatively high number of genes. Also, we find that these methods benefit significantly from using normalized data, although it is hard to draw general conclusions about the relative performance of different normalization procedures.
机器学习是一种用于描述和预测微阵列数据类别的强大方法。尽管已经有几项比较研究调查了各种机器学习方法的相对性能,但这些研究往往没有考虑到性能(例如错误率)是一系列分析步骤的结果,其中最重要的是数据归一化、基因选择和机器学习。
在这项研究中,我们使用了七个先前发表的癌症相关微阵列数据集,比较了五种归一化方法、三种具有 21 个不同选择基因数量的基因选择方法和八种机器学习方法对分类性能的影响。通过反复采用双交叉验证方法,严格估计了性能(以错误率为指标)。由于性能在数据集之间差异很大,我们设计了一种分析方法,首先在单个数据集中比较方法,然后跨数据集可视化比较。我们发现了表现良好的单个方法和不同方法之间的协同作用。
支持向量机(SVM)带有径向基核、线性核或二次多项式核,在所有数据集上的表现都非常一致。我们表明,这些方法与基于 T 检验的基因选择和选择相对较多的基因之间存在协同关系。此外,我们发现这些方法从归一化数据中受益显著,尽管很难对不同归一化程序的相对性能得出一般结论。