Statnikov Alexander, Wang Lily, Aliferis Constantin F
Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA.
BMC Bioinformatics. 2008 Jul 22;9:319. doi: 10.1186/1471-2105-9-319.
Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain.
In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms.
We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.
癌症诊断和临床结果预测是基因表达微阵列技术最重要的新兴应用领域之一,有几种分子特征正朝着临床应用的方向发展。为了开发出最适合患者护理的分子特征,使用可用于微阵列基因表达数据的最准确分类算法是一个关键因素。迄今为止,大量文献表明,支持向量机可被视为用于此类数据分类的“最佳”算法。然而,最近的研究表明,在这一领域随机森林分类器可能优于支持向量机。
在本文中,我们识别了先前比较随机森林和支持向量机的研究中的方法偏差,并对这两种算法进行了新的严格评估,以纠正这些局限性。我们的实验使用了22个诊断和预后数据集,结果表明支持向量机优于随机森林,而且往往优势明显。我们的数据还强调了合理的研究设计在生物信息学算法基准测试和比较中的重要性。
我们发现,无论是在平均水平上,还是在大多数微阵列数据集中,在不进行基因选择以及使用几种常用基因选择方法的情况下,随机森林在性能上都不如支持向量机。