Guo Yu, Balasubramanian Raji
BG Medicine, Inc., USA.
Int J Biostat. 2012 Jun 28;8(1):Article 17. doi: 10.1515/1557-4679.1373.
A central challenge in high dimensional data settings in biomedical investigations involves the estimation of an optimal prediction algorithm to distinguish between different disease phenotypes. A significant complicating aspect in these analyses can be attributed to the presence of features that exhibit statistical interactions. Indeed, in several clinical investigations such as genetic studies of complex diseases, it is of interest to specifically identify such features. In this paper, we compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in settings involving high dimensional datasets including statistically interacting feature subsets. We evaluate the performance of these classifiers under conditions of varying sample size, levels of signal-to-noise ratio and strength of statistical interactions among features. We summarize two datasets from studies in diabetes and cardiovascular disease involving gene expression, metabolomics and proteomics measurements and compare results obtained using the four classifiers.
Simulation studies revealed that the classifier Prediction Analysis of Microarrays had the highest classification accuracy in the absence of noise, statistical interactions and when feature distributions were multivariate Gaussian within each class. In the presence of statistical interactions, modest effect sizes and the absence of noise, Support Vector Machines achieved the best performance followed closely by Random Forests. Random Forests was optimal in settings that included both significant levels of high dimensional noise features and statistical interactions between biomarker pairs. The data applications revealed similar trends in the relative performances of each classifier.
Random Forests had the highest classification accuracy among the four classifiers and was successful in incorporating interaction effects between features in the presence of noise in high dimensional datasets.
生物医学研究中高维数据环境下的一个核心挑战是估计一种最优预测算法,以区分不同的疾病表型。这些分析中一个显著的复杂因素可归因于存在表现出统计交互作用的特征。事实上,在一些临床研究中,如复杂疾病的基因研究,特别识别这些特征是很有意义的。在本文中,我们比较了四种常用分类器(K近邻、微阵列预测分析、随机森林和支持向量机)在涉及高维数据集(包括具有统计交互作用的特征子集)的环境中的性能。我们在不同样本量、信噪比水平和特征间统计交互作用强度的条件下评估这些分类器的性能。我们总结了来自糖尿病和心血管疾病研究的两个数据集,这些研究涉及基因表达、代谢组学和蛋白质组学测量,并比较了使用这四种分类器获得的结果。
模拟研究表明,在没有噪声、统计交互作用且每个类别内特征分布为多元高斯分布时,微阵列预测分析分类器具有最高的分类准确率。在存在统计交互作用、中等效应大小且无噪声的情况下,支持向量机表现最佳,随机森林紧随其后。在包含大量高维噪声特征和生物标志物对之间的统计交互作用的环境中,随机森林是最优的。数据应用揭示了每个分类器相对性能的类似趋势。
在这四种分类器中,随机森林具有最高的分类准确率,并且在高维数据集中存在噪声的情况下成功纳入了特征之间的交互作用。