高维数据环境下特征间存在统计交互作用时分类器的比较评估

Comparative evaluation of classifiers in the presence of statistical interactions between features in high dimensional data settings.

作者信息

Guo Yu, Balasubramanian Raji

机构信息

BG Medicine, Inc., USA.

出版信息

Int J Biostat. 2012 Jun 28;8(1):Article 17. doi: 10.1515/1557-4679.1373.

DOI:10.1515/1557-4679.1373

PMID:22752837

Abstract

BACKGROUND

A central challenge in high dimensional data settings in biomedical investigations involves the estimation of an optimal prediction algorithm to distinguish between different disease phenotypes. A significant complicating aspect in these analyses can be attributed to the presence of features that exhibit statistical interactions. Indeed, in several clinical investigations such as genetic studies of complex diseases, it is of interest to specifically identify such features. In this paper, we compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in settings involving high dimensional datasets including statistically interacting feature subsets. We evaluate the performance of these classifiers under conditions of varying sample size, levels of signal-to-noise ratio and strength of statistical interactions among features. We summarize two datasets from studies in diabetes and cardiovascular disease involving gene expression, metabolomics and proteomics measurements and compare results obtained using the four classifiers.

RESULTS

Simulation studies revealed that the classifier Prediction Analysis of Microarrays had the highest classification accuracy in the absence of noise, statistical interactions and when feature distributions were multivariate Gaussian within each class. In the presence of statistical interactions, modest effect sizes and the absence of noise, Support Vector Machines achieved the best performance followed closely by Random Forests. Random Forests was optimal in settings that included both significant levels of high dimensional noise features and statistical interactions between biomarker pairs. The data applications revealed similar trends in the relative performances of each classifier.

CONCLUSION

Random Forests had the highest classification accuracy among the four classifiers and was successful in incorporating interaction effects between features in the presence of noise in high dimensional datasets.

摘要

背景

生物医学研究中高维数据环境下的一个核心挑战是估计一种最优预测算法，以区分不同的疾病表型。这些分析中一个显著的复杂因素可归因于存在表现出统计交互作用的特征。事实上，在一些临床研究中，如复杂疾病的基因研究，特别识别这些特征是很有意义的。在本文中，我们比较了四种常用分类器（K近邻、微阵列预测分析、随机森林和支持向量机）在涉及高维数据集（包括具有统计交互作用的特征子集）的环境中的性能。我们在不同样本量、信噪比水平和特征间统计交互作用强度的条件下评估这些分类器的性能。我们总结了来自糖尿病和心血管疾病研究的两个数据集，这些研究涉及基因表达、代谢组学和蛋白质组学测量，并比较了使用这四种分类器获得的结果。

结果

模拟研究表明，在没有噪声、统计交互作用且每个类别内特征分布为多元高斯分布时，微阵列预测分析分类器具有最高的分类准确率。在存在统计交互作用、中等效应大小且无噪声的情况下，支持向量机表现最佳，随机森林紧随其后。在包含大量高维噪声特征和生物标志物对之间的统计交互作用的环境中，随机森林是最优的。数据应用揭示了每个分类器相对性能的类似趋势。

结论

在这四种分类器中，随机森林具有最高的分类准确率，并且在高维数据集中存在噪声的情况下成功纳入了特征之间的交互作用。

相似文献

Comparative evaluation of classifiers in the presence of statistical interactions between features in high dimensional data settings.高维数据环境下特征间存在统计交互作用时分类器的比较评估

Int J Biostat. 2012 Jun 28;8(1):Article 17. doi: 10.1515/1557-4679.1373.

Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms.高维数据环境中的样本量和统计功效考虑：分类算法的比较研究。

BMC Bioinformatics. 2010 Sep 3;11:447. doi: 10.1186/1471-2105-11-447.

Bias in error estimation when using cross-validation for model selection.在使用交叉验证进行模型选择时误差估计中的偏差。

BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.

Class-imbalanced classifiers for high-dimensional data.高维数据的不平衡分类器。

Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9.

Applying one-vs-one and one-vs-all classifiers in k-nearest neighbour method and support vector machines to an otoneurological multi-class problem.在k近邻法和支持向量机中应用一对一和一对多分类器来解决耳神经学多分类问题。

Stud Health Technol Inform. 2011;169:579-83.

Optimal number of features as a function of sample size for various classification rules.针对各种分类规则，作为样本大小函数的最优特征数量。

Bioinformatics. 2005 Apr 15;21(8):1509-15. doi: 10.1093/bioinformatics/bti171. Epub 2004 Nov 30.

Gene expression profile class prediction using linear Bayesian classifiers.使用线性贝叶斯分类器进行基因表达谱分类预测。

Comput Biol Med. 2007 Dec;37(12):1690-9. doi: 10.1016/j.compbiomed.2007.04.001. Epub 2007 May 22.

Development of biomarker classifiers from high-dimensional data.从高维数据中开发生物标志物分类器。

Brief Bioinform. 2009 Sep;10(5):537-46. doi: 10.1093/bib/bbp016. Epub 2009 Apr 3.

An efficient statistical feature selection approach for classification of gene expression data.一种用于基因表达数据分类的高效统计特征选择方法。

J Biomed Inform. 2011 Aug;44(4):529-35. doi: 10.1016/j.jbi.2011.01.001. Epub 2011 Jan 15.

Medical data mining by fuzzy modeling with selected features.基于模糊建模和选定特征的医学数据挖掘

Artif Intell Med. 2008 Jul;43(3):195-206. doi: 10.1016/j.artmed.2008.04.004. Epub 2008 Jun 5.

引用本文的文献

A Modified Random Survival Forests Algorithm for High Dimensional Predictors and Self-Reported Outcomes.一种用于高维预测变量和自我报告结果的改进随机生存森林算法。

J Comput Graph Stat. 2018;27(4):763-772. doi: 10.1080/10618600.2018.1474115. Epub 2018 Aug 20.

Bayesian Variable Selection Methods for Matched Case-Control Studies.匹配病例对照研究的贝叶斯变量选择方法

Int J Biostat. 2017 Jan 31;13(1):/j/ijb.2017.13.issue-1/ijb-2016-0043/ijb-2016-0043.xml. doi: 10.1515/ijb-2016-0043.

Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data.基于非靶向代谢组学数据的早期预测生物标志物发现的特征选择方法。

Front Mol Biosci. 2016 Jul 8;3:30. doi: 10.3389/fmolb.2016.00030. eCollection 2016.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

高维数据环境下特征间存在统计交互作用时分类器的比较评估

Comparative evaluation of classifiers in the presence of statistical interactions between features in high dimensional data settings.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献