Lu Danny, Weljie Aalim, de Leon Alexander R, McConnell Yarrow, Bathe Oliver F, Kopciuk Karen
Sick Kids Research Institute, 555 University Avenue, Toronto, ON, M5G 1X8, Canada.
Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, 10-113 Translational Research Center, 3400 Civic Center Blvd, Bldg 421, Philadelphia, PA, 19104, USA.
BMC Res Notes. 2017 Apr 4;10(1):143. doi: 10.1186/s13104-017-2461-8.
Variable selection is frequently carried out during the analysis of many types of high-dimensional data, including those in metabolomics. This study compared the predictive performance of four variable selection methods using stability-based selection, a new secondary selection method that is implemented in the R package BioMark. Two of these methods were evaluated using the more well-known false discovery rate (FDR) as well.
Simulation studies varied factors relevant to biological data studies, with results based on the median values of 200 partial area under the receiver operating characteristic curve. There was no single top performing method across all factor settings, but the student t test based on stability selection or with FDR adjustment and the variable importance in projection (VIP) scores from partial least squares regression models obtained using a stability-based approach tended to perform well in most settings. Similar results were found with a real spiked-in metabolomics dataset. Group sample size, group effect size, number of significant variables and correlation structure were the most important factors whereas the percentage of significant variables was the least important.
Researchers can improve prediction scores for their study data by choosing VIP scores based on stability variable selection over the other approaches when the number of variables is small to modest and by increasing the number of samples even moderately. When the number of variables is high and there is block correlation amongst the significant variables (i.e., true biomarkers), the FDR-adjusted student t test performed best. The R package BioMark is an easy-to-use open-source program for variable selection that had excellent performance characteristics for the purposes of this study.
在包括代谢组学数据在内的多种高维数据分析过程中,变量选择经常会被执行。本研究使用基于稳定性的选择方法比较了四种变量选择方法的预测性能,基于稳定性的选择是一种在R包BioMark中实现的新的二次选择方法。其中两种方法还使用了更为知名的错误发现率(FDR)进行评估。
模拟研究改变了与生物数据研究相关的因素,结果基于200个受试者工作特征曲线下部分面积的中位数。在所有因素设置中没有单一的最佳方法,但基于稳定性选择或经FDR调整的学生t检验以及使用基于稳定性的方法获得的偏最小二乘回归模型中的投影变量重要性(VIP)得分在大多数设置下往往表现良好。在一个真实的加标代谢组学数据集中也发现了类似的结果。组样本量、组效应量、显著变量数量和相关结构是最重要的因素,而显著变量的百分比是最不重要的因素。
当变量数量较少到适中时,研究人员通过基于稳定性变量选择选择VIP得分而不是其他方法,并适度增加样本数量,可以提高其研究数据的预测得分。当变量数量较多且显著变量之间存在组块相关性(即真正的生物标志物)时,经FDR调整的学生t检验表现最佳。R包BioMark是一个易于使用的开源变量选择程序,为本研究目的具有出色的性能特征。