Neonatal Research Centre, Health Research Institute La Fé, 46009 Valencia, Spain.
Talanta. 2013 Nov 15;116:835-40. doi: 10.1016/j.talanta.2013.07.048. Epub 2013 Aug 9.
Variable subset selection is often mandatory in high throughput metabolomics and proteomics. However, depending on the variable to sample ratio there is a significant susceptibility of variable selection towards chance correlations. The evaluation of the predictive capabilities of PLSDA models estimated by cross-validation after feature selection provides overly optimistic results if the selection is performed on the entire set and no external validation set is available. In this work, a simulation of the statistical null hypothesis is proposed to test whether the discrimination capability of a PLSDA model after variable selection estimated by cross-validation is statistically higher than that attributed to the presence of chance correlations in the original data set. Statistical significance of PLSDA CV-figures of merit obtained after variable selection is expressed by means of p-values calculated by using a permutation test that included the variable selection step. The reliability of the approach is evaluated using two variable selection methods on experimental and simulated data sets with and without induced class differences. The proposed approach can be considered as a useful tool when no external validation set is available and provides a straightforward way to evaluate differences between variable selection methods.
在高通量代谢组学和蛋白质组学中,变量子集选择通常是强制性的。然而,根据变量与样本的比例,变量选择对偶然相关具有显著的敏感性。如果在整个数据集上进行选择,并且没有外部验证集,则通过交叉验证估计的 PLSDA 模型的预测能力的评估会提供过于乐观的结果。在这项工作中,提出了一种统计零假设的模拟,以测试经过交叉验证的变量选择后的 PLSDA 模型的判别能力是否在统计上高于原始数据集中原先存在的偶然相关性。通过使用包含变量选择步骤的置换检验来计算 p 值来表示经过变量选择后获得的 PLSDA CV 度量的统计显著性。该方法的可靠性使用两种变量选择方法在具有和不具有诱导类差异的实验和模拟数据集上进行了评估。当没有外部验证集可用时,该方法可以被认为是一种有用的工具,并提供了一种直接评估变量选择方法之间差异的方法。