Department of Molecular Sciences, Swedish University of Agricultural Sciences, Uppsala SE-750 07, Sweden.
Department of Biology and Biological Engineering, Food and Nutrition Science, Chalmers University of Technology, Gothenburg SE-412 96, Sweden.
Bioinformatics. 2019 Mar 15;35(6):972-980. doi: 10.1093/bioinformatics/bty710.
Validation of variable selection and predictive performance is crucial in construction of robust multivariate models that generalize well, minimize overfitting and facilitate interpretation of results. Inappropriate variable selection leads instead to selection bias, thereby increasing the risk of model overfitting and false positive discoveries. Although several algorithms exist to identify a minimal set of most informative variables (i.e. the minimal-optimal problem), few can select all variables related to the research question (i.e. the all-relevant problem). Robust algorithms combining identification of both minimal-optimal and all-relevant variables with proper cross-validation are urgently needed.
We developed the MUVR algorithm to improve predictive performance and minimize overfitting and false positives in multivariate analysis. In the MUVR algorithm, minimal variable selection is achieved by performing recursive variable elimination in a repeated double cross-validation (rdCV) procedure. The algorithm supports partial least squares and random forest modelling, and simultaneously identifies minimal-optimal and all-relevant variable sets for regression, classification and multilevel analyses. Using three authentic omics datasets, MUVR yielded parsimonious models with minimal overfitting and improved model performance compared with state-of-the-art rdCV. Moreover, MUVR showed advantages over other variable selection algorithms, i.e. Boruta and VSURF, including simultaneous variable selection and validation scheme and wider applicability.
Algorithms, data, scripts and tutorial are open source and available as an R package ('MUVR') at https://gitlab.com/CarlBrunius/MUVR.git.
Supplementary data are available at Bioinformatics online.
在构建稳健的多元模型时,验证变量选择和预测性能至关重要,因为这样可以很好地推广模型、最小化过拟合并促进对结果的解释。不适当的变量选择会导致选择偏差,从而增加模型过拟合和假阳性发现的风险。尽管存在几种算法可以识别最具信息量的最小变量集(即最小最优问题),但很少有算法可以选择与研究问题相关的所有变量(即所有相关问题)。迫切需要结合识别最小最优和所有相关变量以及适当的交叉验证的稳健算法。
我们开发了 MUVR 算法来提高多元分析中的预测性能并最小化过拟合和假阳性。在 MUVR 算法中,通过在重复双交叉验证(rdCV)过程中执行递归变量消除来实现最小变量选择。该算法支持偏最小二乘和随机森林建模,并且可以同时为回归、分类和多层次分析识别最小最优和所有相关变量集。使用三个真实的组学数据集,MUVR 生成了简洁的模型,与最先进的 rdCV 相比,过拟合最小,模型性能得到提高。此外,MUVR 与其他变量选择算法(如 Boruta 和 VSURF)相比具有优势,包括同时的变量选择和验证方案以及更广泛的适用性。
算法、数据、脚本和教程都是开源的,并可作为 R 包('MUVR')在 https://gitlab.com/CarlBrunius/MUVR.git 上获得。
补充数据可在 Bioinformatics 在线获得。