Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel, ermany.
Bioinformatics. 2019 Oct 1;35(19):3663-3671. doi: 10.1093/bioinformatics/btz149.
It has been shown that the machine learning approach random forest can be successfully applied to omics data, such as gene expression data, for classification or regression and to select variables that are important for prediction. However, the complex relationships between predictor variables, in particular between causal predictor variables, make the interpretation of currently applied variable selection techniques difficult.
Here we propose a new variable selection approach called surrogate minimal depth (SMD) that incorporates surrogate variables into the concept of minimal depth (MD) variable importance. Applying SMD, we show that simulated correlation patterns can be reconstructed and that the increased consideration of variable relationships improves variable selection. When compared with existing state-of-the-art methods and MD, SMD has higher empirical power to identify causal variables while the resulting variable lists are equally stable. In conclusion, SMD is a promising approach to get more insight into the complex interplay of predictor variables and outcome in a high-dimensional data setting.
https://github.com/StephanSeifert/SurrogateMinimalDepth.
Supplementary data are available at Bioinformatics online.
已经表明,机器学习方法随机森林可以成功地应用于组学数据,如基因表达数据,用于分类或回归,并选择对预测重要的变量。然而,预测变量之间的复杂关系,特别是因果预测变量之间的关系,使得目前应用的变量选择技术的解释变得困难。
在这里,我们提出了一种新的变量选择方法,称为替代最小深度(SMD),它将替代变量纳入最小深度(MD)变量重要性的概念中。应用 SMD,我们表明可以重建模拟的相关模式,并且增加对变量关系的考虑可以改善变量选择。与现有的最先进的方法和 MD 相比,SMD 具有更高的识别因果变量的经验能力,而产生的变量列表同样稳定。总之,SMD 是一种很有前途的方法,可以更深入地了解高维数据环境中预测变量和结果之间的复杂相互作用。
https://github.com/StephanSeifert/SurrogateMinimalDepth。
补充数据可在生物信息学在线获得。