Suppr超能文献

随机森林和组学数据集变量选择方法的评估。

Evaluation of variable selection methods for random forests and omics data sets.

机构信息

Institute of Clinical Molecular Biology, Kiel University, Germany.

Institute of Medical Informatics and Statistics, Kiel University, Germany.

出版信息

Brief Bioinform. 2019 Mar 22;20(2):492-503. doi: 10.1093/bib/bbx124.

Abstract

Machine learning methods and in particular random forests are promising approaches for prediction based on high dimensional omics data sets. They provide variable importance measures to rank predictors according to their predictive power. If building a prediction model is the main goal of a study, often a minimal set of variables with good prediction performance is selected. However, if the objective is the identification of involved variables to find active networks and pathways, approaches that aim to select all relevant variables should be preferred. We evaluated several variable selection procedures based on simulated data as well as publicly available experimental methylation and gene expression data. Our comparison included the Boruta algorithm, the Vita method, recurrent relative variable importance, a permutation approach and its parametric variant (Altmann) as well as recursive feature elimination (RFE).  In our simulation studies, Boruta was the most powerful approach, followed closely by the Vita method. Both approaches demonstrated similar stability in variable selection, while Vita was the most robust approach under a pure null model without any predictor variables related to the outcome. In the analysis of the different experimental data sets, Vita demonstrated slightly better stability in variable selection and was less computationally intensive than Boruta. In conclusion, we recommend the Boruta and Vita approaches for the analysis of high-dimensional data sets. Vita is considerably faster than Boruta and thus more suitable for large data sets, but only Boruta can also be applied in low-dimensional settings.

摘要

机器学习方法,特别是随机森林,是基于高维组学数据集进行预测的很有前途的方法。它们提供了变量重要性度量,可以根据预测能力对预测因子进行排序。如果构建预测模型是研究的主要目标,通常会选择具有良好预测性能的最小变量集。但是,如果目标是确定涉及的变量以找到活跃的网络和途径,则应优先选择旨在选择所有相关变量的方法。我们评估了几种基于模拟数据以及公开的实验甲基化和基因表达数据的变量选择程序。我们的比较包括 Boruta 算法、Vita 方法、递归相对变量重要性、置换方法及其参数变体(Altmann)以及递归特征消除(RFE)。在我们的模拟研究中,Boruta 是最强大的方法,紧随其后的是 Vita 方法。这两种方法在变量选择方面都表现出相似的稳定性,而 Vita 方法在没有任何与结果相关的预测变量的纯零模型下是最稳健的方法。在对不同实验数据集的分析中,Vita 在变量选择方面表现出稍微更好的稳定性,并且比 Boruta 计算量更小。总之,我们建议使用 Boruta 和 Vita 方法来分析高维数据集。Vita 比 Boruta 快得多,因此更适用于大型数据集,但只有 Boruta 也可以应用于低维设置。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f34/6433899/9072ab28e8ca/bbx124f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验