替代最小深度作为随机森林中变量的重要性度量。

Surrogate minimal depth as an importance measure for variables in random forests.

机构信息

Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel, ermany.

出版信息

Bioinformatics. 2019 Oct 1;35(19):3663-3671. doi: 10.1093/bioinformatics/btz149.

DOI:10.1093/bioinformatics/btz149

PMID:30824905

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6761946/

Abstract

MOTIVATION

It has been shown that the machine learning approach random forest can be successfully applied to omics data, such as gene expression data, for classification or regression and to select variables that are important for prediction. However, the complex relationships between predictor variables, in particular between causal predictor variables, make the interpretation of currently applied variable selection techniques difficult.

RESULTS

Here we propose a new variable selection approach called surrogate minimal depth (SMD) that incorporates surrogate variables into the concept of minimal depth (MD) variable importance. Applying SMD, we show that simulated correlation patterns can be reconstructed and that the increased consideration of variable relationships improves variable selection. When compared with existing state-of-the-art methods and MD, SMD has higher empirical power to identify causal variables while the resulting variable lists are equally stable. In conclusion, SMD is a promising approach to get more insight into the complex interplay of predictor variables and outcome in a high-dimensional data setting.

AVAILABILITY AND IMPLEMENTATION

https://github.com/StephanSeifert/SurrogateMinimalDepth.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

已经表明，机器学习方法随机森林可以成功地应用于组学数据，如基因表达数据，用于分类或回归，并选择对预测重要的变量。然而，预测变量之间的复杂关系，特别是因果预测变量之间的关系，使得目前应用的变量选择技术的解释变得困难。

结果

在这里，我们提出了一种新的变量选择方法，称为替代最小深度（SMD），它将替代变量纳入最小深度（MD）变量重要性的概念中。应用 SMD，我们表明可以重建模拟的相关模式，并且增加对变量关系的考虑可以改善变量选择。与现有的最先进的方法和 MD 相比，SMD 具有更高的识别因果变量的经验能力，而产生的变量列表同样稳定。总之，SMD 是一种很有前途的方法，可以更深入地了解高维数据环境中预测变量和结果之间的复杂相互作用。

可用性和实现

https://github.com/StephanSeifert/SurrogateMinimalDepth。

补充信息

补充数据可在生物信息学在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/edc3/6761946/8a5e7b875a54/btz149f1.jpg

相似文献

Surrogate minimal depth as an importance measure for variables in random forests.替代最小深度作为随机森林中变量的重要性度量。

Bioinformatics. 2019 Oct 1;35(19):3663-3671. doi: 10.1093/bioinformatics/btz149.

Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features.利用随机森林中的替代变量进行无偏分析，以了解特征之间的相互影响和重要性。

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad471.

Evaluation of variable selection methods for random forests and omics data sets.随机森林和组学数据集变量选择方法的评估。

Brief Bioinform. 2019 Mar 22;20(2):492-503. doi: 10.1093/bib/bbx124.

The revival of the Gini importance?基尼重要性的复兴？

Bioinformatics. 2018 Nov 1;34(21):3711-3718. doi: 10.1093/bioinformatics/bty373.

Opening the Random Forest Black Box of H NMR Metabolomics Data by the Exploitation of Surrogate Variables.通过替代变量探索打开核磁共振代谢组学数据的随机森林黑箱

Metabolites. 2023 Oct 13;13(10):1075. doi: 10.3390/metabo13101075.

Matched Forest: supervised learning for high-dimensional matched case-control studies.匹配森林：高维匹配病例对照研究的监督学习。

Bioinformatics. 2020 Mar 1;36(5):1570-1576. doi: 10.1093/bioinformatics/btz785.

Study becomes insight: Ecological learning from machine learning.研究转化为洞察：从机器学习中进行生态学习。

Methods Ecol Evol. 2021 Nov;12(11):2117-2128. doi: 10.1111/2041-210X.13686. Epub 2021 Aug 6.

A new approach for interpreting Random Forest models and its application to the biology of ageing.一种解释随机森林模型的新方法及其在衰老生物学中的应用。

Bioinformatics. 2018 Jul 15;34(14):2449-2456. doi: 10.1093/bioinformatics/bty087.

A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery.一种基于决策理论的计算药物发现中机器学习算法评估方法。

Bioinformatics. 2019 Nov 1;35(22):4656-4663. doi: 10.1093/bioinformatics/btz293.

A comparative study of forest methods for time-to-event data: variable selection and predictive performance.森林方法在生存时间数据中的比较研究：变量选择和预测性能。

BMC Med Res Methodol. 2021 Sep 25;21(1):193. doi: 10.1186/s12874-021-01386-8.

引用本文的文献

Prognostic value of genomic mutation signature associated with immune microenvironment in southern Chinese patients with esophageal squamous cell carcinoma.中国南方食管鳞癌患者基因组突变特征与免疫微环境相关的预后价值。

Cancer Immunol Immunother. 2024 Jun 4;73(8):141. doi: 10.1007/s00262-024-03725-2.

Evaluation of network-guided random forest for disease gene discovery.用于疾病基因发现的网络引导随机森林评估。

BioData Min. 2024 Apr 16;17(1):10. doi: 10.1186/s13040-024-00361-5.

Opening the Random Forest Black Box of H NMR Metabolomics Data by the Exploitation of Surrogate Variables.通过替代变量探索打开核磁共振代谢组学数据的随机森林黑箱

Metabolites. 2023 Oct 13;13(10):1075. doi: 10.3390/metabo13101075.

Comparative Analysis of LC-ESI-IM-qToF-MS and FT-NIR Spectroscopy Approaches for the Authentication of Organic and Conventional Eggs.液相色谱-电喷雾电离-离子淌度-四极杆飞行时间质谱法与傅里叶变换近红外光谱法用于有机鸡蛋和传统鸡蛋鉴别的比较分析

Metabolites. 2023 Jul 25;13(8):882. doi: 10.3390/metabo13080882.

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad471.

Generalization of threats attributed to large carnivores in areas of high human-wildlife conflict.归因于高人类-野生动物冲突地区的大型食肉动物的威胁的泛化。

Conserv Biol. 2022 Oct;36(5):e13974. doi: 10.1111/cobi.13974. Epub 2022 Aug 4.

Surface enhanced Raman scattering for probing cellular biochemistry.基于表面增强拉曼散射的细胞生物化学探测。

Nanoscale. 2022 Apr 7;14(14):5314-5328. doi: 10.1039/d2nr00449f.

Opening the Random Forest Black Box of the Metabolome by the Application of Surrogate Minimal Depth.通过应用替代最小深度打开代谢组学的随机森林黑箱

Metabolites. 2021 Dec 21;12(1):5. doi: 10.3390/metabo12010005.

Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study.使用通路引导的随机森林整合生物学知识和基因表达数据：一项基准研究

Bioinformatics. 2020 Aug 1;36(15):4301-4308. doi: 10.1093/bioinformatics/btaa483.

Application of random forest based approaches to surface-enhanced Raman scattering data.基于随机森林方法在表面增强拉曼散射数据中的应用。

Sci Rep. 2020 Mar 25;10(1):5436. doi: 10.1038/s41598-020-62338-8.

本文引用的文献

The revival of the Gini importance?基尼重要性的复兴？

Bioinformatics. 2018 Nov 1;34(21):3711-3718. doi: 10.1093/bioinformatics/bty373.

Evaluation of variable selection methods for random forests and omics data sets.随机森林和组学数据集变量选择方法的评估。

Brief Bioinform. 2019 Mar 22;20(2):492-503. doi: 10.1093/bib/bbx124.

Omics for personalized medicine: defining the current we swim in.用于个性化医疗的组学：定义我们所处的现状。

Expert Rev Mol Diagn. 2016 Jul;16(7):719-22. doi: 10.1586/14737159.2016.1164601. Epub 2016 Apr 6.

Co-expression of genes with estrogen receptor-α and progesterone receptor in human breast carcinoma tissue.雌激素受体-α和孕激素受体基因在人乳腺癌组织中的共表达

Horm Mol Biol Clin Investig. 2012 Dec;12(1):377-90. doi: 10.1515/hmbci-2012-0025.

Comprehensive molecular portraits of human breast tumours.人类乳腺肿瘤的全面分子特征图谱。

Nature. 2012 Oct 4;490(7418):61-70. doi: 10.1038/nature11412. Epub 2012 Sep 23.

Stable feature selection for biomarker discovery.用于生物标志物发现的稳定特征选择。

Comput Biol Chem. 2010 Aug;34(4):215-25. doi: 10.1016/j.compbiolchem.2010.07.002. Epub 2010 Aug 10.

The behaviour of random forest permutation-based variable importance measures under predictor correlation.随机森林排列重要性度量在预测变量相关性下的行为。

BMC Bioinformatics. 2010 Feb 27;11:110. doi: 10.1186/1471-2105-11-110.

An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests.递归分区介绍：分类和回归树、装袋和随机森林的原理、应用和特点。

Psychol Methods. 2009 Dec;14(4):323-48. doi: 10.1037/a0016973.

Statistical challenges of high-dimensional data.高维数据的统计挑战。

Philos Trans A Math Phys Eng Sci. 2009 Nov 13;367(1906):4237-53. doi: 10.1098/rsta.2009.0159.

WGCNA: an R package for weighted correlation network analysis.WGCNA：一个用于加权相关网络分析的R软件包。

BMC Bioinformatics. 2008 Dec 29;9:559. doi: 10.1186/1471-2105-9-559.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

替代最小深度作为随机森林中变量的重要性度量。

Surrogate minimal depth as an importance measure for variables in random forests.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献