LAVASET：潜在变量随机树集成。一种用于具有空间、光谱和时间依赖性的相关数据集的集成方法。

LAVASET: Latent Variable Stochastic Ensemble of Trees. An ensemble method for correlated datasets with spatial, spectral, and temporal dependencies.

机构信息

Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion, and Reproduction, Faculty of Medicine, Imperial College London, London W12 0NN, United Kingdom.

Faculty of Medicine, National Heart & Lung Institute, Imperial College London, London W12 0NN, United Kingdom.

出版信息

Bioinformatics. 2024 Mar 4;40(3). doi: 10.1093/bioinformatics/btae101.

DOI:10.1093/bioinformatics/btae101

PMID:38383048

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11212485/

Abstract

MOTIVATION

Random forests (RFs) can deal with a large number of variables, achieve reasonable prediction scores, and yield highly interpretable feature importance values. As such, RFs are appropriate models for feature selection and further dimension reduction. However, RFs are often not appropriate for correlated datasets due to their mode of selecting individual features for splitting. Addressing correlation relationships in high-dimensional datasets is imperative for reducing the number of variables that are assigned high importance, hence making the dimension reduction most efficient. Here, we propose the LAtent VAriable Stochastic Ensemble of Trees (LAVASET) method that derives latent variables based on the distance characteristics of each feature and aims to incorporate the correlation factor in the splitting step.

RESULTS

Without compromising on performance in the majority of examples, LAVASET outperforms RF by accurately determining feature importance across all correlated variables and ensuring proper distribution of importance values. LAVASET yields mostly non-inferior prediction accuracies to traditional RFs when tested in simulated and real 1D datasets, as well as more complex and high-dimensional 3D datatypes. Unlike traditional RFs, LAVASET is unaffected by single 'important' noisy features (false positives), as it considers the local neighbourhood. LAVASET, therefore, highlights neighbourhoods of features, reflecting real signals that collectively impact the model's predictive ability.

AVAILABILITY AND IMPLEMENTATION

LAVASET is freely available as a standalone package from https://github.com/melkasapi/LAVASET.

摘要

动机

随机森林 (RF) 可以处理大量变量，实现合理的预测分数，并产生高度可解释的特征重要性值。因此，RF 是特征选择和进一步降维的合适模型。然而，由于 RF 选择用于分裂的单个特征的方式，它们通常不适合相关数据集。解决高维数据集中的相关性关系对于减少被赋予高重要性的变量数量至关重要，从而使降维效率最高。在这里，我们提出了 LAtent VAriable Stochastic Ensemble of Trees (LAVASET) 方法，该方法基于每个特征的距离特征来导出潜在变量，并旨在在分裂步骤中纳入相关因素。

结果

在大多数示例中，LAVASET 不会影响性能，而是通过准确确定所有相关变量的特征重要性来超越 RF，从而确保重要性值的正确分布。LAVASET 在模拟和真实的 1D 数据集以及更复杂和高维的 3D 数据类型中进行测试时，大多数情况下都能达到与传统 RF 相似的预测精度。与传统 RF 不同，LAVASET 不受单个“重要”噪声特征（假阳性）的影响，因为它考虑了局部邻域。因此，LAVASET 突出了特征的邻域，反映了共同影响模型预测能力的真实信号。

可用性和实现

LAVASET 可从 https://github.com/melkasapi/LAVASET 免费作为独立软件包使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e0c/11212485/f53aa67f49dc/btae101f1.jpg

相似文献

LAVASET: Latent Variable Stochastic Ensemble of Trees. An ensemble method for correlated datasets with spatial, spectral, and temporal dependencies.

Bioinformatics. 2024 Mar 4;40(3). doi: 10.1093/bioinformatics/btae101.

Topological embedding and directional feature importance in ensemble classifiers for multi-class classification.

Comput Struct Biotechnol J. 2024 Nov 13;23:4108-4123. doi: 10.1016/j.csbj.2024.11.013. eCollection 2024 Dec.

Unbiased feature selection in learning random forests for high-dimensional data.

ScientificWorldJournal. 2015;2015:471371. doi: 10.1155/2015/471371. Epub 2015 Mar 24.

Cluster ensemble based on Random Forests for genetic data.

BioData Min. 2017 Dec 15;10:37. doi: 10.1186/s13040-017-0156-2. eCollection 2017.

A comparative study of forest methods for time-to-event data: variable selection and predictive performance.

BMC Med Res Methodol. 2021 Sep 25;21(1):193. doi: 10.1186/s12874-021-01386-8.

Random KNN feature selection - a fast and stable alternative to Random Forests.

BMC Bioinformatics. 2011 Nov 18;12:450. doi: 10.1186/1471-2105-12-450.

Variable importance-weighted Random Forests.

Quant Biol. 2017 Dec;5(4):338-351. Epub 2017 Nov 6.

Nonparametric IPSS: fast, flexible feature selection with false discovery control.

Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf299.

Effective hybrid feature selection using different bootstrap enhances cancers classification performance.

BioData Min. 2022 Sep 30;15(1):24. doi: 10.1186/s13040-022-00304-y.

Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features.

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad471.

引用本文的文献

Topological embedding and directional feature importance in ensemble classifiers for multi-class classification.

Comput Struct Biotechnol J. 2024 Nov 13;23:4108-4123. doi: 10.1016/j.csbj.2024.11.013. eCollection 2024 Dec.

本文引用的文献

Genotype-Phenotype Taxonomy of Hypertrophic Cardiomyopathy.

Circ Genom Precis Med. 2023 Dec;16(6):e004200. doi: 10.1161/CIRCGEN.123.004200. Epub 2023 Nov 28.

Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features.

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad471.

Multi-omics disease module detection with an explainable Greedy Decision Forest.

Sci Rep. 2022 Oct 7;12(1):16857. doi: 10.1038/s41598-022-21417-8.

Integrated fecal microbiome-metabolome signatures reflect stress and serotonin metabolism in irritable bowel syndrome.

Gut Microbes. 2022 Jan-Dec;14(1):2063016. doi: 10.1080/19490976.2022.2063016.

Random forest of perfect trees: concept, performance, applications and perspectives.

Bioinformatics. 2021 Aug 9;37(15):2165-2174. doi: 10.1093/bioinformatics/btab074.

Automatic 3D Bi-Ventricular Segmentation of Cardiac Images by a Shape-Refined Multi- Task Deep Learning Approach.

IEEE Trans Med Imaging. 2019 Sep;38(9):2151-2164. doi: 10.1109/TMI.2019.2894322. Epub 2019 Jan 23.

Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.

J Cardiovasc Magn Reson. 2018 Sep 14;20(1):65. doi: 10.1186/s12968-018-0471-x.

UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.

PLoS Med. 2015 Mar 31;12(3):e1001779. doi: 10.1371/journal.pmed.1001779. eCollection 2015 Mar.

The behaviour of random forest permutation-based variable importance measures under predictor correlation.

BMC Bioinformatics. 2010 Feb 27;11:110. doi: 10.1186/1471-2105-11-110.

A metabolomic comparison of urinary changes in type 2 diabetes in mouse, rat, and human.

Physiol Genomics. 2007 Apr 24;29(2):99-108. doi: 10.1152/physiolgenomics.00194.2006. Epub 2006 Dec 26.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

LAVASET：潜在变量随机树集成。一种用于具有空间、光谱和时间依赖性的相关数据集的集成方法。

LAVASET: Latent Variable Stochastic Ensemble of Trees. An ensemble method for correlated datasets with spatial, spectral, and temporal dependencies.

机构信息

Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion, and Reproduction, Faculty of Medicine, Imperial College London, London W12 0NN, United Kingdom.

Faculty of Medicine, National Heart & Lung Institute, Imperial College London, London W12 0NN, United Kingdom.