可证明的布尔交互作用从随机森林获得的树集成中恢复。

Provable Boolean interaction recovery from tree ensemble obtained via random forests.

机构信息

Department of Statistics, University of California, Berkeley, CA 94720.

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.

出版信息

Proc Natl Acad Sci U S A. 2022 May 31;119(22):e2118636119. doi: 10.1073/pnas.2118636119. Epub 2022 May 24.

DOI:10.1073/pnas.2118636119

PMID:35609192

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9295780/

Abstract

Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S±. Intuitively speaking, DWP(S±) measures how frequently features in S± appear together in an RF tree ensemble. We prove that, with high probability, DWP(S±) attains a universal upper bound that does not involve any model coefficients, if and only if S± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.

摘要

随机森林（RFs）在预测性能方面处于监督机器学习的前沿，尤其是在基因组学方面。迭代随机森林（iRFs）使用从迭代修改的 RF 中获得的树集成来获得预测和稳定的非线性或布尔特征交互。它们在发现布尔生物相互作用方面显示出巨大的潜力，而这些相互作用是推进功能基因组学和精准医学的核心。然而，关于基于树的方法如何发现布尔特征交互的理论研究还很缺乏。受许多生物过程中阈值行为的启发，我们首先引入了一种不连续的非线性回归模型，称为“局部尖峰稀疏”（LSS）模型。具体来说，LSS 模型假设回归函数是分段常数布尔交互项的线性组合。对于一组带符号的特征 S±，我们定义了一个称为“深度加权频率”（DWP）的量。直观地说，DWP(S±)衡量了特征 S±在 RF 树集成中一起出现的频率。我们证明，在高概率下，如果 S±对应于 LSS 模型下的布尔交互的并集，则 DWP(S±)以高概率达到一个不涉及任何模型系数的通用上限。因此，我们表明，一种称为 LSSFind 的理论上易于处理的 iRF 过程版本，当样本量趋于无穷大时，在 LSS 模型下会产生一致的交互发现。最后，模拟结果表明，即使违反了一些假设，LSSFind 也能恢复 LSS 模型下的交互。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0677/9295780/1874445eb127/pnas.2118636119fig01.jpg

相似文献

Provable Boolean interaction recovery from tree ensemble obtained via random forests.可证明的布尔交互作用从随机森林获得的树集成中恢复。

Proc Natl Acad Sci U S A. 2022 May 31;119(22):e2118636119. doi: 10.1073/pnas.2118636119. Epub 2022 May 24.

Iterative random forests to discover predictive and stable high-order interactions.迭代随机森林发现预测和稳定的高阶交互。

Proc Natl Acad Sci U S A. 2018 Feb 20;115(8):1943-1948. doi: 10.1073/pnas.1711236115. Epub 2018 Jan 19.

LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data.LANDMark：一种基于集成方法的高通量测序数据中生物标志物的有监督选择。

BMC Bioinformatics. 2022 Mar 31;23(1):110. doi: 10.1186/s12859-022-04631-z.

Network inference with ensembles of bi-clustering trees.基于二部聚类树集成的网络推断。

BMC Bioinformatics. 2019 Oct 28;20(1):525. doi: 10.1186/s12859-019-3104-y.

Predictive modeling of blood pressure during hemodialysis: a comparison of linear model, random forest, support vector regression, XGBoost, LASSO regression and ensemble method.血液透析期间血压的预测建模：线性模型、随机森林、支持向量回归、XGBoost、LASSO回归及集成方法的比较

Comput Methods Programs Biomed. 2020 Oct;195:105536. doi: 10.1016/j.cmpb.2020.105536. Epub 2020 May 22.

MediBoost: a Patient Stratification Tool for Interpretable Decision Making in the Era of Precision Medicine.MediBoost：精准医学时代可解释决策的患者分层工具。

Sci Rep. 2016 Nov 30;6:37854. doi: 10.1038/srep37854.

A Novel Consistent Random Forest Framework: Bernoulli Random Forests.一种新型的一致性随机森林框架：伯努利随机森林。

IEEE Trans Neural Netw Learn Syst. 2018 Aug;29(8):3510-3523. doi: 10.1109/TNNLS.2017.2729778. Epub 2017 Aug 15.

Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis.基于数据重采样策略训练的随机森林集成分类器，用于改善心律失常诊断。

Comput Biol Med. 2011 May;41(5):265-71. doi: 10.1016/j.compbiomed.2011.03.001. Epub 2011 Mar 17.

Comparison of the performance of decision tree (DT) algorithms and extreme learning machine (ELM) model in the prediction of water quality of the Upper Green River watershed.决策树（DT）算法和极限学习机（ELM）模型在预测上格林河流域水质方面的性能比较。

Water Environ Res. 2021 Nov;93(11):2360-2373. doi: 10.1002/wer.1642. Epub 2021 Oct 4.

Predicting Health Material Accessibility: Development of Machine Learning Algorithms.预测卫生材料可及性：机器学习算法的开发

JMIR Med Inform. 2021 Sep 1;9(9):e29175. doi: 10.2196/29175.

引用本文的文献

Fast Interpretable Greedy-Tree Sums.快速可解释贪心树和

Proc Natl Acad Sci U S A. 2025 Feb 18;122(7):e2310151122. doi: 10.1073/pnas.2310151122. Epub 2025 Feb 14.

Learning epistatic polygenic phenotypes with Boolean interactions.学习具有布尔交互作用的上位多基因表型。

PLoS One. 2024 Apr 16;19(4):e0298906. doi: 10.1371/journal.pone.0298906. eCollection 2024.

Machine learning-based dynamic prediction of lateral lymph node metastasis in patients with papillary thyroid cancer.基于机器学习的甲状腺乳头状癌患者侧颈部淋巴结转移的动态预测。

Front Endocrinol (Lausanne). 2022 Oct 10;13:1019037. doi: 10.3389/fendo.2022.1019037. eCollection 2022.

本文引用的文献

Conditional permutation importance revisited.条件排列重要性再探。

BMC Bioinformatics. 2020 Jul 14;21(1):307. doi: 10.1186/s12859-020-03622-2.

Veridical data science.真实数据科学。

Proc Natl Acad Sci U S A. 2020 Feb 25;117(8):3920-3929. doi: 10.1073/pnas.1901326117. Epub 2020 Feb 13.

A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks.基于迭代随机森林的高性能计算实现，用于创建预测表达网络。

Genes (Basel). 2019 Dec 2;10(12):996. doi: 10.3390/genes10120996.

Formal Hypothesis Tests for Additive Structure in Random Forests.随机森林中加法结构的形式化假设检验。

J Comput Graph Stat. 2017;26(3):589-597. doi: 10.1080/10618600.2016.1256817. Epub 2017 Apr 17.

Bias in the intervention in prediction measure in random forests: illustrations and recommendations.随机森林中预测测量干预的偏差：示例与建议

Bioinformatics. 2019 Jul 1;35(13):2343-2345. doi: 10.1093/bioinformatics/bty959.

The revival of the Gini importance?基尼重要性的复兴？

Bioinformatics. 2018 Nov 1;34(21):3711-3718. doi: 10.1093/bioinformatics/bty373.

Iterative random forests to discover predictive and stable high-order interactions.迭代随机森林发现预测和稳定的高阶交互。

Proc Natl Acad Sci U S A. 2018 Feb 20;115(8):1943-1948. doi: 10.1073/pnas.1711236115. Epub 2018 Jan 19.

Integrative annotation of chromatin elements from ENCODE data.整合 ENCODE 数据中的染色质元件注释

Nucleic Acids Res. 2013 Jan;41(2):827-41. doi: 10.1093/nar/gks1284. Epub 2012 Dec 5.

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?生命科学中的随机森林数据挖掘：是漫步公园还是迷失丛林？

Brief Bioinform. 2013 May;14(3):315-26. doi: 10.1093/bib/bbs034. Epub 2012 Jul 10.

Random forests for genomic data analysis.随机森林在基因组数据分析中的应用。

Genomics. 2012 Jun;99(6):323-9. doi: 10.1016/j.ygeno.2012.04.003. Epub 2012 Apr 21.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

可证明的布尔交互作用从随机森林获得的树集成中恢复。

Provable Boolean interaction recovery from tree ensemble obtained via random forests.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献