• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

可证明的布尔交互作用从随机森林获得的树集成中恢复。

Provable Boolean interaction recovery from tree ensemble obtained via random forests.

机构信息

Department of Statistics, University of California, Berkeley, CA 94720.

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.

出版信息

Proc Natl Acad Sci U S A. 2022 May 31;119(22):e2118636119. doi: 10.1073/pnas.2118636119. Epub 2022 May 24.

DOI:10.1073/pnas.2118636119
PMID:35609192
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9295780/
Abstract

Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S±. Intuitively speaking, DWP(S±) measures how frequently features in S± appear together in an RF tree ensemble. We prove that, with high probability, DWP(S±) attains a universal upper bound that does not involve any model coefficients, if and only if S± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.

摘要

随机森林(RFs)在预测性能方面处于监督机器学习的前沿,尤其是在基因组学方面。迭代随机森林(iRFs)使用从迭代修改的 RF 中获得的树集成来获得预测和稳定的非线性或布尔特征交互。它们在发现布尔生物相互作用方面显示出巨大的潜力,而这些相互作用是推进功能基因组学和精准医学的核心。然而,关于基于树的方法如何发现布尔特征交互的理论研究还很缺乏。受许多生物过程中阈值行为的启发,我们首先引入了一种不连续的非线性回归模型,称为“局部尖峰稀疏”(LSS)模型。具体来说,LSS 模型假设回归函数是分段常数布尔交互项的线性组合。对于一组带符号的特征 S±,我们定义了一个称为“深度加权频率”(DWP)的量。直观地说,DWP(S±)衡量了特征 S±在 RF 树集成中一起出现的频率。我们证明,在高概率下,如果 S±对应于 LSS 模型下的布尔交互的并集,则 DWP(S±)以高概率达到一个不涉及任何模型系数的通用上限。因此,我们表明,一种称为 LSSFind 的理论上易于处理的 iRF 过程版本,当样本量趋于无穷大时,在 LSS 模型下会产生一致的交互发现。最后,模拟结果表明,即使违反了一些假设,LSSFind 也能恢复 LSS 模型下的交互。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0677/9295780/1874445eb127/pnas.2118636119fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0677/9295780/1874445eb127/pnas.2118636119fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0677/9295780/1874445eb127/pnas.2118636119fig01.jpg

相似文献

1
Provable Boolean interaction recovery from tree ensemble obtained via random forests.可证明的布尔交互作用从随机森林获得的树集成中恢复。
Proc Natl Acad Sci U S A. 2022 May 31;119(22):e2118636119. doi: 10.1073/pnas.2118636119. Epub 2022 May 24.
2
Iterative random forests to discover predictive and stable high-order interactions.迭代随机森林发现预测和稳定的高阶交互。
Proc Natl Acad Sci U S A. 2018 Feb 20;115(8):1943-1948. doi: 10.1073/pnas.1711236115. Epub 2018 Jan 19.
3
LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data.LANDMark:一种基于集成方法的高通量测序数据中生物标志物的有监督选择。
BMC Bioinformatics. 2022 Mar 31;23(1):110. doi: 10.1186/s12859-022-04631-z.
4
Network inference with ensembles of bi-clustering trees.基于二部聚类树集成的网络推断。
BMC Bioinformatics. 2019 Oct 28;20(1):525. doi: 10.1186/s12859-019-3104-y.
5
Predictive modeling of blood pressure during hemodialysis: a comparison of linear model, random forest, support vector regression, XGBoost, LASSO regression and ensemble method.血液透析期间血压的预测建模:线性模型、随机森林、支持向量回归、XGBoost、LASSO回归及集成方法的比较
Comput Methods Programs Biomed. 2020 Oct;195:105536. doi: 10.1016/j.cmpb.2020.105536. Epub 2020 May 22.
6
MediBoost: a Patient Stratification Tool for Interpretable Decision Making in the Era of Precision Medicine.MediBoost:精准医学时代可解释决策的患者分层工具。
Sci Rep. 2016 Nov 30;6:37854. doi: 10.1038/srep37854.
7
A Novel Consistent Random Forest Framework: Bernoulli Random Forests.一种新型的一致性随机森林框架:伯努利随机森林。
IEEE Trans Neural Netw Learn Syst. 2018 Aug;29(8):3510-3523. doi: 10.1109/TNNLS.2017.2729778. Epub 2017 Aug 15.
8
Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis.基于数据重采样策略训练的随机森林集成分类器,用于改善心律失常诊断。
Comput Biol Med. 2011 May;41(5):265-71. doi: 10.1016/j.compbiomed.2011.03.001. Epub 2011 Mar 17.
9
Comparison of the performance of decision tree (DT) algorithms and extreme learning machine (ELM) model in the prediction of water quality of the Upper Green River watershed.决策树(DT)算法和极限学习机(ELM)模型在预测上格林河流域水质方面的性能比较。
Water Environ Res. 2021 Nov;93(11):2360-2373. doi: 10.1002/wer.1642. Epub 2021 Oct 4.
10
Predicting Health Material Accessibility: Development of Machine Learning Algorithms.预测卫生材料可及性:机器学习算法的开发
JMIR Med Inform. 2021 Sep 1;9(9):e29175. doi: 10.2196/29175.

引用本文的文献

1
Fast Interpretable Greedy-Tree Sums.快速可解释贪心树和
Proc Natl Acad Sci U S A. 2025 Feb 18;122(7):e2310151122. doi: 10.1073/pnas.2310151122. Epub 2025 Feb 14.
2
Learning epistatic polygenic phenotypes with Boolean interactions.学习具有布尔交互作用的上位多基因表型。
PLoS One. 2024 Apr 16;19(4):e0298906. doi: 10.1371/journal.pone.0298906. eCollection 2024.
3
Machine learning-based dynamic prediction of lateral lymph node metastasis in patients with papillary thyroid cancer.基于机器学习的甲状腺乳头状癌患者侧颈部淋巴结转移的动态预测。

本文引用的文献

1
Conditional permutation importance revisited.条件排列重要性再探。
BMC Bioinformatics. 2020 Jul 14;21(1):307. doi: 10.1186/s12859-020-03622-2.
2
Veridical data science.真实数据科学。
Proc Natl Acad Sci U S A. 2020 Feb 25;117(8):3920-3929. doi: 10.1073/pnas.1901326117. Epub 2020 Feb 13.
3
A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks.基于迭代随机森林的高性能计算实现,用于创建预测表达网络。
Front Endocrinol (Lausanne). 2022 Oct 10;13:1019037. doi: 10.3389/fendo.2022.1019037. eCollection 2022.
Genes (Basel). 2019 Dec 2;10(12):996. doi: 10.3390/genes10120996.
4
Formal Hypothesis Tests for Additive Structure in Random Forests.随机森林中加法结构的形式化假设检验。
J Comput Graph Stat. 2017;26(3):589-597. doi: 10.1080/10618600.2016.1256817. Epub 2017 Apr 17.
5
Bias in the intervention in prediction measure in random forests: illustrations and recommendations.随机森林中预测测量干预的偏差:示例与建议
Bioinformatics. 2019 Jul 1;35(13):2343-2345. doi: 10.1093/bioinformatics/bty959.
6
The revival of the Gini importance?基尼重要性的复兴?
Bioinformatics. 2018 Nov 1;34(21):3711-3718. doi: 10.1093/bioinformatics/bty373.
7
Iterative random forests to discover predictive and stable high-order interactions.迭代随机森林发现预测和稳定的高阶交互。
Proc Natl Acad Sci U S A. 2018 Feb 20;115(8):1943-1948. doi: 10.1073/pnas.1711236115. Epub 2018 Jan 19.
8
Integrative annotation of chromatin elements from ENCODE data.整合 ENCODE 数据中的染色质元件注释
Nucleic Acids Res. 2013 Jan;41(2):827-41. doi: 10.1093/nar/gks1284. Epub 2012 Dec 5.
9
Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?生命科学中的随机森林数据挖掘:是漫步公园还是迷失丛林?
Brief Bioinform. 2013 May;14(3):315-26. doi: 10.1093/bib/bbs034. Epub 2012 Jul 10.
10
Random forests for genomic data analysis.随机森林在基因组数据分析中的应用。
Genomics. 2012 Jun;99(6):323-9. doi: 10.1016/j.ygeno.2012.04.003. Epub 2012 Apr 21.