Suppr超能文献

可证明的布尔交互作用从随机森林获得的树集成中恢复。

Provable Boolean interaction recovery from tree ensemble obtained via random forests.

机构信息

Department of Statistics, University of California, Berkeley, CA 94720.

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.

出版信息

Proc Natl Acad Sci U S A. 2022 May 31;119(22):e2118636119. doi: 10.1073/pnas.2118636119. Epub 2022 May 24.

Abstract

Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S±. Intuitively speaking, DWP(S±) measures how frequently features in S± appear together in an RF tree ensemble. We prove that, with high probability, DWP(S±) attains a universal upper bound that does not involve any model coefficients, if and only if S± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.

摘要

随机森林(RFs)在预测性能方面处于监督机器学习的前沿,尤其是在基因组学方面。迭代随机森林(iRFs)使用从迭代修改的 RF 中获得的树集成来获得预测和稳定的非线性或布尔特征交互。它们在发现布尔生物相互作用方面显示出巨大的潜力,而这些相互作用是推进功能基因组学和精准医学的核心。然而,关于基于树的方法如何发现布尔特征交互的理论研究还很缺乏。受许多生物过程中阈值行为的启发,我们首先引入了一种不连续的非线性回归模型,称为“局部尖峰稀疏”(LSS)模型。具体来说,LSS 模型假设回归函数是分段常数布尔交互项的线性组合。对于一组带符号的特征 S±,我们定义了一个称为“深度加权频率”(DWP)的量。直观地说,DWP(S±)衡量了特征 S±在 RF 树集成中一起出现的频率。我们证明,在高概率下,如果 S±对应于 LSS 模型下的布尔交互的并集,则 DWP(S±)以高概率达到一个不涉及任何模型系数的通用上限。因此,我们表明,一种称为 LSSFind 的理论上易于处理的 iRF 过程版本,当样本量趋于无穷大时,在 LSS 模型下会产生一致的交互发现。最后,模拟结果表明,即使违反了一些假设,LSSFind 也能恢复 LSS 模型下的交互。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0677/9295780/1874445eb127/pnas.2118636119fig01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验