一种通过随机化和镜像统计进行错误发现率控制和功效最大化的计算高效方法。

A computationally efficient approach to false discovery rate control and power maximisation via randomisation and mirror statistic.

作者信息

Molinari Marco, Thoresen Magne

机构信息

Department of Biostatistics, University of Oslo, Oslo, Norway.

出版信息

Stat Methods Med Res. 2025 Jun;34(6):1233-1253. doi: 10.1177/09622802251329768. Epub 2025 Mar 31.

DOI:10.1177/09622802251329768

PMID:40165448

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12209545/

Abstract

Simultaneously performing variable selection and inference in high-dimensional regression models is an open challenge in statistics and machine learning. The increasing availability of vast amounts of variables requires the adoption of specific statistical procedures to accurately select the most important predictors in a high-dimensional space, while controlling the false discovery rate (FDR) associated with the variable selection procedure. In this paper, we propose the joint adoption of the Mirror Statistic approach to FDR control, coupled with outcome randomisation to maximise the statistical power of the variable selection procedure, measured through the true positive rate. Through extensive simulations, we show how our proposed strategy allows us to combine the benefits of the two techniques. The Mirror Statistic is a flexible method to control FDR, which only requires mild model assumptions, but requires two sets of independent regression coefficient estimates, usually obtained after splitting the original dataset. Outcome randomisation is an alternative to data splitting that allows to generate two independent outcomes, which can then be used to estimate the coefficients that go into the construction of the Mirror Statistic. The combination of these two approaches provides increased testing power in a number of scenarios, such as highly correlated covariates and high percentages of active variables. Moreover, it is scalable to very high-dimensional problems, since the algorithm has a low memory footprint and only requires a single run on the full dataset, as opposed to iterative alternatives such as multiple data splitting.

摘要

在高维回归模型中同时进行变量选择和推断是统计学和机器学习领域的一个开放性挑战。大量变量的可得性不断增加，这就需要采用特定的统计程序，以便在高维空间中准确选择最重要的预测变量，同时控制与变量选择程序相关的错误发现率（FDR）。在本文中，我们建议联合采用用于控制FDR的镜像统计方法，并结合结果随机化，以通过真阳性率衡量最大化变量选择程序的统计功效。通过广泛的模拟，我们展示了我们提出的策略如何使我们能够结合这两种技术的优点。镜像统计是一种控制FDR的灵活方法，它只需要适度的模型假设，但需要两组独立的回归系数估计值，通常是在拆分原始数据集后获得的。结果随机化是数据拆分的一种替代方法，它允许生成两个独立的结果，然后可用于估计构建镜像统计所需的系数。这两种方法的结合在许多情况下都能提高检验功效，比如在协变量高度相关和活跃变量比例很高的情况下。此外，它可扩展到非常高维的问题，因为该算法内存占用低，并且只需要在完整数据集上运行一次，这与诸如多次数据拆分等迭代方法不同。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

一种通过随机化和镜像统计进行错误发现率控制和功效最大化的计算高效方法。

A computationally efficient approach to false discovery rate control and power maximisation via randomisation and mirror statistic.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

一种通过随机化和镜像统计进行错误发现率控制和功效最大化的计算高效方法。

A computationally efficient approach to false discovery rate control and power maximisation via randomisation and mirror statistic.

作者信息

机构信息

出版信息

相似文献

本文引用的文献