Sood Anav, Hastie Trevor
Department of Statistics, Stanford University, Sequoia Hall, 390 Jane Stanford Way, Stanford, CA 94305, USA.
J R Stat Soc Series B Stat Methodol. 2025 May 16. doi: 10.1093/jrsssb/qkaf023.
We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as column subset selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of principal variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum-likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
我们考虑从大型数据集中选择一小部分代表性变量的问题。在计算机科学文献中,这个降维问题通常被形式化为列子集选择(CSS)。同时,典型的统计形式化是找到一组使信息最大化的主变量。本文表明这两种方法是等效的,而且,两者都可以被视为特定半参数模型中的最大似然估计。在这个模型中,我们建立了合适的条件,在这些条件下CSS估计在高维情况下是一致的,特别是在变量数量与样本大小的比例渐近趋于常数的情况下。利用这些联系,我们展示了如何有效地(1)仅使用原始数据集的汇总统计信息来执行CSS;(2)在存在缺失和/或删失数据的情况下执行CSS;以及(3)在假设检验框架中为CSS选择子集大小。