列子集选择的统计视角。

A statistical view of column subset selection.

作者信息

Sood Anav, Hastie Trevor

机构信息

Department of Statistics, Stanford University, Sequoia Hall, 390 Jane Stanford Way, Stanford, CA 94305, USA.

出版信息

J R Stat Soc Series B Stat Methodol. 2025 May 16. doi: 10.1093/jrsssb/qkaf023.

DOI:10.1093/jrsssb/qkaf023

PMID:40717891

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12288642/

Abstract

We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as column subset selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of principal variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum-likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.

摘要

我们考虑从大型数据集中选择一小部分代表性变量的问题。在计算机科学文献中，这个降维问题通常被形式化为列子集选择（CSS）。同时，典型的统计形式化是找到一组使信息最大化的主变量。本文表明这两种方法是等效的，而且，两者都可以被视为特定半参数模型中的最大似然估计。在这个模型中，我们建立了合适的条件，在这些条件下CSS估计在高维情况下是一致的，特别是在变量数量与样本大小的比例渐近趋于常数的情况下。利用这些联系，我们展示了如何有效地（1）仅使用原始数据集的汇总统计信息来执行CSS；（2）在存在缺失和/或删失数据的情况下执行CSS；以及（3）在假设检验框架中为CSS选择子集大小。

相似文献

A statistical view of column subset selection.

J R Stat Soc Series B Stat Methodol. 2025 May 16. doi: 10.1093/jrsssb/qkaf023.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Short-Term Memory Impairment

Eliciting adverse effects data from participants in clinical trials.

Cochrane Database Syst Rev. 2018 Jan 16;1(1):MR000039. doi: 10.1002/14651858.MR000039.pub2.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

The Black Book of Psychotropic Dosing and Monitoring.

Psychopharmacol Bull. 2024 Jul 8;54(3):8-59.

Assessing the comparative effects of interventions in COPD: a tutorial on network meta-analysis for clinicians.

Respir Res. 2024 Dec 21;25(1):438. doi: 10.1186/s12931-024-03056-x.

Magnetic resonance perfusion for differentiating low-grade from high-grade gliomas at first presentation.

Cochrane Database Syst Rev. 2018 Jan 22;1(1):CD011551. doi: 10.1002/14651858.CD011551.pub2.

Portion, package or tableware size for changing selection and consumption of food, alcohol and tobacco.

Cochrane Database Syst Rev. 2015 Sep 14;2015(9):CD011045. doi: 10.1002/14651858.CD011045.pub2.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

本文引用的文献

From the BFI-44 to BFI-20: Psychometric Properties of the Short Form of the Big Five Inventory.

Psychol Rep. 2025 Apr;128(2):1230-1247. doi: 10.1177/00332941231161754. Epub 2023 Mar 2.

GhostKnockoff inference empowers identification of putative causal variants in genome-wide association studies.

Nat Commun. 2022 Nov 23;13(1):7209. doi: 10.1038/s41467-022-34932-z.

Constructing validity: New developments in creating objective measuring instruments.

Psychol Assess. 2019 Dec;31(12):1412-1427. doi: 10.1037/pas0000626. Epub 2019 Mar 21.

Thanks coefficient alpha, we'll take it from here.

Psychol Methods. 2018 Sep;23(3):412-433. doi: 10.1037/met0000144. Epub 2017 May 29.

Polygenic scores via penalized regression on summary statistics.

Genet Epidemiol. 2017 Sep;41(6):469-480. doi: 10.1002/gepi.22050. Epub 2017 May 8.

Component Analysis versus Common Factor Analysis: Some issues in Selecting an Appropriate Procedure.

Multivariate Behav Res. 1990 Jan 1;25(1):1-28. doi: 10.1207/s15327906mbr2501_1.

A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.

Biostatistics. 2009 Jul;10(3):515-34. doi: 10.1093/biostatistics/kxp008. Epub 2009 Apr 17.

Classification of highly unbalanced CYP450 data of drugs using cost sensitive machine learning techniques.

J Chem Inf Model. 2007 Jan-Feb;47(1):92-103. doi: 10.1021/ci6002619.

Feature subset selection and ranking for data dimensionality reduction.

IEEE Trans Pattern Anal Mach Intell. 2007 Jan;29(1):162-6. doi: 10.1109/tpami.2007.250607.

On Hierarchical Correlation Systems.

Proc Natl Acad Sci U S A. 1928 Mar;14(3):283-91. doi: 10.1073/pnas.14.3.283.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

列子集选择的统计视角。

A statistical view of column subset selection.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献