• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

具有异质缺失值的高维主成分分析

High-dimensional principal component analysis with heterogeneous missingness.

作者信息

Zhu Ziwei, Wang Tengyao, Samworth Richard J

机构信息

Statistical Laboratory University of Cambridge Cambridge UK.

Department of Statistics University of Michigan Ann Arbor Michigan USA.

出版信息

J R Stat Soc Series B Stat Methodol. 2022 Nov;84(5):2000-2031. doi: 10.1111/rssb.12550. Epub 2022 Nov 20.

DOI:10.1111/rssb.12550
PMID:37065873
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10098677/
Abstract

We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCA converges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.

摘要

我们研究了存在缺失观测值情况下的高维主成分分析(PCA)问题。在一个简单的、均匀的观测模型中,我们表明,对于主导主成分的现有观测比例加权(OPW)估计器能够(近乎)达到极小极大最优收敛速率,这呈现出一个有趣的相变。然而,深入研究发现,特别是在观测概率非均匀的更现实场景中,OPW估计器的实证性能可能并不理想;此外,在无噪声情况下,它无法精确恢复主成分。那么,我们的主要贡献是引入一种新方法,我们称之为primePCA,该方法旨在应对观测值可能以非均匀方式缺失的情况。从OPW估计器出发,primePCA迭代地将数据矩阵的观测元素投影到当前估计的列空间上以插补缺失元素,然后通过计算插补后数据矩阵的主导右奇异空间来更新我们的估计。我们证明,在无噪声情况下且信号强度不太小的时候,primePCA的误差以几何速率收敛到零。我们理论保证的一个重要特征是它们依赖于缺失机制的平均性质,而不是最坏情况性质。我们对模拟数据和真实数据的数值研究表明,primePCA在广泛的场景中都表现出非常令人鼓舞的性能,包括数据并非完全随机缺失的情况。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f592/10098677/ef4ea83e6e9f/RSSB-84-2000-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f592/10098677/f0d8c1cd63f7/RSSB-84-2000-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f592/10098677/2a81ec06c39f/RSSB-84-2000-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f592/10098677/ce237d65df30/RSSB-84-2000-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f592/10098677/50840412a3d3/RSSB-84-2000-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f592/10098677/ef4ea83e6e9f/RSSB-84-2000-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f592/10098677/f0d8c1cd63f7/RSSB-84-2000-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f592/10098677/2a81ec06c39f/RSSB-84-2000-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f592/10098677/ce237d65df30/RSSB-84-2000-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f592/10098677/50840412a3d3/RSSB-84-2000-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f592/10098677/ef4ea83e6e9f/RSSB-84-2000-g004.jpg

相似文献

1
High-dimensional principal component analysis with heterogeneous missingness.具有异质缺失值的高维主成分分析
J R Stat Soc Series B Stat Methodol. 2022 Nov;84(5):2000-2031. doi: 10.1111/rssb.12550. Epub 2022 Nov 20.
2
A nonparametric multiple imputation approach for missing categorical data.一种针对缺失分类数据的非参数多重填补方法。
BMC Med Res Methodol. 2017 Jun 6;17(1):87. doi: 10.1186/s12874-017-0360-2.
3
Nonrandom missing data can bias Principal Component Analysis inference of population genetic structure.非随机缺失数据可能会使群体遗传结构的主成分分析推断产生偏差。
Mol Ecol Resour. 2022 Feb;22(2):602-611. doi: 10.1111/1755-0998.13498. Epub 2021 Sep 9.
4
Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data.具有不完全数据的高维协方差矩阵的极小极大速率最优估计
J Multivar Anal. 2016 Sep;150:55-74. doi: 10.1016/j.jmva.2016.05.002. Epub 2016 May 19.
5
Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time.松弛后收紧:多项式时间内的极小极大最优稀疏主成分分析
Adv Neural Inf Process Syst. 2014;2014:3383-3391.
6
Robust imputation method with context-aware voting ensemble model for management of water-quality data.具有上下文感知投票集成模型的稳健插补方法用于水质数据管理。
Water Res. 2023 Sep 1;243:120369. doi: 10.1016/j.watres.2023.120369. Epub 2023 Jul 16.
7
Dealing with missing delirium assessments in prospective clinical studies of the critically ill: a simulation study and reanalysis of two delirium studies.处理危重症患者前瞻性临床研究中缺失的谵妄评估:一项模拟研究和两项谵妄研究的重新分析。
BMC Med Res Methodol. 2021 May 6;21(1):97. doi: 10.1186/s12874-021-01274-1.
8
Outcome-sensitive multiple imputation: a simulation study.结果敏感多重填补:一项模拟研究。
BMC Med Res Methodol. 2017 Jan 9;17(1):2. doi: 10.1186/s12874-016-0281-5.
9
Imputation methods for addressing missing data in short-term monitoring of air pollutants.用于解决短期空气污染物监测中缺失数据的插补方法。
Sci Total Environ. 2020 Aug 15;730:139140. doi: 10.1016/j.scitotenv.2020.139140. Epub 2020 May 3.
10
Multinomial Logistic Factor Regression for Multi-source Functional Block-wise Missing Data.多源功能块式缺失数据的多项式逻辑因子回归。
Psychometrika. 2023 Sep;88(3):975-1001. doi: 10.1007/s11336-023-09918-5. Epub 2023 Jun 2.

引用本文的文献

1
Inference in High-Dimensional Online Changepoint Detection.高维在线变化点检测中的推理
J Am Stat Assoc. 2023 May 26;119(546):1461-1472. doi: 10.1080/01621459.2023.2199962. eCollection 2024.
2
Imputation of single-cell gene expression with an autoencoder neural network.使用自动编码器神经网络对单细胞基因表达进行插补
Quant Biol. 2020 Mar;8(1):78-94. doi: 10.1007/s40484-019-0192-7. Epub 2020 Jan 22.

本文引用的文献

1
Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.通过快速交替最小二乘法实现矩阵补全与低秩奇异值分解
J Mach Learn Res. 2015;16:3367-3402.
2
Asymptotics of empirical eigenstructure for high dimensional spiked covariance.高维尖峰协方差的经验特征结构渐近性
Ann Stat. 2017 Jun;45(3):1342-1374. doi: 10.1214/16-AOS1487. Epub 2017 Jun 13.
3
The Statistics and Mathematics of High Dimension Low Sample Size Asymptotics.高维小样本渐近性的统计学与数学
Stat Sin. 2016 Oct;26(4):1747-1770. doi: 10.5705/ss.202015.0088.
4
Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data.具有不完全数据的高维协方差矩阵的极小极大速率最优估计
J Multivar Anal. 2016 Sep;150:55-74. doi: 10.1016/j.jmva.2016.05.002. Epub 2016 May 19.
5
Large Covariance Estimation by Thresholding Principal Orthogonal Complements.通过阈值化主正交补进行大协方差估计
J R Stat Soc Series B Stat Methodol. 2013 Sep 1;75(4). doi: 10.1111/rssb.12016.
6
Spectral Regularization Algorithms for Learning Large Incomplete Matrices.用于学习大型不完整矩阵的谱正则化算法
J Mach Learn Res. 2010 Mar 1;11:2287-2322.
7
On Consistency and Sparsity for Principal Components Analysis in High Dimensions.高维主成分分析中的一致性与稀疏性
J Am Stat Assoc. 2009 Jun 1;104(486):682-693. doi: 10.1198/jasa.2009.0121.