Paul Debolina, Chakraborty Saptarshi, Das Swagatam
IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16788-16800. doi: 10.1109/TNNLS.2023.3298011. Epub 2024 Oct 29.
Principal component analysis (PCA) is a fundamental tool for data visualization, denoising, and dimensionality reduction. It is widely popular in statistics, machine learning, computer vision, and related fields. However, PCA is well-known to fall prey to outliers and often fails to detect the true underlying low-dimensional structure within the dataset. Following the Median of Means (MoM) philosophy, recent supervised learning methods have shown great success in dealing with outlying observations without much compromise to their large sample theoretical properties. This article proposes a PCA procedure based on the MoM principle. Called the MoMPCA, the proposed method is not only computationally appealing but also achieves optimal convergence rates under minimal assumptions. In particular, we explore the nonasymptotic error bounds of the obtained solution via the aid of the Rademacher complexities while granting absolutely no assumption on the outlying observations. The derived concentration results are not dependent on the dimension because the analysis is conducted in a separable Hilbert space, and the results only depend on the fourth moment of the underlying distribution in the corresponding norm. The proposal's efficacy is also thoroughly showcased through simulations and real data applications.
主成分分析(PCA)是用于数据可视化、去噪和降维的一种基本工具。它在统计学、机器学习、计算机视觉及相关领域广泛流行。然而,众所周知,PCA容易受到异常值的影响,并且常常无法检测数据集中真正潜在的低维结构。遵循均值中位数(MoM)理念,最近的监督学习方法在处理异常观测值方面取得了巨大成功,同时对其大样本理论性质没有太大影响。本文提出了一种基于MoM原理的PCA方法。所提出的方法称为MoMPCA,不仅在计算上具有吸引力,而且在最小假设下实现了最优收敛速度。特别是,我们借助拉德马赫复杂度探索了所得解的非渐近误差界,同时对异常观测值完全不做任何假设。推导得到的集中结果不依赖于维度,因为分析是在可分希尔伯特空间中进行的,结果仅取决于相应范数下基础分布的四阶矩。通过模拟和实际数据应用也充分展示了该方法的有效性。