Jolliffe Ian T, Cadima Jorge
College of Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter, UK.
Secção de Matemática (DCEB), Instituto Superior de Agronomia, Universidade de Lisboa, Tapada da Ajuda, Lisboa 1340-017, Portugal Centro de Estatística e Aplicações da Universidade de Lisboa (CEAUL), Lisboa, Portugal
Philos Trans A Math Phys Eng Sci. 2016 Apr 13;374(2065):20150202. doi: 10.1098/rsta.2015.0202.
Large datasets are increasingly common and are often difficult to interpret. Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. Finding such new variables, the principal components, reduces to solving an eigenvalue/eigenvector problem, and the new variables are defined by the dataset at hand, not a priori, hence making PCA an adaptive data analysis technique. It is adaptive in another sense too, since variants of the technique have been developed that are tailored to various different data types and structures. This article will begin by introducing the basic ideas of PCA, discussing what it can and cannot do. It will then describe some variants of PCA and their application.
大型数据集越来越普遍,且往往难以解读。主成分分析(PCA)是一种用于降低此类数据集维度的技术,它在增加可解释性的同时,将信息损失降至最低。它通过创建新的不相关变量来实现这一点,这些变量会依次最大化方差。找到这些新变量,即主成分,归结为求解一个特征值/特征向量问题,并且新变量由手头的数据集定义,而非先验确定,因此PCA成为一种自适应数据分析技术。它在另一种意义上也是自适应的,因为已经开发出了该技术的变体,以适应各种不同的数据类型和结构。本文将首先介绍PCA的基本思想,讨论其能做什么和不能做什么。然后将描述PCA的一些变体及其应用。