Palese Luigi Leonardo
University of Bari "Aldo Moro", Department of Basic Medical Sciences, Neurosciences and Sense Organs (SMBNOS), Bari 70124, Italy.
Comput Biol Chem. 2018 Apr;73:57-64. doi: 10.1016/j.compbiolchem.2018.01.009. Epub 2018 Feb 2.
Principal component analysis (PCA) is a widespread technique for data analysis that relies on the covariance/correlation matrix of the analyzed data. However, to properly work with high-dimensional data sets, PCA poses severe mathematical constraints on the minimum number of different replicates, or samples, that must be included in the analysis. Generally, improper sampling is due to a small number of data respect to the number of the degrees of freedom that characterize the ensemble. In the field of life sciences it is often important to have an algorithm that can accept poorly dimensioned data sets, including degenerated ones. Here a new random projection algorithm is proposed, in which a random symmetric matrix surrogates the covariance/correlation matrix of PCA, while maintaining the data clustering capacity. We demonstrate that what is important for clustering efficiency of PCA is not the exact form of the covariance/correlation matrix, but simply its symmetry.
主成分分析(PCA)是一种广泛应用于数据分析的技术,它依赖于所分析数据的协方差/相关矩阵。然而,为了妥善处理高维数据集,PCA 对分析中必须包含的不同重复样本或样本的最小数量提出了严格的数学约束。通常,采样不当是由于相对于表征总体的自由度数量而言数据量较少。在生命科学领域,拥有一种能够接受维度不佳的数据集(包括退化数据集)的算法通常很重要。在此提出一种新的随机投影算法,其中一个随机对称矩阵替代了 PCA 的协方差/相关矩阵,同时保持数据聚类能力。我们证明,对于 PCA 的聚类效率而言重要的不是协方差/相关矩阵的精确形式,而仅仅是其对称性。