Peterson Leif E
Department of Medicine, Baylor College of Medicine, One Baylor Plaza ST-924, Houston, TX 77030, USA.
Comput Methods Programs Biomed. 2003 Feb;70(2):107-19. doi: 10.1016/s0169-2607(02)00009-3.
Principal components analysis (PCA) is useful for reproducing the total variation among hundreds or thousands of continuously-scaled variables with a much smaller number of unobservable variables called 'latent factors'. The CLUSFAVOR computer program was used to implement PCA for identifying groups of genes with similar expression profiles from a large number of genes used on DNA microarrays. This paper describes the principal components solution to the factor model of the correlation matrix R, calculation of eigenvalues and eigenvectors of R, extraction of factors, and calculation of factor loadings and identification of genes with similar loading patterns to construct groups of genes with similar expression profiles. With regard to extraction of factors, it was found that more than 90% of the total variance in input data could be accounted for by extracting factors whose eigenvalues exceed unity. Bipolar factors containing strong positive and negative loadings can also be used for identifying two unique groups of genes, since expression profiles of genes that load positive are unlike expression profiles of genes that load negative on the same factor. While PCA does not provide the absolute answer to a multidimensional problem, it nevertheless can provide a heuristic with which natural groupings of genes with similar expression profiles can be assembled. While cluster analysis essentially generates a single dendogram (tree branch) containing every gene in the input data, PCA can be used to assemble gene expression profiles that strongly correlate with the latent factors accounting for a majority of total variance. Example results for CLUSFAVOR computer program runs are provided.
主成分分析(PCA)有助于用数量少得多的不可观测变量(称为“潜在因子”)来重现数百或数千个连续尺度变量之间的总变异。CLUSFAVOR计算机程序用于实施主成分分析,以便从DNA微阵列上使用的大量基因中识别出具有相似表达谱的基因组。本文描述了相关矩阵R的因子模型的主成分解、R的特征值和特征向量的计算、因子提取、因子载荷的计算以及具有相似载荷模式的基因的识别,以构建具有相似表达谱的基因组。关于因子提取,发现通过提取特征值超过1的因子,可以解释输入数据中超过90%的总方差。包含强正载荷和负载荷的双极因子也可用于识别两个独特的基因组,因为在同一因子上载荷为正的基因的表达谱与载荷为负的基因的表达谱不同。虽然主成分分析不能为多维问题提供绝对答案,但它仍然可以提供一种启发式方法,通过该方法可以组装具有相似表达谱的基因自然分组。虽然聚类分析本质上生成一个包含输入数据中每个基因的单一树状图(树枝),但主成分分析可用于组装与占总方差大部分的潜在因子高度相关的基因表达谱。提供了CLUSFAVOR计算机程序运行的示例结果。