Yeung K Y, Ruzzo W L
Computer Science and Engineering, Box 352350, University of Washington, Seattle, WA 98195, USA.
Bioinformatics. 2001 Sep;17(9):763-74. doi: 10.1093/bioinformatics/17.9.763.
There is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. Other classical techniques, such as principal component analysis (PCA), have also been applied to analyze gene expression data. Using different data analysis techniques and different clustering algorithms to analyze the same data set can lead to very different conclusions. Our goal is to study the effectiveness of principal components (PCs) in capturing cluster structure. Specifically, using both real and synthetic gene expression data sets, we compared the quality of clusters obtained from the original data to the quality of clusters obtained after projecting onto subsets of the principal component axes.
Our empirical study showed that clustering with the PCs instead of the original variables does not necessarily improve, and often degrades, cluster quality. In particular, the first few PCs (which contain most of the variation in the data) do not necessarily capture most of the cluster structure. We also showed that clustering with PCs has different impact on different algorithms and different similarity metrics. Overall, we would not recommend PCA before clustering except in special circumstances.
迫切需要开发分析方法来分析和利用基因表达数据中包含的信息。由于基因数量众多且生物网络复杂,聚类是分析基因表达数据的一种有用的探索性技术。其他经典技术,如主成分分析(PCA),也已应用于分析基因表达数据。使用不同的数据分析技术和不同的聚类算法来分析同一数据集可能会得出非常不同的结论。我们的目标是研究主成分(PC)在捕捉聚类结构方面的有效性。具体而言,我们使用真实和合成基因表达数据集,将从原始数据获得的聚类质量与投影到主成分轴子集后获得的聚类质量进行了比较。
我们的实证研究表明,使用主成分进行聚类而非原始变量并不一定会提高聚类质量,反而常常会降低聚类质量。特别是,最初的几个主成分(包含数据中的大部分变异)不一定能捕捉到大部分聚类结构。我们还表明,使用主成分进行聚类对不同的算法和不同的相似性度量有不同的影响。总体而言,除特殊情况外,我们不建议在聚类之前进行主成分分析。