Sanguinetti Guido, Milo Marta, Rattray Magnus, Lawrence Neil D
Department of Computer Science, Regent Court 211 Portobello Road, Sheffield S1 4DP, UK.
Bioinformatics. 2005 Oct 1;21(19):3748-54. doi: 10.1093/bioinformatics/bti617. Epub 2005 Aug 9.
Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques for the analysis of high-dimensional datasets. However, in its standard form, it does not take into account any error measures associated with the data points beyond a standard spherical noise. This indiscriminate nature provides one of its main weaknesses when applied to biological data with inherently large variability, such as expression levels measured with microarrays. Methods now exist for extracting credibility intervals from the probe-level analysis of cDNA and oligonucleotide microarray experiments. These credibility intervals are gene and experiment specific, and can be propagated through an appropriate probabilistic downstream analysis.
We propose a new model-based approach to PCA that takes into account the variances associated with each gene in each experiment. We develop an efficient EM-algorithm to estimate the parameters of our new model. The model provides significantly better results than standard PCA, while remaining computationally reasonable. We show how the model can be used to 'denoise' a microarray dataset leading to improved expression profiles and tighter clustering across profiles. The probabilistic nature of the model means that the correct number of principal components is automatically obtained.
主成分分析(PCA)是用于分析高维数据集最流行的降维技术之一。然而,其标准形式并未考虑与数据点相关的任何误差度量,仅考虑标准球形噪声。当应用于具有固有较大变异性的生物数据(如用微阵列测量的表达水平)时,这种不加区分的特性构成了其主要弱点之一。目前存在从cDNA和寡核苷酸微阵列实验的探针水平分析中提取可信区间的方法。这些可信区间是基因和实验特异性的,并且可以通过适当的概率下游分析进行传播。
我们提出了一种基于模型的新PCA方法,该方法考虑了每个实验中与每个基因相关的方差。我们开发了一种高效的期望最大化(EM)算法来估计新模型的参数。该模型比标准PCA提供了显著更好的结果,同时计算上仍然合理。我们展示了该模型如何用于对微阵列数据集进行“去噪”,从而改善表达谱并使各谱之间的聚类更紧密。该模型的概率性质意味着可以自动获得正确数量的主成分。