Zapala Matthew A, Schork Nicholas J
Biomedical Sciences Graduate Program and the Polymorphism Research Laboratory, Department of Psychiatry, Moores UCSD Cancer Center, Center for Human Genetics and Genomics, University of California at San Diego, La Jolla, CA 92093, USA.
Proc Natl Acad Sci U S A. 2006 Dec 19;103(51):19430-5. doi: 10.1073/pnas.0609333103. Epub 2006 Dec 4.
A fundamental step in the analysis of gene expression and other high-dimensional genomic data is the calculation of the similarity or distance between pairs of individual samples in a study. If one has collected N total samples and assayed the expression level of G genes on those samples, then an N x N similarity matrix can be formed that reflects the correlation or similarity of the samples with respect to the expression values over the G genes. This matrix can then be examined for patterns via standard data reduction and cluster analysis techniques. We consider an alternative to conventional data reduction and cluster analyses of similarity matrices that is rooted in traditional linear models. This analysis method allows predictor variables collected on the samples to be related to variation in the pairwise similarity/distance values reflected in the matrix. The proposed multivariate method avoids the need for reducing the dimensions of a similarity matrix, can be used to assess relationships between the genes used to construct the matrix and additional information collected on the samples under study, and can be used to analyze individual genes or groups of genes identified in different ways. The technique can be used with any high-dimensional assay or data type and is ideally suited for testing subsets of genes defined by their participation in a biochemical pathway or other a priori grouping. We showcase the methodology using three published gene expression data sets.
基因表达及其他高维基因组数据分析的一个基本步骤是计算研究中各个样本对之间的相似度或距离。如果总共收集了N个样本,并检测了这些样本上G个基因的表达水平,那么就可以形成一个N×N的相似度矩阵,该矩阵反映了样本在G个基因的表达值方面的相关性或相似性。然后,可以通过标准的数据降维和聚类分析技术来检查这个矩阵中的模式。我们考虑一种替代传统相似度矩阵数据降维和聚类分析的方法,它基于传统的线性模型。这种分析方法允许样本上收集的预测变量与矩阵中反映的成对相似度/距离值的变化相关。所提出的多变量方法避免了对相似度矩阵进行降维的需要,可用于评估用于构建矩阵的基因与所研究样本上收集的其他信息之间的关系,并且可用于分析以不同方式识别的单个基因或基因组。该技术可用于任何高维检测或数据类型,非常适合测试由其参与生化途径或其他先验分组定义的基因子集。我们使用三个已发表的基因表达数据集展示了该方法。