Wagner Florian
Graduate Program in Computational Biology & Bioinformatics, Duke University, Durham, NC, United States of America.
Center for Genomic and Computational Biology, Duke University, Durham, NC, United States of America.
PLoS One. 2015 Nov 17;10(11):e0143196. doi: 10.1371/journal.pone.0143196. eCollection 2015.
Genome-wide expression profiling is a widely used approach for characterizing heterogeneous populations of cells, tissues, biopsies, or other biological specimen. The exploratory analysis of such data typically relies on generic unsupervised methods, e.g. principal component analysis (PCA) or hierarchical clustering. However, generic methods fail to exploit prior knowledge about the molecular functions of genes. Here, I introduce GO-PCA, an unsupervised method that combines PCA with nonparametric GO enrichment analysis, in order to systematically search for sets of genes that are both strongly correlated and closely functionally related. These gene sets are then used to automatically generate expression signatures with functional labels, which collectively aim to provide a readily interpretable representation of biologically relevant similarities and differences. The robustness of the results obtained can be assessed by bootstrapping.
I first applied GO-PCA to datasets containing diverse hematopoietic cell types from human and mouse, respectively. In both cases, GO-PCA generated a small number of signatures that represented the majority of lineages present, and whose labels reflected their respective biological characteristics. I then applied GO-PCA to human glioblastoma (GBM) data, and recovered signatures associated with four out of five previously defined GBM subtypes. My results demonstrate that GO-PCA is a powerful and versatile exploratory method that reduces an expression matrix containing thousands of genes to a much smaller set of interpretable signatures. In this way, GO-PCA aims to facilitate hypothesis generation, design of further analyses, and functional comparisons across datasets.
全基因组表达谱分析是一种广泛应用于表征细胞、组织、活检样本或其他生物标本异质群体的方法。对此类数据的探索性分析通常依赖于通用的无监督方法,例如主成分分析(PCA)或层次聚类。然而,通用方法无法利用有关基因分子功能的先验知识。在此,我介绍了GO-PCA,这是一种将PCA与非参数GO富集分析相结合的无监督方法,以便系统地搜索既高度相关又在功能上密切相关的基因集。然后使用这些基因集自动生成带有功能标签的表达特征,其共同目的是提供生物学相关异同的易于解释的表示。所得结果的稳健性可通过自抽样法进行评估。
我首先将GO-PCA分别应用于包含人类和小鼠不同造血细胞类型的数据集。在这两种情况下,GO-PCA都生成了少量代表大多数现有谱系的特征,其标签反映了它们各自的生物学特征。然后我将GO-PCA应用于人类胶质母细胞瘤(GBM)数据,并恢复了与先前定义的五种GBM亚型中的四种相关的特征。我的结果表明,GO-PCA是一种强大且通用的探索性方法,它将包含数千个基因的表达矩阵简化为一组小得多的可解释特征。通过这种方式,GO-PCA旨在促进假设生成、进一步分析的设计以及跨数据集的功能比较。