Klomp Jeff A, Furge Kyle A
Center for Cancer Genomics and Computational Biology, Van Andel Research Institute, Grand Rapids, MI, USA.
BMC Res Notes. 2012 Jul 23;5:370. doi: 10.1186/1756-0500-5-370.
High-throughput methods that ascribe a cellular or physiological function for each gene product are useful to understand the roles of genes that have not been extensively characterized by molecular or genetic approaches. One method to infer gene function is "guilt-by-association", in which the expression pattern of a poorly characterized gene is shown to co-vary with the expression of better-characterized genes. The function of the poorly characterized gene is inferred from the known function(s) of the well-described genes. For example, genes co-expressed with transcripts that vary during the cell cycle, development, environmental stresses, and with oncogenesis have been implicated in those processes.
While examining the expression characteristics of several poorly characterized genes, we noted that we could associate each of the genes with a cellular phenotype by correlating individual gene expression changes with gene set enrichment scores from individual samples. We evaluated the effectiveness of this approach using a modest sized gene expression data set (expO) and a compendium of gene expression phenotypes (MSigDBv3.0). We found the transcripts that correlated best with enrichment in mitochondrial and lysosomal gene sets were mostly related to those processes (89/100 and 44/50, respectively). The reciprocal evaluation, ranking gene sets according to correlation of enrichment with an individual gene's expression, also reflected known associations for prominent genes in the biomedical literature (16/19). In evaluating the model, we also found that 4% of the genome encodes proteins that are associated with small molecule and small peptide signal transduction gene sets, implicating a large number of genes in both internal and external environmental sensing.
Our results show that this approach is useful to infer functions of disparate sets of genes. This method mirrors the biological experimental approaches used by others to associate individual genes with defined gene expression changes. Moreover, the approach can be used beyond discovering genes related to a cellular process to discover meaningful expression phenotypes from a compendium that are associated with a given gene. The effectiveness, versatility, and breadth of this approach make possible its application in a variety of contexts and with a variety of downstream analyses.
为每个基因产物赋予细胞或生理功能的高通量方法,有助于理解那些尚未通过分子或遗传方法进行广泛表征的基因的作用。一种推断基因功能的方法是“关联有罪”,即一个表征不佳的基因的表达模式被证明与表征较好的基因的表达共同变化。表征不佳的基因的功能是从已充分描述的基因的已知功能中推断出来的。例如,与在细胞周期、发育、环境应激和肿瘤发生过程中发生变化的转录本共表达的基因,已被认为与这些过程有关。
在研究几个表征不佳的基因的表达特征时,我们注意到,通过将单个基因表达变化与单个样本的基因集富集分数相关联,我们可以将每个基因与一种细胞表型联系起来。我们使用一个规模适中的基因表达数据集(expO)和一个基因表达表型汇编(MSigDBv3.0)评估了这种方法的有效性。我们发现,与线粒体和溶酶体基因集富集相关性最佳的转录本大多与这些过程相关(分别为89/100和44/50)。反向评估,即根据富集与单个基因表达的相关性对基因集进行排名,也反映了生物医学文献中著名基因的已知关联(16/19)。在评估该模型时,我们还发现,基因组的4%编码与小分子和小肽信号转导基因集相关的蛋白质,这意味着大量基因参与内部和外部环境感知。
我们的结果表明,这种方法有助于推断不同基因集的功能。这种方法反映了其他人用于将单个基因与定义的基因表达变化相关联的生物学实验方法。此外,该方法不仅可用于发现与细胞过程相关的基因,还可用于从与给定基因相关的汇编中发现有意义的表达表型。这种方法的有效性、通用性和广度使其有可能应用于各种背景和各种下游分析。