Department of Statistics, National Chengchi University, Taiwan.
Gene. 2013 Apr 10;518(1):179-86. doi: 10.1016/j.gene.2012.11.034. Epub 2012 Dec 6.
In DNA microarray studies, gene-set analysis (GSA) has become the focus of gene expression data analysis. GSA utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or priori-defined biological classes to assess the significance of gene sets associated with clinical outcomes or phenotypes. Many statistical approaches have been proposed to determine whether such functionally related gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to the discriminatory power of gene sets and classification of patients. In this study, we propose a method of gene set analysis, in which gene sets are used to develop classifications of patients based on the Random Forest (RF) algorithm. The corresponding empirical p-value of an observed out-of-bag (OOB) error rate of the classifier is introduced to identify differentially expressed gene sets using an adequate resampling method. In addition, we discuss the impacts and correlations of genes within each gene set based on the measures of variable importance in the RF algorithm. Significant classifications are reported and visualized together with the underlying gene sets and their contribution to the phenotypes of interest. Numerical studies using both synthesized data and a series of publicly available gene expression data sets are conducted to evaluate the performance of the proposed methods. Compared with other hypothesis testing approaches, our proposed methods are reliable and successful in identifying enriched gene sets and in discovering the contributions of genes within a gene set. The classification results of identified gene sets can provide an valuable alternative to gene set testing to reveal the unknown, biologically relevant classes of samples or patients. In summary, our proposed method allows one to simultaneously assess the discriminatory ability of gene sets and the importance of genes for interpretation of data in complex biological systems. The classifications of biologically defined gene sets can reveal the underlying interactions of gene sets associated with the phenotypes, and provide an insightful complement to conventional gene set analyses.
在 DNA 微阵列研究中,基因集分析(GSA)已成为基因表达数据分析的焦点。GSA 利用基因本体论(GO)类别或先定义的生物学类别中功能相关基因集的基因表达谱,评估与临床结果或表型相关的基因集的显著性。已经提出了许多统计方法来确定这些功能相关的基因集是否在表型的变化中差异表达(富集和/或缺失)。然而,很少关注基因集的判别能力和患者的分类。在这项研究中,我们提出了一种基因集分析方法,其中基因集用于基于随机森林(RF)算法对患者进行分类。引入了观察到的袋外(OOB)误差率的对应经验 p 值,以使用适当的重采样方法识别差异表达的基因集。此外,我们还根据 RF 算法中的变量重要性度量,讨论了每个基因集中基因的影响和相关性。报告了显著的分类,并与潜在的基因集及其对感兴趣的表型的贡献一起可视化。使用合成数据和一系列公开可用的基因表达数据集进行数值研究,以评估所提出方法的性能。与其他假设检验方法相比,我们提出的方法在识别富集基因集和发现基因集中基因的贡献方面是可靠和成功的。所识别基因集的分类结果可以提供一种有价值的替代基因集测试方法,以揭示未知的、与生物学相关的样本或患者类别。总之,我们提出的方法允许同时评估基因集的判别能力和基因对复杂生物系统中数据解释的重要性。生物学定义的基因集的分类可以揭示与表型相关的基因集的潜在相互作用,并为传统的基因集分析提供有见地的补充。