基于随机森林的基因表达数据基因集差异分析。

Random forests-based differential analysis of gene sets for gene expression data.

机构信息

Department of Statistics, National Chengchi University, Taiwan.

出版信息

Gene. 2013 Apr 10;518(1):179-86. doi: 10.1016/j.gene.2012.11.034. Epub 2012 Dec 6.

DOI:10.1016/j.gene.2012.11.034

PMID:23219997

Abstract

In DNA microarray studies, gene-set analysis (GSA) has become the focus of gene expression data analysis. GSA utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or priori-defined biological classes to assess the significance of gene sets associated with clinical outcomes or phenotypes. Many statistical approaches have been proposed to determine whether such functionally related gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to the discriminatory power of gene sets and classification of patients. In this study, we propose a method of gene set analysis, in which gene sets are used to develop classifications of patients based on the Random Forest (RF) algorithm. The corresponding empirical p-value of an observed out-of-bag (OOB) error rate of the classifier is introduced to identify differentially expressed gene sets using an adequate resampling method. In addition, we discuss the impacts and correlations of genes within each gene set based on the measures of variable importance in the RF algorithm. Significant classifications are reported and visualized together with the underlying gene sets and their contribution to the phenotypes of interest. Numerical studies using both synthesized data and a series of publicly available gene expression data sets are conducted to evaluate the performance of the proposed methods. Compared with other hypothesis testing approaches, our proposed methods are reliable and successful in identifying enriched gene sets and in discovering the contributions of genes within a gene set. The classification results of identified gene sets can provide an valuable alternative to gene set testing to reveal the unknown, biologically relevant classes of samples or patients. In summary, our proposed method allows one to simultaneously assess the discriminatory ability of gene sets and the importance of genes for interpretation of data in complex biological systems. The classifications of biologically defined gene sets can reveal the underlying interactions of gene sets associated with the phenotypes, and provide an insightful complement to conventional gene set analyses.

摘要

在 DNA 微阵列研究中，基因集分析（GSA）已成为基因表达数据分析的焦点。GSA 利用基因本体论（GO）类别或先定义的生物学类别中功能相关基因集的基因表达谱，评估与临床结果或表型相关的基因集的显著性。已经提出了许多统计方法来确定这些功能相关的基因集是否在表型的变化中差异表达（富集和/或缺失）。然而，很少关注基因集的判别能力和患者的分类。在这项研究中，我们提出了一种基因集分析方法，其中基因集用于基于随机森林（RF）算法对患者进行分类。引入了观察到的袋外（OOB）误差率的对应经验 p 值，以使用适当的重采样方法识别差异表达的基因集。此外，我们还根据 RF 算法中的变量重要性度量，讨论了每个基因集中基因的影响和相关性。报告了显著的分类，并与潜在的基因集及其对感兴趣的表型的贡献一起可视化。使用合成数据和一系列公开可用的基因表达数据集进行数值研究，以评估所提出方法的性能。与其他假设检验方法相比，我们提出的方法在识别富集基因集和发现基因集中基因的贡献方面是可靠和成功的。所识别基因集的分类结果可以提供一种有价值的替代基因集测试方法，以揭示未知的、与生物学相关的样本或患者类别。总之，我们提出的方法允许同时评估基因集的判别能力和基因对复杂生物系统中数据解释的重要性。生物学定义的基因集的分类可以揭示与表型相关的基因集的潜在相互作用，并为传统的基因集分析提供有见地的补充。

相似文献

Random forests-based differential analysis of gene sets for gene expression data.基于随机森林的基因表达数据基因集差异分析。

Gene. 2013 Apr 10;518(1):179-86. doi: 10.1016/j.gene.2012.11.034. Epub 2012 Dec 6.

A SATS algorithm for jointly identifying multiple differentially expressed gene sets.一种联合识别多个差异表达基因集的 SATS 算法。

Stat Med. 2011 Jul 20;30(16):2028-39. doi: 10.1002/sim.4235. Epub 2011 Apr 7.

Annotation-based distance measures for patient subgroup discovery in clinical microarray studies.临床微阵列研究中用于发现患者亚组的基于注释的距离度量。

Bioinformatics. 2007 Sep 1;23(17):2256-64. doi: 10.1093/bioinformatics/btm322. Epub 2007 Jun 22.

Gene expression analysis in clear cell renal cell carcinoma using gene set enrichment analysis for biostatistical management.基于基因集富集分析的 clear cell 肾细胞癌基因表达分析用于生物统计学管理。

BJU Int. 2011 Jul;108(2 Pt 2):E29-35. doi: 10.1111/j.1464-410X.2010.09794.x. Epub 2011 Mar 16.

Assessment of gene set analysis methods based on microarray data.基于微阵列数据的基因集分析方法评估。

Gene. 2014 Jan 25;534(2):383-9. doi: 10.1016/j.gene.2013.08.063. Epub 2013 Sep 3.

ADGO: analysis of differentially expressed gene sets using composite GO annotation.ADGO：使用复合基因本体注释分析差异表达基因集

Bioinformatics. 2006 Sep 15;22(18):2249-53. doi: 10.1093/bioinformatics/btl378. Epub 2006 Jul 12.

SEGS: search for enriched gene sets in microarray data.SEGS：在微阵列数据中搜索富集的基因集。

J Biomed Inform. 2008 Aug;41(4):588-601. doi: 10.1016/j.jbi.2007.12.001. Epub 2007 Dec 15.

Inference of combinatorial Boolean rules of synergistic gene sets from cancer microarray datasets.从癌症基因芯片数据集推断协同基因组合的组合布尔规则。

Bioinformatics. 2010 Jun 15;26(12):1506-12. doi: 10.1093/bioinformatics/btq207. Epub 2010 Apr 21.

Challenges in projecting clustering results across gene expression-profiling datasets.跨基因表达谱数据集预测聚类结果面临的挑战。

J Natl Cancer Inst. 2007 Nov 21;99(22):1715-23. doi: 10.1093/jnci/djm216. Epub 2007 Nov 13.

Comparison of seven methods for producing Affymetrix expression scores based on False Discovery Rates in disease profiling data.基于疾病谱数据中错误发现率的七种生成Affymetrix表达分数方法的比较。

BMC Bioinformatics. 2005 Feb 10;6:26. doi: 10.1186/1471-2105-6-26.

引用本文的文献

Molecular clustering based on gene set expression and its relationship with prognosis in patients with lung adenocarcinoma.基于基因集表达的分子聚类及其与肺腺癌患者预后的关系

J Thorac Dis. 2022 May;14(5):1638-1650. doi: 10.21037/jtd-22-557.

binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions.binomialRF：随机森林可解释的组合效率，用于识别生物标志物相互作用。

BMC Bioinformatics. 2020 Aug 28;21(1):374. doi: 10.1186/s12859-020-03718-9.

Non-destructive monitoring of netted muskmelon quality based on its external phenotype using Random Forest.基于随机森林的网纹甜瓜外部表型的无损监测。

PLoS One. 2019 Aug 19;14(8):e0221259. doi: 10.1371/journal.pone.0221259. eCollection 2019.

Prognostic value of cancer antigen -125 for lung adenocarcinoma patients with brain metastasis: A random survival forest prognostic model.癌症抗原-125 对肺腺癌脑转移患者的预后价值：随机生存森林预后模型。

Sci Rep. 2018 Apr 4;8(1):5670. doi: 10.1038/s41598-018-23946-7.

GOexpress: an R/Bioconductor package for the identification and visualisation of robust gene ontology signatures through supervised learning of gene expression data.GOexpress：一个用于通过对基因表达数据进行监督学习来识别和可视化稳健基因本体特征的R/Bioconductor软件包。

BMC Bioinformatics. 2016 Mar 11;17:126. doi: 10.1186/s12859-016-0971-3.

Towards understanding the breast cancer epigenome: a comparison of genome-wide DNA methylation and gene expression data.迈向理解乳腺癌表观基因组：全基因组DNA甲基化与基因表达数据的比较

Oncotarget. 2016 Jan 19;7(3):3002-17. doi: 10.18632/oncotarget.6503.

MAVTgsa: an R package for gene set (enrichment) analysis.MAVTgsa：一个用于基因集（富集）分析的R软件包。

Biomed Res Int. 2014;2014:346074. doi: 10.1155/2014/346074. Epub 2014 Jul 3.

Machine learning approaches distinguish multiple stress conditions using stress-responsive genes and identify candidate genes for broad resistance in rice.机器学习方法利用胁迫响应基因区分多种胁迫条件，并鉴定出水稻广谱抗性的候选基因。

Plant Physiol. 2014 Jan;164(1):481-95. doi: 10.1104/pp.113.225862. Epub 2013 Nov 14.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于随机森林的基因表达数据基因集差异分析。

Random forests-based differential analysis of gene sets for gene expression data.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献