Hu Jianchang, Szymczak Silke
Institute of Medical Biometry and Statistics, University of Lübeck, Ratzeburger Allee 160, Lübeck, 23562, Germany.
BioData Min. 2024 Apr 16;17(1):10. doi: 10.1186/s13040-024-00361-5.
Gene network information is believed to be beneficial for disease module and pathway identification, but has not been explicitly utilized in the standard random forest (RF) algorithm for gene expression data analysis. We investigate the performance of a network-guided RF where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF.
Our simulation results suggest that network-guided RF does not provide better disease prediction than the standard RF. In terms of disease gene discovery, if disease genes form module(s), network-guided RF identifies them more accurately. In addition, when disease status is independent from genes in the given network, spurious gene selection results can occur when using network information, especially on hub genes. Our empirical analysis on two balanced microarray and RNA-Seq breast cancer datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone receptor (PR) status also demonstrates that network-guided RF can identify genes from PGR-related pathways, which leads to a better connected module of identified genes.
Gene networks can provide additional information to aid the gene expression analysis for disease module and pathway identification. But they need to be used with caution and validation on the results need to be carried out to guard against spurious gene selection. More robust approaches to incorporate such information into RF construction also warrant further study.
基因网络信息被认为有助于疾病模块和通路的识别,但在用于基因表达数据分析的标准随机森林(RF)算法中尚未得到明确应用。我们研究了一种网络引导的随机森林的性能,其中网络信息被总结为预测变量的抽样概率,并进一步用于构建随机森林。
我们的模拟结果表明,网络引导的随机森林在疾病预测方面并不比标准随机森林更好。在疾病基因发现方面,如果疾病基因形成模块,网络引导的随机森林能更准确地识别它们。此外,当疾病状态与给定网络中的基因无关时,使用网络信息可能会出现虚假的基因选择结果,尤其是对于枢纽基因。我们对来自癌症基因组图谱(TCGA)的两个平衡微阵列和RNA测序乳腺癌数据集进行的孕酮受体(PR)状态分类的实证分析也表明,网络引导的随机森林可以识别来自PGR相关通路的基因,这导致了一个连接更好的已识别基因模块。
基因网络可以提供额外信息,以辅助疾病模块和通路识别的基因表达分析。但需要谨慎使用,并对结果进行验证,以防止虚假的基因选择。将此类信息纳入随机森林构建的更稳健方法也值得进一步研究。