Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing 100871, China.
Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing 100871, China.
J Genet Genomics. 2018 Sep 20;45(9):489-504. doi: 10.1016/j.jgg.2018.08.002. Epub 2018 Sep 13.
Gene set enrichment (GSE) analyses play an important role in the interpretation of large-scale transcriptome datasets. Multiple GSE tools can be integrated into a single method as obtaining optimal results is challenging due to the plethora of GSE tools and their discrepant performances. Several existing ensemble methods lead to different scores in sorting pathways as integrated results; furthermore, it is difficult for users to choose a single ensemble score to obtain optimal final results. Here, we develop an ensemble method using a machine learning approach called Combined Gene set analysis incorporating Prioritization and Sensitivity (CGPS) that integrates the results provided by nine prominent GSE tools into a single ensemble score (R score) to sort pathways as integrated results. Moreover, to the best of our knowledge, CGPS is the first GSE ensemble method built based on a priori knowledge of pathways and phenotypes. Compared with 10 widely used individual methods and five types of ensemble scores from two ensemble methods, we demonstrate that sorting pathways based on the R score can better prioritize relevant pathways, as established by an evaluation of 120 simulated datasets and 45 real datasets. Additionally, CGPS is applied to expression data involving the drug panobinostat, which is an anticancer treatment against multiple myeloma. The results identify cell processes associated with cancer, such as the p53 signaling pathway (hsa04115); by contrast, according to two ensemble methods (EnrichmentBrowser and EGSEA), this pathway has a rank higher than 20, which may cause users to miss the pathway in their analyses. We show that this method, which is based on a priori knowledge, can capture valuable biological information from numerous types of gene set collections, such as KEGG pathways, GO terms, Reactome, and BioCarta. CGPS is publicly available as a standalone source code at ftp://ftp.cbi.pku.edu.cn/pub/CGPS_download/cgps-1.0.0.tar.gz.
基因集富集 (GSE) 分析在解释大规模转录组数据集方面发挥着重要作用。由于 GSE 工具众多且性能参差不齐,将多个 GSE 工具集成到单个方法中以获得最佳结果具有挑战性。几种现有的集成方法在整合结果中导致不同的途径排序分数;此外,用户很难选择单个集成分数来获得最佳的最终结果。在这里,我们开发了一种使用机器学习方法的集成方法,称为结合基因集分析纳入优先级和敏感性 (CGPS),该方法将来自九个著名 GSE 工具的结果集成到单个集成分数(R 分数)中,以对途径进行排序作为整合结果。此外,据我们所知,CGPS 是第一个基于途径和表型先验知识构建的 GSE 集成方法。与 10 种广泛使用的个体方法和两种集成方法的 5 种集成分数相比,我们证明了基于 R 分数对途径进行排序可以更好地优先考虑相关途径,这是通过对 120 个模拟数据集和 45 个真实数据集的评估得出的。此外,CGPS 还应用于涉及抗癌药物帕比司他的表达数据,该药是一种针对多发性骨髓瘤的抗癌治疗药物。结果确定了与癌症相关的细胞过程,例如 p53 信号通路(hsa04115);相比之下,根据两种集成方法(EnrichmentBrowser 和 EGSEA),该途径的排名高于 20,这可能导致用户在分析中忽略该途径。我们表明,这种基于先验知识的方法可以从多种类型的基因集集合(如 KEGG 途径、GO 术语、Reactome 和 BioCarta)中捕获有价值的生物学信息。CGPS 可作为独立源代码在 ftp://ftp.cbi.pku.edu.cn/pub/CGPS_download/cgps-1.0.0.tar.gz 上公开获取。