Rue-Albrecht Kévin, McGettigan Paul A, Hernández Belinda, Nalpas Nicolas C, Magee David A, Parnell Andrew C, Gordon Stephen V, MacHugh David E
Animal Genomics Laboratory, UCD School of Agriculture and Food Science, University College Dublin, Dublin 4, Ireland.
Centre for Pharmacology and Therapeutics, Division of Experimental Medicine, Imperial College London, Hammersmith Hospital, London, W12 0NN, UK.
BMC Bioinformatics. 2016 Mar 11;17:126. doi: 10.1186/s12859-016-0971-3.
Identification of gene expression profiles that differentiate experimental groups is critical for discovery and analysis of key molecular pathways and also for selection of robust diagnostic or prognostic biomarkers. While integration of differential expression statistics has been used to refine gene set enrichment analyses, such approaches are typically limited to single gene lists resulting from simple two-group comparisons or time-series analyses. In contrast, functional class scoring and machine learning approaches provide powerful alternative methods to leverage molecular measurements for pathway analyses, and to compare continuous and multi-level categorical factors.
We introduce GOexpress, a software package for scoring and summarising the capacity of gene ontology features to simultaneously classify samples from multiple experimental groups. GOexpress integrates normalised gene expression data (e.g., from microarray and RNA-seq experiments) and phenotypic information of individual samples with gene ontology annotations to derive a ranking of genes and gene ontology terms using a supervised learning approach. The default random forest algorithm allows interactions between all experimental factors, and competitive scoring of expressed genes to evaluate their relative importance in classifying predefined groups of samples.
GOexpress enables rapid identification and visualisation of ontology-related gene panels that robustly classify groups of samples and supports both categorical (e.g., infection status, treatment) and continuous (e.g., time-series, drug concentrations) experimental factors. The use of standard Bioconductor extension packages and publicly available gene ontology annotations facilitates straightforward integration of GOexpress within existing computational biology pipelines.
识别能够区分实验组的基因表达谱对于发现和分析关键分子途径以及选择可靠的诊断或预后生物标志物至关重要。虽然差异表达统计的整合已被用于完善基因集富集分析,但此类方法通常仅限于简单两组比较或时间序列分析产生的单个基因列表。相比之下,功能类评分和机器学习方法提供了强大的替代方法,可利用分子测量进行途径分析,并比较连续和多级分类因素。
我们引入了GOexpress,这是一个用于评分和总结基因本体特征对来自多个实验组的样本进行同时分类能力的软件包。GOexpress将标准化的基因表达数据(例如,来自微阵列和RNA测序实验)以及单个样本的表型信息与基因本体注释相结合,使用监督学习方法得出基因和基因本体术语的排名。默认的随机森林算法允许所有实验因素之间的相互作用,并对表达的基因进行竞争性评分,以评估它们在对预定义样本组进行分类中的相对重要性。
GOexpress能够快速识别和可视化与本体相关的基因面板,这些面板能够可靠地对样本组进行分类,并支持分类(例如,感染状态、治疗)和连续(例如,时间序列、药物浓度)实验因素。使用标准的Bioconductor扩展包和公开可用的基因本体注释有助于将GOexpress直接整合到现有的计算生物学流程中。