Lu Liangqun, Townsend Kevin A, Daigle Bernie J
Department of Biological Sciences, University of Memphis, Memphis, USA.
Department of Computer Science, University of Memphis, Memphis, USA.
BMC Bioinformatics. 2021 Feb 3;22(1):44. doi: 10.1186/s12859-020-03932-5.
Differential expression and feature selection analyses are essential steps for the development of accurate diagnostic/prognostic classifiers of complicated human diseases using transcriptomics data. These steps are particularly challenging due to the curse of dimensionality and the presence of technical and biological noise. A promising strategy for overcoming these challenges is the incorporation of pre-existing transcriptomics data in the identification of differentially expressed (DE) genes. This approach has the potential to improve the quality of selected genes, increase classification performance, and enhance biological interpretability. While a number of methods have been developed that use pre-existing data for differential expression analysis, existing methods do not leverage the identities of experimental conditions to create a robust metric for identifying DE genes.
In this study, we propose a novel differential expression and feature selection method-GEOlimma-which combines pre-existing microarray data from the Gene Expression Omnibus (GEO) with the widely-applied Limma method for differential expression analysis. We first quantify differential gene expression across 2481 pairwise comparisons from 602 curated GEO Datasets, and we convert differential expression frequencies to DE prior probabilities. Genes with high DE prior probabilities show enrichment in cell growth and death, signal transduction, and cancer-related biological pathways, while genes with low prior probabilities were enriched in sensory system pathways. We then applied GEOlimma to four differential expression comparisons within two human disease datasets and performed differential expression, feature selection, and supervised classification analyses. Our results suggest that use of GEOlimma provides greater experimental power to detect DE genes compared to Limma, due to its increased effective sample size. Furthermore, in a supervised classification analysis using GEOlimma as a feature selection method, we observed similar or better classification performance than Limma given small, noisy subsets of an asthma dataset.
Our results demonstrate that GEOlimma is a more effective method for differential gene expression and feature selection analyses compared to the standard Limma method. Due to its focus on gene-level differential expression, GEOlimma also has the potential to be applied to other high-throughput biological datasets.
差异表达分析和特征选择分析是利用转录组学数据开发复杂人类疾病准确诊断/预后分类器的关键步骤。由于维度诅咒以及技术和生物学噪声的存在,这些步骤极具挑战性。克服这些挑战的一种有前景的策略是在差异表达(DE)基因的识别中纳入已有的转录组学数据。这种方法有可能提高所选基因的质量、提升分类性能并增强生物学可解释性。虽然已经开发了许多使用已有数据进行差异表达分析的方法,但现有方法并未利用实验条件的特性来创建用于识别DE基因的稳健指标。
在本研究中,我们提出了一种新颖的差异表达和特征选择方法——GEOlimma,它将来自基因表达综合数据库(GEO)的已有微阵列数据与广泛应用的Limma方法相结合用于差异表达分析。我们首先对来自602个经过整理的GEO数据集的2481对比较中的基因差异表达进行量化,并将差异表达频率转换为DE先验概率。具有高DE先验概率的基因在细胞生长与死亡、信号转导以及癌症相关生物学途径中富集,而具有低先验概率的基因在感觉系统途径中富集。然后我们将GEOlimma应用于两个人类疾病数据集内的四个差异表达比较,并进行差异表达、特征选择和监督分类分析。我们的结果表明,与Limma相比,使用GEOlimma由于有效样本量增加,具有更强的检测DE基因的实验能力。此外,在使用GEOlimma作为特征选择方法的监督分类分析中,对于哮喘数据集的小的、有噪声的子集,我们观察到与Limma相似或更好的分类性能。
我们的结果表明,与标准的Limma方法相比,GEOlimma是一种用于差异基因表达和特征选择分析的更有效方法。由于其专注于基因水平的差异表达,GEOlimma也有潜力应用于其他高通量生物学数据集。