IEEE J Biomed Health Inform. 2020 Jan;24(1):311-318. doi: 10.1109/JBHI.2019.2896144. Epub 2019 Jan 30.
The gene expression omnibus (GEO) repository harbours an exponentially increasing number of gene expression studies. The expression data, as well as the related metadata, provides an abundant resource for knowledge discovery. Each study in GEO focuses on the gene expression perturbation of a specific subject (e.g., gene, drug, and disease). The identification of those subjects and the associations among them are beneficial for further in-depth studies. However, they cannot be directly inferred from the studies. A unified representation of those subjects (i.e., gene expression signatures) is desired. We developed GESgnExt for the automatic construction of gene expression signatures. The resultant 6542 signatures are built on 1934 series and 35 919 samples from GEO. To evaluate its significance, we calculated the similarities among those signatures and compared the discovered associations against the existing interaction databases. The signatures connect the genes, drugs, and diseases, covering most of the experimentally validated interactions. Besides, we have discovered 3307 novel signatures and their related associations, complementing the existing signature knowledge. The biomedical relevance of GESgnExt is demonstrated further in multiple case studies, providing mechanistic insights into its knowledge discovery process.
基因表达综合数据库(GEO)中存储了数量呈指数级增长的基因表达研究。这些表达数据以及相关的元数据为知识发现提供了丰富的资源。GEO 中的每一项研究都专注于特定主题(例如基因、药物和疾病)的基因表达扰动。这些主题的识别以及它们之间的关联对于进一步的深入研究是有益的。然而,这些关联并不能直接从研究中推断出来。人们希望有一种统一的方法来表示这些主题(即基因表达特征)。我们开发了 GESgnExt 来自动构建基因表达特征。由此产生的 6542 个特征基于来自 GEO 的 1934 个系列和 35919 个样本构建。为了评估其意义,我们计算了这些特征之间的相似性,并将发现的关联与现有的交互数据库进行了比较。这些特征连接了基因、药物和疾病,涵盖了大多数经过实验验证的相互作用。此外,我们还发现了 3307 个新的特征及其相关关联,补充了现有的特征知识。在多个案例研究中进一步证明了 GESgnExt 的生物医学相关性,为其知识发现过程提供了机制上的见解。