Sun Duanchen, Liu Yinliang, Zhang Xiang-Sun, Wu Ling-Yun
Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China.
National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing, 100190, China.
BMC Syst Biol. 2017 Sep 21;11(Suppl 4):75. doi: 10.1186/s12918-017-0456-7.
High-throughput experimental techniques have been dramatically improved and widely applied in the past decades. However, biological interpretation of the high-throughput experimental results, such as differential expression gene sets derived from microarray or RNA-seq experiments, is still a challenging task. Gene Ontology (GO) is commonly used in the functional enrichment studies. The GO terms identified via current functional enrichment analysis tools often contain direct parent or descendant terms in the GO hierarchical structure. Highly redundant terms make users difficult to analyze the underlying biological processes.
In this paper, a novel network-based probabilistic generative model, NetGen, was proposed to perform the functional enrichment analysis. An additional protein-protein interaction (PPI) network was explicitly used to assist the identification of significantly enriched GO terms. NetGen achieved a superior performance than the existing methods in the simulation studies. The effectiveness of NetGen was explored further on four real datasets. Notably, several GO terms which were not directly linked with the active gene list for each disease were identified. These terms were closely related to the corresponding diseases when accessed to the curated literatures. NetGen has been implemented in the R package CopTea publicly available at GitHub ( http://github.com/wulingyun/CopTea/ ).
Our procedure leads to a more reasonable and interpretable result of the functional enrichment analysis. As a novel term combination-based functional enrichment analysis method, NetGen is complementary to current individual term-based methods, and can help to explore the underlying pathogenesis of complex diseases.
在过去几十年中,高通量实验技术得到了显著改进并被广泛应用。然而,对高通量实验结果进行生物学解释,例如从微阵列或RNA测序实验中获得的差异表达基因集,仍然是一项具有挑战性的任务。基因本体论(GO)常用于功能富集研究。通过当前功能富集分析工具识别的GO术语在GO层次结构中通常包含直接的父术语或子术语。高度冗余的术语使得用户难以分析潜在的生物学过程。
本文提出了一种基于网络的新型概率生成模型NetGen来进行功能富集分析。明确使用了一个额外的蛋白质-蛋白质相互作用(PPI)网络来辅助识别显著富集的GO术语。在模拟研究中,NetGen比现有方法表现更优。在四个真实数据集上进一步探究了NetGen的有效性。值得注意的是,识别出了几个与每种疾病的活跃基因列表没有直接关联的GO术语。当查阅经过整理的文献时,这些术语与相应疾病密切相关。NetGen已在R包CopTea中实现,可在GitHub(http://github.com/wulingyun/CopTea/)上公开获取。
我们的方法导致功能富集分析的结果更合理且更具可解释性。作为一种基于新型术语组合的功能富集分析方法,NetGen是对当前基于单个术语的方法的补充,并且有助于探索复杂疾病的潜在发病机制。