Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, USA.
BMC Bioinformatics. 2020 Apr 10;21(1):139. doi: 10.1186/s12859-020-3447-4.
Functional enrichment of genes and pathways based on Gene Ontology (GO) has been widely used to describe the results of various -omics analyses. GO terms statistically overrepresented within a set of a large number of genes are typically used to describe the main functional attributes of the gene set. However, these lists of overrepresented GO terms are often too large and contains redundant overlapping GO terms hindering informative functional interpretations.
We developed GOMCL to reduce redundancy and summarize lists of GO terms effectively and informatively. This lightweight python toolkit efficiently identifies clusters within a list of GO terms using the Markov Clustering (MCL) algorithm, based on the overlap of gene members between GO terms. GOMCL facilitates biological interpretation of a large number of GO terms by condensing them into GO clusters representing non-overlapping functional themes. It enables visualizing GO clusters as a heatmap, networks based on either overlap of members or hierarchy among GO terms, and tables with depth and cluster information for each GO term. Each GO cluster generated by GOMCL can be evaluated and further divided into non-overlapping sub-clusters using the GOMCL-sub module. The outputs from both GOMCL and GOMCL-sub can be imported to Cytoscape for additional visualization effects.
GOMCL is a convenient toolkit to cluster, evaluate, and extract non-redundant associations of Gene Ontology-based functions. GOMCL helps researchers to reduce time spent on manual curation of large lists of GO terms, minimize biases introduced by redundant GO terms in data interpretation, and batch processing of multiple GO enrichment datasets. A user guide, a test dataset, and the source code of GOMCL are available at https://github.com/Guannan-Wang/GOMCL and www.lsugenomics.org.
基于基因本体论(GO)的基因和途径功能富集已广泛用于描述各种组学分析的结果。在大量基因集中,统计上过度代表的 GO 术语通常用于描述基因集的主要功能属性。然而,这些过度代表的 GO 术语列表通常太大,包含冗余重叠的 GO 术语,从而阻碍了信息丰富的功能解释。
我们开发了 GOMCL 来有效且有信息量地减少冗余并总结 GO 术语列表。这个轻量级的 Python 工具包使用基于基因成员在 GO 术语之间的重叠的 Markov 聚类(MCL)算法,有效地识别 GO 术语列表中的簇。GOMCL 通过将它们压缩成代表非重叠功能主题的 GO 簇,促进了大量 GO 术语的生物学解释。它可以将 GO 簇可视化作为热图,基于成员重叠或 GO 术语之间的层次结构的网络,以及每个 GO 术语的深度和簇信息的表。通过 GOMCL-sub 模块,可以对 GOMCL 生成的每个 GO 簇进行评估,并进一步将其划分为非重叠子簇。GOMCL 和 GOMCL-sub 的输出都可以导入 Cytoscape 以获得额外的可视化效果。
GOMCL 是一个方便的工具包,用于聚类、评估和提取基于基因本体论的功能的非冗余关联。GOMCL 帮助研究人员减少手动整理大量 GO 术语列表的时间,最大限度地减少数据解释中冗余 GO 术语引入的偏差,并批量处理多个 GO 富集数据集。用户指南、测试数据集和 GOMCL 的源代码可在 https://github.com/Guannan-Wang/GOMCL 和 www.lsugenomics.org 上获得。