Salem Saeed, Alokshiya Mohammed, Hasan Mohammad Al
North Dakota State University, Fargo, ND, 58102, USA.
Indiana University-Purdue University Indianapolis, Indianapolis, IN, 46202, USA.
BioData Min. 2021 Mar 16;14(1):19. doi: 10.1186/s13040-021-00250-1.
Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Maximal frequent subgraphs are a representative set of frequent subgraphs; A frequent subgraph is maximal if it does not have a super-graph that is frequent. In the bioinformatics discipline, methodologies for mining frequent and/or maximal frequent subgraphs can be used to discover interesting network motifs that elucidate complex interactions among genes, reflected through the edges of the frequent subnetworks. Further study of frequent coexpression subnetworks enhances the discovery of biological modules and biological signatures for gene expression and disease classification.
We propose a reverse search algorithm, called RASMA, for mining frequent and maximal frequent subgraphs in a given collection of graphs. A key innovation in RASMA is a connected subgraph enumerator that uses a reverse-search strategy to enumerate connected subgraphs of an undirected graph. Using this enumeration strategy, RASMA obtains all maximal frequent subgraphs very efficiently. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for the maximal frequent subgraphs, RASMA employs several pruning strategies that substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs.
Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are frequent are highly enriched with known biological ontologies.
给定一组基因上的共表达网络集合,识别频繁出现的子网络是一个重要的研究问题,即挖掘频繁子图。最大频繁子图是频繁子图的一个代表性集合;如果一个频繁子图没有频繁的超图,那么它就是最大的。在生物信息学领域,挖掘频繁和/或最大频繁子图的方法可用于发现有趣的网络基序,这些基序通过频繁子网络的边反映基因之间的复杂相互作用。对频繁共表达子网络的进一步研究有助于发现基因表达和疾病分类的生物模块和生物特征。
我们提出了一种反向搜索算法,称为RASMA,用于在给定的图集合中挖掘频繁和最大频繁子图。RASMA的一个关键创新是一个连通子图枚举器,它使用反向搜索策略来枚举无向图的连通子图。使用这种枚举策略,RASMA能够非常高效地获得所有最大频繁子图。为了克服在挖掘最大频繁子图时枚举所有频繁子图的计算量过大的任务,RASMA采用了几种剪枝策略,显著提高了其整体运行时性能。实验结果表明,在大型基因共表达网络上,该算法能够有效地挖掘出与生物学相关的最大频繁子图。
从多个基因表达实验中提取反复出现的基因共表达子网络,有助于发现功能模块和子网络生物标志物。我们提出了一种用于挖掘最大频繁子网络的反向搜索算法。对提取的最大频繁子网络的富集分析表明,频繁出现的子网络高度富集了已知的生物学本体。