IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1302-1315. doi: 10.1109/TCBB.2016.2586046. Epub 2016 Jun 28.
With recent advances in high-throughput cell biology, the amount of cellular biological data has grown drastically. Such data is often modeled as graphs (also called networks) and studying them can lead to new insights into molecule-level organization. A possible way to understand their structure is by analyzing the smaller components that constitute them, namely network motifs and graphlets. Graphlets are particularly well suited to compare networks and to assess their level of similarity due to the rich topological information that they offer but are almost always used as small undirected graphs of up to five nodes, thus limiting their applicability in directed networks. However, a large set of interesting biological networks such as metabolic, cell signaling, or transcriptional regulatory networks are intrinsically directional, and using metrics that ignore edge direction may gravely hinder information extraction. Our main purpose in this work is to extend the applicability of graphlets to directed networks by considering their edge direction, thus providing a powerful basis for the analysis of directed biological networks. We tested our approach on two network sets, one composed of synthetic graphs and another of real directed biological networks, and verified that they were more accurately grouped using directed graphlets than undirected graphlets. It is also evident that directed graphlets offer substantially more topological information than simple graph metrics such as degree distribution or reciprocity. However, enumerating graphlets in large networks is a computationally demanding task. Our implementation addresses this concern by using a state-of-the-art data structure, the g-trie, which is able to greatly reduce the necessary computation. We compared our tool to other state-of-the art methods and verified that it is the fastest general tool for graphlet counting.
随着高通量细胞生物学的最新进展,细胞生物学数据的数量急剧增长。这些数据通常被建模为图(也称为网络),研究它们可以为分子水平的组织提供新的见解。理解它们结构的一种可能方法是分析构成它们的较小组件,即网络模式和图元。由于它们提供的丰富拓扑信息,图元特别适合于比较网络并评估它们的相似性水平,但它们几乎总是用作最多五个节点的小无向图,从而限制了它们在有向网络中的适用性。然而,一组有趣的生物网络,如代谢、细胞信号转导或转录调控网络,本质上是有方向的,并且使用忽略边方向的度量标准可能会严重阻碍信息提取。我们在这项工作中的主要目的是通过考虑边的方向将图元的适用性扩展到有向网络,从而为分析有向生物网络提供强大的基础。我们在两个网络集上测试了我们的方法,一个由合成图组成,另一个由真实的有向生物网络组成,并验证了使用有向图元比无向图元更准确地对它们进行分组。显然,有向图元提供的拓扑信息比度分布或互反性等简单图度量要多得多。然而,在大型网络中枚举图元是一项计算密集型任务。我们的实现通过使用最先进的数据结构 g-trie 来解决这个问题,g-trie 能够大大减少所需的计算量。我们将我们的工具与其他最先进的方法进行了比较,并验证了它是用于图元计数的最快的通用工具。