Alon Noga, Dao Phuong, Hajirasouliha Iman, Hormozdiari Fereydoun, Sahinalp S Cenk
School of Mathematical Sciences, Tel Aviv University, Ramat Aviv, Israel.
Bioinformatics. 2008 Jul 1;24(13):i241-9. doi: 10.1093/bioinformatics/btn163.
Protein-protein interaction (PPI) networks of many organisms share global topological features such as degree distribution, k-hop reachability, betweenness and closeness. Yet, some of these networks can differ significantly from the others in terms of local structures: e.g. the number of specific network motifs can vary significantly among PPI networks. Counting the number of network motifs provides a major challenge to compare biomolecular networks. Recently developed algorithms have been able to count the number of induced occurrences of subgraphs with k < or = 7 vertices. Yet no practical algorithm exists for counting non-induced occurrences, or counting subgraphs with k > or = 8 vertices. Counting non-induced occurrences of network motifs is not only challenging but also quite desirable as available PPI networks include several false interactions and miss many others. In this article, we show how to apply the 'color coding' technique for counting non-induced occurrences of subgraph topologies in the form of trees and bounded treewidth subgraphs. Our algorithm can count all occurrences of motif G' with k vertices in a network G with n vertices in time polynomial with n, provided k = O(log n). We use our algorithm to obtain 'treelet' distributions for k < or = 10 of available PPI networks of unicellular organisms (Saccharomyces cerevisiae Escherichia coli and Helicobacter Pyloris), which are all quite similar, and a multicellular organism (Caenorhabditis elegans) which is significantly different. Furthermore, the treelet distribution of the unicellular organisms are similar to that obtained by the 'duplication model' but are quite different from that of the 'preferential attachment model'. The treelet distribution is robust w.r.t. sparsification with bait/edge coverage of 70% but differences can be observed when bait/edge coverage drops to 50%.
许多生物体的蛋白质-蛋白质相互作用(PPI)网络具有一些全局拓扑特征,如度分布、k跳可达性、介数和紧密性。然而,其中一些网络在局部结构方面可能与其他网络有显著差异:例如,特定网络基序的数量在PPI网络之间可能有很大差异。计算网络基序的数量是比较生物分子网络的一项重大挑战。最近开发的算法能够计算具有k≤7个顶点的子图的诱导出现次数。然而,对于计算非诱导出现次数或计算具有k≥8个顶点的子图,尚无实用算法。计算网络基序的非诱导出现次数不仅具有挑战性,而且非常必要,因为现有的PPI网络包含一些错误的相互作用,并且遗漏了许多其他相互作用。在本文中,我们展示了如何应用“颜色编码”技术来计算树和有界树宽子图形式的子图拓扑的非诱导出现次数。我们的算法可以在时间复杂度为n的多项式时间内,计算出具有n个顶点的网络G中具有k个顶点的基序G'的所有出现次数,前提是k = O(log n)。我们使用我们的算法获得了单细胞生物(酿酒酵母、大肠杆菌和幽门螺杆菌)和多细胞生物(秀丽隐杆线虫)现有PPI网络中k≤10的“小树”分布,其中单细胞生物的分布都非常相似,而多细胞生物的分布则有显著差异。此外,单细胞生物的小树分布与通过“复制模型”获得的分布相似,但与“优先连接模型”的分布有很大不同。小树分布对于诱饵/边覆盖率为70%的稀疏化是稳健的,但当诱饵/边覆盖率降至50%时,可以观察到差异。