Hou Lin, Wang Lin, Berg Arthur, Qian Minping, Zhu Yunping, Li Fangting, Deng Minghua
LMAM, School of Mathematical Sciences, Peking University, Beijing 100871, China.
Front Biosci (Elite Ed). 2012 Jan 1;4(6):2150-61. doi: 10.2741/e532.
The goal of network clustering algorithms detect dense clusters in a network, and provide a first step towards the understanding of large scale biological networks. With numerous recent advances in biotechnologies, large-scale genetic interactions are widely available, but there is a limited understanding of which clustering algorithms may be most effective. In order to address this problem, we conducted a systematic study to compare and evaluate six clustering algorithms in analyzing genetic interaction networks, and investigated influencing factors in choosing algorithms. The algorithms considered in this comparison include hierarchical clustering, topological overlap matrix, bi-clustering, Markov clustering, Bayesian discriminant analysis based community detection, and variational Bayes approach to modularity. Both experimentally identified and synthetically constructed networks were used in this comparison. The accuracy of the algorithms is measured by the Jaccard index in comparing predicted gene modules with benchmark gene sets. The results suggest that the choice differs according to the network topology and evaluation criteria. Hierarchical clustering showed to be best at predicting protein complexes; Bayesian discriminant analysis based community detection proved best under epistatic miniarray profile (EMAP) datasets; the variational Bayes approach to modularity was noticeably better than the other algorithms in the genome-scale networks.
网络聚类算法的目标是在网络中检测密集簇,并为理解大规模生物网络迈出第一步。随着生物技术最近取得众多进展,大规模遗传相互作用广泛可得,但对于哪种聚类算法可能最有效,人们的了解有限。为了解决这个问题,我们进行了一项系统研究,以比较和评估六种聚类算法在分析遗传相互作用网络方面的表现,并研究选择算法时的影响因素。此次比较中考虑的算法包括层次聚类、拓扑重叠矩阵、双聚类、马尔可夫聚类、基于贝叶斯判别分析的社区检测以及变分贝叶斯模块化方法。此次比较使用了实验鉴定的网络和人工构建的网络。在将预测的基因模块与基准基因集进行比较时,算法的准确性通过杰卡德指数来衡量。结果表明,根据网络拓扑结构和评估标准的不同,选择也会有所不同。层次聚类在预测蛋白质复合物方面表现最佳;基于贝叶斯判别分析的社区检测在上位性微阵列谱(EMAP)数据集下被证明是最佳的;变分贝叶斯模块化方法在基因组规模网络中明显优于其他算法。