基于结构相似性的大规模网络分布式网络聚类算法

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

机构信息

Department of Computer Science, Yonsei University, Seoul, South Korea.

Korea Institute of Science and Technology Information, Daejeon, South Korea.

出版信息

PLoS One. 2018 Oct 10;13(10):e0203670. doi: 10.1371/journal.pone.0203670. eCollection 2018.

DOI:10.1371/journal.pone.0203670

PMID:30303961

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6179193/

Abstract

As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data analysis using Apache Spark by changing the paradigm of the conventional clustering algorithm to improve its efficiency in the Apache Spark environment. We also apply optimization approaches such as Bloom filter and shuffle selection to reduce memory usage and execution time. By evaluating our proposed algorithm based on an average normalized cut, we confirmed that the algorithm can analyze diverse large-scale network datasets such as biological, co-authorship, internet topology and social networks. Experimental results show that the proposed algorithm can develop more accurate clusters than comparative algorithms with less memory usage. Furthermore, we confirm the proposed optimization approaches and the scalability of the proposed algorithm. In addition, we validate that clusters found from the proposed algorithm can represent biologically meaningful functions.

摘要

随着网络规模的不断扩大，对大规模网络数据进行分析变得越来越重要。网络聚类算法是网络数据分析的一种有效手段。在单机环境中，传统的网络聚类算法而不是在并行机环境中得到了积极的研究。然而，由于内存大小的问题，这些算法无法分析大规模的网络数据。作为一种解决方案，我们通过改变传统聚类算法的范例，提出了一种使用 Apache Spark 的大规模网络数据分析的网络聚类算法，以提高其在 Apache Spark 环境中的效率。我们还应用了布隆过滤器和洗牌选择等优化方法来减少内存使用和执行时间。通过基于平均归一化切割的评估，我们证实了该算法可以分析各种大规模网络数据集，如生物、合著、互联网拓扑和社交网络。实验结果表明，与使用较少内存的比较算法相比，该算法可以生成更准确的聚类。此外，我们还证实了所提出的优化方法和算法的可扩展性。此外，我们验证了从所提出的算法中找到的聚类可以代表具有生物学意义的功能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a90f/6179193/0a4b98dd4d09/pone.0203670.g001.jpg

相似文献

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

PLoS One. 2018 Oct 10;13(10):e0203670. doi: 10.1371/journal.pone.0203670. eCollection 2018.

A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark.

Entropy (Basel). 2023 Jan 31;25(2):259. doi: 10.3390/e25020259.

A vector reconstruction based clustering algorithm particularly for large-scale text collection.

Neural Netw. 2015 Mar;63:141-55. doi: 10.1016/j.neunet.2014.10.012. Epub 2014 Dec 9.

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks.

Nucleic Acids Res. 2018 Apr 6;46(6):e33. doi: 10.1093/nar/gkx1313.

Distributed controller clustering in software defined networks.

PLoS One. 2017 Apr 6;12(4):e0174715. doi: 10.1371/journal.pone.0174715. eCollection 2017.

Parallel spectral clustering in distributed systems.

IEEE Trans Pattern Anal Mach Intell. 2011 Mar;33(3):568-86. doi: 10.1109/TPAMI.2010.88.

A Novel Cluster Head Selection Algorithm Based on Fuzzy Clustering and Particle Swarm Optimization.

IEEE/ACM Trans Comput Biol Bioinform. 2017 Jan-Feb;14(1):76-84. doi: 10.1109/TCBB.2015.2446475.

Clustering approaches for visual knowledge exploration in molecular interaction networks.

BMC Bioinformatics. 2018 Aug 29;19(1):308. doi: 10.1186/s12859-018-2314-z.

Generic, network schema agnostic sparse tensor factorization for single-pass clustering of heterogeneous information networks.

PLoS One. 2017 Feb 28;12(2):e0172323. doi: 10.1371/journal.pone.0172323. eCollection 2017.

Communities Detection for Advertising by Futuristic Greedy Method with Clustering Approach.

Big Data. 2021 Feb;9(1):22-40. doi: 10.1089/big.2020.0133. Epub 2021 Jan 12.

引用本文的文献

Big data clustering techniques based on Spark: a literature review.

PeerJ Comput Sci. 2020 Nov 30;6:e321. doi: 10.7717/peerj-cs.321. eCollection 2020.

本文引用的文献

Progression of pathology in PINK1-deficient mouse brain from splicing via ubiquitination, ER stress, and mitophagy changes to neuroinflammation.

J Neuroinflammation. 2017 Aug 2;14(1):154. doi: 10.1186/s12974-017-0928-0.

SNAP: A General Purpose Network Analysis and Graph Mining Library.

ACM Trans Intell Syst Technol. 2016 Oct;8(1). doi: 10.1145/2898361. Epub 2016 Oct 3.

Prioritizing candidate disease genes by network-based boosting of genome-wide association data.

Genome Res. 2011 Jul;21(7):1109-21. doi: 10.1101/gr.118992.110. Epub 2011 May 2.

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

Nat Protoc. 2009;4(1):44-57. doi: 10.1038/nprot.2008.211.

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists.

Nucleic Acids Res. 2009 Jan;37(1):1-13. doi: 10.1093/nar/gkn923. Epub 2008 Nov 25.

Identification of functional modules in a PPI network by clique percolation clustering.

Comput Biol Chem. 2006 Dec;30(6):445-51. doi: 10.1016/j.compbiolchem.2006.10.001. Epub 2006 Nov 13.

Ribosome dysfunction is an early event in Alzheimer's disease.

J Neurosci. 2005 Oct 5;25(40):9171-5. doi: 10.1523/JNEUROSCI.3040-05.2005.

Characterizing gene sets with FuncAssociate.

Bioinformatics. 2003 Dec 12;19(18):2502-4. doi: 10.1093/bioinformatics/btg363.

A protein interaction map of Drosophila melanogaster.

Science. 2003 Dec 5;302(5651):1727-36. doi: 10.1126/science.1090289. Epub 2003 Nov 6.

Role of ubiquitin-mediated proteolysis in the pathogenesis of neurodegenerative disorders.

Ageing Res Rev. 2003 Oct;2(4):343-56. doi: 10.1016/s1568-1637(03)00025-4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于结构相似性的大规模网络分布式网络聚类算法

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献