PACC：在 Hadoop 和 Spark 上进行大规模连通分量计算。

PACC: Large scale connected component computation on Hadoop and Spark.

机构信息

Kookmin University, Seoul, Republic of Korea.

Carnegie Mellon University, Pittsburgh, PA, United States of America.

出版信息

PLoS One. 2020 Mar 18;15(3):e0229936. doi: 10.1371/journal.pone.0229936. eCollection 2020.

DOI:10.1371/journal.pone.0229936

PMID:32187232

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7080249/

Abstract

A connected component in a graph is a set of nodes linked to each other by paths. The problem of finding connected components has been applied to diverse graph analysis tasks such as graph partitioning, graph compression, and pattern recognition. Several distributed algorithms have been proposed to find connected components in enormous graphs. Ironically, the distributed algorithms do not scale enough due to unnecessary data IO & processing, massive intermediate data, numerous rounds of computations, and load balancing issues. In this paper, we propose a fast and scalable distributed algorithm PACC (Partition-Aware Connected Components) for connected component computation based on three key techniques: two-step processing of partitioning & computation, edge filtering, and sketching. PACC considerably shrinks the size of intermediate data, the size of input graph, and the number of rounds without suffering from load balancing issues. PACC performs 2.9 to 10.7 times faster on real-world graphs compared to the state-of-the-art MapReduce and Spark algorithms.

摘要

图中的连通分量是指通过路径相互连接的节点集。发现连通分量的问题已经应用于各种图分析任务，如图划分、图压缩和模式识别。已经提出了几种分布式算法来在巨大的图中找到连通分量。具有讽刺意味的是，由于不必要的数据 I/O 和处理、大量的中间数据、多轮计算和负载平衡问题，分布式算法的扩展性不够。在本文中，我们提出了一种快速且可扩展的基于三个关键技术的分布式算法 PACC（分区感知连通分量）用于连通分量计算：分区和计算的两步处理、边过滤和速写。PACC 大大缩小了中间数据、输入图的大小和轮数，而不会出现负载平衡问题。与最先进的 MapReduce 和 Spark 算法相比，PACC 在真实图上的性能提高了 2.9 到 10.7 倍。