DHPV：一种用于大规模图分区的分布式算法。

DHPV: a distributed algorithm for large-scale graph partitioning.

作者信息

Adoni Wilfried Yves Hamilton, Nahhal Tarik, Krichen Moez, El Byed Abdeltif, Assayad Ismail

机构信息

LIMSAD Laboratory, Faculty of sciences, Hassan II University of Casablanca, Casablanca, Morocco.

Faculty of CSIT, Albaha University, Al Bahah, Saudi Arabia.

出版信息

J Big Data. 2020;7(1):76. doi: 10.1186/s40537-020-00357-y. Epub 2020 Sep 16.

DOI:10.1186/s40537-020-00357-y

PMID:32953386

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7493068/

Abstract

Big graphs are part of the movement of "Not Only SQL" databases (also called NoSQL) focusing on the relationships between data, rather than the values themselves. The data is stored in vertices while the edges model the interactions or relationships between these data. They offer flexibility in handling data that is strongly connected to each other. The analysis of a big graph generally involves exploring all of its vertices. Thus, this operation is costly in time and resources because big graphs are generally composed of millions of vertices connected through billions of edges. Consequently, the graph algorithms are expansive compared to the size of the big graph, and are therefore ineffective for data exploration. Thus, partitioning the graph stands out as an efficient and less expensive alternative for exploring a big graph. This technique consists in partitioning the graph into a set of k sub-graphs in order to reduce the complexity of the queries. Nevertheless, it presents many challenges because it is an NP-complete problem. In this article, we present DPHV (Distributed Placement of Hub-Vertices) an efficient parallel and distributed heuristic for large-scale graph partitioning. An application on a real-world graphs demonstrates the feasibility and reliability of our method. The experiments carried on a 10-nodes Spark cluster proved that the proposed methodology achieves significant gain in term of time and outperforms JA-BE-JA, Greedy, DFEP.

摘要

大图是“非关系型数据库”（也称为NoSQL）发展趋势的一部分，这类数据库关注数据之间的关系，而非数据值本身。数据存储在顶点中，而边则对这些数据之间的交互或关系进行建模。它们在处理相互紧密关联的数据时具有灵活性。对大图的分析通常涉及探索其所有顶点。因此，此操作在时间和资源方面成本高昂，因为大图通常由数百万个通过数十亿条边连接的顶点组成。因此，与大图的规模相比，图算法的开销很大，因此对于数据探索效率不高。因此，对图进行分区成为探索大图的一种高效且成本较低的替代方法。该技术在于将图划分为一组k个子图，以降低查询的复杂度。然而，这带来了许多挑战，因为它是一个NP完全问题。在本文中，我们提出了DPHV（中心顶点的分布式放置），这是一种用于大规模图分区的高效并行分布式启发式算法。在实际图上的应用证明了我们方法的可行性和可靠性。在一个10节点的Spark集群上进行的实验证明，所提出的方法在时间方面取得了显著提升，并且优于JA - BE - JA、贪心算法、DFEP。