一种基于MapReduce的大规模网络快速连通分量检测方法。

A MapReduce-Based Approach for Fast Connected Components Detection from Large-Scale Networks.

作者信息

Bhat Sajid Yousuf, Abulaish Muhammad

机构信息

Department of Computer Science, University of Kashmir, Srinagar, Jammu and Kashmir, India.

Department of Computer Science, South Asian University, New Delhi, India.

出版信息

Big Data. 2024 Jan 29. doi: 10.1089/big.2022.0264.

DOI:10.1089/big.2022.0264

PMID:38285477

Abstract

Owing to increasing size of the real-world networks, their processing using classical techniques has become infeasible. The amount of storage and central processing unit time required for processing large networks is far beyond the capabilities of a high-end computing machine. Moreover, real-world network data are generally distributed in nature because they are collected and stored on distributed platforms. This has popularized the use of the MapReduce, a distributed data processing framework, for analyzing real-world network data. Existing MapReduce-based methods for connected components detection mainly struggle to minimize the number of MapReduce rounds and the amount of data generated and forwarded to the subsequent rounds. This article presents an efficient MapReduce-based approach for finding connected components, which does not forward the complete set of connected components to the subsequent rounds; instead, it writes them to the Hadoop Distributed File System as soon as they are found to reduce the amount of data forwarded to the subsequent rounds. It also presents an application of the proposed method in contact tracing. The proposed method is evaluated on several network data sets and compared with two state-of-the-art methods. The empirical results reveal that the proposed method performs significantly better and is scalable to find connected components in large-scale networks.

摘要

由于现实世界网络规模的不断增大，使用传统技术对其进行处理已变得不可行。处理大型网络所需的存储量和中央处理器时间远远超出了高端计算机的能力范围。此外，现实世界的网络数据本质上通常是分布式的，因为它们是在分布式平台上收集和存储的。这使得用于分析现实世界网络数据的分布式数据处理框架MapReduce得到了广泛应用。现有的基于MapReduce的连通分量检测方法主要致力于尽量减少MapReduce轮数以及生成并转发到后续轮次的数据量。本文提出了一种基于MapReduce的高效连通分量查找方法，该方法不会将完整的连通分量集转发到后续轮次；相反，一旦找到连通分量就将其写入Hadoop分布式文件系统，以减少转发到后续轮次的数据量。本文还展示了该方法在接触者追踪中的应用。在多个网络数据集上对所提方法进行了评估，并与两种最先进的方法进行了比较。实证结果表明，所提方法性能显著更优，并且能够在大规模网络中扩展以查找连通分量。