College of Cyber Security and Computer, Hebei University, Baoding 071000, China.
Key Laboratory of High Trusted Information System in Hebei Province (Hebei University), Baoding 071000, China.
Sensors (Basel). 2022 May 19;22(10):3856. doi: 10.3390/s22103856.
A crucial step in improving data quality is to discover semantic relationships between data. Functional dependencies are rules that describe semantic relationships between data in relational databases and have been applied to improve data quality recently. However, traditional functional discovery algorithms applied to distributed data may lead to errors and the inability to scale to large-scale data. To solve the above problems, we propose a novel distributed functional dependency discovery algorithm based on Apache Spark, which can effectively discover functional dependencies in large-scale data. The basic idea is to use data redistribution to discover functional dependencies in parallel on multiple nodes. In this algorithm, we take a sampling approach to quickly remove invalid functional dependencies and propose a greedy-based task assignment strategy to balance the load. In addition, the prefix tree is used to store intermediate computation results during the validation process to avoid repeated computation of equivalence classes. Experimental results on real and synthetic datasets show that the proposed algorithm in this paper is more efficient than existing methods while ensuring accuracy.
提高数据质量的关键步骤是发现数据之间的语义关系。函数依赖是描述关系数据库中数据之间语义关系的规则,最近已被应用于提高数据质量。然而,应用于分布式数据的传统函数发现算法可能会导致错误,并且无法扩展到大规模数据。为了解决上述问题,我们提出了一种基于 Apache Spark 的新颖的分布式函数依赖发现算法,该算法可以有效地发现大规模数据中的函数依赖。基本思想是使用数据重分布在多个节点上并行发现函数依赖。在这个算法中,我们采用了一种抽样方法来快速去除无效的函数依赖,并提出了一种基于贪心的任务分配策略来平衡负载。此外,在验证过程中使用前缀树来存储中间计算结果,以避免等价类的重复计算。在真实和合成数据集上的实验结果表明,本文提出的算法在保证准确性的同时比现有方法更高效。