• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种从分布式大数据中挖掘函数依赖关系的高效可扩展算法。

An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data.

机构信息

College of Cyber Security and Computer, Hebei University, Baoding 071000, China.

Key Laboratory of High Trusted Information System in Hebei Province (Hebei University), Baoding 071000, China.

出版信息

Sensors (Basel). 2022 May 19;22(10):3856. doi: 10.3390/s22103856.

DOI:10.3390/s22103856
PMID:35632261
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9142976/
Abstract

A crucial step in improving data quality is to discover semantic relationships between data. Functional dependencies are rules that describe semantic relationships between data in relational databases and have been applied to improve data quality recently. However, traditional functional discovery algorithms applied to distributed data may lead to errors and the inability to scale to large-scale data. To solve the above problems, we propose a novel distributed functional dependency discovery algorithm based on Apache Spark, which can effectively discover functional dependencies in large-scale data. The basic idea is to use data redistribution to discover functional dependencies in parallel on multiple nodes. In this algorithm, we take a sampling approach to quickly remove invalid functional dependencies and propose a greedy-based task assignment strategy to balance the load. In addition, the prefix tree is used to store intermediate computation results during the validation process to avoid repeated computation of equivalence classes. Experimental results on real and synthetic datasets show that the proposed algorithm in this paper is more efficient than existing methods while ensuring accuracy.

摘要

提高数据质量的关键步骤是发现数据之间的语义关系。函数依赖是描述关系数据库中数据之间语义关系的规则,最近已被应用于提高数据质量。然而,应用于分布式数据的传统函数发现算法可能会导致错误,并且无法扩展到大规模数据。为了解决上述问题,我们提出了一种基于 Apache Spark 的新颖的分布式函数依赖发现算法,该算法可以有效地发现大规模数据中的函数依赖。基本思想是使用数据重分布在多个节点上并行发现函数依赖。在这个算法中,我们采用了一种抽样方法来快速去除无效的函数依赖,并提出了一种基于贪心的任务分配策略来平衡负载。此外,在验证过程中使用前缀树来存储中间计算结果,以避免等价类的重复计算。在真实和合成数据集上的实验结果表明,本文提出的算法在保证准确性的同时比现有方法更高效。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/9e1b059de2ed/sensors-22-03856-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/b185c93b28ad/sensors-22-03856-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/d041b9f1d766/sensors-22-03856-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/346d899e9e3d/sensors-22-03856-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/e81513ee1f86/sensors-22-03856-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/588f7203715f/sensors-22-03856-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/287624781dd9/sensors-22-03856-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/68a3ddae3a56/sensors-22-03856-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/579cf72d37b6/sensors-22-03856-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/5d41b400bd95/sensors-22-03856-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/9e1b059de2ed/sensors-22-03856-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/b185c93b28ad/sensors-22-03856-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/d041b9f1d766/sensors-22-03856-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/346d899e9e3d/sensors-22-03856-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/e81513ee1f86/sensors-22-03856-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/588f7203715f/sensors-22-03856-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/287624781dd9/sensors-22-03856-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/68a3ddae3a56/sensors-22-03856-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/579cf72d37b6/sensors-22-03856-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/5d41b400bd95/sensors-22-03856-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9284/9142976/9e1b059de2ed/sensors-22-03856-g010.jpg

相似文献

1
An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data.一种从分布式大数据中挖掘函数依赖关系的高效可扩展算法。
Sensors (Basel). 2022 May 19;22(10):3856. doi: 10.3390/s22103856.
2
A distributed computing model for big data anonymization in the networks.一种用于网络大数据匿名化的分布式计算模型。
PLoS One. 2023 Apr 28;18(4):e0285212. doi: 10.1371/journal.pone.0285212. eCollection 2023.
3
A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark.一种基于Apache Spark的并行多目标粒子群加权平均聚类算法。
Entropy (Basel). 2023 Jan 31;25(2):259. doi: 10.3390/e25020259.
4
FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data.FDTool:一个用于挖掘表格数据中函数依赖和候选键的Python应用程序。
F1000Res. 2018 Oct 19;7:1667. doi: 10.12688/f1000research.16483.2. eCollection 2018.
5
RST-DE: Rough Sets-Based New Differential Evolution Algorithm for Scalable Big Data Feature Selection in Distributed Computing Platforms.基于粗糙集的新差分进化算法在分布式计算平台中的可扩展大数据特征选择。
Big Data. 2022 Aug;10(4):356-367. doi: 10.1089/big.2021.0267. Epub 2022 May 4.
6
A hybrid multi-objective whale optimization algorithm for analyzing microarray data based on Apache Spark.一种基于Apache Spark的用于分析微阵列数据的混合多目标鲸鱼优化算法。
PeerJ Comput Sci. 2021 Mar 25;7:e416. doi: 10.7717/peerj-cs.416. eCollection 2021.
7
Big Data in metagenomics: Apache Spark vs MPI.宏基因组学中的大数据:Apache Spark 与 MPI。
PLoS One. 2020 Oct 6;15(10):e0239741. doi: 10.1371/journal.pone.0239741. eCollection 2020.
8
SAX and Random Projection Algorithms for the Motif Discovery of Orbital Asteroid Resonance Using Big Data Platforms.基于大数据平台的小行星轨道共振基序发现的 SAX 和随机投影算法。
Sensors (Basel). 2022 Jul 6;22(14):5071. doi: 10.3390/s22145071.
9
A Novel Spark-Based Attribute Reduction and Neighborhood Classification for Rough Evidence.一种基于新颖火花的粗糙证据属性约简与邻域分类方法
IEEE Trans Cybern. 2024 Mar;54(3):1470-1483. doi: 10.1109/TCYB.2022.3208130. Epub 2024 Feb 9.
10
Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism.具有多个最小支持度的关联规则挖掘:一种新的挖掘算法和支持度调整机制。
Decis Support Syst. 2006 Oct;42(1):1-24. doi: 10.1016/j.dss.2004.09.007. Epub 2004 Nov 30.

本文引用的文献

1
Data Science and its Relationship to Big Data and Data-Driven Decision Making.数据科学及其与大数据和数据驱动决策的关系。
Big Data. 2013 Mar;1(1):51-9. doi: 10.1089/big.2013.1508.
2
Machine learning: Trends, perspectives, and prospects.机器学习:趋势、观点和展望。
Science. 2015 Jul 17;349(6245):255-60. doi: 10.1126/science.aaa8415.