• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

FDTool:一个用于挖掘表格数据中函数依赖和候选键的Python应用程序。

FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data.

作者信息

Buranosky Matt, Stellnberger Elmar, Pfaff Emily, Diaz-Sanchez David, Ward-Caviness Cavin

机构信息

National Health and Environmental Effects Research Laboratory, United States Environmental Protection Agency, Chapel Hill, NC, USA.

University of Klagenfurt, Klagenfurt, Austria.

出版信息

F1000Res. 2018 Oct 19;7:1667. doi: 10.12688/f1000research.16483.2. eCollection 2018.

DOI:10.12688/f1000research.16483.2
PMID:31069050
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6489977/
Abstract

Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys from them. The runtime and memory costs associated with seven published FD discovery algorithms are given with an overview of their theoretical foundations. Previous research establishes that FD_Mine is the most efficient FD discovery algorithm when applied to datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This puts it in a special position to rule mine clinical and demographic datasets, which often consist of long and narrow sets of participant records. The structure of FD_Mine is described and supplemented with a formal proof of the equivalence pruning method used. FDTool is a re-implementation of FD_Mine with additional features added to improve performance and automate typical processes in database architecture. The experimental results of applying FDTool to 13 datasets of different dimensions are summarized in terms of the number of FDs checked, the number of FDs found, and the time it takes for the code to terminate. We find that the number of attributes in a dataset has a much greater effect on the runtime and memory costs of FDTool than does row count. The last section explains in detail how the FDTool application can be accessed, executed, and further developed.

摘要

函数依赖(FDs)和候选键对于表分解、数据库规范化及数据清理至关重要。在本文中,我们介绍了FDTool,这是一个命令行Python应用程序,用于在表格数据集中发现最小函数依赖,并从中推断等效属性集和候选键。文中给出了与七种已发表的函数依赖发现算法相关的运行时和内存成本,并概述了它们的理论基础。先前的研究表明,当应用于具有多行(> 100,000行)和少量列(< 14列)的数据集时,FD_Mine是最有效的函数依赖发现算法。这使其在挖掘临床和人口统计数据集方面处于特殊地位,因为这些数据集通常由长而窄的参与者记录集组成。本文描述了FD_Mine的结构,并补充了所使用的等效剪枝方法的形式证明。FDTool是FD_Mine的重新实现,添加了额外功能以提高性能并自动化数据库架构中的典型流程。根据检查的函数依赖数量、找到的函数依赖数量以及代码终止所需的时间,总结了将FDTool应用于13个不同维度数据集的实验结果。我们发现,数据集中的属性数量对FDTool的运行时和内存成本的影响远大于行数。最后一部分详细解释了如何访问、执行和进一步开发FDTool应用程序。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/02e1/6584966/47523c919b41/f1000research-7-21548-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/02e1/6584966/1e9660580c6c/f1000research-7-21548-g0000.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/02e1/6584966/cf49d0e1a421/f1000research-7-21548-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/02e1/6584966/47523c919b41/f1000research-7-21548-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/02e1/6584966/1e9660580c6c/f1000research-7-21548-g0000.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/02e1/6584966/cf49d0e1a421/f1000research-7-21548-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/02e1/6584966/47523c919b41/f1000research-7-21548-g0002.jpg

相似文献

1
FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data.FDTool:一个用于挖掘表格数据中函数依赖和候选键的Python应用程序。
F1000Res. 2018 Oct 19;7:1667. doi: 10.12688/f1000research.16483.2. eCollection 2018.
2
An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data.一种从分布式大数据中挖掘函数依赖关系的高效可扩展算法。
Sensors (Basel). 2022 May 19;22(10):3856. doi: 10.3390/s22103856.
3
PARM--an efficient algorithm to mine association rules from spatial data.PARM——一种从空间数据中挖掘关联规则的高效算法。
IEEE Trans Syst Man Cybern B Cybern. 2008 Dec;38(6):1513-24. doi: 10.1109/TSMCB.2008.927730.
4
Efficient mining differential co-expression biclusters in microarray datasets.高效挖掘基因表达微阵列数据中的差异共表达子模块。
Gene. 2013 Apr 10;518(1):59-69. doi: 10.1016/j.gene.2012.11.085. Epub 2012 Dec 28.
5
Mining approximate temporal functional dependencies with pure temporal grouping in clinical databases.在临床数据库中使用纯时间分组挖掘近似时间功能依赖关系。
Comput Biol Med. 2015 Jul;62:306-24. doi: 10.1016/j.compbiomed.2014.08.004. Epub 2014 Aug 21.
6
Attribute clustering for grouping, selection, and classification of gene expression data.用于基因表达数据分组、选择和分类的属性聚类
IEEE/ACM Trans Comput Biol Bioinform. 2005 Apr-Jun;2(2):83-101. doi: 10.1109/TCBB.2005.17.
7
Mining of high utility-probability sequential patterns from uncertain databases.从不确定数据库中挖掘高效用概率序列模式。
PLoS One. 2017 Jul 25;12(7):e0180931. doi: 10.1371/journal.pone.0180931. eCollection 2017.
8
Fast Utility Mining on Sequence Data.快速序列数据实用挖掘。
IEEE Trans Cybern. 2021 Feb;51(2):487-500. doi: 10.1109/TCYB.2020.2970176. Epub 2021 Jan 15.
9
Minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers.Minerva 和 minepy:MINE 套件及其 R、Python 和 MATLAB 包装器的 C 引擎。
Bioinformatics. 2013 Feb 1;29(3):407-8. doi: 10.1093/bioinformatics/bts707. Epub 2012 Dec 14.
10
Distributed sequence alignment applications for the public computing architecture.面向公共计算架构的分布式序列比对应用程序。
IEEE Trans Nanobioscience. 2008 Mar;7(1):35-43. doi: 10.1109/TNB.2008.2000148.

引用本文的文献

1
Associations Between Long-Term Fine Particulate Matter Exposure and Mortality in Heart Failure Patients.长期细颗粒物暴露与心力衰竭患者死亡的相关性。
J Am Heart Assoc. 2020 Mar 17;9(6):e012517. doi: 10.1161/JAHA.119.012517. Epub 2020 Mar 16.