Buranosky Matt, Stellnberger Elmar, Pfaff Emily, Diaz-Sanchez David, Ward-Caviness Cavin
National Health and Environmental Effects Research Laboratory, United States Environmental Protection Agency, Chapel Hill, NC, USA.
University of Klagenfurt, Klagenfurt, Austria.
F1000Res. 2018 Oct 19;7:1667. doi: 10.12688/f1000research.16483.2. eCollection 2018.
Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys from them. The runtime and memory costs associated with seven published FD discovery algorithms are given with an overview of their theoretical foundations. Previous research establishes that FD_Mine is the most efficient FD discovery algorithm when applied to datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This puts it in a special position to rule mine clinical and demographic datasets, which often consist of long and narrow sets of participant records. The structure of FD_Mine is described and supplemented with a formal proof of the equivalence pruning method used. FDTool is a re-implementation of FD_Mine with additional features added to improve performance and automate typical processes in database architecture. The experimental results of applying FDTool to 13 datasets of different dimensions are summarized in terms of the number of FDs checked, the number of FDs found, and the time it takes for the code to terminate. We find that the number of attributes in a dataset has a much greater effect on the runtime and memory costs of FDTool than does row count. The last section explains in detail how the FDTool application can be accessed, executed, and further developed.
函数依赖(FDs)和候选键对于表分解、数据库规范化及数据清理至关重要。在本文中,我们介绍了FDTool,这是一个命令行Python应用程序,用于在表格数据集中发现最小函数依赖,并从中推断等效属性集和候选键。文中给出了与七种已发表的函数依赖发现算法相关的运行时和内存成本,并概述了它们的理论基础。先前的研究表明,当应用于具有多行(> 100,000行)和少量列(< 14列)的数据集时,FD_Mine是最有效的函数依赖发现算法。这使其在挖掘临床和人口统计数据集方面处于特殊地位,因为这些数据集通常由长而窄的参与者记录集组成。本文描述了FD_Mine的结构,并补充了所使用的等效剪枝方法的形式证明。FDTool是FD_Mine的重新实现,添加了额外功能以提高性能并自动化数据库架构中的典型流程。根据检查的函数依赖数量、找到的函数依赖数量以及代码终止所需的时间,总结了将FDTool应用于13个不同维度数据集的实验结果。我们发现,数据集中的属性数量对FDTool的运行时和内存成本的影响远大于行数。最后一部分详细解释了如何访问、执行和进一步开发FDTool应用程序。