Băncioiu Camil, Brad Remus
Department of Computer Science and Electrical Engineering, Lucian Blaga University of Sibiu, 550024 Sibiu, Romania.
Entropy (Basel). 2021 Nov 12;23(11):1501. doi: 10.3390/e23111501.
This article presents a novel and remarkably efficient method of computing the statistical G-test made possible by exploiting a connection with the fundamental elements of information theory: by writing the statistic as a sum of joint entropy terms, its computation is decomposed into easily reusable partial results with no change in the resulting value. This method greatly improves the efficiency of applications that perform a series of G-tests on permutations of the same features, such as feature selection and causal inference applications because this decomposition allows for an intensive reuse of these partial results. The efficiency of this method is demonstrated by implementing it as part of an experiment involving IPC-MB, an efficient Markov blanket discovery algorithm, applicable both as a feature selection algorithm and as a causal inference method. The results show outstanding efficiency gains for IPC-MB when the G-test is computed with the proposed method, compared to the unoptimized G-test, but also when compared to IPC-MB++, a variant of IPC-MB which is enhanced with an AD-tree, both static and dynamic. Even if this proposed method of computing the G-test is presented here in the context of IPC-MB, it is in fact bound neither to IPC-MB in particular, nor to feature selection or causal inference applications in general, because this method targets the information-theoretic concept that underlies the G-test, namely conditional mutual information. This aspect grants it wide applicability in data sciences.
本文提出了一种新颖且极为高效的计算统计G检验的方法,该方法通过利用与信息论基本元素的联系得以实现:将统计量写成联合熵项的和,其计算被分解为易于重复使用的部分结果,而结果值不变。这种方法极大地提高了对相同特征排列进行一系列G检验的应用的效率,例如特征选择和因果推断应用,因为这种分解允许对这些部分结果进行大量重复使用。通过将其作为涉及IPC-MB(一种高效的马尔可夫毯发现算法,既可用作特征选择算法,也可用作因果推断方法)的实验的一部分来实现,证明了该方法的效率。结果表明,与未优化的G检验相比,当使用所提出的方法计算G检验时,IPC-MB具有显著的效率提升,而且与IPC-MB++(IPC-MB的一个变体,通过静态和动态AD树进行增强)相比也是如此。即使这里提出的计算G检验的方法是在IPC-MB的背景下呈现的,但实际上它既不特别局限于IPC-MB,也不一般地局限于特征选择或因果推断应用,因为该方法针对的是G检验背后的信息论概念,即条件互信息。这一特性使其在数据科学中具有广泛的适用性。