Hernández Damián G, Samengo Inés
Department of Medical Physics, Centro Atómico Bariloche and Instituto Balseiro, 8400 San Carlos de Bariloche, Argentina.
Entropy (Basel). 2019 Jun 25;21(6):623. doi: 10.3390/e21060623.
Determining the strength of nonlinear, statistical dependencies between two variables is a crucial matter in many research fields. The established measure for quantifying such relations is the mutual information. However, estimating mutual information from limited samples is a challenging task. Since the mutual information is the difference of two entropies, the existing Bayesian estimators of entropy may be used to estimate information. This procedure, however, is still biased in the severely under-sampled regime. Here, we propose an alternative estimator that is applicable to those cases in which the marginal distribution of one of the two variables-the one with minimal entropy-is well sampled. The other variable, as well as the joint and conditional distributions, can be severely undersampled. We obtain a consistent estimator that presents very low bias, outperforming previous methods even when the sampled data contain few coincidences. As with other Bayesian estimators, our proposal focuses on the strength of the interaction between the two variables, without seeking to model the specific way in which they are related. A distinctive property of our method is that the main data statistics determining the amount of mutual information is the inhomogeneity of the conditional distribution of the low-entropy variable in those states in which the large-entropy variable registers coincidences.
确定两个变量之间非线性统计依赖关系的强度在许多研究领域都是至关重要的问题。用于量化此类关系的既定度量是互信息。然而,从有限样本中估计互信息是一项具有挑战性的任务。由于互信息是两个熵的差值,现有的熵的贝叶斯估计器可用于估计信息。然而,在严重欠采样的情况下,该过程仍然存在偏差。在此,我们提出一种替代估计器,它适用于两个变量之一(熵最小的那个变量)的边际分布被充分采样的情况。另一个变量以及联合分布和条件分布可能严重欠采样。我们得到了一个一致估计器,它具有非常低的偏差,即使在采样数据中巧合很少时也优于先前的方法。与其他贝叶斯估计器一样,我们的提议关注两个变量之间相互作用的强度,但不试图对它们相关的具体方式进行建模。我们方法的一个独特性质是,决定互信息量的主要数据统计量是高熵变量出现巧合的那些状态下低熵变量条件分布的不均匀性。