Hernández Damián G, Roman Ahmed, Nemenman Ilya
Department of Physics, Emory University, Atlanta, Georgia, USA.
Department of Medical Physics, Centro Atómico Bariloche and Instituto Balseiro, 8400 San Carlos de Bariloche, Argentina.
Phys Rev E. 2023 Jul;108(1-1):014101. doi: 10.1103/PhysRevE.108.014101.
A fundamental problem in the analysis of complex systems is getting a reliable estimate of the entropy of their probability distributions over the state space. This is difficult because unsampled states can contribute substantially to the entropy, while they do not contribute to the maximum likelihood estimator of entropy, which replaces probabilities by the observed frequencies. Bayesian estimators overcome this obstacle by introducing a model of the low-probability tail of the probability distribution. Which statistical features of the observed data determine the model of the tail, and hence the output of such estimators, remains unclear. Here we show that well-known entropy estimators for probability distributions on discrete state spaces model the structure of the low-probability tail based largely on a few statistics of the data: the sample size, the maximum likelihood estimate, the number of coincidences among the samples, and the dispersion of the coincidences. We derive approximate analytical entropy estimators for undersampled distributions based on these statistics, and we use the results to propose an intuitive understanding of how the Bayesian entropy estimators work.
复杂系统分析中的一个基本问题是,如何可靠地估计其状态空间上概率分布的熵。这很困难,因为未采样的状态可能对熵有很大贡献,但它们对熵的最大似然估计器没有贡献,该估计器用观察到的频率代替概率。贝叶斯估计器通过引入概率分布低概率尾部的模型来克服这一障碍。观察数据的哪些统计特征决定了尾部模型,进而决定了此类估计器的输出,目前尚不清楚。在这里,我们表明,离散状态空间上概率分布的著名熵估计器,在很大程度上基于数据的一些统计量来建模低概率尾部的结构:样本大小、最大似然估计、样本中的重合次数以及重合的离散度。我们基于这些统计量推导了欠采样分布的近似解析熵估计器,并利用这些结果对贝叶斯熵估计器的工作方式提出了直观的理解。