Provencher Langlois Gabriel, Buch Jatan, Darbon Jérôme
Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA.
Department of Earth and Environmental Engineering, Columbia University, New York, NY 10027, USA.
Entropy (Basel). 2024 Aug 15;26(8):691. doi: 10.3390/e26080691.
Maximum entropy (MaxEnt) models are a class of statistical models that use the maximum entropy principle to estimate probability distributions from data. Due to the size of modern data sets, MaxEnt models need efficient optimization algorithms to scale well for big data applications. State-of-the-art algorithms for MaxEnt models, however, were not originally designed to handle big data sets; these algorithms either rely on technical devices that may yield unreliable numerical results, scale poorly, or require smoothness assumptions that many practical MaxEnt models lack. In this paper, we present novel optimization algorithms that overcome the shortcomings of state-of-the-art algorithms for training large-scale, non-smooth MaxEnt models. Our proposed first-order algorithms leverage the Kullback-Leibler divergence to train large-scale and non-smooth MaxEnt models efficiently. For MaxEnt models with discrete probability distribution of elements built from samples, each containing features, the stepsize parameter estimation and iterations in our algorithms scale on the order of O(mn) operations and can be trivially parallelized. Moreover, the strong ℓ1 convexity of the Kullback-Leibler divergence allows for larger stepsize parameters, thereby speeding up the convergence rate of our algorithms. To illustrate the efficiency of our novel algorithms, we consider the problem of estimating probabilities of fire occurrences as a function of ecological features in the Western US MTBS-Interagency wildfire data set. Our numerical results show that our algorithms outperform the state of the art by one order of magnitude and yield results that agree with physical models of wildfire occurrence and previous statistical analyses of wildfire drivers.
最大熵(MaxEnt)模型是一类统计模型,它使用最大熵原理从数据中估计概率分布。由于现代数据集的规模,MaxEnt模型需要高效的优化算法才能在大数据应用中实现良好的扩展。然而,用于MaxEnt模型的现有算法最初并非为处理大数据集而设计;这些算法要么依赖于可能产生不可靠数值结果的技术手段,扩展性能不佳,要么需要许多实际MaxEnt模型所缺乏的平滑性假设。在本文中,我们提出了新颖的优化算法,克服了用于训练大规模、非平滑MaxEnt模型的现有算法的缺点。我们提出的一阶算法利用库尔贝克-莱布勒散度来高效地训练大规模和非平滑MaxEnt模型。对于由样本构建的具有离散概率分布元素的MaxEnt模型,每个样本包含特征,我们算法中的步长参数估计和迭代的运算规模为O(mn)量级,并且可以很容易地并行化。此外,库尔贝克-莱布勒散度的强ℓ1凸性允许使用更大的步长参数,从而加快了我们算法的收敛速度。为了说明我们新颖算法的效率,我们考虑在美国西部MTBS跨部门野火数据集中,根据生态特征估计火灾发生概率的问题。我们的数值结果表明,我们的算法比现有技术性能优一个数量级,并且产生的结果与野火发生的物理模型以及先前对野火驱动因素的统计分析结果一致。