Jedynak Bruno M, Khudanpur Sanjeev
Département de Mathématiques, Université des Sciences et Technologies de Lille, France.
Neural Comput. 2005 Jul;17(7):1508-30. doi: 10.1162/0899766053723078.
We propose a new method for estimating the probability mass function (pmf) of a discrete and finite random variable from a small sample. We focus on the observed counts--the number of times each value appears in the sample--and define the maximum likelihood set (MLS) as the set of pmfs that put more mass on the observed counts than on any other set of counts possible for the same sample size. We characterize the MLS in detail in this article. We show that the MLS is a diamond-shaped subset of the probability simplex [0,1]k bounded by at most k x (k-1) hyper-planes, where k is the number of possible values of the random variable. The MLS always contains the empirical distribution, as well as a family of Bayesian estimators based on a Dirichlet prior, particularly the well-known Laplace estimator. We propose to select from the MLS the pmf that is closest to a fixed pmf that encodes prior knowledge. When using Kullback-Leibler distance for this selection, the optimization problem comprises finding the minimum of a convex function over a domain defined by linear inequalities, for which standard numerical procedures are available. We apply this estimate to language modeling using Zipf's law to encode prior knowledge and show that this method permits obtaining state-of-the-art results while being conceptually simpler than most competing methods.
我们提出了一种从小样本估计离散有限随机变量概率质量函数(pmf)的新方法。我们关注观察到的计数——即每个值在样本中出现的次数——并将最大似然集(MLS)定义为这样一组pmf:对于相同样本量,这些pmf在观察到的计数上分配的质量比在任何其他可能的计数集上分配的质量更多。在本文中,我们详细刻画了MLS。我们表明,MLS是概率单纯形[0,1]k的一个菱形子集,由至多k×(k - 1)个超平面界定,其中k是随机变量可能值的数量。MLS始终包含经验分布,以及基于狄利克雷先验的一族贝叶斯估计量,特别是著名的拉普拉斯估计量。我们建议从MLS中选择最接近编码先验知识的固定pmf的pmf。当使用库尔贝克 - 莱布勒距离进行此选择时,优化问题包括在由线性不等式定义的域上找到凸函数的最小值,对此有可用的标准数值程序。我们将此估计应用于使用齐普夫定律编码先验知识的语言建模,并表明该方法在概念上比大多数竞争方法更简单的同时,能够获得最新的结果。