Lebanon Guy
Department of Statistics, Purdue University, 150 N. University Street, West Lafayette, IN 47907, USA.
IEEE Trans Pattern Anal Mach Intell. 2006 Apr;28(4):497-508. doi: 10.1109/TPAMI.2006.77.
Many algorithms in machine learning rely on being given a good distance metric over the input space. Rather than using a default metric such as the Euclidean metric, it is desirable to obtain a metric based on the provided data. We consider the problem of learning a Riemannian metric associated with a given differentiable manifold and a set of points. Our approach to the problem involves choosing a metric from a parametric family that is based on maximizing the inverse volume of a given data set of points. From a statistical perspective, it is related to maximum likelihood under a model that assigns probabilities inversely proportional to the Riemannian volume element. We discuss in detail learning a metric on the multinomial simplex where the metric candidates are pull-back metrics of the Fisher information under a Lie group of transformations. When applied to text document classification the resulting geodesic distance resemble, but outperform, the tfidf cosine similarity measure.
机器学习中的许多算法都依赖于在输入空间上给定一个良好的距离度量。与其使用诸如欧几里得度量这样的默认度量,不如基于所提供的数据获得一个度量。我们考虑学习与给定可微流形和一组点相关联的黎曼度量的问题。我们解决这个问题的方法是从一个参数族中选择一个度量,该参数族基于最大化给定数据集点的逆体积。从统计角度来看,它与在一个模型下的最大似然相关,该模型分配的概率与黎曼体积元成反比。我们详细讨论在多项式单纯形上学习度量,其中度量候选是在一个变换李群下费希尔信息的拉回度量。当应用于文本文档分类时,所得的测地距离与tfidf余弦相似度度量相似,但性能更优。