Suppr超能文献

文本文献的度量学习

Metric learning for text documents.

作者信息

Lebanon Guy

机构信息

Department of Statistics, Purdue University, 150 N. University Street, West Lafayette, IN 47907, USA.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2006 Apr;28(4):497-508. doi: 10.1109/TPAMI.2006.77.

Abstract

Many algorithms in machine learning rely on being given a good distance metric over the input space. Rather than using a default metric such as the Euclidean metric, it is desirable to obtain a metric based on the provided data. We consider the problem of learning a Riemannian metric associated with a given differentiable manifold and a set of points. Our approach to the problem involves choosing a metric from a parametric family that is based on maximizing the inverse volume of a given data set of points. From a statistical perspective, it is related to maximum likelihood under a model that assigns probabilities inversely proportional to the Riemannian volume element. We discuss in detail learning a metric on the multinomial simplex where the metric candidates are pull-back metrics of the Fisher information under a Lie group of transformations. When applied to text document classification the resulting geodesic distance resemble, but outperform, the tfidf cosine similarity measure.

摘要

机器学习中的许多算法都依赖于在输入空间上给定一个良好的距离度量。与其使用诸如欧几里得度量这样的默认度量,不如基于所提供的数据获得一个度量。我们考虑学习与给定可微流形和一组点相关联的黎曼度量的问题。我们解决这个问题的方法是从一个参数族中选择一个度量,该参数族基于最大化给定数据集点的逆体积。从统计角度来看,它与在一个模型下的最大似然相关,该模型分配的概率与黎曼体积元成反比。我们详细讨论在多项式单纯形上学习度量,其中度量候选是在一个变换李群下费希尔信息的拉回度量。当应用于文本文档分类时,所得的测地距离与tfidf余弦相似度度量相似,但性能更优。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验