文本文献的度量学习

Metric learning for text documents.

作者信息

Lebanon Guy

机构信息

Department of Statistics, Purdue University, 150 N. University Street, West Lafayette, IN 47907, USA.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2006 Apr;28(4):497-508. doi: 10.1109/TPAMI.2006.77.

DOI:10.1109/TPAMI.2006.77

PMID:16566500

Abstract

Many algorithms in machine learning rely on being given a good distance metric over the input space. Rather than using a default metric such as the Euclidean metric, it is desirable to obtain a metric based on the provided data. We consider the problem of learning a Riemannian metric associated with a given differentiable manifold and a set of points. Our approach to the problem involves choosing a metric from a parametric family that is based on maximizing the inverse volume of a given data set of points. From a statistical perspective, it is related to maximum likelihood under a model that assigns probabilities inversely proportional to the Riemannian volume element. We discuss in detail learning a metric on the multinomial simplex where the metric candidates are pull-back metrics of the Fisher information under a Lie group of transformations. When applied to text document classification the resulting geodesic distance resemble, but outperform, the tfidf cosine similarity measure.

摘要

机器学习中的许多算法都依赖于在输入空间上给定一个良好的距离度量。与其使用诸如欧几里得度量这样的默认度量，不如基于所提供的数据获得一个度量。我们考虑学习与给定可微流形和一组点相关联的黎曼度量的问题。我们解决这个问题的方法是从一个参数族中选择一个度量，该参数族基于最大化给定数据集点的逆体积。从统计角度来看，它与在一个模型下的最大似然相关，该模型分配的概率与黎曼体积元成反比。我们详细讨论在多项式单纯形上学习度量，其中度量候选是在一个变换李群下费希尔信息的拉回度量。当应用于文本文档分类时，所得的测地距离与tfidf余弦相似度度量相似，但性能更优。

相似文献

Metric learning for text documents.

IEEE Trans Pattern Anal Mach Intell. 2006 Apr;28(4):497-508. doi: 10.1109/TPAMI.2006.77.

A novel document ranking method using the discrete cosine transform.

IEEE Trans Pattern Anal Mach Intell. 2005 Jan;27(1):130-5. doi: 10.1109/TPAMI.2005.2.

Machine printed text and handwriting identification in noisy document images.

IEEE Trans Pattern Anal Mach Intell. 2004 Mar;26(3):337-53. doi: 10.1109/TPAMI.2004.1262324.

Artificial neural networks for document analysis and recognition.

IEEE Trans Pattern Anal Mach Intell. 2005 Jan;27(1):23-35. doi: 10.1109/TPAMI.2005.4.

Restoring 2D content from distorted documents.

IEEE Trans Pattern Anal Mach Intell. 2007 Nov;29(11):1904-16. doi: 10.1109/TPAMI.2007.1118.

Offline geometric parameters for automatic signature verification using fixed-point arithmetic.

IEEE Trans Pattern Anal Mach Intell. 2005 Jun;27(6):993-7. doi: 10.1109/TPAMI.2005.125.

Texture for script identification.

IEEE Trans Pattern Anal Mach Intell. 2005 Nov;27(11):1720-32. doi: 10.1109/TPAMI.2005.227.

Recognition and verification of unconstrained handwritten words.

IEEE Trans Pattern Anal Mach Intell. 2005 Oct;27(10):1509-22. doi: 10.1109/TPAMI.2005.207.

A scale space approach for automatically segmenting words from historical handwritten documents.

IEEE Trans Pattern Anal Mach Intell. 2005 Aug;27(8):1212-25. doi: 10.1109/TPAMI.2005.150.

Offline recognition of unconstrained handwritten texts using HMMs and statistical language models.

IEEE Trans Pattern Anal Mach Intell. 2004 Jun;26(6):709-20. doi: 10.1109/TPAMI.2004.14.

引用本文的文献

Spherical Minimum Description Length.

Entropy (Basel). 2018 Aug 3;20(8):575. doi: 10.3390/e20080575.

An Efficient Framework for Constructing Generalized Locally-Induced Text Metrics.

IJCAI (U S). 2011:1159-1164.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

文本文献的度量学习

Metric learning for text documents.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献