Suppr超能文献

使用频率守恒上下文模型识别缺失的词典条目。

Identifying missing dictionary entries with frequency-conserving context models.

作者信息

Williams Jake Ryland, Clark Eric M, Bagrow James P, Danforth Christopher M, Dodds Peter Sheridan

机构信息

Department of Mathematics & Statistics, Vermont Complex Systems Center, Computational Story Lab, and The Vermont Advanced Computing Core, The University of Vermont, Burlington, Vermont 05401, USA.

出版信息

Phys Rev E Stat Nonlin Soft Matter Phys. 2015 Oct;92(4):042808. doi: 10.1103/PhysRevE.92.042808. Epub 2015 Oct 12.

Abstract

In an effort to better understand meaning from natural language texts, we explore methods aimed at organizing lexical objects into contexts. A number of these methods for organization fall into a family defined by word ordering. Unlike demographic or spatial partitions of data, these collocation models are of special importance for their universal applicability. While we are interested here in text and have framed our treatment appropriately, our work is potentially applicable to other areas of research (e.g., speech, genomics, and mobility patterns) where one has ordered categorical data (e.g., sounds, genes, and locations). Our approach focuses on the phrase (whether word or larger) as the primary meaning-bearing lexical unit and object of study. To do so, we employ our previously developed framework for generating word-conserving phrase-frequency data. Upon training our model with the Wiktionary, an extensive, online, collaborative, and open-source dictionary that contains over 100000 phrasal definitions, we develop highly effective filters for the identification of meaningful, missing phrase entries. With our predictions we then engage the editorial community of the Wiktionary and propose short lists of potential missing entries for definition, developing a breakthrough, lexical extraction technique and expanding our knowledge of the defined English lexicon of phrases.

摘要

为了更好地从自然语言文本中理解语义,我们探索旨在将词汇对象组织到语境中的方法。许多这些组织方法属于由词序定义的类别。与数据的人口统计学或空间划分不同,这些搭配模型因其普遍适用性而具有特殊重要性。虽然我们这里关注的是文本并进行了适当的处理,但我们的工作可能适用于其他研究领域(例如语音、基因组学和移动模式),在这些领域中存在有序的分类数据(例如声音、基因和位置)。我们的方法将短语(无论是单词还是更长的短语)作为主要的承载语义的词汇单元和研究对象。为此,我们采用先前开发的框架来生成保留单词的短语频率数据。在用维基词典(一个包含超过100000个短语定义的广泛、在线、协作和开源词典)训练我们的模型后,我们开发了高效的过滤器来识别有意义的、缺失的短语条目。通过我们的预测,我们随后与维基词典的编辑社区合作,提出潜在缺失条目的简短列表以供定义,开发了一种突破性的词汇提取技术,并扩展了我们对已定义的英语短语词汇的认识。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验