Suppr超能文献

自然语言中与规模相关的关系。

Scale-Dependent Relationships in Natural Language.

作者信息

Sarkar Aakash, Howard Marc W

机构信息

Department of Psychological and Brain Sciences Boston University.

出版信息

Comput Brain Behav. 2021 Jun;4:164-177. doi: 10.1007/s42113-020-00094-8. Epub 2021 Jan 4.

Abstract

Language, like other natural sequences, exhibits statistical dependencies at a wide range of scales (Lin & Tegmark, 2016). However, many statistical learning models applied to language impose a sampling scale while extracting statistical structure. For instance, Word2Vec creates vector embeddings by sampling context in a window around each word, the size of which defines a strong scale; relationships over much larger temporal scales would be invisible to the algorithm. This paper examines the family of Word2Vec embeddings generated while systematically manipulating the size of the context window. The primary result is that different linguistic relationships are preferentially encoded at different scales. Different scales emphasize different syntactic and semantic relations between words, as assessed both by analogical reasoning tasks in the Google Analogies test set and human similarity rating datasets WordSim-353 and SimLex-999. Moreover, the neighborhoods of a given word in the embeddings change considerably depending on the scale. These results suggest that sampling at any individual scale can only identify a subset of the meaningful relationships a word might have, and point toward the importance of developing scale-free models of semantic meaning.

摘要

语言与其他自然序列一样,在广泛的尺度范围内呈现出统计依赖性(林 & 泰格马克,2016)。然而,许多应用于语言的统计学习模型在提取统计结构时会施加一个采样尺度。例如,Word2Vec通过在每个单词周围的窗口中对上下文进行采样来创建向量嵌入,窗口大小定义了一个显著的尺度;算法对大得多的时间尺度上的关系是不可见的。本文研究了在系统地操纵上下文窗口大小的同时生成的Word2Vec嵌入族。主要结果是不同的语言关系在不同的尺度上被优先编码。通过谷歌类比测试集中的类比推理任务以及人类相似度评级数据集WordSim - 353和SimLex - 999评估发现,不同的尺度强调单词之间不同的句法和语义关系。此外,嵌入中给定单词的邻域会根据尺度而发生很大变化。这些结果表明,在任何单个尺度上进行采样只能识别一个单词可能具有的有意义关系的一个子集,并指出了开发无尺度语义模型的重要性。

相似文献

1
Scale-Dependent Relationships in Natural Language.自然语言中与规模相关的关系。
Comput Brain Behav. 2021 Jun;4:164-177. doi: 10.1007/s42113-020-00094-8. Epub 2021 Jan 4.

本文引用的文献

5
Time Cells in Hippocampal Area CA3.海马体CA3区的时间细胞
J Neurosci. 2016 Jul 13;36(28):7476-84. doi: 10.1523/JNEUROSCI.0087-16.2016.
6
LSTM: A Search Space Odyssey.长短期记忆网络:搜索空间奥德赛。
IEEE Trans Neural Netw Learn Syst. 2017 Oct;28(10):2222-2232. doi: 10.1109/TNNLS.2016.2582924. Epub 2016 Jul 8.
9
A distributed representation of internal time.内部时间的分布式表征。
Psychol Rev. 2015 Jan;122(1):24-53. doi: 10.1037/a0037840. Epub 2014 Oct 20.
10

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验