Sarkar Aakash, Howard Marc W
Department of Psychological and Brain Sciences Boston University.
Comput Brain Behav. 2021 Jun;4:164-177. doi: 10.1007/s42113-020-00094-8. Epub 2021 Jan 4.
Language, like other natural sequences, exhibits statistical dependencies at a wide range of scales (Lin & Tegmark, 2016). However, many statistical learning models applied to language impose a sampling scale while extracting statistical structure. For instance, Word2Vec creates vector embeddings by sampling context in a window around each word, the size of which defines a strong scale; relationships over much larger temporal scales would be invisible to the algorithm. This paper examines the family of Word2Vec embeddings generated while systematically manipulating the size of the context window. The primary result is that different linguistic relationships are preferentially encoded at different scales. Different scales emphasize different syntactic and semantic relations between words, as assessed both by analogical reasoning tasks in the Google Analogies test set and human similarity rating datasets WordSim-353 and SimLex-999. Moreover, the neighborhoods of a given word in the embeddings change considerably depending on the scale. These results suggest that sampling at any individual scale can only identify a subset of the meaningful relationships a word might have, and point toward the importance of developing scale-free models of semantic meaning.
语言与其他自然序列一样,在广泛的尺度范围内呈现出统计依赖性(林 & 泰格马克,2016)。然而,许多应用于语言的统计学习模型在提取统计结构时会施加一个采样尺度。例如,Word2Vec通过在每个单词周围的窗口中对上下文进行采样来创建向量嵌入,窗口大小定义了一个显著的尺度;算法对大得多的时间尺度上的关系是不可见的。本文研究了在系统地操纵上下文窗口大小的同时生成的Word2Vec嵌入族。主要结果是不同的语言关系在不同的尺度上被优先编码。通过谷歌类比测试集中的类比推理任务以及人类相似度评级数据集WordSim - 353和SimLex - 999评估发现,不同的尺度强调单词之间不同的句法和语义关系。此外,嵌入中给定单词的邻域会根据尺度而发生很大变化。这些结果表明,在任何单个尺度上进行采样只能识别一个单词可能具有的有意义关系的一个子集,并指出了开发无尺度语义模型的重要性。