自然语言中与规模相关的关系。

Scale-Dependent Relationships in Natural Language.

作者信息

Sarkar Aakash, Howard Marc W

机构信息

Department of Psychological and Brain Sciences Boston University.

出版信息

Comput Brain Behav. 2021 Jun;4:164-177. doi: 10.1007/s42113-020-00094-8. Epub 2021 Jan 4.

DOI:10.1007/s42113-020-00094-8

PMID:34337323

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8317965/

Abstract

Language, like other natural sequences, exhibits statistical dependencies at a wide range of scales (Lin & Tegmark, 2016). However, many statistical learning models applied to language impose a sampling scale while extracting statistical structure. For instance, Word2Vec creates vector embeddings by sampling context in a window around each word, the size of which defines a strong scale; relationships over much larger temporal scales would be invisible to the algorithm. This paper examines the family of Word2Vec embeddings generated while systematically manipulating the size of the context window. The primary result is that different linguistic relationships are preferentially encoded at different scales. Different scales emphasize different syntactic and semantic relations between words, as assessed both by analogical reasoning tasks in the Google Analogies test set and human similarity rating datasets WordSim-353 and SimLex-999. Moreover, the neighborhoods of a given word in the embeddings change considerably depending on the scale. These results suggest that sampling at any individual scale can only identify a subset of the meaningful relationships a word might have, and point toward the importance of developing scale-free models of semantic meaning.

摘要

语言与其他自然序列一样，在广泛的尺度范围内呈现出统计依赖性（林 & 泰格马克，2016）。然而，许多应用于语言的统计学习模型在提取统计结构时会施加一个采样尺度。例如，Word2Vec通过在每个单词周围的窗口中对上下文进行采样来创建向量嵌入，窗口大小定义了一个显著的尺度；算法对大得多的时间尺度上的关系是不可见的。本文研究了在系统地操纵上下文窗口大小的同时生成的Word2Vec嵌入族。主要结果是不同的语言关系在不同的尺度上被优先编码。通过谷歌类比测试集中的类比推理任务以及人类相似度评级数据集WordSim - 353和SimLex - 999评估发现，不同的尺度强调单词之间不同的句法和语义关系。此外，嵌入中给定单词的邻域会根据尺度而发生很大变化。这些结果表明，在任何单个尺度上进行采样只能识别一个单词可能具有的有意义关系的一个子集，并指出了开发无尺度语义模型的重要性。

相似文献

Scale-Dependent Relationships in Natural Language.自然语言中与规模相关的关系。

Comput Brain Behav. 2021 Jun;4:164-177. doi: 10.1007/s42113-020-00094-8. Epub 2021 Jan 4.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study.用于对ICD-10-CM编码进行分类的混合采样训练投影词嵌入模型：纵向观察研究

JMIR Med Inform. 2019 Jul 23;7(3):e14499. doi: 10.2196/14499.

Bio-SimVerb and Bio-SimLex: wide-coverage evaluation sets of word similarity in biomedicine.生物模拟动词和生物模拟词汇：生物医学中词汇相似度的广泛覆盖评估集。

BMC Bioinformatics. 2018 Feb 5;19(1):33. doi: 10.1186/s12859-018-2039-z.

Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information.利用分布式语义学和本体论信息提高泰语语义相似度的现有水平。

PLoS One. 2021 Feb 17;16(2):e0246751. doi: 10.1371/journal.pone.0246751. eCollection 2021.

Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research.通用和特定词嵌入在研究转化阶段分类中的效用

AMIA Annu Symp Proc. 2018 Dec 5;2018:1405-1414. eCollection 2018.

Discovering the Context of People With Disabilities: Semantic Categorization Test and Environmental Factors Mapping of Word Embeddings from Reddit.探索残疾人的背景：语义分类测试以及来自Reddit的词嵌入的环境因素映射

JMIR Med Inform. 2020 Nov 20;8(11):e17903. doi: 10.2196/17903.

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.用于单词、短语和文本的无监督低维向量表示，具有透明性、可扩展性，并能产生与神经嵌入不冗余的相似性度量。

J Biomed Inform. 2019 Feb;90:103096. doi: 10.1016/j.jbi.2019.103096. Epub 2019 Jan 14.

Optimizing Word Embeddings for Patient Portal Message Datasets with a Small Number of Samples.针对少量样本的患者门户消息数据集优化词嵌入

Res Sq. 2024 May 15:rs.3.rs-4350387. doi: 10.21203/rs.3.rs-4350387/v1.

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。

PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.

本文引用的文献

A temporal record of the past with a spectrum of time constants in the monkey entorhinal cortex.猴子内嗅皮层中具有时间常数谱的过去的时间记录。

Proc Natl Acad Sci U S A. 2020 Aug 18;117(33):20274-20283. doi: 10.1073/pnas.1917197117. Epub 2020 Aug 3.

The Role of Negative Information in Distributional Semantic Learning.负向信息在分布语义学习中的作用。

Cogn Sci. 2019 May;43(5):e12730. doi: 10.1111/cogs.12730.

The Same Hippocampal CA1 Population Simultaneously Codes Temporal Information over Multiple Timescales.同一海马 CA1 群体同时在多个时间尺度上编码时间信息。

Curr Biol. 2018 May 21;28(10):1499-1508.e4. doi: 10.1016/j.cub.2018.03.051. Epub 2018 Apr 26.

Compressed Timeline of Recent Experience in Monkey Lateral Prefrontal Cortex.近期猕猴外侧前额叶皮质经验的压缩时间表。

J Cogn Neurosci. 2018 Jul;30(7):935-950. doi: 10.1162/jocn_a_01273. Epub 2018 Apr 26.

Time Cells in Hippocampal Area CA3.海马体CA3区的时间细胞

J Neurosci. 2016 Jul 13;36(28):7476-84. doi: 10.1523/JNEUROSCI.0087-16.2016.

LSTM: A Search Space Odyssey.长短期记忆网络：搜索空间奥德赛。

IEEE Trans Neural Netw Learn Syst. 2017 Oct;28(10):2222-2232. doi: 10.1109/TNNLS.2016.2582924. Epub 2016 Jul 8.

A shared neural ensemble links distinct contextual memories encoded close in time.一个共享的神经集合将在时间上编码相近的不同情境记忆联系起来。

Nature. 2016 Jun 2;534(7605):115-8. doi: 10.1038/nature17955. Epub 2016 May 23.

Hippocampal ensemble dynamics timestamp events in long-term memory.海马体集合动力学为长期记忆中的事件计时。

Elife. 2015 Dec 18;4:e12247. doi: 10.7554/eLife.12247.

A distributed representation of internal time.内部时间的分布式表征。

Psychol Rev. 2015 Jan;122(1):24-53. doi: 10.1037/a0037840. Epub 2014 Oct 20.

Time cells in the hippocampus: a new dimension for mapping memories.海马体中的时间细胞：记忆绘制的新维度

Nat Rev Neurosci. 2014 Nov;15(11):732-44. doi: 10.1038/nrn3827. Epub 2014 Oct 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验