• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

语境至关重要:从大规模文本语料库的机器学习分析中恢复人类语义结构。

Context Matters: Recovering Human Semantic Structure from Machine Learning Analysis of Large-Scale Text Corpora.

机构信息

Princeton Neuroscience Institute & Department of Psychology, Princeton University.

Department of Psychology, Yale University.

出版信息

Cogn Sci. 2022 Feb;46(2):e13085. doi: 10.1111/cogs.13085.

DOI:10.1111/cogs.13085
PMID:35146779
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9285590/
Abstract

Applying machine learning algorithms to automatically infer relationships between concepts from large-scale collections of documents presents a unique opportunity to investigate at scale how human semantic knowledge is organized, how people use it to make fundamental judgments ("How similar are cats and bears?"), and how these judgments depend on the features that describe concepts (e.g., size, furriness). However, efforts to date have exhibited a substantial discrepancy between algorithm predictions and human empirical judgments. Here, we introduce a novel approach to generating embeddings for this purpose motivated by the idea that semantic context plays a critical role in human judgment. We leverage this idea by constraining the topic or domain from which documents used for generating embeddings are drawn (e.g., referring to the natural world vs. transportation apparatus). Specifically, we trained state-of-the-art machine learning algorithms using contextually-constrained text corpora (domain-specific subsets of Wikipedia articles, 50+ million words each) and showed that this procedure greatly improved predictions of empirical similarity judgments and feature ratings of contextually relevant concepts. Furthermore, we describe a novel, computationally tractable method for improving predictions of contextually-unconstrained embedding models based on dimensionality reduction of their internal representation to a small number of contextually relevant semantic features. By improving the correspondence between predictions derived automatically by machine learning methods using vast amounts of data and more limited, but direct empirical measurements of human judgments, our approach may help leverage the availability of online corpora to better understand the structure of human semantic representations and how people make judgments based on those.

摘要

应用机器学习算法从大规模文档集合中自动推断概念之间的关系,为从大规模上研究人类语义知识的组织方式、人们如何利用它做出基本判断(“猫和熊有多相似?”)以及这些判断如何依赖于描述概念的特征(例如,大小、毛茸茸)提供了独特的机会。然而,迄今为止的努力在算法预测和人类经验判断之间表现出了显著的差异。在这里,我们引入了一种新的方法来生成为此目的的嵌入,其灵感来自于语义上下文在人类判断中起着关键作用的想法。我们通过限制用于生成嵌入的文档的主题或领域(例如,指自然世界与交通设备)来利用这个想法。具体来说,我们使用上下文受限的文本语料库(每个语料库都来自维基百科文章的特定领域子集,超过 5000 万词)对最先进的机器学习算法进行了训练,并表明该过程极大地提高了对经验相似性判断和相关概念特征评级的预测。此外,我们描述了一种新颖的、计算上可行的方法,用于根据其内部表示的降维来提高上下文不受限制的嵌入模型的预测,从而将其简化为少数几个与上下文相关的语义特征。通过提高使用大量数据的机器学习方法自动推导的预测与人类判断的更有限但直接的经验测量之间的一致性,我们的方法可以帮助利用在线语料库的可用性来更好地理解人类语义表示的结构以及人们如何基于这些表示做出判断。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/91f89d3d6faa/COGS-46-0-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/dc5099d40d13/COGS-46-0-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/0fba444a558e/COGS-46-0-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/bbd38bcf130e/COGS-46-0-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/45d9b51b48b7/COGS-46-0-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/790b8e9e96ab/COGS-46-0-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/91f89d3d6faa/COGS-46-0-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/dc5099d40d13/COGS-46-0-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/0fba444a558e/COGS-46-0-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/bbd38bcf130e/COGS-46-0-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/45d9b51b48b7/COGS-46-0-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/790b8e9e96ab/COGS-46-0-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e86/9285590/91f89d3d6faa/COGS-46-0-g005.jpg

相似文献

1
Context Matters: Recovering Human Semantic Structure from Machine Learning Analysis of Large-Scale Text Corpora.语境至关重要:从大规模文本语料库的机器学习分析中恢复人类语义结构。
Cogn Sci. 2022 Feb;46(2):e13085. doi: 10.1111/cogs.13085.
2
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
3
Neural sentence embedding models for semantic similarity estimation in the biomedical domain.生物医学领域中语义相似度估计的神经句子嵌入模型。
BMC Bioinformatics. 2019 Apr 11;20(1):178. doi: 10.1186/s12859-019-2789-2.
4
Effective type label-based synergistic representation learning for biomedical event trigger detection.用于生物医学事件触发检测的基于有效类型标签的协同表示学习
BMC Bioinformatics. 2024 Jul 31;25(1):251. doi: 10.1186/s12859-024-05851-1.
5
HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.HPO2Vec+:利用异构知识资源丰富人类表型本体的节点嵌入。
J Biomed Inform. 2019 Aug;96:103246. doi: 10.1016/j.jbi.2019.103246. Epub 2019 Jun 27.
6
Constructing Semantic Models From Words, Images, and Emojis.从字词、图像和表情符号构建语义模型。
Cogn Sci. 2020 Apr;44(4):e12830. doi: 10.1111/cogs.12830.
7
Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records.基于生物医学语料库预训练的句子嵌入的深度学习提高了在电子病历中查找相似句子的性能。
BMC Med Inform Decis Mak. 2020 Apr 30;20(Suppl 1):73. doi: 10.1186/s12911-020-1044-0.
8
BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量:在大规模上创建和评估基于文献的生物医学概念嵌入。
PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.
9
Multi-Ontology Refined Embeddings (MORE): A hybrid multi-ontology and corpus-based semantic representation model for biomedical concepts.多本体精炼嵌入模型(MORE):一种基于混合多本体和语料库的生物医学概念语义表示模型。
J Biomed Inform. 2020 Nov;111:103581. doi: 10.1016/j.jbi.2020.103581. Epub 2020 Oct 1.
10
Large scale biomedical texts classification: a kNN and an ESA-based approaches.大规模生物医学文本分类:基于k近邻算法和基于词嵌入语义分析的方法。
J Biomed Semantics. 2016 Jun 16;7:40. doi: 10.1186/s13326-016-0073-1.

引用本文的文献

1
THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior.事物数据集(THINGS-data),一个多模态大型数据集集合,用于研究人类大脑和行为中的目标表示。
Elife. 2023 Feb 27;12:e82580. doi: 10.7554/eLife.82580.
2
Beyond the Benchmarks: Toward Human-Like Lexical Representations.超越基准:迈向类人词汇表征
Front Artif Intell. 2022 May 24;5:796741. doi: 10.3389/frai.2022.796741. eCollection 2022.
3
Semantic projection recovers rich human knowledge of multiple object features from word embeddings.

本文引用的文献

1
Semantic projection recovers rich human knowledge of multiple object features from word embeddings.语义投射从词嵌入中恢复了人类对多个对象特征的丰富知识。
Nat Hum Behav. 2022 Jul;6(7):975-987. doi: 10.1038/s41562-022-01316-8. Epub 2022 Apr 14.
2
Similarity Judgment Within and Across Categories: A Comprehensive Model Comparison.范畴内和范畴间相似性判断:综合模型比较
Cogn Sci. 2021 Aug;45(8):e13030. doi: 10.1111/cogs.13030.
3
Revealing the multidimensional mental representations of natural objects underlying human similarity judgements.
语义投射从词嵌入中恢复了人类对多个对象特征的丰富知识。
Nat Hum Behav. 2022 Jul;6(7):975-987. doi: 10.1038/s41562-022-01316-8. Epub 2022 Apr 14.
4
Behavioral correlates of cortical semantic representations modeled by word vectors.基于词向量模型的皮质语义表象的行为相关性。
PLoS Comput Biol. 2021 Jun 23;17(6):e1009138. doi: 10.1371/journal.pcbi.1009138. eCollection 2021 Jun.
5
Revealing the multidimensional mental representations of natural objects underlying human similarity judgements.揭示人类相似性判断所基于的自然物体的多维心理表象。
Nat Hum Behav. 2020 Nov;4(11):1173-1185. doi: 10.1038/s41562-020-00951-3. Epub 2020 Oct 12.
揭示人类相似性判断所基于的自然物体的多维心理表象。
Nat Hum Behav. 2020 Nov;4(11):1173-1185. doi: 10.1038/s41562-020-00951-3. Epub 2020 Oct 12.
4
Evaluating (and Improving) the Correspondence Between Deep Neural Networks and Human Representations.评估(和改进)深度神经网络与人的表示之间的对应关系。
Cogn Sci. 2018 Nov;42(8):2648-2669. doi: 10.1111/cogs.12670. Epub 2018 Sep 3.
5
Toward a universal decoder of linguistic meaning from brain activation.迈向基于大脑激活的语言意义通用解码器。
Nat Commun. 2018 Mar 6;9(1):963. doi: 10.1038/s41467-018-03068-4.
6
Tests of an exemplar-memory model of classification learning in a high-dimensional natural-science category domain.在高维自然科学类别领域中对分类学习范例记忆模型的测试。
J Exp Psychol Gen. 2018 Mar;147(3):328-353. doi: 10.1037/xge0000369. Epub 2017 Oct 23.
7
Semantics derived automatically from language corpora contain human-like biases.从语言语料库中自动推导出来的语义包含类人偏见。
Science. 2017 Apr 14;356(6334):183-186. doi: 10.1126/science.aal4230.
8
The sequence of study changes what information is attended to, encoded, and remembered during category learning.学习顺序会改变在类别学习过程中被关注、编码和记忆的信息。
J Exp Psychol Learn Mem Cogn. 2017 Nov;43(11):1699-1719. doi: 10.1037/xlm0000406. Epub 2017 Mar 23.
9
The neural and computational bases of semantic cognition.语义认知的神经和计算基础。
Nat Rev Neurosci. 2017 Jan;18(1):42-55. doi: 10.1038/nrn.2016.150. Epub 2016 Nov 24.
10
A comparative evaluation of off-the-shelf distributed semantic representations for modelling behavioural data.用于行为数据建模的现成分布式语义表示的比较评估。
Cogn Neuropsychol. 2016 May-Jun;33(3-4):175-90. doi: 10.1080/02643294.2016.1176907.