Liu Wei, Xiang Ming, Ding Nai
Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou, China.
Department of Linguistics, The University of Chicago, Chicago, IL, USA.
Nat Hum Behav. 2025 Sep 10. doi: 10.1038/s41562-025-02297-0.
Understanding how sentences are represented in the human brain, as well as in large language models (LLMs), poses a substantial challenge for cognitive science. Here we develop a one-shot learning task to investigate whether humans and LLMs encode tree-structured constituents within sentences. Participants (total N = 372, native Chinese or English speakers, and bilingual in Chinese and English) and LLMs (for example, ChatGPT) were asked to infer which words should be deleted from a sentence. Both groups tend to delete constituents, instead of non-constituent word strings, following rules specific to Chinese and English, respectively. The results cannot be explained by models that rely only on word properties and word positions. Crucially, based on word strings deleted by either humans or LLMs, the underlying constituency tree structure can be successfully reconstructed. Altogether, these results demonstrate that latent tree-structured sentence representations emerge in both humans and LLMs.
理解句子在人类大脑以及大语言模型(LLMs)中是如何被表征的,对认知科学来说是一项重大挑战。在此,我们开发了一项一次性学习任务,以研究人类和大语言模型是否在句子中编码了树形结构成分。参与者(总共N = 372人,以中文或英文为母语,且具备中英双语能力)和大语言模型(例如ChatGPT)被要求推断句子中应该删除哪些单词。两组都倾向于分别按照中文和英文特有的规则删除成分,而不是非成分单词串。这些结果无法用仅依赖单词属性和单词位置的模型来解释。至关重要的是,基于人类或大语言模型删除的单词串,可以成功重建潜在的成分树结构。总之,这些结果表明,人类和大语言模型中都出现了潜在的树形结构句子表征。