Department of Computer Science, Tufts University.
Department of Psychology, Tufts University.
Cogn Sci. 2024 Nov;48(11):e70013. doi: 10.1111/cogs.70013.
Transformer-based Large Language Models (LLMs) have recently increased in popularity, in part due to their impressive performance on a number of language tasks. While LLMs can produce human-like writing, the extent to which these models can learn to predict spoken language in natural interaction remains unclear. This is a nontrivial question, as spoken and written language differ in syntax, pragmatics, and norms that interlocutors follow. Previous work suggests that while LLMs may develop an understanding of linguistic rules based on statistical regularities, they fail to acquire the knowledge required for language use. This implies that LLMs may not learn the normative structure underlying interactive spoken language, but may instead only model superficial regularities in speech. In this paper, we aim to evaluate LLMs as models of spoken dialogue. Specifically, we investigate whether LLMs can learn that the identity of a speaker in spoken dialogue influences what is likely to be said. To answer this question, we first fine-tuned two variants of a specific LLM (GPT-2) on transcripts of natural spoken dialogue in English. Then, we used these models to compute surprisal values for two-turn sequences with the same first-turn but different second-turn speakers and compared the output to human behavioral data. While the predictability of words in all fine-tuned models was influenced by speaker identity information, the models did not replicate humans' use of this information. Our findings suggest that although LLMs may learn to generate text conforming to normative linguistic structure, they do not (yet) faithfully replicate human behavior in natural conversation.
基于转换器的大型语言模型 (LLM) 最近越来越受欢迎,部分原因是它们在许多语言任务上的出色表现。虽然 LLM 可以生成类似人类的文本,但这些模型在自然交互中学习预测口语的程度尚不清楚。这是一个重要的问题,因为口语和书面语在语法、语用和说话者遵循的规范方面存在差异。以前的工作表明,虽然 LLM 可能会根据统计规律发展出对语言规则的理解,但它们无法获得语言使用所需的知识。这意味着 LLM 可能无法学习互动口语背后的规范结构,而只是对言语中的表面规律进行建模。在本文中,我们旨在评估 LLM 作为口语对话模型。具体来说,我们研究 LLM 是否可以学习到口语对话中说话者的身份会影响可能会说什么。为了回答这个问题,我们首先在英语自然口语对话的转录本上对两种特定 LLM(GPT-2)的变体进行微调。然后,我们使用这些模型计算具有相同第一句话但不同第二句话说话者的两句话序列的惊讶值,并将输出与人类行为数据进行比较。虽然所有微调模型中单词的可预测性都受到说话者身份信息的影响,但模型并没有复制人类对这些信息的使用。我们的研究结果表明,尽管 LLM 可能会学习生成符合规范语言结构的文本,但它们还没有(尚未)忠实地复制人类在自然对话中的行为。