Johns Brendan T, Jamieson Randall K
Department of Communicative Disorders and Sciences, University at Buffalo.
Department of Psychology, University of Manitoba.
Cogn Sci. 2018 May;42(4):1360-1374. doi: 10.1111/cogs.12583. Epub 2018 Jan 22.
The collection of very large text sources has revolutionized the study of natural language, leading to the development of several models of language learning and distributional semantics that extract sophisticated semantic representations of words based on the statistical redundancies contained within natural language (e.g., Griffiths, Steyvers, & Tenenbaum, ; Jones & Mewhort, ; Landauer & Dumais, ; Mikolov, Sutskever, Chen, Corrado, & Dean, ). The models treat knowledge as an interaction of processing mechanisms and the structure of language experience. But language experience is often treated agnostically. We report a distributional semantic analysis that shows written language in fiction books varies appreciably between books from the different genres, books from the same genre, and even books written by the same author. Given that current theories assume that word knowledge reflects an interaction between processing mechanisms and the language environment, the analysis shows the need for the field to engage in a more deliberate consideration and curation of the corpora used in computational studies of natural language processing.
大量文本源的收集彻底改变了自然语言研究,催生了多种语言学习模型和分布语义学,这些模型基于自然语言中包含的统计冗余提取复杂的单词语义表示(例如,格里菲思、斯泰弗斯和特南鲍姆;琼斯和梅霍特;兰道尔和杜迈斯;米科洛夫、苏茨克维、陈、科拉多和迪恩)。这些模型将知识视为处理机制与语言经验结构的相互作用。但语言经验往往未得到深入探讨。我们报告了一项分布语义分析,结果表明小说书籍中的书面语言在不同体裁、同一体裁的不同书籍甚至同一作者所写的书籍之间存在明显差异。鉴于当前理论认为单词知识反映了处理机制与语言环境之间的相互作用,该分析表明该领域需要更审慎地考虑和筛选自然语言处理计算研究中使用的语料库。