Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands.
Princeton University, Princeton, NJ, 08544, USA.
Behav Res Methods. 2021 Apr;53(2):629-655. doi: 10.3758/s13428-020-01406-3.
This paper introduces a novel collection of word embeddings, numerical representations of lexical semantics, in 55 languages, trained on a large corpus of pseudo-conversational speech transcriptions from television shows and movies. The embeddings were trained on the OpenSubtitles corpus using the fastText implementation of the skipgram algorithm. Performance comparable with (and in some cases exceeding) embeddings trained on non-conversational (Wikipedia) text is reported on standard benchmark evaluation datasets. A novel evaluation method of particular relevance to psycholinguists is also introduced: prediction of experimental lexical norms in multiple languages. The models, as well as code for reproducing the models and all analyses reported in this paper (implemented as a user-friendly Python package), are freely available at: https://github.com/jvparidon/subs2vec .
本文介绍了一种新颖的词嵌入集合,这是词汇语义的数字表示,涵盖了 55 种语言,基于大量来自电视剧和电影的伪会话语音转录语料库进行训练。嵌入是使用 fastText 的 skipgram 算法在 OpenSubtitles 语料库上训练的。在标准基准评估数据集上,报告了与(在某些情况下甚至超过)基于非会话(维基百科)文本训练的嵌入相比具有可比性的性能。本文还介绍了一种特别与心理语言学家相关的新颖评估方法:在多种语言中预测实验词汇规范。这些模型以及重现本文中报告的所有模型和分析的代码(实现为用户友好的 Python 包)可在以下网址免费获取:https://github.com/jvparidon/subs2vec。