Suppr超能文献

subs2vec:来自 55 种语言字幕的单词嵌入。

subs2vec: Word embeddings from subtitles in 55 languages.

机构信息

Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands.

Princeton University, Princeton, NJ, 08544, USA.

出版信息

Behav Res Methods. 2021 Apr;53(2):629-655. doi: 10.3758/s13428-020-01406-3.

Abstract

This paper introduces a novel collection of word embeddings, numerical representations of lexical semantics, in 55 languages, trained on a large corpus of pseudo-conversational speech transcriptions from television shows and movies. The embeddings were trained on the OpenSubtitles corpus using the fastText implementation of the skipgram algorithm. Performance comparable with (and in some cases exceeding) embeddings trained on non-conversational (Wikipedia) text is reported on standard benchmark evaluation datasets. A novel evaluation method of particular relevance to psycholinguists is also introduced: prediction of experimental lexical norms in multiple languages. The models, as well as code for reproducing the models and all analyses reported in this paper (implemented as a user-friendly Python package), are freely available at: https://github.com/jvparidon/subs2vec .

摘要

本文介绍了一种新颖的词嵌入集合,这是词汇语义的数字表示,涵盖了 55 种语言,基于大量来自电视剧和电影的伪会话语音转录语料库进行训练。嵌入是使用 fastText 的 skipgram 算法在 OpenSubtitles 语料库上训练的。在标准基准评估数据集上,报告了与(在某些情况下甚至超过)基于非会话(维基百科)文本训练的嵌入相比具有可比性的性能。本文还介绍了一种特别与心理语言学家相关的新颖评估方法:在多种语言中预测实验词汇规范。这些模型以及重现本文中报告的所有模型和分析的代码(实现为用户友好的 Python 包)可在以下网址免费获取:https://github.com/jvparidon/subs2vec。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d3f1/8062394/b8db4b120b05/13428_2020_1406_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验