Suppr超能文献

论从字幕中提取的词频和语境多样性度量的优势:以葡萄牙语为例。

On the advantages of word frequency and contextual diversity measures extracted from subtitles: The case of Portuguese.

作者信息

Soares Ana Paula, Machado João, Costa Ana, Iriarte Álvaro, Simões Alberto, de Almeida José João, Comesaña Montserrat, Perea Manuel

机构信息

a Human Cognition Lab, CIPsi, School of Psychology , University of Minho , Minho , Portugal.

出版信息

Q J Exp Psychol (Hove). 2015;68(4):680-96. doi: 10.1080/17470218.2014.964271. Epub 2014 Nov 7.

Abstract

We examined the potential advantage of the lexical databases using subtitles and present SUBTLEX-PT, a new lexical database for 132,710 Portuguese words obtained from a 78 million corpus based on film and television series subtitles, offering word frequency and contextual diversity measures. Additionally we validated SUBTLEX-PT with a lexical decision study involving 1920 Portuguese words (and 1920 nonwords) with different lengths in letters (M = 6.89, SD = 2.10) and syllables (M = 2.99, SD = 0.94). Multiple regression analyses on latency and accuracy data were conducted to compare the proportion of variance explained by the Portuguese subtitle word frequency measures with that accounted by the recent written-word frequency database (Procura-PALavras; P-PAL; Soares, Iriarte, et al., 2014 ). As its international counterparts, SUBTLEX-PT explains approximately 15% more of the variance in the lexical decision performance of young adults than the P-PAL database. Moreover, in line with recent studies, contextual diversity accounted for approximately 2% more of the variance in participants' reading performance than the raw frequency counts obtained from subtitles. SUBTLEX-PT is freely available for research purposes (at http://p-pal.di.uminho.pt/about/databases ).

摘要

我们研究了使用字幕的词汇数据库的潜在优势,并展示了SUBTLEX-PT,这是一个新的词汇数据库,包含从基于电影和电视剧字幕的7800万语料库中获取的132710个葡萄牙语单词,提供词频和上下文多样性度量。此外,我们通过一项词汇判断研究对SUBTLEX-PT进行了验证,该研究涉及1920个不同字母长度(M = 6.89,SD = 2.10)和音节长度(M = 2.99,SD = 0.94)的葡萄牙语单词(以及1920个非单词)。对反应时和准确性数据进行了多元回归分析,以比较葡萄牙语字幕词频度量所解释的方差比例与最近的书面词频数据库(Procura-PALavras;P-PAL;Soares,Iriarte等人,2014)所解释的方差比例。与国际上的同类数据库一样,SUBTLEX-PT比P-PAL数据库多解释了约15%的年轻人词汇判断表现中的方差。此外,与最近的研究一致,上下文多样性比从字幕中获得的原始词频计数多解释了约2%的参与者阅读表现中的方差。SUBTLEX-PT可免费用于研究目的(网址为http://p-pal.di.uminho.pt/about/databases)。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验