Suppr超能文献

SUBTLEX-CH:基于电影字幕的中文词频和字频。

SUBTLEX-CH: Chinese word and character frequencies based on film subtitles.

机构信息

Department of Experimental Psychology, Ghent University, Ghent, Belgium.

出版信息

PLoS One. 2010 Jun 2;5(6):e10729. doi: 10.1371/journal.pone.0010729.

Abstract

BACKGROUND

Word frequency is the most important variable in language research. However, despite the growing interest in the Chinese language, there are only a few sources of word frequency measures available to researchers, and the quality is less than what researchers in other languages are used to.

METHODOLOGY

Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.

CONCLUSIONS

Our results confirm that word frequencies based on subtitles are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. In addition, our database is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.

摘要

背景

词频是语言研究中最重要的变量。然而,尽管人们对汉语越来越感兴趣,但可供研究人员使用的词频测量资源却很少,而且质量也不如其他语言的研究人员所习惯的那样。

方法

我们遵循 New、Brysbaert 和同事在英语、法语和荷兰语中的最新研究,根据电影和电视字幕语料库(4680 万个字符,3350 万个单词),构建了一个词频和字符频数据库。与其他语言的发现一致,新的词频和字符频能够更好地解释汉语词汇命名和词汇判断表现的变化,而基于书面文本的词频则不能。

结论

我们的研究结果证实,基于字幕的词频是对日常语言接触的一个很好的估计,并且可以捕捉到词汇处理效率的大部分变化。此外,我们的数据库是第一个包含词的上下文多样性信息的数据库,并为多字符词和词在不同句法角色中的使用提供了很好的频率估计。这些词频可供研究目的免费使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f02c/2880003/47728db906fe/pone.0010729.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验