• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SUBTLEX-CH:基于电影字幕的中文词频和字频。

SUBTLEX-CH: Chinese word and character frequencies based on film subtitles.

机构信息

Department of Experimental Psychology, Ghent University, Ghent, Belgium.

出版信息

PLoS One. 2010 Jun 2;5(6):e10729. doi: 10.1371/journal.pone.0010729.

DOI:10.1371/journal.pone.0010729
PMID:20532192
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2880003/
Abstract

BACKGROUND

Word frequency is the most important variable in language research. However, despite the growing interest in the Chinese language, there are only a few sources of word frequency measures available to researchers, and the quality is less than what researchers in other languages are used to.

METHODOLOGY

Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.

CONCLUSIONS

Our results confirm that word frequencies based on subtitles are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. In addition, our database is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.

摘要

背景

词频是语言研究中最重要的变量。然而,尽管人们对汉语越来越感兴趣,但可供研究人员使用的词频测量资源却很少,而且质量也不如其他语言的研究人员所习惯的那样。

方法

我们遵循 New、Brysbaert 和同事在英语、法语和荷兰语中的最新研究,根据电影和电视字幕语料库(4680 万个字符,3350 万个单词),构建了一个词频和字符频数据库。与其他语言的发现一致,新的词频和字符频能够更好地解释汉语词汇命名和词汇判断表现的变化,而基于书面文本的词频则不能。

结论

我们的研究结果证实,基于字幕的词频是对日常语言接触的一个很好的估计,并且可以捕捉到词汇处理效率的大部分变化。此外,我们的数据库是第一个包含词的上下文多样性信息的数据库,并为多字符词和词在不同句法角色中的使用提供了很好的频率估计。这些词频可供研究目的免费使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f02c/2880003/50eb07418c95/pone.0010729.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f02c/2880003/47728db906fe/pone.0010729.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f02c/2880003/125a67d93e5d/pone.0010729.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f02c/2880003/47ea859665d7/pone.0010729.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f02c/2880003/50eb07418c95/pone.0010729.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f02c/2880003/47728db906fe/pone.0010729.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f02c/2880003/125a67d93e5d/pone.0010729.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f02c/2880003/47ea859665d7/pone.0010729.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f02c/2880003/50eb07418c95/pone.0010729.g004.jpg

相似文献

1
SUBTLEX-CH: Chinese word and character frequencies based on film subtitles.SUBTLEX-CH:基于电影字幕的中文词频和字频。
PLoS One. 2010 Jun 2;5(6):e10729. doi: 10.1371/journal.pone.0010729.
2
SUBTLEX-NL: a new measure for Dutch word frequency based on film subtitles.SUBTLEX-NL:一种基于电影字幕的新的荷兰语词汇频率衡量标准。
Behav Res Methods. 2010 Aug;42(3):643-50. doi: 10.3758/BRM.42.3.643.
3
SUBTLEX-CY: A new word frequency database for Welsh.SUBTLEX-CY:威尔士语新的单词频率数据库。
Q J Exp Psychol (Hove). 2024 May;77(5):1052-1067. doi: 10.1177/17470218231190315. Epub 2023 Aug 30.
4
SUBTLEX-UK: a new and improved word frequency database for British English.SUBTLEX-UK:一个全新且经过改进的英式英语词汇频率数据库。
Q J Exp Psychol (Hove). 2014;67(6):1176-90. doi: 10.1080/17470218.2013.850521. Epub 2014 Jan 13.
5
On the advantages of word frequency and contextual diversity measures extracted from subtitles: The case of Portuguese.论从字幕中提取的词频和语境多样性度量的优势:以葡萄牙语为例。
Q J Exp Psychol (Hove). 2015;68(4):680-96. doi: 10.1080/17470218.2014.964271. Epub 2014 Nov 7.
6
Assessing the usefulness of google books' word frequencies for psycholinguistic research on word processing.评估谷歌图书的词频在词汇加工心理语言学研究中的有用性。
Front Psychol. 2011 Mar 2;2:27. doi: 10.3389/fpsyg.2011.00027. eCollection 2011.
7
SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan.SUBTLEX-CAT:加泰罗尼亚语字幕词频和上下文多样性。
Behav Res Methods. 2020 Feb;52(1):360-375. doi: 10.3758/s13428-019-01233-1.
8
Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English.超越库切拉和弗朗西斯:当前词频规范的批判性评估,以及美国英语新的、经过改进的词频衡量标准的引入。
Behav Res Methods. 2009 Nov;41(4):977-90. doi: 10.3758/BRM.41.4.977.
9
Subtitle-based word frequencies as the best estimate of reading behavior: the case of greek.基于字幕的单词频率是阅读行为的最佳估计:以希腊语为例。
Front Psychol. 2010 Dec 21;1:218. doi: 10.3389/fpsyg.2010.00218. eCollection 2010.
10
Subtlex-pl: subtitle-based word frequency estimates for Polish.Subtlex-pl:基于波兰语字幕的词频估算
Behav Res Methods. 2015 Jun;47(2):471-83. doi: 10.3758/s13428-014-0489-4.

引用本文的文献

1
Language switching during production: The influence of preceding exposure to other bilinguals in different switching contexts.语言产出过程中的语言切换:在不同切换情境下先前接触其他双语者的影响。
Mem Cognit. 2025 Sep 9. doi: 10.3758/s13421-025-01787-w.
2
Syntactic Information Extraction in the Parafovea: Evidence from Two-Character Phrases in Chinese.副中央凹的句法信息提取:来自中文双字短语的证据。
Behav Sci (Basel). 2025 Jul 10;15(7):935. doi: 10.3390/bs15070935.
3
The Character Position Encoding of Parafoveal Semantic Previews Is Flexible in Chinese Reading.

本文引用的文献

1
Morphological structure in the Arabic mental lexicon: Parallels between standard and dialectal Arabic.阿拉伯语心理词汇中的形态结构:标准阿拉伯语与阿拉伯语方言之间的相似之处。
Lang Cogn Process. 2013 Dec;28(10):1453-1473. doi: 10.1080/01690965.2012.719629. Epub 2012 Oct 31.
2
SUBTLEX-NL: a new measure for Dutch word frequency based on film subtitles.SUBTLEX-NL:一种基于电影字幕的新的荷兰语词汇频率衡量标准。
Behav Res Methods. 2010 Aug;42(3):643-50. doi: 10.3758/BRM.42.3.643.
3
Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English.
副中央凹语义预视的字符位置编码在中文阅读中具有灵活性。
Behav Sci (Basel). 2025 Jul 4;15(7):907. doi: 10.3390/bs15070907.
4
The Influence of Judgments of Learning on Collaborative Memory for Items and Sequences.学习判断对项目和序列的协作记忆的影响。
Behav Sci (Basel). 2025 Jul 3;15(7):905. doi: 10.3390/bs15070905.
5
An extended Chinese social evaluative word list.一份扩展的中文社会评价词表。
Behav Res Methods. 2025 Jul 25;57(9):236. doi: 10.3758/s13428-025-02760-w.
6
Information-theoretic measures for mapping regularities between orthography and phonology: A comprehensive quantification and validation in the Chinese writing system.用于映射正字法和音系学之间规律的信息论测度:中文书写系统中的全面量化与验证
Behav Res Methods. 2025 Jul 25;57(9):232. doi: 10.3758/s13428-025-02721-3.
7
Simplified Chinese lexicon project: A lexical decision database with 8105 characters and 4864 pseudocharacters.简体中文字典项目:一个包含8105个汉字和4864个假字的词汇判断数据库。
Behav Res Methods. 2025 Jun 23;57(7):206. doi: 10.3758/s13428-025-02701-7.
8
Chipola: A Chinese Podcast Lexical Database for capturing spoken language nuances and predicting behavioral data.奇波拉:一个用于捕捉口语细微差别和预测行为数据的中文播客词汇数据库。
Behav Res Methods. 2025 May 8;57(6):166. doi: 10.3758/s13428-025-02697-0.
9
Bilingual Proficiency Effects on Word Recall and Recognition.双语能力对单词回忆和识别的影响。
Behav Sci (Basel). 2025 Mar 28;15(4):437. doi: 10.3390/bs15040437.
10
Lexical decision times for nouns from the Croatian Psycholinguistic Database.来自克罗地亚心理语言学数据库的名词的词汇判断时间。
Behav Res Methods. 2025 Apr 25;57(6):156. doi: 10.3758/s13428-025-02676-5.
超越库切拉和弗朗西斯:当前词频规范的批判性评估,以及美国英语新的、经过改进的词频衡量标准的引入。
Behav Res Methods. 2009 Nov;41(4):977-90. doi: 10.3758/BRM.41.4.977.
4
Reading spaced and unspaced Chinese text: evidence from eye movements.阅读带空格和不带空格的中文文本:来自眼动的证据。
J Exp Psychol Hum Percept Perform. 2008 Oct;34(5):1277-87. doi: 10.1037/0096-1523.34.5.1277.
5
The English Lexicon Project.英语词汇项目
Behav Res Methods. 2007 Aug;39(3):445-59. doi: 10.3758/bf03193014.
6
Word naming and psycholinguistic norms: Chinese.词汇命名与心理语言学规范:中文
Behav Res Methods. 2007 May;39(2):192-8. doi: 10.3758/bf03193147.
7
Contextual diversity, not word frequency, determines word-naming and lexical decision times.上下文多样性而非词频决定单词命名和词汇判断时间。
Psychol Sci. 2006 Sep;17(9):814-23. doi: 10.1111/j.1467-9280.2006.01787.x.
8
Visual word recognition of single-syllable words.单音节词的视觉单词识别
J Exp Psychol Gen. 2004 Jun;133(2):283-316. doi: 10.1037/0096-3445.133.2.283.
9
ERP evidence for the time course of graphic, phonological, and semantic information in Chinese meaning and pronunciation decisions.事件相关电位(ERP)证据表明,在中文语义和发音判断中,图形、语音和语义信息的时间进程。
J Exp Psychol Learn Mem Cogn. 2003 Nov;29(6):1231-47. doi: 10.1037/0278-7393.29.6.1231.