• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

处理零词频问题:现有经验法则综述及基于证据的选择建议。

Dealing with zero word frequencies: a review of the existing rules of thumb and a suggestion for an evidence-based choice.

机构信息

Department of Experimental Psychology, Ghent University, H. Dunantlaan 2, 9000, Ghent, Belgium.

出版信息

Behav Res Methods. 2013 Jun;45(2):422-30. doi: 10.3758/s13428-012-0270-5.

DOI:10.3758/s13428-012-0270-5
PMID:23055175
Abstract

In a critical review of the heuristics used to deal with zero word frequencies, we show that four are suboptimal, one is good, and one may be acceptable. The four suboptimal strategies are discarding words with zero frequencies, giving words with zero frequencies a very low frequency, adding 1 to the frequency per million, and making use of the Good-Turing algorithm. The good algorithm is the Laplace transformation, which consists of adding 1 to each frequency count and increasing the total corpus size by the number of word types observed. A strategy that may be acceptable is to guess the frequency of absent words on the basis of other corpora and then increasing the total corpus size by the estimated summed frequency of the missing words. A comparison with the lexical decision times of the English Lexicon Project and the British Lexicon Project suggests that the Laplace transformation gives the most useful estimates (in addition to being easy to calculate). Therefore, we recommend it to researchers.

摘要

在对处理零词频所用启发式方法的批判性回顾中,我们表明其中四种是次优的,一种是好的,一种可能是可接受的。四种次优策略是丢弃零频率的单词、给零频率的单词一个非常低的频率、将每百万频率增加 1 以及使用古德-图灵算法。好的算法是拉普拉斯变换,它包括给每个频率计数加 1,并将总语料库大小增加到观察到的单词类型数量。一种可能可接受的策略是根据其他语料库猜测缺失单词的频率,然后将总语料库大小增加到缺失单词估计的总和频率。与英语词汇项目和英国词汇项目的词汇决策时间的比较表明,拉普拉斯变换提供了最有用的估计(除了易于计算)。因此,我们向研究人员推荐它。

相似文献

1
Dealing with zero word frequencies: a review of the existing rules of thumb and a suggestion for an evidence-based choice.处理零词频问题:现有经验法则综述及基于证据的选择建议。
Behav Res Methods. 2013 Jun;45(2):422-30. doi: 10.3758/s13428-012-0270-5.
2
The word frequency effect: a review of recent developments and implications for the choice of frequency estimates in German.词频效应:近期发展综述及其对德语中频率估计选择的影响
Exp Psychol. 2011;58(5):412-24. doi: 10.1027/1618-3169/a000123.
3
SUBTLEX-UK: a new and improved word frequency database for British English.SUBTLEX-UK:一个全新且经过改进的英式英语词汇频率数据库。
Q J Exp Psychol (Hove). 2014;67(6):1176-90. doi: 10.1080/17470218.2013.850521. Epub 2014 Jan 13.
4
Social Media and Language Processing: How Facebook and Twitter Provide the Best Frequency Estimates for Studying Word Recognition.社交媒体与语言处理:脸书和推特如何为研究单词识别提供最佳频率估计。
Cogn Sci. 2017 May;41(4):976-995. doi: 10.1111/cogs.12392. Epub 2016 Aug 1.
5
Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English.超越库切拉和弗朗西斯:当前词频规范的批判性评估,以及美国英语新的、经过改进的词频衡量标准的引入。
Behav Res Methods. 2009 Nov;41(4):977-90. doi: 10.3758/BRM.41.4.977.
6
Subtlex-pl: subtitle-based word frequency estimates for Polish.Subtlex-pl:基于波兰语字幕的词频估算
Behav Res Methods. 2015 Jun;47(2):471-83. doi: 10.3758/s13428-014-0489-4.
7
A database of 629 English compound words: ratings of familiarity, lexeme meaning dominance, semantic transparency, age of acquisition, imageability, and sensory experience.629 个英语复合词数据库:熟悉度评分、词元意义主导性、语义透明度、习得年龄、形象性和感官体验。
Behav Res Methods. 2015 Dec;47(4):1004-1019. doi: 10.3758/s13428-014-0523-6.
8
The impact of word prevalence on lexical decision times: Evidence from the Dutch Lexicon Project 2.词频对词汇判断时间的影响:来自荷兰词汇项目2的证据
J Exp Psychol Hum Percept Perform. 2016 Mar;42(3):441-58. doi: 10.1037/xhp0000159. Epub 2015 Oct 26.
9
Spoken word frequency counts based on 1.6 million words in American English.基于160万个美式英语单词的口语词汇频率统计。
Behav Res Methods. 2007 Nov;39(4):1025-8. doi: 10.3758/bf03193000.
10
Just Google It: An Approach on Word Frequencies Based on Online Search Result.谷歌一下就知道了:一种基于在线搜索结果的词频方法。
J Gen Psychol. 2018 Apr-Jun;145(2):170-182. doi: 10.1080/00221309.2018.1459451. Epub 2018 May 14.

引用本文的文献

1
Taboo language across the globe: A multi-lab study.全球禁忌语:一项多实验室研究。
Behav Res Methods. 2024 Apr;56(4):3794-3813. doi: 10.3758/s13428-024-02376-6. Epub 2024 May 9.
2
The Children and Young People's Books Lexicon (CYP-LEX): A large-scale lexical database of books read by children and young people in the United Kingdom.《儿童与青少年书籍词汇表》(CYP-LEX):一个大规模的词汇数据库,收录了英国儿童和青少年阅读的书籍。
Q J Exp Psychol (Hove). 2024 Dec;77(12):2418-2438. doi: 10.1177/17470218241229694. Epub 2024 Mar 12.
3
Frequency effects in Spanish phonological speech errors: Weak sources in the context of weak syllables and words.
西班牙语语音言语错误中的频率效应:弱音节和弱词语境下的薄弱源头。
Appl Psycholinguist. 2023 Sep;44(5):722-749. doi: 10.1017/s0142716423000231. Epub 2023 May 4.
4
Dynamics of Functional Networks for Syllable and Word-Level Processing.音节和单词层面处理的功能网络动态
Neurobiol Lang (Camb). 2023 Mar 8;4(1):120-144. doi: 10.1162/nol_a_00089. eCollection 2023.
5
PHOR-in-One: A multilingual lexical database with PHonological, ORthographic and PHonographic word similarity estimates in four languages.PHO-R 一体:一个多语言词汇数据库,包含四种语言的 PHonological、ORthographic 和 PHonographic 单词相似度估计。
Behav Res Methods. 2023 Oct;55(7):3699-3725. doi: 10.3758/s13428-022-01985-3. Epub 2022 Nov 7.
6
Dopamine-Related Reduction of Semantic Spreading Activation in Patients With Parkinson's Disease.帕金森病患者中多巴胺相关的语义扩散激活减少
Front Hum Neurosci. 2022 Mar 31;16:837122. doi: 10.3389/fnhum.2022.837122. eCollection 2022.
7
Lexical-semantic search related to side of onset and putamen volume in Parkinson's disease.与帕金森病发病侧和壳核体积相关的词汇语义搜索
Brain Lang. 2020 Oct;209:104841. doi: 10.1016/j.bandl.2020.104841. Epub 2020 Aug 17.
8
Understanding Karma Police: The Perceived Plausibility of Noun Compounds as Predicted by Distributional Models of Semantic Representation.理解《因果警察》:语义表征分布模型预测的名词复合词的可感知合理性。
PLoS One. 2016 Oct 12;11(10):e0163200. doi: 10.1371/journal.pone.0163200. eCollection 2016.