Department of Experimental Psychology, Ghent University, H. Dunantlaan 2, 9000, Ghent, Belgium.
Behav Res Methods. 2013 Jun;45(2):422-30. doi: 10.3758/s13428-012-0270-5.
In a critical review of the heuristics used to deal with zero word frequencies, we show that four are suboptimal, one is good, and one may be acceptable. The four suboptimal strategies are discarding words with zero frequencies, giving words with zero frequencies a very low frequency, adding 1 to the frequency per million, and making use of the Good-Turing algorithm. The good algorithm is the Laplace transformation, which consists of adding 1 to each frequency count and increasing the total corpus size by the number of word types observed. A strategy that may be acceptable is to guess the frequency of absent words on the basis of other corpora and then increasing the total corpus size by the estimated summed frequency of the missing words. A comparison with the lexical decision times of the English Lexicon Project and the British Lexicon Project suggests that the Laplace transformation gives the most useful estimates (in addition to being easy to calculate). Therefore, we recommend it to researchers.
在对处理零词频所用启发式方法的批判性回顾中,我们表明其中四种是次优的,一种是好的,一种可能是可接受的。四种次优策略是丢弃零频率的单词、给零频率的单词一个非常低的频率、将每百万频率增加 1 以及使用古德-图灵算法。好的算法是拉普拉斯变换,它包括给每个频率计数加 1,并将总语料库大小增加到观察到的单词类型数量。一种可能可接受的策略是根据其他语料库猜测缺失单词的频率,然后将总语料库大小增加到缺失单词估计的总和频率。与英语词汇项目和英国词汇项目的词汇决策时间的比较表明,拉普拉斯变换提供了最有用的估计(除了易于计算)。因此,我们向研究人员推荐它。