更多数据胜过更智能的算法：逐点互信息与潜在语义分析的比较

More data trumps smarter algorithms: comparing pointwise mutual information with latent semantic analysis.

作者信息

Recchia Gabriel, Jones Michael N

机构信息

Cognitive Science Program, Indiana University, Bloomington, Indiana 47406-7512, USA.

出版信息

Behav Res Methods. 2009 Aug;41(3):647-56. doi: 10.3758/BRM.41.3.647.

DOI:10.3758/BRM.41.3.647

PMID:19587174

Abstract

Computational models of lexical semantics, such as latent semantic analysis, can automatically generate semantic similarity measures between words from statistical redundancies in text. These measures are useful for experimental stimulus selection and for evaluating a model's cognitive plausibility as a mechanism that people might use to organize meaning in memory. Although humans are exposed to enormous quantities of speech, practical constraints limit the amount of data that many current computational models can learn from. We follow up on previous work evaluating a simple metric of pointwise mutual information. Controlling for confounds in previous work, we demonstrate that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. We also present a simple tool for building simple and scalable models from large corpora quickly and efficiently.

摘要

词汇语义的计算模型，如潜在语义分析，可以根据文本中的统计冗余自动生成单词之间的语义相似性度量。这些度量对于实验刺激选择以及评估模型作为人们可能用于在记忆中组织意义的机制的认知合理性很有用。尽管人类接触到大量的语音，但实际限制限制了许多当前计算模型可以从中学习的数据量。我们跟进了之前评估逐点互信息简单度量的工作。在控制了之前工作中的混杂因素后，我们证明这个度量从大量数据的训练中受益，并且与人类语义相似性评级的相关性比几个更复杂模型的公开可用实现更紧密。我们还展示了一个简单的工具，用于快速有效地从大型语料库构建简单且可扩展的模型。