Suppr超能文献

停用词表和形态分解对基于词-词语料库的语义空间模型的性能影响。

Performance impact of stop lists and morphological decomposition on word-word corpus-based semantic space models.

作者信息

Keith Jeff, Westbury Chris, Goldman James

机构信息

University of Alberta, Edmonton, Alberta, Canada,

出版信息

Behav Res Methods. 2015 Sep;47(3):666-84. doi: 10.3758/s13428-015-0614-z.

Abstract

Corpus-based semantic space models, which primarily rely on lexical co-occurrence statistics, have proven effective in modeling and predicting human behavior in a number of experimental paradigms that explore semantic memory representation. The most widely studied extant models, however, are strongly influenced by orthographic word frequency (e.g., Shaoul & Westbury, Behavior Research Methods, 38, 190-195, 2006). This has the implication that high-frequency closed-class words can potentially bias co-occurrence statistics. Because these closed-class words are purported to carry primarily syntactic, rather than semantic, information, the performance of corpus-based semantic space models may be improved by excluding closed-class words (using stop lists) from co-occurrence statistics, while retaining their syntactic information through other means (e.g., part-of-speech tagging and/or affixes from inflected word forms). Additionally, very little work has been done to explore the effect of employing morphological decomposition on the inflected forms of words in corpora prior to compiling co-occurrence statistics, despite (controversial) evidence that humans perform early morphological decomposition in semantic processing. In this study, we explored the impact of these factors on corpus-based semantic space models. From this study, morphological decomposition appears to significantly improve performance in word-word co-occurrence semantic space models, providing some support for the claim that sublexical information-specifically, word morphology-plays a role in lexical semantic processing. An overall decrease in performance was observed in models employing stop lists (e.g., excluding closed-class words). Furthermore, we found some evidence that weakens the claim that closed-class words supply primarily syntactic information in word-word co-occurrence semantic space models.

摘要

基于语料库的语义空间模型主要依赖词汇共现统计,在探索语义记忆表征的许多实验范式中,已被证明在模拟和预测人类行为方面是有效的。然而,目前研究最广泛的模型受到正字法词频的强烈影响(例如,绍乌尔和韦斯特伯里,《行为研究方法》,第38卷,第190 - 195页,2006年)。这意味着高频封闭类词可能会使共现统计产生偏差。由于这些封闭类词据称主要携带句法而非语义信息,通过从共现统计中排除封闭类词(使用停用词列表),同时通过其他方式(例如词性标注和/或屈折词形的词缀)保留其句法信息,基于语料库的语义空间模型的性能可能会得到改善。此外,在编制共现统计之前,很少有研究探讨对语料库中词的屈折形式进行形态分解的影响,尽管有(存在争议的)证据表明人类在语义处理中会进行早期形态分解。在本研究中,我们探讨了这些因素对基于语料库的语义空间模型的影响。从这项研究来看,形态分解似乎能显著提高词 - 词共现语义空间模型的性能,为次词汇信息——具体来说,词的形态——在词汇语义处理中起作用这一观点提供了一些支持。在使用停用词列表(例如排除封闭类词)的模型中,观察到性能总体下降。此外,我们发现了一些证据,削弱了封闭类词在词 - 词共现语义空间模型中主要提供句法信息这一观点。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验