Department of Psychology, Emmanuel College, 400 The Fenway, Boston, MA 02115, USA.
Behav Res Methods. 2011 Mar;43(1):77-88. doi: 10.3758/s13428-010-0042-z.
In this article, we introduce a software package that applies a corpus-based algorithm to derive semantic representations of words. The algorithm relies on analyses of contextual information extracted from a text corpus--specifically, analyses of word co-occurrences in a large-scale electronic database of text. Here, a target word is represented as the combination of the average of all words preceding the target and all words following it in a text corpus. The semantic representation of the target words can be further processed by a self-organizing map (SOM; Kohonen, Self-organizing maps, 2001), an unsupervised neural network model that provides efficient data extraction and representation. Due to its topography-preserving features, the SOM projects the statistical structure of the context onto a 2-D space, such that words with similar meanings cluster together, forming groups that correspond to lexically meaningful categories. Such a representation system has its applications in a variety of contexts, including computational modeling of language acquisition and processing. In this report, we present specific examples from two languages (English and Chinese) to demonstrate how the method is applied to extract the semantic representations of words.
本文介绍了一种应用语料库算法来推导词汇语义表示的软件包。该算法依赖于从文本语料库中提取的上下文信息的分析——具体来说,是对大型文本电子数据库中的词共现的分析。在这里,目标词表示为在文本语料库中紧跟目标词之前的所有词的平均值与紧跟目标词之后的所有词的平均值的组合。目标词的语义表示可以进一步通过自组织映射(SOM;Kohonen,Self-organizing maps,2001)进行处理,这是一种无监督神经网络模型,可以提供高效的数据提取和表示。由于其保形特征,SOM 将上下文的统计结构投影到二维空间上,使得具有相似含义的词聚集在一起,形成与词汇意义相关的类别相对应的组。这种表示系统在多种语境下都有应用,包括语言习得和处理的计算建模。在本报告中,我们将从两种语言(英语和汉语)中呈现具体的示例,以展示如何应用该方法来提取词汇的语义表示。