Andrews Mark, Vigliocco Gabriella, Vinson David
Cognitive, Perceptual and Brain Sciences, University College London, London, UK.
Psychol Rev. 2009 Jul;116(3):463-98. doi: 10.1037/a0016261.
The authors identify 2 major types of statistical data from which semantic representations can be learned. These are denoted as experiential data and distributional data. Experiential data are derived by way of experience with the physical world and comprise the sensory-motor data obtained through sense receptors. Distributional data, by contrast, describe the statistical distribution of words across spoken and written language. The authors claim that experiential and distributional data represent distinct data types and that each is a nontrivial source of semantic information. Their theoretical proposal is that human semantic representations are derived from an optimal statistical combination of these 2 data types. Using a Bayesian probabilistic model, they demonstrate how word meanings can be learned by treating experiential and distributional data as a single joint distribution and learning the statistical structure that underlies it. The semantic representations that are learned in this manner are measurably more realistic-as verified by comparison to a set of human-based measures of semantic representation-than those available from either data type individually or from both sources independently. This is not a result of merely using quantitatively more data, but rather it is because experiential and distributional data are qualitatively distinct, yet intercorrelated, types of data. The semantic representations that are learned are based on statistical structures that exist both within and between the experiential and distributional data types.
作者识别出两类主要的统计数据,从中可以学习语义表征。这些数据被称为经验数据和分布数据。经验数据是通过与物理世界的交互获得的,包括通过感官受体获取的感觉运动数据。相比之下,分布数据描述了单词在口语和书面语中的统计分布。作者声称,经验数据和分布数据代表了不同的数据类型,且每一种都是语义信息的重要来源。他们的理论主张是,人类语义表征源自这两种数据类型的最优统计组合。通过使用贝叶斯概率模型,他们展示了如何将经验数据和分布数据视为一个联合分布,并学习其背后的统计结构,从而习得单词的含义。通过与一组基于人类的语义表征度量标准进行比较验证,以这种方式习得的语义表征明显更贴近现实,比单独从任何一种数据类型或从这两种数据类型独立得出的表征都更具现实性。这并非仅仅是使用了数量更多的数据所致,而是因为经验数据和分布数据在性质上截然不同,但又相互关联。所习得的语义表征基于经验数据和分布数据类型内部及之间存在的统计结构。