Department of Informatics & Artificial Intelligence eXploration Research Center, The University of Electro-Communications.
Cogn Sci. 2020 Jun;44(6):e12844. doi: 10.1111/cogs.12844.
The pervasive use of distributional semantic models or word embeddings for both cognitive modeling and practical application is because of their remarkable ability to represent the meanings of words. However, relatively little effort has been made to explore what types of information are encoded in distributional word vectors. Knowing the internal knowledge embedded in word vectors is important for cognitive modeling using distributional semantic models. Therefore, in this paper, we attempt to identify the knowledge encoded in word vectors by conducting a computational experiment using Binder et al.'s (2016) featural conceptual representations based on neurobiologically motivated attributes. In an experiment, these conceptual vectors are predicted from text-based word vectors using a neural network and linear transformation, and prediction performance is compared among various types of information. The analysis demonstrates that abstract information is generally predicted more accurately by word vectors than perceptual and spatiotemporal information, and specifically, the prediction accuracy of cognitive and social information is higher. Emotional information is also found to be successfully predicted for abstract words. These results indicate that language can be a major source of knowledge about abstract attributes, and they support the recent view that emphasizes the importance of language for abstract concepts. Furthermore, we show that word vectors can capture some types of perceptual and spatiotemporal information about concrete concepts and some relevant word categories. This suggests that language statistics can encode more perceptual knowledge than often expected.
分布语义模型或词嵌入在认知建模和实际应用中被广泛使用,是因为它们具有出色的表示单词含义的能力。然而,对于分布词向量中编码了哪些类型的信息,人们的研究相对较少。了解词向量中嵌入的内部知识对于使用分布语义模型进行认知建模非常重要。因此,在本文中,我们尝试通过使用基于神经生物学动机属性的 Binder 等人(2016 年)的特征概念表示来进行计算实验,以识别词向量中编码的知识。在实验中,我们使用神经网络和线性变换从基于文本的词向量中预测这些概念向量,并比较各种类型信息的预测性能。分析表明,与感知和时空信息相比,词向量通常可以更准确地预测抽象信息,特别是认知和社会信息的预测准确性更高。此外,还发现情感信息可以成功地预测抽象词。这些结果表明语言可以成为抽象属性知识的主要来源,并且支持了最近强调语言对抽象概念重要性的观点。此外,我们还表明词向量可以捕捉到一些具体概念和相关词类的感知和时空信息。这表明语言统计信息可以编码比通常预期更多的感知知识。