Nedumpozhimana Vasudevan, Klubička Filip, Kelleher John D
ADAPT Centre, Technological University Dublin, Dublin, Ireland.
Front Artif Intell. 2022 Mar 14;5:813967. doi: 10.3389/frai.2022.813967. eCollection 2022.
This article examines the basis of Natural Language Understanding of transformer based language models, such as BERT. It does this through a case study on idiom token classification. We use idiom token identification as a basis for our analysis because of the variety of information types that have previously been explored in the literature for this task, including: topic, lexical, and syntactic features. This variety of relevant information types means that the task of idiom token identification enables us to explore the forms of linguistic information that a BERT language model captures and encodes in its representations. The core of this article presents three experiments. The first experiment analyzes the effectiveness of BERT sentence embeddings for creating a general idiom token identification model and the results indicate that the BERT sentence embeddings outperform Skip-Thought. In the second and third experiment we use the game theory concept of Shapley Values to rank the usefulness of individual idiomatic expressions for model training and use this ranking to analyse the type of information that the model finds useful. We find that a combination of idiom-intrinsic and topic-based properties contribute to an expression's usefulness in idiom token identification. Overall our results indicate that BERT efficiently encodes a variety of information from topic, through lexical and syntactic information. Based on these results we argue that notwithstanding recent criticisms of language model based semantics, the ability of BERT to efficiently encode a variety of linguistic information types does represent a significant step forward in natural language understanding.
本文探讨了基于Transformer的语言模型(如BERT)的自然语言理解基础。这是通过一个关于习语词元分类的案例研究来实现的。我们将习语词元识别作为分析的基础,是因为此前文献中针对该任务探讨了多种信息类型,包括:主题、词汇和句法特征。这种多样的相关信息类型意味着习语词元识别任务使我们能够探究BERT语言模型在其表示中捕获和编码的语言信息形式。本文核心部分呈现了三个实验。第一个实验分析了BERT句子嵌入用于创建通用习语词元识别模型的有效性,结果表明BERT句子嵌入优于Skip-Thought。在第二个和第三个实验中,我们使用夏普利值的博弈论概念对各个习语表达在模型训练中的有用性进行排名,并利用该排名来分析模型认为有用的信息类型。我们发现习语内在属性和基于主题的属性相结合有助于一个表达在习语词元识别中的有用性。总体而言,我们的结果表明BERT能有效地从主题开始,通过词汇和句法信息对多种信息进行编码。基于这些结果,我们认为尽管最近对基于语言模型的语义学存在批评,但BERT有效编码多种语言信息类型的能力确实代表了自然语言理解方面的显著进步。