Meng Yu, Huang Jiaxin, Wang Guangyuan, Wang Zihan, Zhang Chao, Han Jiawei
Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL, United States.
School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, GA, United States.
Front Big Data. 2020 Mar 11;3:9. doi: 10.3389/fdata.2020.00009. eCollection 2020.
Word embedding has benefited a broad spectrum of text analysis tasks by learning distributed word representations to encode word semantics. Word representations are typically learned by modeling local contexts of words, assuming that words sharing similar surrounding words are semantically close. We argue that local contexts can only partially define word semantics in the unsupervised word embedding learning. Global contexts, referring to the broader semantic units, such as the document or paragraph where the word appears, can capture different aspects of word semantics and complement local contexts. We propose two simple yet effective unsupervised word embedding models that jointly model both local and global contexts to learn word representations. We provide theoretical interpretations of the proposed models to demonstrate how local and global contexts are jointly modeled, assuming a generative relationship between words and contexts. We conduct a thorough evaluation on a wide range of benchmark datasets. Our quantitative analysis and case study show that despite their simplicity, our two proposed models achieve superior performance on word similarity and text classification tasks.
词嵌入通过学习分布式词表示来编码词的语义,从而使广泛的文本分析任务受益。词表示通常通过对词的局部上下文进行建模来学习,假设共享相似周围词的词在语义上相近。我们认为,在无监督词嵌入学习中,局部上下文只能部分地定义词的语义。全局上下文,指的是更广泛的语义单元,例如词出现的文档或段落,可以捕捉词语义的不同方面并补充局部上下文。我们提出了两种简单而有效的无监督词嵌入模型,它们联合对局部和全局上下文进行建模以学习词表示。我们对所提出的模型进行了理论解释,以展示如何在假设词与上下文之间存在生成关系的情况下联合对局部和全局上下文进行建模。我们在广泛的基准数据集上进行了全面评估。我们的定量分析和案例研究表明,尽管我们提出的两个模型很简单,但它们在词相似度和文本分类任务上取得了卓越的性能。