Suppr超能文献

一种基于词义归纳构建词义嵌入的方法。

A method for constructing word sense embeddings based on word sense induction.

作者信息

Sun Yujia, Platoš Jan

机构信息

Department of Computer Science, Technical University of Ostrava, 17. Listopadu 2172/15, 70800, Ostrava-Poruba, Czech Republic.

Institute of Network Information Security, Hebei GEO University, No. 136 East Huai΄an Road , Shijiazhuang, 050031, Hebei, China.

出版信息

Sci Rep. 2023 Aug 9;13(1):12945. doi: 10.1038/s41598-023-40062-3.

Abstract

Polysemy is an inherent characteristic of natural language. In order to make it easier to distinguish between different senses of polysemous words, we propose a method for encoding multiple different senses of polysemous words using a single vector. The method first uses a two-layer bidirectional long short-term memory neural network and a self-attention mechanism to extract the contextual information of polysemous words. Then, a K-means algorithm, which is improved by optimizing the density peaks clustering algorithm based on cosine similarity, is applied to perform word sense induction on the contextual information of polysemous words. Finally, the method constructs the corresponding word sense embedded representations of the polysemous words. The results of the experiments demonstrate that the proposed method produces better word sense induction than Euclidean distance, Pearson correlation, and KL-divergence and more accurate word sense embeddings than mean shift, DBSCAN, spectral clustering, and agglomerative clustering.

摘要

一词多义是自然语言的固有特征。为了更便于区分多义词的不同语义,我们提出了一种使用单个向量对多义词的多种不同语义进行编码的方法。该方法首先使用两层双向长短期记忆神经网络和自注意力机制来提取多义词的上下文信息。然后,应用一种基于余弦相似度优化密度峰值聚类算法改进的K均值算法,对多义词的上下文信息进行词义归纳。最后,该方法构建多义词相应的词义嵌入表示。实验结果表明,所提出的方法比欧几里得距离、皮尔逊相关性和KL散度产生更好的词义归纳,并且比均值漂移、DBSCAN、谱聚类和凝聚聚类产生更准确的词义嵌入。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3504/10412592/9be0a511054f/41598_2023_40062_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验