利用分布式语义学和本体论信息提高泰语语义相似度的现有水平。

Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information.

机构信息

Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang (KMITL), Bangkok, Thailand.

Faculty of Software Engineering and Computer Systems, ITMO University, St. Petersburg, Russia.

出版信息

PLoS One. 2021 Feb 17;16(2):e0246751. doi: 10.1371/journal.pone.0246751. eCollection 2021.

DOI:10.1371/journal.pone.0246751

PMID:33596220

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7888635/

Abstract

Research into semantic similarity has a long history in lexical semantics, and it has applications in many natural language processing (NLP) tasks like word sense disambiguation or machine translation. The task of calculating semantic similarity is usually presented in the form of datasets which contain word pairs and a human-assigned similarity score. Algorithms are then evaluated by their ability to approximate the gold standard similarity scores. Many such datasets, with different characteristics, have been created for English language. Recently, four of those were transformed to Thai language versions, namely WordSim-353, SimLex-999, SemEval-2017-500, and R&G-65. Given those four datasets, in this work we aim to improve the previous baseline evaluations for Thai semantic similarity and solve challenges of unsegmented Asian languages (particularly the high fraction of out-of-vocabulary (OOV) dataset terms). To this end we apply and integrate different strategies to compute similarity, including traditional word-level embeddings, subword-unit embeddings, and ontological or hybrid sources like WordNet and ConceptNet. With our best model, which combines self-trained fastText subword embeddings with ConceptNet Numberbatch, we managed to raise the state-of-the-art, measured with the harmonic mean of Pearson on Spearman ρ, by a large margin from 0.356 to 0.688 for TH-WordSim-353, from 0.286 to 0.769 for TH-SemEval-500, from 0.397 to 0.717 for TH-SimLex-999, and from 0.505 to 0.901 for TWS-65.

摘要

在词汇语义学中，语义相似性研究有着悠久的历史，它在许多自然语言处理（NLP）任务中都有应用，如词义消歧或机器翻译。计算语义相似性的任务通常以包含单词对和人工分配的相似性得分的数据集的形式呈现。然后，算法的评估标准是其近似黄金标准相似性得分的能力。许多具有不同特点的此类数据集已经为英语语言创建。最近，其中四个数据集被转换为泰语版本，即 WordSim-353、SimLex-999、SemEval-2017-500 和 R&G-65。在这项工作中，我们使用这四个数据集，旨在提高之前的泰语语义相似性的基线评估，并解决未分段的亚洲语言（特别是词汇量外（OOV）数据集术语的高比例）的挑战。为此，我们应用和整合了不同的策略来计算相似性，包括传统的单词级嵌入、子词单元嵌入以及本体论或混合来源，如 WordNet 和 ConceptNet。我们使用结合了自训练 fastText 子词嵌入和 ConceptNet Numberbatch 的最佳模型，成功地提高了最先进的水平，以 TH-WordSim-353 的斯皮尔曼 ρ 的皮尔逊调和平均值、TH-SemEval-500 的 0.286 到 0.769、TH-SimLex-999 的 0.397 到 0.717 和 TWS-65 的 0.505 到 0.901 为衡量标准。