Suppr超能文献

基于迁移学习的现代标准阿拉伯语和方言的语义文本相似度研究。

Semantic textual similarity for modern standard and dialectal Arabic using transfer learning.

机构信息

Department of Computer Engineering, College of Computer and Information Sciences (CCIS), King Saud University, Riyadh, Saudi Arabia.

Center of Smart Robotics Research, College of Computer and Information Science, King Saud University, Riyadh, Saudi Arabia.

出版信息

PLoS One. 2022 Aug 11;17(8):e0272991. doi: 10.1371/journal.pone.0272991. eCollection 2022.

Abstract

Semantic Textual Similarity (STS) is the task of identifying the semantic correlation between two sentences of the same or different languages. STS is an important task in natural language processing because it has many applications in different domains such as information retrieval, machine translation, plagiarism detection, document categorization, semantic search, and conversational systems. The availability of STS training and evaluation data resources for some languages such as English has led to good performance systems that achieve above 80% correlation with human judgment. Unfortunately, such required STS data resources are not available for many languages like Arabic. To overcome this challenge, this paper proposes three different approaches to generate effective STS Arabic models. The first one is based on evaluating the use of automatic machine translation for English STS data to Arabic to be used in fine-tuning. The second approach is based on the interleaving of Arabic models with English data resources. The third approach is based on fine-tuning the knowledge distillation-based models to boost their performance in Arabic using a proposed translated dataset. With very limited resources consisting of just a few hundred Arabic STS sentence pairs, we managed to achieve a score of 81% correlation, evaluated using the standard STS 2017 Arabic evaluation set. Also, we managed to extend the Arabic models to process two local dialects, Egyptian (EG) and Saudi Arabian (SA), with a correlation score of 77.5% for EG dialect and 76% for the SA dialect evaluated using dialectal conversion from the same standard STS 2017 Arabic set.

摘要

语义文本相似度 (STS) 是识别相同或不同语言的两个句子之间语义相关性的任务。STS 是自然语言处理中的一项重要任务,因为它在信息检索、机器翻译、剽窃检测、文档分类、语义搜索和会话系统等不同领域有许多应用。一些语言(如英语)的 STS 培训和评估数据资源的可用性导致了性能良好的系统,这些系统与人类判断的相关性超过 80%。不幸的是,许多语言(如阿拉伯语)并没有这样的 STS 数据资源。为了克服这一挑战,本文提出了三种不同的方法来生成有效的 STS 阿拉伯语模型。第一种方法是基于评估自动机器翻译对英语 STS 数据到阿拉伯语的使用,以在微调中使用。第二种方法是基于将阿拉伯语模型与英语数据资源交错。第三种方法是基于使用所提出的翻译数据集对基于知识蒸馏的模型进行微调,以提高它们在阿拉伯语中的性能。在只有几百对阿拉伯语 STS 句子的非常有限的资源下,我们设法实现了 81%的相关性评分,使用标准的 STS 2017 阿拉伯语评估集进行评估。此外,我们还设法扩展了阿拉伯语模型,以处理两种本地方言,埃及(EG)和沙特阿拉伯(SA),在使用来自相同标准 STS 2017 阿拉伯语集的方言转换对 EG 方言进行评估时,相关性评分为 77.5%,对 SA 方言的相关性评分为 76%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6be4/9371328/03b728f3f82e/pone.0272991.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验