• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于迁移学习的现代标准阿拉伯语和方言的语义文本相似度研究。

Semantic textual similarity for modern standard and dialectal Arabic using transfer learning.

机构信息

Department of Computer Engineering, College of Computer and Information Sciences (CCIS), King Saud University, Riyadh, Saudi Arabia.

Center of Smart Robotics Research, College of Computer and Information Science, King Saud University, Riyadh, Saudi Arabia.

出版信息

PLoS One. 2022 Aug 11;17(8):e0272991. doi: 10.1371/journal.pone.0272991. eCollection 2022.

DOI:10.1371/journal.pone.0272991
PMID:35951673
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9371328/
Abstract

Semantic Textual Similarity (STS) is the task of identifying the semantic correlation between two sentences of the same or different languages. STS is an important task in natural language processing because it has many applications in different domains such as information retrieval, machine translation, plagiarism detection, document categorization, semantic search, and conversational systems. The availability of STS training and evaluation data resources for some languages such as English has led to good performance systems that achieve above 80% correlation with human judgment. Unfortunately, such required STS data resources are not available for many languages like Arabic. To overcome this challenge, this paper proposes three different approaches to generate effective STS Arabic models. The first one is based on evaluating the use of automatic machine translation for English STS data to Arabic to be used in fine-tuning. The second approach is based on the interleaving of Arabic models with English data resources. The third approach is based on fine-tuning the knowledge distillation-based models to boost their performance in Arabic using a proposed translated dataset. With very limited resources consisting of just a few hundred Arabic STS sentence pairs, we managed to achieve a score of 81% correlation, evaluated using the standard STS 2017 Arabic evaluation set. Also, we managed to extend the Arabic models to process two local dialects, Egyptian (EG) and Saudi Arabian (SA), with a correlation score of 77.5% for EG dialect and 76% for the SA dialect evaluated using dialectal conversion from the same standard STS 2017 Arabic set.

摘要

语义文本相似度 (STS) 是识别相同或不同语言的两个句子之间语义相关性的任务。STS 是自然语言处理中的一项重要任务,因为它在信息检索、机器翻译、剽窃检测、文档分类、语义搜索和会话系统等不同领域有许多应用。一些语言(如英语)的 STS 培训和评估数据资源的可用性导致了性能良好的系统,这些系统与人类判断的相关性超过 80%。不幸的是,许多语言(如阿拉伯语)并没有这样的 STS 数据资源。为了克服这一挑战,本文提出了三种不同的方法来生成有效的 STS 阿拉伯语模型。第一种方法是基于评估自动机器翻译对英语 STS 数据到阿拉伯语的使用,以在微调中使用。第二种方法是基于将阿拉伯语模型与英语数据资源交错。第三种方法是基于使用所提出的翻译数据集对基于知识蒸馏的模型进行微调,以提高它们在阿拉伯语中的性能。在只有几百对阿拉伯语 STS 句子的非常有限的资源下,我们设法实现了 81%的相关性评分,使用标准的 STS 2017 阿拉伯语评估集进行评估。此外,我们还设法扩展了阿拉伯语模型,以处理两种本地方言,埃及(EG)和沙特阿拉伯(SA),在使用来自相同标准 STS 2017 阿拉伯语集的方言转换对 EG 方言进行评估时,相关性评分为 77.5%,对 SA 方言的相关性评分为 76%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6be4/9371328/80429e2f72bc/pone.0272991.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6be4/9371328/03b728f3f82e/pone.0272991.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6be4/9371328/19a5c89b676b/pone.0272991.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6be4/9371328/80429e2f72bc/pone.0272991.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6be4/9371328/03b728f3f82e/pone.0272991.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6be4/9371328/19a5c89b676b/pone.0272991.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6be4/9371328/80429e2f72bc/pone.0272991.g003.jpg

相似文献

1
Semantic textual similarity for modern standard and dialectal Arabic using transfer learning.基于迁移学习的现代标准阿拉伯语和方言的语义文本相似度研究。
PLoS One. 2022 Aug 11;17(8):e0272991. doi: 10.1371/journal.pone.0272991. eCollection 2022.
2
Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT.基于 BERT 的日本临床领域文本的语义文本相似性研究
Methods Inf Med. 2021 Jun;60(S 01):e56-e64. doi: 10.1055/s-0041-1731390. Epub 2021 Jul 8.
3
The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview.2019年n2c2/OHNLP临床语义文本相似性赛道:概述
JMIR Med Inform. 2020 Nov 27;8(11):e23375. doi: 10.2196/23375.
4
Improving neural machine translation for low resource languages through non-parallel corpora: a case study of Egyptian dialect to modern standard Arabic translation.通过非平行语料库改进低资源语言的神经机器翻译:以埃及方言到现代标准阿拉伯语的翻译为例
Sci Rep. 2024 Jan 27;14(1):2265. doi: 10.1038/s41598-023-51090-4.
5
Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.使用Transformer模型预测临床句子对之间的语义相似性:评估与表征分析
JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.
6
BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.BIOSSES:一种用于生物医学领域的语义句子相似度估计系统。
Bioinformatics. 2017 Jul 15;33(14):i49-i58. doi: 10.1093/bioinformatics/btx238.
7
BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights.BioLORD-2023:融合大型语言模型和临床知识图谱洞察的语义文本表示。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1844-1855. doi: 10.1093/jamia/ocae029.
8
Is literary Arabic a second language for native Arab speakers?: Evidence from semantic priming study.文学阿拉伯语对以阿拉伯语为母语的人来说是第二语言吗?:来自语义启动研究的证据。
J Psycholinguist Res. 2005 Jan;34(1):51-70. doi: 10.1007/s10936-005-3631-8.
9
ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations.ArzEn-多体裁:一个包含埃及阿拉伯语歌曲歌词、小说和字幕以及英文翻译的对齐平行数据集。
Data Brief. 2024 Feb 29;54:110271. doi: 10.1016/j.dib.2024.110271. eCollection 2024 Jun.
10
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

本文引用的文献

1
Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT.基于 BERT 的日本临床领域文本的语义文本相似性研究
Methods Inf Med. 2021 Jun;60(S 01):e56-e64. doi: 10.1055/s-0041-1731390. Epub 2021 Jul 8.
2
Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models.临床文本中语义文本相似度的测量:基于Transformer模型的比较。
JMIR Med Inform. 2020 Nov 23;8(11):e19735. doi: 10.2196/19735.
3
A grammar-based semantic similarity algorithm for natural language sentences.一种基于语法的自然语言句子语义相似度算法。
ScientificWorldJournal. 2014;2014:437162. doi: 10.1155/2014/437162. Epub 2014 Apr 10.