• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

Arab2Vec:一种用于推特自然语言处理应用的阿拉伯语词嵌入模型。

Arab2Vec: An Arabic word embedding model for use in Twitter NLP applications.

作者信息

Hamdy Abdelrahman, Youssef Ayman, Ryan Conor

机构信息

The Open University, Milton Keynes, United Kingdom.

Department of Computers and Systems, Electronics Research Institute, Cairo, Egypt.

出版信息

PLoS One. 2025 Aug 29;20(8):e0328369. doi: 10.1371/journal.pone.0328369. eCollection 2025.

DOI:10.1371/journal.pone.0328369
PMID:40880504
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12396693/
Abstract

The analysis of Arabic Twitter data sets is a highly active research topic, particularly since the outbreak of COVID-19 and subsequent attempts to understand public sentiment related to the pandemic. This activity is partially driven by the high number of Arabic Twitter users, around 164 million. Word embedding models are a vital tool for analysing Twitter data sets, as they are considered one of the essential methods of transforming words into numbers that can be processed using machine learning (ML) algorithms. In this work, we introduce a new model, Arab2Vec, that can be used in Twitter-based natural language processing (NLP) applications. Arab2Vec was constructed using a vast data set of approximately 186,000,000 tweets from 2008 to 2021 from all Arabic Twitter sources. This makes Arab2Vec the most up-to-date word embedding model researchers can use for Twitter-based applications. The model is compared with existing models from the literature. The reported results demonstrate superior performance regarding the number of recognised words and F1 score for classification tasks with known data sets and the ability to work with emojis. We also incorporate skip-grams with negative sampling, an approach that other Arabic models haven't previously used. Nine versions of Arab2Vec are produced; these models differ regarding available features, the number of words trained on, speed, etc. This paper provides Arab2Vec as an open-source project for users to employ in research. It describes the data collection methods, the data pre-processing and cleaning step, the effort to build these nine models, and experiments to validate them qualitatively and quantitatively.

摘要

阿拉伯语推特数据集的分析是一个高度活跃的研究课题,尤其是自新冠疫情爆发以及随后人们试图了解与该疫情相关的公众情绪以来。这一活动部分是由大量阿拉伯语推特用户推动的,大约有1.64亿用户。词嵌入模型是分析推特数据集的重要工具,因为它们被认为是将单词转化为可用机器学习(ML)算法处理的数字的基本方法之一。在这项工作中,我们引入了一种新模型Arab2Vec,可用于基于推特的自然语言处理(NLP)应用。Arab2Vec是使用2008年至2021年来自所有阿拉伯语推特来源的约1.86亿条推文的大量数据集构建的。这使得Arab2Vec成为研究人员可用于基于推特应用的最新词嵌入模型。该模型与文献中的现有模型进行了比较。报告结果表明,在已知数据集的分类任务中,在识别单词数量和F1分数方面以及处理表情符号的能力方面,该模型具有卓越的性能。我们还纳入了带负采样的跳字模型,这是其他阿拉伯语模型以前未曾使用过的方法。我们生成了九个版本的Arab2Vec;这些模型在可用特征、训练的单词数量、速度等方面存在差异。本文将Arab2Vec作为一个开源项目提供给用户用于研究。它描述了数据收集方法、数据预处理和清理步骤、构建这九个模型的工作以及对它们进行定性和定量验证的实验。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/5e3f993de1c8/pone.0328369.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/060313f2dc7d/pone.0328369.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/4275a2c850c1/pone.0328369.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/c2398c001f14/pone.0328369.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/640506ce8263/pone.0328369.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/7ea0dd8f903a/pone.0328369.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/40593f3ae3e4/pone.0328369.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/2d8883490a3e/pone.0328369.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/70c221d435bc/pone.0328369.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/a4b627d390b2/pone.0328369.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/5e3f993de1c8/pone.0328369.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/060313f2dc7d/pone.0328369.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/4275a2c850c1/pone.0328369.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/c2398c001f14/pone.0328369.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/640506ce8263/pone.0328369.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/7ea0dd8f903a/pone.0328369.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/40593f3ae3e4/pone.0328369.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/2d8883490a3e/pone.0328369.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/70c221d435bc/pone.0328369.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/a4b627d390b2/pone.0328369.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dce0/12396693/5e3f993de1c8/pone.0328369.g010.jpg

相似文献

1
Arab2Vec: An Arabic word embedding model for use in Twitter NLP applications.Arab2Vec:一种用于推特自然语言处理应用的阿拉伯语词嵌入模型。
PLoS One. 2025 Aug 29;20(8):e0328369. doi: 10.1371/journal.pone.0328369. eCollection 2025.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Short-Term Memory Impairment短期记忆障碍
4
Healthcare workers' informal uses of mobile phones and other mobile devices to support their work: a qualitative evidence synthesis.医护人员非正规使用手机和其他移动设备来支持工作:定性证据综合评价。
Cochrane Database Syst Rev. 2024 Aug 27;8(8):CD015705. doi: 10.1002/14651858.CD015705.pub2.
5
The development of a novel, standardized, norm-referenced Arabic Discourse Assessment Tool (ADAT), including an examination of psychometric properties of discourse measures in aphasia.开发一种新型、标准化、基于常模的阿拉伯语语篇评估工具(ADAT),包括评估失语症患者语篇测量的心理测量特性。
Int J Lang Commun Disord. 2024 Sep-Oct;59(5):2103-2117. doi: 10.1111/1460-6984.13083. Epub 2024 Jun 18.
6
Using Natural Language Processing to Explore Social Media Opinions on Food Security: Sentiment Analysis and Topic Modeling Study.使用自然语言处理技术探索社交媒体对食品安全的看法:情感分析和主题建模研究。
J Med Internet Res. 2024 Mar 21;26:e47826. doi: 10.2196/47826.
7
Methods and Annotated Data Sets Used to Predict the Gender and Age of Twitter Users: Scoping Review.用于预测 Twitter 用户性别和年龄的方法和标注数据集:范围综述。
J Med Internet Res. 2024 Mar 15;26:e47923. doi: 10.2196/47923.
8
Sexual Harassment and Prevention Training性骚扰与预防培训
9
Differential Analysis of Age, Gender, Race, Sentiment, and Emotion in Substance Use Discourse on Twitter During the COVID-19 Pandemic: A Natural Language Processing Approach.COVID-19大流行期间推特上药物使用话语中年龄、性别、种族、情绪和情感的差异分析:一种自然语言处理方法
JMIR Infodemiology. 2025 Jul 28;5:e67333. doi: 10.2196/67333.
10
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

本文引用的文献

1
Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media.研究预处理技术和预训练词嵌入在社交媒体上检测阿拉伯语健康信息方面的影响。
J Big Data. 2021;8(1):95. doi: 10.1186/s40537-021-00488-w. Epub 2021 Jul 2.
2
Arabic Sentiment Classification Using Convolutional Neural Network and Differential Evolution Algorithm.基于卷积神经网络和差分进化算法的阿拉伯语情感分类
Comput Intell Neurosci. 2019 Feb 26;2019:2537689. doi: 10.1155/2019/2537689. eCollection 2019.