Al-Shameri Noora, Al-Khalifa Hend
Information Technology Department, King Saud University, Riyadh, Saudi Arabia.
Data Brief. 2024 Oct 10;57:111004. doi: 10.1016/j.dib.2024.111004. eCollection 2024 Dec.
The Arabic paraphrased parallel dataset plays a crucial role in advancing NLP and other language-related applications by leveraging data from diverse sources and expanding it through data augmentation techniques. This dataset enhances machine translation, text summarization, and sentiment analysis, providing a better understanding and manipulation of the Arabic language. It also serves as a valuable tool for improving educational materials, optimizing search engines, and supporting content creation across various fields. Its role in semantic analysis aids in understanding context and meaning, making it indispensable for domain-specific applications. The main aim of building this dataset is to generate paraphrased sentences through synthetic augmentation using the back translation technique, addressing the gap in research and datasets focused on paraphrase generation in Arabic. The process involves collecting sentences from various sources, followed by preprocessing and evaluation to ensure reliability and usefulness. This systematic approach aims to produce a robust Arabic paraphrased dataset that can be utilized in various NLP tasks, fostering further innovation in Arabic language processing.
阿拉伯语释义平行数据集通过利用来自不同来源的数据并通过数据增强技术进行扩展,在推进自然语言处理(NLP)和其他与语言相关的应用方面发挥着关键作用。该数据集增强了机器翻译、文本摘要和情感分析,有助于更好地理解和处理阿拉伯语。它也是改进教育材料、优化搜索引擎以及支持各个领域内容创作的宝贵工具。其在语义分析中的作用有助于理解上下文和含义,使其对于特定领域的应用不可或缺。构建这个数据集的主要目的是使用反向翻译技术通过合成增强来生成释义句子,解决专注于阿拉伯语释义生成的研究和数据集方面的差距。这个过程包括从各种来源收集句子,然后进行预处理和评估以确保可靠性和实用性。这种系统方法旨在生成一个强大的阿拉伯语释义数据集,可用于各种NLP任务,促进阿拉伯语处理的进一步创新。