阿拉伯语标点符号数据集。

Arabic punctuation dataset.

作者信息

Yagi Sane, Elnagar Ashraf, Yaghi Esra

机构信息

Department of Foreign Languages, University of Sharjah, the United Arab Emirates.

Department of Computer Science, University of Sharjah, the United Arab Emirates.

出版信息

Data Brief. 2024 Feb 1;53:110118. doi: 10.1016/j.dib.2024.110118. eCollection 2024 Apr.

DOI:10.1016/j.dib.2024.110118

PMID:38348323

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10859254/

Abstract

Arabic, unlike many languages, suffers from punctuation inconsistency, posing a significant obstacle for Natural Language Processing (NLP). To address this, we present the Arabic Punctuation Dataset (APD), a large collection of annotated Modern Standard Arabic texts designed to train machine learning models in sentence boundary identification and punctuation prediction. APD leverages the "theme-rheme completion" principle, a grammatical feature closely linked to consistent punctuation placement. It consists of an annotated collection of Modern Standard Arabic (MSA) texts that encompass 312 million words in approximately 12 million sentences. It comprises three diverse components: Arabic Book Chapters (ABC): Manually annotated, non-fiction, book excerpts, constituting a gold-standard reference. Complete Book Translations (CBT): Parallel English-Arabic book translations with aligned sentence endings, ideal for machine translation training. Scrambled Sentences from the Arabic Component of the United Nations Parallel Corpus (SSAC-UNPC): Jumbled sentences for model training in automatic punctuation restoration. Beyond NLP, APD serves as a valuable resource for linguistics research, language learning, and real-time subtitling. Its authentic, grammar-based approach can enhance the readability and clarity of machine-generated text, opening doors for various applications such as automatic speech recognition, text summarization, and machine translation.

摘要

与许多语言不同，阿拉伯语存在标点符号不一致的问题，这给自然语言处理（NLP）带来了重大障碍。为了解决这个问题，我们推出了阿拉伯语标点数据集（APD），这是一个大量注释的现代标准阿拉伯语文本集合，旨在训练机器学习模型进行句子边界识别和标点预测。APD利用了“主题-述题完成”原则，这是一种与一致的标点放置紧密相关的语法特征。它由一个注释的现代标准阿拉伯语（MSA）文本集合组成，包含约1200万个句子中的3.12亿个单词。它包括三个不同的部分：阿拉伯语书籍章节（ABC）：人工注释的非虚构书籍摘录，构成黄金标准参考。完整书籍翻译（CBT）：具有对齐句子结尾的平行英语-阿拉伯语书籍翻译，非常适合机器翻译训练。联合国平行语料库阿拉伯语部分的打乱句子（SSAC-UNPC）：用于自动标点恢复模型训练的打乱句子。除了NLP，APD还是语言学研究、语言学习和实时字幕的宝贵资源。其基于语法的真实方法可以提高机器生成文本的可读性和清晰度，为自动语音识别、文本摘要和机器翻译等各种应用打开大门。

相似文献

Arabic punctuation dataset.阿拉伯语标点符号数据集。

Data Brief. 2024 Feb 1;53:110118. doi: 10.1016/j.dib.2024.110118. eCollection 2024 Apr.

A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking.A7׳ta：关于用于语法检查的单语阿拉伯语平行语料库的数据。（注：这里的“A7׳ta”可能是特定的名称或术语，由于不清楚其确切含义，所以保留原样翻译）

Data Brief. 2018 Dec 4;22:237-240. doi: 10.1016/j.dib.2018.11.146. eCollection 2019 Feb.

Parallel texts dataset for Uzbek-Kazakh machine translation.乌兹别克语-哈萨克语机器翻译的平行文本数据集。

Data Brief. 2024 Feb 15;53:110194. doi: 10.1016/j.dib.2024.110194. eCollection 2024 Apr.

Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems.Tashkeela：阿拉伯语标音文本的新型语料库，自动标注系统的数据。

Data Brief. 2017 Feb 3;11:147-151. doi: 10.1016/j.dib.2017.01.011. eCollection 2017 Apr.

Hate speech detection in the Arabic language: corpus design, construction, and evaluation.阿拉伯语中的仇恨言论检测：语料库设计、构建与评估。

Front Artif Intell. 2024 Feb 20;7:1345445. doi: 10.3389/frai.2024.1345445. eCollection 2024.

A comprehensive dataset for Arabic word sense disambiguation.

Data Brief. 2024 Jun 4;55:110591. doi: 10.1016/j.dib.2024.110591. eCollection 2024 Aug.

Improving neural machine translation for low resource languages through non-parallel corpora: a case study of Egyptian dialect to modern standard Arabic translation.通过非平行语料库改进低资源语言的神经机器翻译：以埃及方言到现代标准阿拉伯语的翻译为例

Sci Rep. 2024 Jan 27;14(1):2265. doi: 10.1038/s41598-023-51090-4.

The design, construction and evaluation of annotated Arabic cyberbullying corpus.带注释的阿拉伯语网络欺凌语料库的设计、构建与评估。

Educ Inf Technol (Dordr). 2022;27(8):10977-11023. doi: 10.1007/s10639-022-11056-x. Epub 2022 Apr 28.

ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations.ArzEn-多体裁：一个包含埃及阿拉伯语歌曲歌词、小说和字幕以及英文翻译的对齐平行数据集。

Data Brief. 2024 Feb 29;54:110271. doi: 10.1016/j.dib.2024.110271. eCollection 2024 Jun.

Balinese story texts dataset for narrative text analyses.用于叙事文本分析的巴厘岛故事文本数据集。

Data Brief. 2024 Aug 8;56:110781. doi: 10.1016/j.dib.2024.110781. eCollection 2024 Oct.

本文引用的文献

Impact of typical aging and Parkinson's disease on the relationship among breath pausing, syntax, and punctuation.典型衰老和帕金森病对呼吸暂停、句法和标点之间关系的影响。

Am J Speech Lang Pathol. 2012 Nov;21(4):368-79. doi: 10.1044/1058-0360(2012/11-0059). Epub 2012 Jul 30.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验