Yagi Sane, Elnagar Ashraf, Yaghi Esra
Department of Foreign Languages, University of Sharjah, the United Arab Emirates.
Department of Computer Science, University of Sharjah, the United Arab Emirates.
Data Brief. 2024 Feb 1;53:110118. doi: 10.1016/j.dib.2024.110118. eCollection 2024 Apr.
Arabic, unlike many languages, suffers from punctuation inconsistency, posing a significant obstacle for Natural Language Processing (NLP). To address this, we present the Arabic Punctuation Dataset (APD), a large collection of annotated Modern Standard Arabic texts designed to train machine learning models in sentence boundary identification and punctuation prediction. APD leverages the "theme-rheme completion" principle, a grammatical feature closely linked to consistent punctuation placement. It consists of an annotated collection of Modern Standard Arabic (MSA) texts that encompass 312 million words in approximately 12 million sentences. It comprises three diverse components: Arabic Book Chapters (ABC): Manually annotated, non-fiction, book excerpts, constituting a gold-standard reference. Complete Book Translations (CBT): Parallel English-Arabic book translations with aligned sentence endings, ideal for machine translation training. Scrambled Sentences from the Arabic Component of the United Nations Parallel Corpus (SSAC-UNPC): Jumbled sentences for model training in automatic punctuation restoration. Beyond NLP, APD serves as a valuable resource for linguistics research, language learning, and real-time subtitling. Its authentic, grammar-based approach can enhance the readability and clarity of machine-generated text, opening doors for various applications such as automatic speech recognition, text summarization, and machine translation.
与许多语言不同,阿拉伯语存在标点符号不一致的问题,这给自然语言处理(NLP)带来了重大障碍。为了解决这个问题,我们推出了阿拉伯语标点数据集(APD),这是一个大量注释的现代标准阿拉伯语文本集合,旨在训练机器学习模型进行句子边界识别和标点预测。APD利用了“主题-述题完成”原则,这是一种与一致的标点放置紧密相关的语法特征。它由一个注释的现代标准阿拉伯语(MSA)文本集合组成,包含约1200万个句子中的3.12亿个单词。它包括三个不同的部分:阿拉伯语书籍章节(ABC):人工注释的非虚构书籍摘录,构成黄金标准参考。完整书籍翻译(CBT):具有对齐句子结尾的平行英语-阿拉伯语书籍翻译,非常适合机器翻译训练。联合国平行语料库阿拉伯语部分的打乱句子(SSAC-UNPC):用于自动标点恢复模型训练的打乱句子。除了NLP,APD还是语言学研究、语言学习和实时字幕的宝贵资源。其基于语法的真实方法可以提高机器生成文本的可读性和清晰度,为自动语音识别、文本摘要和机器翻译等各种应用打开大门。