Suppr超能文献

阿拉伯语标点符号数据集。

Arabic punctuation dataset.

作者信息

Yagi Sane, Elnagar Ashraf, Yaghi Esra

机构信息

Department of Foreign Languages, University of Sharjah, the United Arab Emirates.

Department of Computer Science, University of Sharjah, the United Arab Emirates.

出版信息

Data Brief. 2024 Feb 1;53:110118. doi: 10.1016/j.dib.2024.110118. eCollection 2024 Apr.

Abstract

Arabic, unlike many languages, suffers from punctuation inconsistency, posing a significant obstacle for Natural Language Processing (NLP). To address this, we present the Arabic Punctuation Dataset (APD), a large collection of annotated Modern Standard Arabic texts designed to train machine learning models in sentence boundary identification and punctuation prediction. APD leverages the "theme-rheme completion" principle, a grammatical feature closely linked to consistent punctuation placement. It consists of an annotated collection of Modern Standard Arabic (MSA) texts that encompass 312 million words in approximately 12 million sentences. It comprises three diverse components: Arabic Book Chapters (ABC): Manually annotated, non-fiction, book excerpts, constituting a gold-standard reference. Complete Book Translations (CBT): Parallel English-Arabic book translations with aligned sentence endings, ideal for machine translation training. Scrambled Sentences from the Arabic Component of the United Nations Parallel Corpus (SSAC-UNPC): Jumbled sentences for model training in automatic punctuation restoration. Beyond NLP, APD serves as a valuable resource for linguistics research, language learning, and real-time subtitling. Its authentic, grammar-based approach can enhance the readability and clarity of machine-generated text, opening doors for various applications such as automatic speech recognition, text summarization, and machine translation.

摘要

与许多语言不同,阿拉伯语存在标点符号不一致的问题,这给自然语言处理(NLP)带来了重大障碍。为了解决这个问题,我们推出了阿拉伯语标点数据集(APD),这是一个大量注释的现代标准阿拉伯语文本集合,旨在训练机器学习模型进行句子边界识别和标点预测。APD利用了“主题-述题完成”原则,这是一种与一致的标点放置紧密相关的语法特征。它由一个注释的现代标准阿拉伯语(MSA)文本集合组成,包含约1200万个句子中的3.12亿个单词。它包括三个不同的部分:阿拉伯语书籍章节(ABC):人工注释的非虚构书籍摘录,构成黄金标准参考。完整书籍翻译(CBT):具有对齐句子结尾的平行英语-阿拉伯语书籍翻译,非常适合机器翻译训练。联合国平行语料库阿拉伯语部分的打乱句子(SSAC-UNPC):用于自动标点恢复模型训练的打乱句子。除了NLP,APD还是语言学研究、语言学习和实时字幕的宝贵资源。其基于语法的真实方法可以提高机器生成文本的可读性和清晰度,为自动语音识别、文本摘要和机器翻译等各种应用打开大门。

相似文献

1
Arabic punctuation dataset.阿拉伯语标点符号数据集。
Data Brief. 2024 Feb 1;53:110118. doi: 10.1016/j.dib.2024.110118. eCollection 2024 Apr.
3
Parallel texts dataset for Uzbek-Kazakh machine translation.乌兹别克语-哈萨克语机器翻译的平行文本数据集。
Data Brief. 2024 Feb 15;53:110194. doi: 10.1016/j.dib.2024.110194. eCollection 2024 Apr.
6
A comprehensive dataset for Arabic word sense disambiguation.
Data Brief. 2024 Jun 4;55:110591. doi: 10.1016/j.dib.2024.110591. eCollection 2024 Aug.
10
Balinese story texts dataset for narrative text analyses.用于叙事文本分析的巴厘岛故事文本数据集。
Data Brief. 2024 Aug 8;56:110781. doi: 10.1016/j.dib.2024.110781. eCollection 2024 Oct.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验