Alrehili Ahlam, Alhothali Areej
Department of Computer Sciences, Faculty of Computing and Information Technology, King Abdul Aziz University, Jeddah, Saudi Arabia.
Department of Computer Science, College of Computing and Informatics, Saudi Electronic University, Riyadh, Saudi Arabia.
PeerJ Comput Sci. 2025 Mar 31;11:e2724. doi: 10.7717/peerj-cs.2724. eCollection 2025.
Natural language processing (NLP) augments text data to overcome sample size constraints. Scarce and low-quality data present particular challenges when learning from these domains. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. Moreover, data-augmentation techniques are commonly used in languages with rich data resources to address problems such as exposure bias. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC) despite being one of the most popular among Arabs and non-Arabs because of its close connection to Islam. Therefore, this study aims to develop an Arabic corpus called "Tibyan" for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including collecting and pre-processing a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49% of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens.
自然语言处理(NLP)扩充文本数据以克服样本量限制。当从这些领域进行学习时,稀缺且质量低的数据会带来特殊挑战。增加样本量是缓解这些挑战的自然且广泛使用的策略。此外,数据增强技术通常用于拥有丰富数据资源的语言,以解决诸如曝光偏差等问题。在本研究中,我们选择阿拉伯语来增加样本量并纠正语法错误。尽管阿拉伯语因与伊斯兰教的紧密联系而在阿拉伯人和非阿拉伯人中非常流行,但它被认为是语法错误纠正(GEC)资源有限的语言之一。因此,本研究旨在使用ChatGPT开发一个名为“Tibyan”的用于语法错误纠正的阿拉伯语语料库。ChatGPT被用作数据增强工具,基于一对包含语法错误的阿拉伯语句子与从阿拉伯语书籍中提取的无错误句子(称为引导句子)相匹配。建立我们的语料库涉及多个步骤,包括从各种来源(如图书和开放获取语料库)收集和预处理一对阿拉伯语文本。然后,我们使用ChatGPT基于先前收集的文本生成一个平行语料库,作为生成具有多种错误类型句子的指南。通过让语言专家审查和验证自动生成的句子,我们确保它们是正确且无错误的。根据语言专家提供的反馈对语料库进行迭代验证和完善,以提高其准确性。最后,我们使用阿拉伯语错误类型标注工具(ARETA)来分析Tibyan语料库中的错误类型。我们的语料库包含49%的错误,包括七种类型:拼写、形态、句法、语义、标点、合并和拆分。Tibyan语料库包含约60万个词元。