• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

提比安语料库:使用ChatGPT进行阿拉伯语语法错误纠正的平衡且全面的错误覆盖语料库。

Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction.

作者信息

Alrehili Ahlam, Alhothali Areej

机构信息

Department of Computer Sciences, Faculty of Computing and Information Technology, King Abdul Aziz University, Jeddah, Saudi Arabia.

Department of Computer Science, College of Computing and Informatics, Saudi Electronic University, Riyadh, Saudi Arabia.

出版信息

PeerJ Comput Sci. 2025 Mar 31;11:e2724. doi: 10.7717/peerj-cs.2724. eCollection 2025.

DOI:10.7717/peerj-cs.2724
PMID:40567667
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12190623/
Abstract

Natural language processing (NLP) augments text data to overcome sample size constraints. Scarce and low-quality data present particular challenges when learning from these domains. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. Moreover, data-augmentation techniques are commonly used in languages with rich data resources to address problems such as exposure bias. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC) despite being one of the most popular among Arabs and non-Arabs because of its close connection to Islam. Therefore, this study aims to develop an Arabic corpus called "Tibyan" for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including collecting and pre-processing a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49% of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens.

摘要

自然语言处理(NLP)扩充文本数据以克服样本量限制。当从这些领域进行学习时,稀缺且质量低的数据会带来特殊挑战。增加样本量是缓解这些挑战的自然且广泛使用的策略。此外,数据增强技术通常用于拥有丰富数据资源的语言,以解决诸如曝光偏差等问题。在本研究中,我们选择阿拉伯语来增加样本量并纠正语法错误。尽管阿拉伯语因与伊斯兰教的紧密联系而在阿拉伯人和非阿拉伯人中非常流行,但它被认为是语法错误纠正(GEC)资源有限的语言之一。因此,本研究旨在使用ChatGPT开发一个名为“Tibyan”的用于语法错误纠正的阿拉伯语语料库。ChatGPT被用作数据增强工具,基于一对包含语法错误的阿拉伯语句子与从阿拉伯语书籍中提取的无错误句子(称为引导句子)相匹配。建立我们的语料库涉及多个步骤,包括从各种来源(如图书和开放获取语料库)收集和预处理一对阿拉伯语文本。然后,我们使用ChatGPT基于先前收集的文本生成一个平行语料库,作为生成具有多种错误类型句子的指南。通过让语言专家审查和验证自动生成的句子,我们确保它们是正确且无错误的。根据语言专家提供的反馈对语料库进行迭代验证和完善,以提高其准确性。最后,我们使用阿拉伯语错误类型标注工具(ARETA)来分析Tibyan语料库中的错误类型。我们的语料库包含49%的错误,包括七种类型:拼写、形态、句法、语义、标点、合并和拆分。Tibyan语料库包含约60万个词元。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/3f9abf111bed/peerj-cs-11-2724-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/624e7bff90da/peerj-cs-11-2724-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/ff50db9ba22a/peerj-cs-11-2724-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/c510f852929d/peerj-cs-11-2724-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/f5c5f2f1aeec/peerj-cs-11-2724-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/9f8ee82c5d4d/peerj-cs-11-2724-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/612ecc3e433b/peerj-cs-11-2724-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/bf8cd68e595b/peerj-cs-11-2724-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/3051a290b846/peerj-cs-11-2724-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/2375e04be4d9/peerj-cs-11-2724-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/d478469874c6/peerj-cs-11-2724-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/300177e42861/peerj-cs-11-2724-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/3f9abf111bed/peerj-cs-11-2724-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/624e7bff90da/peerj-cs-11-2724-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/ff50db9ba22a/peerj-cs-11-2724-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/c510f852929d/peerj-cs-11-2724-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/f5c5f2f1aeec/peerj-cs-11-2724-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/9f8ee82c5d4d/peerj-cs-11-2724-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/612ecc3e433b/peerj-cs-11-2724-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/bf8cd68e595b/peerj-cs-11-2724-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/3051a290b846/peerj-cs-11-2724-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/2375e04be4d9/peerj-cs-11-2724-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/d478469874c6/peerj-cs-11-2724-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/300177e42861/peerj-cs-11-2724-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fd7/12190623/3f9abf111bed/peerj-cs-11-2724-g012.jpg

相似文献

1
Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction.提比安语料库:使用ChatGPT进行阿拉伯语语法错误纠正的平衡且全面的错误覆盖语料库。
PeerJ Comput Sci. 2025 Mar 31;11:e2724. doi: 10.7717/peerj-cs.2724. eCollection 2025.
2
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
3
Magnetic resonance perfusion for differentiating low-grade from high-grade gliomas at first presentation.首次就诊时磁共振灌注成像用于鉴别低级别与高级别胶质瘤
Cochrane Database Syst Rev. 2018 Jan 22;1(1):CD011551. doi: 10.1002/14651858.CD011551.pub2.
4
Arabic Aphasia Research Through a Clinical and Linguistic Lens: A Systematic Review of Current Limitations and Future Directions.从临床和语言视角看阿拉伯语失语症研究:对当前局限性和未来方向的系统综述
Int J Lang Commun Disord. 2025 Jul-Aug;60(4):e70064. doi: 10.1111/1460-6984.70064.
5
Eliciting adverse effects data from participants in clinical trials.从临床试验参与者中获取不良反应数据。
Cochrane Database Syst Rev. 2018 Jan 16;1(1):MR000039. doi: 10.1002/14651858.MR000039.pub2.
6
Diagnostic test accuracy and cost-effectiveness of tests for codeletion of chromosomal arms 1p and 19q in people with glioma.染色体臂 1p 和 19q 缺失的检测在胶质瘤患者中的诊断准确性和成本效益。
Cochrane Database Syst Rev. 2022 Mar 2;3(3):CD013387. doi: 10.1002/14651858.CD013387.pub2.
7
A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.对紫杉醇、多西他赛、吉西他滨和长春瑞滨在非小细胞肺癌中的临床疗效和成本效益进行的快速系统评价。
Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.
8
Exercise for intermittent claudication.间歇性跛行的运动疗法
Cochrane Database Syst Rev. 2017 Dec 26;12(12):CD000990. doi: 10.1002/14651858.CD000990.pub4.
9
Immunogenicity and seroefficacy of pneumococcal conjugate vaccines: a systematic review and network meta-analysis.肺炎球菌结合疫苗的免疫原性和血清效力:系统评价和网络荟萃分析。
Health Technol Assess. 2024 Jul;28(34):1-109. doi: 10.3310/YWHA3079.
10
Interventions for central serous chorioretinopathy: a network meta-analysis.中心性浆液性脉络膜视网膜病变的干预措施:一项网状Meta分析
Cochrane Database Syst Rev. 2025 Jun 16;6(6):CD011841. doi: 10.1002/14651858.CD011841.pub3.

引用本文的文献

1
Democratizing cost-effective, agentic artificial intelligence to multilingual medical summarization through knowledge distillation.通过知识蒸馏将具有成本效益、具备自主性的人工智能应用于多语言医学摘要,实现其普及化。
Sci Rep. 2025 Jul 29;15(1):27619. doi: 10.1038/s41598-025-10451-x.

本文引用的文献

1
A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking.A7׳ta:关于用于语法检查的单语阿拉伯语平行语料库的数据。 (注:这里的“A7׳ta”可能是特定的名称或术语,由于不清楚其确切含义,所以保留原样翻译)
Data Brief. 2018 Dec 4;22:237-240. doi: 10.1016/j.dib.2018.11.146. eCollection 2019 Feb.