在斯瓦希里语的核心地带：自然语言处理中数据收集方法与语料库构建的探索

In the heart of Swahili: An exploration of data collection methods and corpus curation for natural language processing.

作者信息

Masua Bernard, Masasi Noel

机构信息

College of Information and Communication Technologies (CoICT), University of Dar Es Salaam, Ali Hassan Mwinyi Road, Kijitonyama campus, Dar Es Salaam TZ 33335, Tanzania.

出版信息

Data Brief. 2024 Jul 17;55:110751. doi: 10.1016/j.dib.2024.110751. eCollection 2024 Aug.

DOI:10.1016/j.dib.2024.110751

PMID:39234059

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11372376/

Abstract

Swahili corpus is a dataset generated by collecting written Kiswahili sentences from different sectors that deals with Kiswahili documents. Corpus of intended language is needed in Natural Language Processing (NLP) task to fit algorithm in order to understand that language before training the model. Swahili corpus dataset generated contained 1,693,228 sentences with 39,639,824 words and 871,452 unique words. Corpus exported in text file format with storage size of 168 MB. These sentences collected from different sources in different categories as follows: - Health (AFYA), Business and Industries (BIASHARA), Parliament (BUNGE), Religion (DINI), Education (ELIMU), News (HABARI), Agriculture (KILIMO), Social Media (MITANDAO), Non-Governmental Organizations (MASHIRIKA YA KIRAIA), Government (SERIKALI), Laws (SHERIA) and Politics (SIASA). the systematic data collection process employed for the creation of a Swahili corpus derived from multiple public websites and reports. The compilation of this corpus involves a meticulous and comprehensive approach to ensure the representation of diverse linguistic contexts and topics relevant to the Swahili language. The data collection process commenced with the identification of suitable sources across various domains, including news articles, health publications, online forums, and Governmental public reports. Websites and platforms with publicly available Swahili content were systematically crawled and archived to capture a broad spectrum of linguistic expressions. Furthermore, special attention was given to reputable sources to maintain the authenticity of the corpus and linguistic richness. The inclusion of diverse sources ensures that the corpus reflects the linguistic nuances inherent in different contexts and registers within the Swahili language. Additionally, efforts were made to incorporate variations in domain dialects, acknowledging the linguistic diversity present in Swahili. The potential for reusing this Swahili corpus is vast. Researchers, linguists, and language enthusiasts can leverage the diverse and extensive dataset for a multitude of applications, including NLP tasks such as sentiment analysis, textual data clustering, classifications tasks and machine translation. The Corpus can serve as training data for developing and evaluating NLP algorithms, including part-of-speech tagging, and named entity recognition. Also, text mining techniques can be applied to corpus and enable researchers to extract valuable insights, identify patterns, and discover knowledge from large textual datasets.

摘要

斯瓦希里语语料库是一个通过收集来自不同领域的斯瓦希里语书面句子而生成的数据集，这些领域涉及斯瓦希里语文献。在自然语言处理（NLP）任务中，需要目标语言的语料库来适配算法，以便在训练模型之前理解该语言。生成的斯瓦希里语语料库数据集包含1,693,228个句子，39,639,824个单词和871,

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc01/11372376/8d327470198a/gr1.jpg

相似文献

In the heart of Swahili: An exploration of data collection methods and corpus curation for natural language processing.在斯瓦希里语的核心地带：自然语言处理中数据收集方法与语料库构建的探索

Data Brief. 2024 Jul 17;55:110751. doi: 10.1016/j.dib.2024.110751. eCollection 2024 Aug.

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words.增强斯瓦希里语的文本预处理：用于常见斯瓦希里语停用词、俚语和错别字以及相应正确词汇的数据集。

Data Brief. 2020 Nov 10;33:106517. doi: 10.1016/j.dib.2020.106517. eCollection 2020 Dec.

Enhancing African low-resource languages: Swahili data for language modelling.提升非洲资源匮乏语言：用于语言建模的斯瓦希里语数据

Data Brief. 2020 Jun 30;31:105951. doi: 10.1016/j.dib.2020.105951. eCollection 2020 Aug.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Challenges and solutions to employing natural language processing and machine learning to measure patients' health literacy and physician writing complexity: The ECLIPPSE study.运用自然语言处理和机器学习来衡量患者健康素养及医生书写复杂度所面临的挑战与解决方案：ECLIPPSE研究

J Biomed Inform. 2021 Jan;113:103658. doi: 10.1016/j.jbi.2020.103658. Epub 2020 Dec 11.

The Real-World Experiences of Persons With Multiple Sclerosis During the First COVID-19 Lockdown: Application of Natural Language Processing.多发性硬化症患者在首次新冠疫情封锁期间的真实世界经历：自然语言处理的应用

JMIR Med Inform. 2022 Nov 10;10(11):e37945. doi: 10.2196/37945.

Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature.自动语料库：一种用于规范和复用生物医学文献的自然语言处理工具。

Front Digit Health. 2022 Feb 15;4:788124. doi: 10.3389/fdgth.2022.788124. eCollection 2022.

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation.为乌兹别克语开发命名实体识别算法：数据集见解与实现

Data Brief. 2024 Apr 16;54:110413. doi: 10.1016/j.dib.2024.110413. eCollection 2024 Jun.

BTSD: A curated transformation of sentence dataset for text classification in Bangla language.BTSD：孟加拉语用于文本分类的句子数据集的精心整理转换。

Data Brief. 2023 Jul 24;50:109445. doi: 10.1016/j.dib.2023.109445. eCollection 2023 Oct.

引用本文的文献

Swahili questions and answers dataset for aflatoxin knowledge domain.黄曲霉毒素知识领域的斯瓦希里语问答数据集。

Data Brief. 2025 Mar 20;60:111475. doi: 10.1016/j.dib.2025.111475. eCollection 2025 Jun.

本文引用的文献

Survey on sentiment analysis: evolution of research methods and topics.情感分析综述：研究方法与主题的演变

Artif Intell Rev. 2023 Jan 6:1-42. doi: 10.1007/s10462-022-10386-z.

Natural language processing: state of the art, current trends and challenges.自然语言处理：技术现状、当前趋势与挑战。

Multimed Tools Appl. 2023;82(3):3713-3744. doi: 10.1007/s11042-022-13428-4. Epub 2022 Jul 14.

A large-scaled corpus for assessing text readability.用于评估文本可读性的大规模语料库。

Behav Res Methods. 2023 Feb;55(2):491-507. doi: 10.3758/s13428-022-01802-x. Epub 2022 Mar 16.

Data Brief. 2020 Nov 10;33:106517. doi: 10.1016/j.dib.2020.106517. eCollection 2020 Dec.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

在斯瓦希里语的核心地带：自然语言处理中数据收集方法与语料库构建的探索

In the heart of Swahili: An exploration of data collection methods and corpus curation for natural language processing.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献