Suppr超能文献

在斯瓦希里语的核心地带:自然语言处理中数据收集方法与语料库构建的探索

In the heart of Swahili: An exploration of data collection methods and corpus curation for natural language processing.

作者信息

Masua Bernard, Masasi Noel

机构信息

College of Information and Communication Technologies (CoICT), University of Dar Es Salaam, Ali Hassan Mwinyi Road, Kijitonyama campus, Dar Es Salaam TZ 33335, Tanzania.

出版信息

Data Brief. 2024 Jul 17;55:110751. doi: 10.1016/j.dib.2024.110751. eCollection 2024 Aug.

Abstract

Swahili corpus is a dataset generated by collecting written Kiswahili sentences from different sectors that deals with Kiswahili documents. Corpus of intended language is needed in Natural Language Processing (NLP) task to fit algorithm in order to understand that language before training the model. Swahili corpus dataset generated contained 1,693,228 sentences with 39,639,824 words and 871,452 unique words. Corpus exported in text file format with storage size of 168 MB. These sentences collected from different sources in different categories as follows: - Health (AFYA), Business and Industries (BIASHARA), Parliament (BUNGE), Religion (DINI), Education (ELIMU), News (HABARI), Agriculture (KILIMO), Social Media (MITANDAO), Non-Governmental Organizations (MASHIRIKA YA KIRAIA), Government (SERIKALI), Laws (SHERIA) and Politics (SIASA). the systematic data collection process employed for the creation of a Swahili corpus derived from multiple public websites and reports. The compilation of this corpus involves a meticulous and comprehensive approach to ensure the representation of diverse linguistic contexts and topics relevant to the Swahili language. The data collection process commenced with the identification of suitable sources across various domains, including news articles, health publications, online forums, and Governmental public reports. Websites and platforms with publicly available Swahili content were systematically crawled and archived to capture a broad spectrum of linguistic expressions. Furthermore, special attention was given to reputable sources to maintain the authenticity of the corpus and linguistic richness. The inclusion of diverse sources ensures that the corpus reflects the linguistic nuances inherent in different contexts and registers within the Swahili language. Additionally, efforts were made to incorporate variations in domain dialects, acknowledging the linguistic diversity present in Swahili. The potential for reusing this Swahili corpus is vast. Researchers, linguists, and language enthusiasts can leverage the diverse and extensive dataset for a multitude of applications, including NLP tasks such as sentiment analysis, textual data clustering, classifications tasks and machine translation. The Corpus can serve as training data for developing and evaluating NLP algorithms, including part-of-speech tagging, and named entity recognition. Also, text mining techniques can be applied to corpus and enable researchers to extract valuable insights, identify patterns, and discover knowledge from large textual datasets.

摘要

斯瓦希里语语料库是一个通过收集来自不同领域的斯瓦希里语书面句子而生成的数据集,这些领域涉及斯瓦希里语文献。在自然语言处理(NLP)任务中,需要目标语言的语料库来适配算法,以便在训练模型之前理解该语言。生成的斯瓦希里语语料库数据集包含1,693,228个句子,39,639,824个单词和871,

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc01/11372376/8d327470198a/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验