Gaustad Tanja, Puttkammer Martin J
Centre for Text Technology, North-West University, South Africa.
Data Brief. 2022 Feb 25;41:107994. doi: 10.1016/j.dib.2022.107994. eCollection 2022 Apr.
This data article presents a linguistically annotated data set for four official South African languages with a conjunctive orthography, namely isiNdebele, isiXhosa, isiZulu and Siswati. The data set is parallel for all four languages and can be used for language-specific as well as cross-language development and evaluation of Natural Language Processing (NLP) core technologies. In addition, it can be used for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and it provides some details on the three different types of linguistic annotation added (morphology, part-of-speech and lemmas), including an example.
本数据文章展示了一个针对四种具有连写正字法的南非官方语言的语言注释数据集,这四种语言分别是恩德贝莱语、科萨语、祖鲁语和斯瓦蒂语。该数据集对所有四种语言都是平行的,可用于特定语言以及跨语言的自然语言处理(NLP)核心技术的开发和评估。此外,它还可用于语料库语言学研究。本文描述了数据的收集方式、所包含的文本类型,并提供了关于所添加的三种不同类型语言注释(形态学、词性和词元)的一些细节,包括一个示例。