用于增强低资源语言文本分类的文本数据增强和预训练语言模型。

BACKGROUND: In the domain of natural language processing (NLP), the development and success of advanced language models are predominantly anchored in the richness of available linguistic resources. Languages such as Azerbaijani, which is classified as a low-resource, often face challenges arising from limited labeled datasets, consequently hindering effective model training. METHODOLOGY: The primary objective of this study was to enhance the effectiveness and generalization capabilities of news text classification models using text augmentation techniques. In this study, we solve the problem of working with low-resource languages using translations using the Facebook mBart50 model, as well as the Google Translate API and a combination of mBart50 and Google Translate thus expanding the capabilities when working with text. RESULTS: The experimental outcomes reveal a promising uptick in classification performance when models are trained on the augmented dataset compared with their counterparts using the original data. This investigation underscores the immense potential of combined data augmentation strategies to bolster the NLP capabilities of underrepresented languages. As a result of our research, we have published our labeled text classification dataset and pre-trained RoBERTa model for the Azerbaijani language.

背景：在自然语言处理（NLP）领域，先进语言模型的发展与成功主要依赖于可用语言资源的丰富程度。像阿塞拜疆语这种被归类为低资源的语言，常常因标注数据集有限而面临挑战，进而阻碍有效的模型训练。方法：本研究的主要目标是利用文本增强技术提高新闻文本分类模型的有效性和泛化能力。在本研究中，我们通过使用Facebook的mBart50模型进行翻译、谷歌翻译应用程序编程接口（API）以及mBart50与谷歌翻译的组合来解决处理低资源语言的问题，从而在处理文本时扩展能力。结果：实验结果显示，与使用原始数据训练的模型相比，在增强数据集上训练的模型在分类性能上有显著提升。这项研究强调了组合数据增强策略在提升代表性不足语言的自然语言处理能力方面的巨大潜力。作为我们研究的成果，我们发布了阿塞拜疆语的标注文本分类数据集和预训练的RoBERTa模型。

新学期，新优惠

Suppr 超能文献

新学期，新优惠

Suppr 超能文献

Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

推荐工具