文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

用于增强低资源语言文本分类的文本数据增强和预训练语言模型。

Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages.

作者信息

Ziyaden Atabay, Yelenov Amir, Hajiyev Fuad, Rustamov Samir, Pak Alexandr

机构信息

Kazakh-British Technical University, Almaty, Kazakhstan.

Institute of Information and Computational Technologies, Almaty, Kazakhstan.

出版信息

PeerJ Comput Sci. 2024 Mar 29;10:e1974. doi: 10.7717/peerj-cs.1974. eCollection 2024.


DOI:10.7717/peerj-cs.1974
PMID:38660166
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11041965/
Abstract

BACKGROUND: In the domain of natural language processing (NLP), the development and success of advanced language models are predominantly anchored in the richness of available linguistic resources. Languages such as Azerbaijani, which is classified as a low-resource, often face challenges arising from limited labeled datasets, consequently hindering effective model training. METHODOLOGY: The primary objective of this study was to enhance the effectiveness and generalization capabilities of news text classification models using text augmentation techniques. In this study, we solve the problem of working with low-resource languages using translations using the Facebook mBart50 model, as well as the Google Translate API and a combination of mBart50 and Google Translate thus expanding the capabilities when working with text. RESULTS: The experimental outcomes reveal a promising uptick in classification performance when models are trained on the augmented dataset compared with their counterparts using the original data. This investigation underscores the immense potential of combined data augmentation strategies to bolster the NLP capabilities of underrepresented languages. As a result of our research, we have published our labeled text classification dataset and pre-trained RoBERTa model for the Azerbaijani language.

摘要

背景:在自然语言处理(NLP)领域,先进语言模型的发展与成功主要依赖于可用语言资源的丰富程度。像阿塞拜疆语这种被归类为低资源的语言,常常因标注数据集有限而面临挑战,进而阻碍有效的模型训练。 方法:本研究的主要目标是利用文本增强技术提高新闻文本分类模型的有效性和泛化能力。在本研究中,我们通过使用Facebook的mBart50模型进行翻译、谷歌翻译应用程序编程接口(API)以及mBart50与谷歌翻译的组合来解决处理低资源语言的问题,从而在处理文本时扩展能力。 结果:实验结果显示,与使用原始数据训练的模型相比,在增强数据集上训练的模型在分类性能上有显著提升。这项研究强调了组合数据增强策略在提升代表性不足语言的自然语言处理能力方面的巨大潜力。作为我们研究的成果,我们发布了阿塞拜疆语的标注文本分类数据集和预训练的RoBERTa模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/95496f86eed9/peerj-cs-10-1974-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/081c51035cf4/peerj-cs-10-1974-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/3f7a6d2d8e6f/peerj-cs-10-1974-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/de37998103a7/peerj-cs-10-1974-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/72ecb457d050/peerj-cs-10-1974-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/ed21185136c8/peerj-cs-10-1974-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/95496f86eed9/peerj-cs-10-1974-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/081c51035cf4/peerj-cs-10-1974-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/3f7a6d2d8e6f/peerj-cs-10-1974-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/de37998103a7/peerj-cs-10-1974-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/72ecb457d050/peerj-cs-10-1974-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/ed21185136c8/peerj-cs-10-1974-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdc1/11041965/95496f86eed9/peerj-cs-10-1974-g006.jpg

相似文献

[1]
Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages.

PeerJ Comput Sci. 2024-3-29

[2]
Building lexicon-based sentiment analysis model for low-resource languages.

MethodsX. 2023-10-22

[3]
Annotated dataset creation through large language models for non-english medical NLP.

J Biomed Inform. 2023-9

[4]
Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports.

BMC Bioinformatics. 2023-9-2

[5]
Enhancing African low-resource languages: Swahili data for language modelling.

Data Brief. 2020-6-30

[6]
Detection of offensive terms in resource-poor language using machine learning algorithms.

PeerJ Comput Sci. 2023-8-29

[7]
Neural machine translation of clinical text: an empirical investigation into multilingual pre-trained language models and transfer-learning.

Front Digit Health. 2024-2-26

[8]
Natural Language Processing Applications in the Clinical Neurosciences: A Machine Learning Augmented Systematic Review.

Acta Neurochir Suppl. 2022

[9]
Towards Transfer Learning Techniques-BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study.

Sensors (Basel). 2022-10-26

[10]
BTSD: A curated transformation of sentence dataset for text classification in Bangla language.

Data Brief. 2023-7-24

引用本文的文献

[1]
An adaptive fusion-based data augmentation method for abstract dialogue summarization.

PeerJ Comput Sci. 2025-4-18

本文引用的文献

[1]
Abstractive text summarization of low-resourced languages using deep learning.

PeerJ Comput Sci. 2023-1-13

[2]
Fake news detection in Urdu language using machine learning.

PeerJ Comput Sci. 2023-5-23

[3]
The neural machine translation models for the low-resource Kazakh-English language pair.

PeerJ Comput Sci. 2023-2-8

[4]
Addressing religious hate online: from taxonomy creation to automated detection.

PeerJ Comput Sci. 2022-12-15

[5]
BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language.

PeerJ Comput Sci. 2021-11-16

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索