Suppr超能文献

使用自动机器学习(AutoML)在多标签文本分类中,对预训练语言模型在碳排放、时间和准确性方面的比较。

Comparison of pre-trained language models in terms of carbon emissions, time and accuracy in multi-label text classification using AutoML.

作者信息

Savci Pinar, Das Bihter

机构信息

Arçelik A.Ş. Karaağaç Caddesi 2-6, Sütlüce Beyoğlu 34445 Istanbul, Turkey.

Department of Software Engineering, Technology Faculty, Firat University, 23119, Elazig, Turkey.

出版信息

Heliyon. 2023 May 1;9(5):e15670. doi: 10.1016/j.heliyon.2023.e15670. eCollection 2023 May.

Abstract

Since Turkish is an agglutinative language and contains reduplication, idiom, and metaphor words, Turkish texts are sources of information with extremely rich meanings. For this reason, the processing and classification of Turkish texts according to their characteristics is both time-consuming and difficult. In this study, the performances of pre-trained language models for multi-text classification using Autotrain were compared in a 250 K Turkish dataset that we created. The results showed that the BERTurk (uncased, 128 k) language model on the dataset showed higher accuracy performance with a training time of 66 min compared to the other models and the CO2 emission was quite low. The ConvBERTurk mC4 (uncased) model is also the best-performing second language model. As a result of this study, we have provided a deeper understanding of the capabilities of pre-trained language models for Turkish on machine learning.

摘要

由于土耳其语是一种黏着语,包含重复、习语和隐喻词,土耳其语文本是具有极其丰富含义的信息来源。因此,根据土耳其语文本的特征进行处理和分类既耗时又困难。在本研究中,我们在自己创建的25万个土耳其语数据集上比较了使用Autotrain进行多文本分类的预训练语言模型的性能。结果表明,数据集中的BERTurk(无大小写,128k)语言模型与其他模型相比,在66分钟的训练时间内表现出更高的准确率,并且二氧化碳排放量相当低。ConvBERTurk mC4(无大小写)模型也是表现最佳的第二语言模型。作为这项研究的结果,我们对预训练语言模型在土耳其语机器学习方面的能力有了更深入的理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40a8/10176029/09706150b220/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验