Savci Pinar, Das Bihter
Arçelik A.Ş. Karaağaç Caddesi 2-6, Sütlüce Beyoğlu 34445 Istanbul, Turkey.
Department of Software Engineering, Technology Faculty, Firat University, 23119, Elazig, Turkey.
Heliyon. 2023 May 1;9(5):e15670. doi: 10.1016/j.heliyon.2023.e15670. eCollection 2023 May.
Since Turkish is an agglutinative language and contains reduplication, idiom, and metaphor words, Turkish texts are sources of information with extremely rich meanings. For this reason, the processing and classification of Turkish texts according to their characteristics is both time-consuming and difficult. In this study, the performances of pre-trained language models for multi-text classification using Autotrain were compared in a 250 K Turkish dataset that we created. The results showed that the BERTurk (uncased, 128 k) language model on the dataset showed higher accuracy performance with a training time of 66 min compared to the other models and the CO2 emission was quite low. The ConvBERTurk mC4 (uncased) model is also the best-performing second language model. As a result of this study, we have provided a deeper understanding of the capabilities of pre-trained language models for Turkish on machine learning.
由于土耳其语是一种黏着语,包含重复、习语和隐喻词,土耳其语文本是具有极其丰富含义的信息来源。因此,根据土耳其语文本的特征进行处理和分类既耗时又困难。在本研究中,我们在自己创建的25万个土耳其语数据集上比较了使用Autotrain进行多文本分类的预训练语言模型的性能。结果表明,数据集中的BERTurk(无大小写,128k)语言模型与其他模型相比,在66分钟的训练时间内表现出更高的准确率,并且二氧化碳排放量相当低。ConvBERTurk mC4(无大小写)模型也是表现最佳的第二语言模型。作为这项研究的结果,我们对预训练语言模型在土耳其语机器学习方面的能力有了更深入的理解。