利用大语言模型进行土耳其语拼写纠正。

Leveraging large language models for spelling correction in Turkish.

作者信息

Guzel Turhan Ceren

机构信息

Department of Computer Engineering, Gazi University, Ankara, Turkey.

Department of Cognitive Robotics, Delft University of Technology, Delft, Netherlands.

出版信息

PeerJ Comput Sci. 2025 Jun 16;11:e2889. doi: 10.7717/peerj-cs.2889. eCollection 2025.

DOI:10.7717/peerj-cs.2889

PMID:40567745

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12192738/

Abstract

The field of natural language processing (NLP) has rapidly progressed, particularly with the rise of large language models (LLMs), which enhance our understanding of the intrinsic structures of languages in a cross-linguistic manner for complex NLP tasks. However, commonly encountered misspellings in human-written texts adversely affect language understanding for LLMs for various NLP tasks as well as misspelling applications such as auto-proofreading and chatbots. Therefore, this study focuses on the task of spelling correction in the agglutinative language Turkish, where its nature makes spell correction significantly more challenging. To address this, the research introduces a novel dataset, referred to as NoisyWikiTr, to explore encoder-only models based on bidirectional encoder representations from transformers (BERT) and existing auto-correction tools. For the first time in this study, as far as is known, encoder-only models based on BERT are presented as subword prediction models, and encoder-decoder models based on text-cleaning (Text-to-Text Transfer Transformer) architecture are fine-tuned for this task in Turkish. A comprehensive comparison of these models highlights the advantages of context-based approaches over traditional, context-free auto-correction tools. The findings also reveal that among LLMs, a language-specific sequence-to-sequence model outperforms both cross-lingual sequence-to-sequence models and encoder-only models in handling realistic misspellings.

摘要

自然语言处理（NLP）领域发展迅速，尤其是随着大语言模型（LLMs）的兴起，这些模型以跨语言的方式增强了我们对复杂NLP任务中语言内在结构的理解。然而，人工书写文本中常见的拼写错误对LLMs在各种NLP任务中的语言理解以及自动校对和聊天机器人等拼写检查应用产生了不利影响。因此，本研究聚焦于黏着语土耳其语的拼写纠正任务，其特性使得拼写纠正的难度显著增加。为解决这一问题，该研究引入了一个名为NoisyWikiTr的新数据集，以探索基于变换器双向编码器表征（BERT）的仅编码器模型和现有的自动纠正工具。据我们所知，本研究首次将基于BERT的仅编码器模型作为子词预测模型呈现，并针对土耳其语的这项任务对基于文本清洗（文本到文本迁移变换器）架构的编码器-解码器模型进行了微调。对这些模型的全面比较突出了基于上下文的方法相对于传统的、无上下文的自动纠正工具的优势。研究结果还表明，在大语言模型中，特定语言的序列到序列模型在处理实际拼写错误方面优于跨语言序列到序列模型和仅编码器模型。