• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用大语言模型进行土耳其语拼写纠正。

Leveraging large language models for spelling correction in Turkish.

作者信息

Guzel Turhan Ceren

机构信息

Department of Computer Engineering, Gazi University, Ankara, Turkey.

Department of Cognitive Robotics, Delft University of Technology, Delft, Netherlands.

出版信息

PeerJ Comput Sci. 2025 Jun 16;11:e2889. doi: 10.7717/peerj-cs.2889. eCollection 2025.

DOI:10.7717/peerj-cs.2889
PMID:40567745
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12192738/
Abstract

The field of natural language processing (NLP) has rapidly progressed, particularly with the rise of large language models (LLMs), which enhance our understanding of the intrinsic structures of languages in a cross-linguistic manner for complex NLP tasks. However, commonly encountered misspellings in human-written texts adversely affect language understanding for LLMs for various NLP tasks as well as misspelling applications such as auto-proofreading and chatbots. Therefore, this study focuses on the task of spelling correction in the agglutinative language Turkish, where its nature makes spell correction significantly more challenging. To address this, the research introduces a novel dataset, referred to as NoisyWikiTr, to explore encoder-only models based on bidirectional encoder representations from transformers (BERT) and existing auto-correction tools. For the first time in this study, as far as is known, encoder-only models based on BERT are presented as subword prediction models, and encoder-decoder models based on text-cleaning (Text-to-Text Transfer Transformer) architecture are fine-tuned for this task in Turkish. A comprehensive comparison of these models highlights the advantages of context-based approaches over traditional, context-free auto-correction tools. The findings also reveal that among LLMs, a language-specific sequence-to-sequence model outperforms both cross-lingual sequence-to-sequence models and encoder-only models in handling realistic misspellings.

摘要

自然语言处理(NLP)领域发展迅速,尤其是随着大语言模型(LLMs)的兴起,这些模型以跨语言的方式增强了我们对复杂NLP任务中语言内在结构的理解。然而,人工书写文本中常见的拼写错误对LLMs在各种NLP任务中的语言理解以及自动校对和聊天机器人等拼写检查应用产生了不利影响。因此,本研究聚焦于黏着语土耳其语的拼写纠正任务,其特性使得拼写纠正的难度显著增加。为解决这一问题,该研究引入了一个名为NoisyWikiTr的新数据集,以探索基于变换器双向编码器表征(BERT)的仅编码器模型和现有的自动纠正工具。据我们所知,本研究首次将基于BERT的仅编码器模型作为子词预测模型呈现,并针对土耳其语的这项任务对基于文本清洗(文本到文本迁移变换器)架构的编码器-解码器模型进行了微调。对这些模型的全面比较突出了基于上下文的方法相对于传统的、无上下文的自动纠正工具的优势。研究结果还表明,在大语言模型中,特定语言的序列到序列模型在处理实际拼写错误方面优于跨语言序列到序列模型和仅编码器模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/abb7/12192738/ecf1070be27c/peerj-cs-11-2889-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/abb7/12192738/c3f725ae21e5/peerj-cs-11-2889-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/abb7/12192738/54b7278d4dcf/peerj-cs-11-2889-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/abb7/12192738/d182baaaeb28/peerj-cs-11-2889-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/abb7/12192738/ecf1070be27c/peerj-cs-11-2889-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/abb7/12192738/c3f725ae21e5/peerj-cs-11-2889-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/abb7/12192738/54b7278d4dcf/peerj-cs-11-2889-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/abb7/12192738/d182baaaeb28/peerj-cs-11-2889-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/abb7/12192738/ecf1070be27c/peerj-cs-11-2889-g004.jpg

相似文献

1
Leveraging large language models for spelling correction in Turkish.利用大语言模型进行土耳其语拼写纠正。
PeerJ Comput Sci. 2025 Jun 16;11:e2889. doi: 10.7717/peerj-cs.2889. eCollection 2025.
2
Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study.使用来自Transformer句子嵌入的语言无关双向编码器表示法检测冗余健康调查问题:算法开发研究
JMIR Med Inform. 2025 Jun 10;13:e71687. doi: 10.2196/71687.
3
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
4
Large Language Model Architectures in Health Care: Scoping Review of Research Perspectives.医疗保健中的大语言模型架构:研究视角的范围综述
J Med Internet Res. 2025 Jun 19;27:e70315. doi: 10.2196/70315.
5
Trajectory-Ordered Objectives for Self-Supervised Representation Learning of Temporal Healthcare Data Using Transformers: Model Development and Evaluation Study.使用Transformer进行时间序列医疗数据自监督表示学习的轨迹有序目标:模型开发与评估研究
JMIR Med Inform. 2025 Jun 4;13:e68138. doi: 10.2196/68138.
6
Sentiment Analysis Using a Large Language Model-Based Approach to Detect Opioids Mixed With Other Substances Via Social Media: Method Development and Validation.使用基于大语言模型的方法通过社交媒体检测与其他物质混合的阿片类药物的情感分析:方法开发与验证
JMIR Infodemiology. 2025 Jun 19;5:e70525. doi: 10.2196/70525.
7
Text intelligent correction in English translation: A study on integrating models with dependency attention mechanism.英文翻译中的文本智能校正:一项关于集成具有依存注意力机制模型的研究。
PLoS One. 2025 Jun 24;20(6):e0319690. doi: 10.1371/journal.pone.0319690. eCollection 2025.
8
From BERT to generative AI - Comparing encoder-only vs. large language models in a cohort of lung cancer patients for named entity recognition in unstructured medical reports.从BERT到生成式人工智能——在一组肺癌患者中比较仅编码器模型与大语言模型用于非结构化医疗报告中的命名实体识别
Comput Biol Med. 2025 Sep;195:110665. doi: 10.1016/j.compbiomed.2025.110665. Epub 2025 Jun 24.
9
Applications of Large Language Models in the Field of Suicide Prevention: Scoping Review.大语言模型在自杀预防领域的应用:范围综述
J Med Internet Res. 2025 Jan 23;27:e63126. doi: 10.2196/63126.
10
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

本文引用的文献

1
The model student: GPT-4 performance on graduate biomedical science exams.模范学生:GPT-4 在研究生生物医学科学考试中的表现。
Sci Rep. 2024 Mar 7;14(1):5670. doi: 10.1038/s41598-024-55568-7.
2
Long short-term memory.长短期记忆
Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.